QSAR-Ready Pipeline
The QSAR-Ready Pipeline prepares chemical datasets for machine learning by applying a standardized multi-step curation pipeline. It ensures that molecules are clean, canonical, and deduplicated before use in QSAR/QSPR modeling.
Pipeline Steps
The pipeline applies the following steps in order:
| Step | Description | Details |
|---|---|---|
| Standardization | ChEMBL-compatible structure normalization | Fixes nitro groups, metal disconnection, charge normalization |
| Salt Stripping | Remove counterions and salts | Extracts the parent molecule using MolVS fragment patterns |
| Neutralization | Neutralize charged species | Converts charged forms to neutral where chemically appropriate |
| Tautomer Canonicalization | Canonicalize tautomeric forms | Uses RDKit's tautomer enumerator for a canonical representation |
| Duplicate Removal | Remove duplicates by InChIKey | Identifies and flags duplicate structures after standardization |
Each molecule receives a result status indicating its outcome.
Result Status
| Status | Meaning |
|---|---|
| ok | Successfully curated — molecule passed all pipeline steps |
| rejected | Failed a pipeline step (see rejection_reason for details) |
| duplicate | Duplicate of another molecule by standardized InChIKey |
| error | Processing error (e.g., unparseable SMILES) |
Using the Web Interface
- Navigate to QSAR-Ready under the Data Preparation dropdown in the header
- Choose your input method:
- Paste SMILES: Enter SMILES strings, one per line
- Upload file: Drag and drop a CSV or SDF file
- Configure pipeline options if needed
- Click Process
- Monitor real-time progress
- Review results:
- Summary statistics (ok / rejected / duplicate / error counts)
- Per-molecule details with step-by-step provenance
- InChIKey change tracking (original vs. standardized)
- Download the curated dataset in CSV, SDF, or JSON format
Batch Export
Use the download buttons to export only the successfully curated molecules (ok status) for direct use in ML pipelines.
API Reference
Single Molecule
Process one molecule through the pipeline:
curl -X POST http://localhost:8001/api/v1/qsar-ready/single \
-H "Content-Type: application/json" \
-d '{
"smiles": "CC(=O)Oc1ccccc1C(=O)[O-].[Na+]",
"config": {}
}'
Response:
{
"original_smiles": "CC(=O)Oc1ccccc1C(=O)[O-].[Na+]",
"original_inchikey": "XAKUHBCMIFQRLG-UHFFFAOYSA-M",
"curated_smiles": "CC(=O)Oc1ccccc1C(=O)O",
"standardized_inchikey": "BSYNRYMUTXBXSQ-UHFFFAOYSA-N",
"inchikey_changed": true,
"status": "ok",
"rejection_reason": null,
"steps": [
{
"step_name": "standardization",
"applied": true,
"changes": ["Applied standardization rules"]
},
{
"step_name": "salt_stripping",
"applied": true,
"changes": ["Removed fragment: [Na+]"]
}
]
}
Batch Upload
# Upload a file
curl -X POST http://localhost:8001/api/v1/qsar-ready/batch/upload \
-F "file=@molecules.csv" \
-F 'config={}'
# Or paste SMILES text
curl -X POST http://localhost:8001/api/v1/qsar-ready/batch/upload \
-F "smiles_text=CCO
c1ccccc1
CC(=O)Oc1ccccc1C(=O)O"
Check Status
curl http://localhost:8001/api/v1/qsar-ready/batch/{job_id}/status
Get Results
curl "http://localhost:8001/api/v1/qsar-ready/batch/{job_id}/results?page=1&per_page=50"
Download Curated Dataset
# CSV format
curl http://localhost:8001/api/v1/qsar-ready/batch/{job_id}/download/csv -o curated.csv
# SDF format
curl http://localhost:8001/api/v1/qsar-ready/batch/{job_id}/download/sdf -o curated.sdf
# JSON format
curl http://localhost:8001/api/v1/qsar-ready/batch/{job_id}/download/json -o curated.json
import requests
# Single molecule
response = requests.post(
"http://localhost:8001/api/v1/qsar-ready/single",
json={"smiles": "CC(=O)Oc1ccccc1C(=O)[O-].[Na+]", "config": {}}
)
result = response.json()
print(f"Status: {result['status']}, Curated: {result['curated_smiles']}")
# Batch upload
with open("molecules.csv", "rb") as f:
response = requests.post(
"http://localhost:8001/api/v1/qsar-ready/batch/upload",
files={"file": f},
data={"config": "{}"}
)
job_id = response.json()["job_id"]
Rate Limits
| Endpoint | Limit |
|---|---|
POST /qsar-ready/single | 30 req/min |
POST /qsar-ready/batch/upload | 3 req/min |
GET /qsar-ready/batch/*/status | 60 req/min |
GET /qsar-ready/batch/*/results | 30 req/min |
GET /qsar-ready/batch/*/download/* | 10 req/min |
Use Cases
ML Dataset Preparation
- Upload your raw compound collection
- The pipeline standardizes, deduplicates, and cleans all structures
- Download the curated set and use it directly for model training
Compound Registration
- Process incoming compounds through the pipeline
- Use the standardized InChIKey for unique registration
- Flag duplicates against your existing collection
Next Steps
- Structure Filter — Apply property and substructure filters to curated molecules
- Batch Processing — Full validation and scoring for large datasets
- Standardization — Learn about the underlying standardization pipeline