Batch Processing
Process large datasets efficiently with ChemAudit's batch validation system. Handle from hundreds to millions of molecules with real-time WebSocket progress updates.
Specifications
Batch processing capabilities vary by deployment profile:
| Feature | Range | Notes |
|---|---|---|
| Max Molecules | 1,000 - 1,000,000 | Configurable per profile |
| Max File Size | 100 MB - 1 GB | Depends on deployment |
| Supported Formats | SDF, CSV, TSV, TXT | Auto-detected |
| Progress Updates | Real-time | Via WebSocket |
| Worker Queues | Default + Priority | Separate queues for responsiveness |
Your deployment limits are available at /api/v1/config. The frontend displays these automatically.
Supported File Formats
SDF Files
Structure-Data Files are the preferred format for batch processing:
Aspirin
RDKit 2D
13 13 0 0 0 0 0 0 0 0999 V2000
...atom coordinates...
M END
$$$$
Caffeine
RDKit 2D
...
$$$$
SDF files preserve 2D/3D coordinates, can include properties, and are widely supported by chemistry software.
CSV/TSV Files
Delimited text files must have a column containing SMILES strings:
Name,SMILES,Activity,MW
Aspirin,CC(=O)Oc1ccccc1C(=O)O,Active,180.16
Caffeine,Cn1cnc2c1c(=O)n(c(=O)n2C)C,Active,194.19
Ethanol,CCO,Inactive,46.07
CSV Requirements:
- Must have a header row
- SMILES column is required
- Optional name/ID column
- UTF-8 encoding recommended
Supported SMILES column names:
SMILES,smiles,SmilesCANONICAL_SMILES,canonical_smiles- Or select manually during upload
How to Process Batch Files
Web Interface
- Navigate to the Batch Processing page
- Drag and drop your file or click to browse
- For CSV/TSV: Select the SMILES column and optional Name column
- Configure options:
- Extended safety filters (NIH, ZINC)
- ChEMBL alerts
- Standardization pipeline
- Scoring profile (expand the Scoring Profile sidebar to select a preset or custom profile)
- Click "Upload and Process"
- Monitor real-time progress via WebSocket
- View results with sorting, filtering, and pagination
- Export results in your preferred format
ChemAudit automatically suggests likely SMILES and Name columns based on content analysis.
API
Upload and Start Processing
# Upload SDF file
curl -X POST http://localhost:8001/api/v1/batch/upload \
-F "file=@molecules.sdf"
# Upload CSV with column selection
curl -X POST http://localhost:8001/api/v1/batch/upload \
-F "file=@molecules.csv" \
-F "smiles_column=SMILES" \
-F "name_column=Name" \
-F "include_extended_safety=true"
Response:
{
"job_id": "550e8400-e29b-41d4-a716-446655440000",
"status": "pending",
"total_molecules": 10000,
"message": "Job submitted. Processing 10000 molecules."
}
Check Progress
curl http://localhost:8001/api/v1/batch/{job_id}/status
Response:
{
"job_id": "550e8400-...",
"status": "processing",
"progress": 45.5,
"processed": 455,
"total": 1000,
"eta_seconds": 68
}
Get Results
# Get paginated results
curl "http://localhost:8001/api/v1/batch/{job_id}?page=1&page_size=50"
# Get statistics only
curl http://localhost:8001/api/v1/batch/{job_id}/stats
Processing Options
| Option | Description | Default |
|---|---|---|
| include_extended_safety | Screen against NIH and ZINC filters | false |
| include_chembl_alerts | Screen against ChEMBL pharma filters | false |
| include_standardization | Run ChEMBL standardization pipeline | false |
| profile_id | Apply a scoring profile to score each molecule | None |
| notification_email | Email address for completion notification | None |
Enabling all options increases processing time. For large batches, consider running with basic options first, then re-process specific molecules if needed.
Filtering and Sorting Results
Filter and sort results to focus on molecules of interest:
Available Filters
| Filter | Type | Description |
|---|---|---|
| Status | success or error | Processing outcome |
| Min/Max Score | 0-100 | Validation score range |
| Sort By | Various | See sort fields below |
Sort Fields
index: Original file ordername: Molecule name (if provided)smiles: SMILES string alphabeticallyscore: Validation scoreqed: QED drug-likeness scoresafety: Safety filter scorestatus: Success/error statusissues: Number of validation issuesprofile_score: Profile desirability score (when a profile is applied)
Sort Direction
asc: Ascending orderdesc: Descending order
Example:
# Get molecules with score >= 80, sorted by QED descending
curl "http://localhost:8001/api/v1/batch/{job_id}?min_score=80&sort_by=qed&sort_dir=desc"
Real-Time Progress Updates
ChemAudit provides real-time progress via WebSocket:
JavaScript Example
const ws = new WebSocket('ws://localhost:8001/ws/batch/' + jobId);
ws.onmessage = (event) => {
const data = JSON.parse(event.data);
console.log(`Progress: ${data.progress}% (${data.processed}/${data.total})`);
console.log(`ETA: ${data.eta_seconds} seconds`);
};
// Send keep-alive pings
setInterval(() => ws.send('ping'), 30000);
ws.onclose = () => {
console.log('Job complete or connection closed');
};
Message Format
{
"job_id": "550e8400-...",
"status": "processing",
"progress": 45.5,
"processed": 455,
"total": 1000,
"eta_seconds": 68
}
Status values:
processing: Job in progresscomplete: Job finished successfullyfailed: Job encountered fatal errorcancelled: Job was cancelled
For Python integration, see the WebSocket API documentation.
Understanding Results
Statistics Summary
Each batch job includes aggregate statistics:
- Total molecules: Count processed
- Successful/Errors: Success rate
- Average scores: Validation, ML-readiness, QED, SA
- Pass rates: Lipinski, safety filters
- Score distribution: Histogram of validation scores
- Alert summary: Count by catalog (PAINS, BRENK, etc.)
- Issue summary: Count by check type
- Processing time: Total duration
Individual Results
Each molecule result includes:
- SMILES: Canonical SMILES
- Name: Molecule name (if provided)
- Index: Original file position
- Status:
successorerror - Validation: All check results and overall score
- Alerts: Matched structural alerts (if screening enabled)
- Scoring: ML-readiness, drug-likeness, ADMET scores
- Standardization: Cleanup results (if enabled)
Handling Partial Failures
ChemAudit gracefully handles molecules that fail to parse or validate:
- Failed molecules are marked with
status: "error" - Error messages explain the failure reason
- Processing continues for remaining molecules
- Statistics show success/failure breakdown
- Filter by status to review only errors
Export error molecules separately, fix them manually, and re-upload. The batch index helps track which molecules failed.
Performance Tips
- Use SDF when possible: Faster parsing than CSV
- Split very large files: Process in chunks if near size limits
- Monitor worker utilization: Scale workers for better throughput
- Enable caching: Results are cached by InChIKey for 1 hour
- Use filters wisely: Basic validation is fastest; add options as needed
Analytics
After batch processing completes, automatic analytics run immediately (deduplication and statistics). On-demand analytics are available for deeper exploration:
- Scaffold analysis — Murcko scaffold extraction, diversity metrics
- Chemical space mapping — PCA or t-SNE projections
- Matched molecular pairs — BRICS fragmentation, activity cliffs
- Interactive visualizations — Score histograms, property scatter plots, treemaps, chemical space scatter with linked brushing
See Batch Analytics for full details.
Subset Actions
Select molecules from the results table to perform targeted operations:
- Re-validate — Create a new job from selected molecules
- Re-score — Apply a different scoring profile to the selection
- Export subset — Export only selected molecules
- Compare — Side-by-side comparison of 1–2 molecules with property radar overlay
See Subset Actions & Sharing for full details.
Sharing & Notifications
- Permalinks — Share batch results via short-lived URLs (30-day expiry)
- Email notifications — Receive completion emails by providing
notification_emailduring upload - Webhooks — Integrate with pipelines via HMAC-signed HTTP callbacks
See Subset Actions & Sharing for configuration details.
Next Steps
- Batch Analytics — Interactive analytics and visualizations
- Subset Actions — Work with molecule selections
- Exporting Results — Export in multiple formats
- Scoring Profiles — Custom property scoring for batches
- Structural Alerts — Understanding alert screening
- WebSocket API — Integrate real-time progress