Batch Processing

Process large datasets efficiently with ChemAudit's batch validation system. Handle from hundreds to millions of molecules with real-time WebSocket progress updates.

Specifications

Batch processing capabilities vary by deployment profile:

Feature	Range	Notes
Max Molecules	1,000 - 1,000,000	Configurable per profile
Max File Size	100 MB - 1 GB	Depends on deployment
Supported Formats	SDF, CSV, TSV, TXT	Auto-detected
Progress Updates	Real-time	Via WebSocket
Worker Queues	Default + Priority	Separate queues for responsiveness

Check Your Limits

Your deployment limits are available at /api/v1/config. The frontend displays these automatically.

Supported File Formats

SDF Files

Structure-Data Files are the preferred format for batch processing:

Aspirin
  RDKit          2D

 13 13  0  0  0  0  0  0  0  0999 V2000
   ...atom coordinates...
M  END
$$$$
Caffeine
  RDKit          2D
   ...
$$$$

SDF Benefits

SDF files preserve 2D/3D coordinates, can include properties, and are widely supported by chemistry software.

CSV/TSV Files

Delimited text files must have a column containing SMILES strings:

Name,SMILES,Activity,MW
Aspirin,CC(=O)Oc1ccccc1C(=O)O,Active,180.16
Caffeine,Cn1cnc2c1c(=O)n(c(=O)n2C)C,Active,194.19
Ethanol,CCO,Inactive,46.07

CSV Requirements:

Must have a header row
SMILES column is required
Optional name/ID column
UTF-8 encoding recommended

Supported SMILES column names:

SMILES, smiles, Smiles
CANONICAL_SMILES, canonical_smiles
Or select manually during upload

How to Process Batch Files

Web Interface

Navigate to the Batch Processing page
Drag and drop your file or click to browse
For CSV/TSV: Select the SMILES column and optional Name column
Configure options:
- Extended safety filters (NIH, ZINC)
- ChEMBL alerts
- Standardization pipeline
- Scoring profile (expand the Scoring Profile sidebar to select a preset or custom profile)
Click "Upload and Process"
Monitor real-time progress via WebSocket
View results with sorting, filtering, and pagination
Export results in your preferred format

Column Detection

ChemAudit automatically suggests likely SMILES and Name columns based on content analysis.

API

Upload and Start Processing

# Upload SDF file
curl -X POST http://localhost:8001/api/v1/batch/upload \
  -F "file=@molecules.sdf"

# Upload CSV with column selection
curl -X POST http://localhost:8001/api/v1/batch/upload \
  -F "file=@molecules.csv" \
  -F "smiles_column=SMILES" \
  -F "name_column=Name" \
  -F "include_extended_safety=true"

Response:

{
  "job_id": "550e8400-e29b-41d4-a716-446655440000",
  "status": "pending",
  "total_molecules": 10000,
  "message": "Job submitted. Processing 10000 molecules."
}

Check Progress

curl http://localhost:8001/api/v1/batch/{job_id}/status

Response:

{
  "job_id": "550e8400-...",
  "status": "processing",
  "progress": 45.5,
  "processed": 455,
  "total": 1000,
  "eta_seconds": 68
}

Get Results

# Get paginated results
curl "http://localhost:8001/api/v1/batch/{job_id}?page=1&page_size=50"

# Get statistics only
curl http://localhost:8001/api/v1/batch/{job_id}/stats

Processing Options

Option	Description	Default
include_extended_safety	Screen against NIH and ZINC filters	`false`
include_chembl_alerts	Screen against ChEMBL pharma filters	`false`
include_standardization	Run ChEMBL standardization pipeline	`false`
profile_id	Apply a scoring profile to score each molecule	None
notification_email	Email address for completion notification	None

Performance Impact

Enabling all options increases processing time. For large batches, consider running with basic options first, then re-process specific molecules if needed.

Filtering and Sorting Results

Filter and sort results to focus on molecules of interest:

Available Filters

Filter	Type	Description
Status	`success` or `error`	Processing outcome
Min/Max Score	0-100	Validation score range
Sort By	Various	See sort fields below

Sort Fields

index: Original file order
name: Molecule name (if provided)
smiles: SMILES string alphabetically
score: Validation score
qed: QED drug-likeness score
safety: Safety filter score
status: Success/error status
issues: Number of validation issues
profile_score: Profile desirability score (when a profile is applied)

Sort Direction

asc: Ascending order
desc: Descending order

Example:

# Get molecules with score >= 80, sorted by QED descending
curl "http://localhost:8001/api/v1/batch/{job_id}?min_score=80&sort_by=qed&sort_dir=desc"

Real-Time Progress Updates

ChemAudit provides real-time progress via WebSocket:

JavaScript Example

const ws = new WebSocket('ws://localhost:8001/ws/batch/' + jobId);

ws.onmessage = (event) => {
  const data = JSON.parse(event.data);
  console.log(`Progress: ${data.progress}% (${data.processed}/${data.total})`);
  console.log(`ETA: ${data.eta_seconds} seconds`);
};

// Send keep-alive pings
setInterval(() => ws.send('ping'), 30000);

ws.onclose = () => {
  console.log('Job complete or connection closed');
};

Message Format

{
  "job_id": "550e8400-...",
  "status": "processing",
  "progress": 45.5,
  "processed": 455,
  "total": 1000,
  "eta_seconds": 68
}

Status values:

processing: Job in progress
complete: Job finished successfully
failed: Job encountered fatal error
cancelled: Job was cancelled

Developer Tip

For Python integration, see the WebSocket API documentation.

Understanding Results

Statistics Summary

Each batch job includes aggregate statistics:

Total molecules: Count processed
Successful/Errors: Success rate
Average scores: Validation, ML-readiness, QED, SA
Pass rates: Lipinski, safety filters
Score distribution: Histogram of validation scores
Alert summary: Count by catalog (PAINS, BRENK, etc.)
Issue summary: Count by check type
Processing time: Total duration

Individual Results

Each molecule result includes:

SMILES: Canonical SMILES
Name: Molecule name (if provided)
Index: Original file position
Status: success or error
Validation: All check results and overall score
Alerts: Matched structural alerts (if screening enabled)
Scoring: ML-readiness, drug-likeness, ADMET scores
Standardization: Cleanup results (if enabled)

Handling Partial Failures

ChemAudit gracefully handles molecules that fail to parse or validate:

Failed molecules are marked with status: "error"
Error messages explain the failure reason
Processing continues for remaining molecules
Statistics show success/failure breakdown
Filter by status to review only errors

Debugging Failures

Export error molecules separately, fix them manually, and re-upload. The batch index helps track which molecules failed.

Performance Tips

Use SDF when possible: Faster parsing than CSV
Split very large files: Process in chunks if near size limits
Monitor worker utilization: Scale workers for better throughput
Enable caching: Results are cached by InChIKey for 1 hour
Use filters wisely: Basic validation is fastest; add options as needed

Analytics

After batch processing completes, automatic analytics run immediately (deduplication and statistics). On-demand analytics are available for deeper exploration:

Scaffold analysis — Murcko scaffold extraction, diversity metrics
Chemical space mapping — PCA or t-SNE projections
Matched molecular pairs — BRICS fragmentation, activity cliffs
Interactive visualizations — Score histograms, property scatter plots, treemaps, chemical space scatter with linked brushing

See Batch Analytics for full details.

Subset Actions

Select molecules from the results table to perform targeted operations:

Re-validate — Create a new job from selected molecules
Re-score — Apply a different scoring profile to the selection
Export subset — Export only selected molecules
Compare — Side-by-side comparison of 1–2 molecules with property radar overlay

See Subset Actions & Sharing for full details.

Permalinks — Share batch results via short-lived URLs (30-day expiry)
Email notifications — Receive completion emails by providing notification_email during upload
Webhooks — Integrate with pipelines via HMAC-signed HTTP callbacks

See Subset Actions & Sharing for configuration details.

Next Steps

Batch Analytics — Interactive analytics and visualizations
Subset Actions — Work with molecule selections
Exporting Results — Export in multiple formats
Scoring Profiles — Custom property scoring for batches
Structural Alerts — Understanding alert screening
WebSocket API — Integrate real-time progress

Specifications​

Supported File Formats​

SDF Files​

CSV/TSV Files​

How to Process Batch Files​

Web Interface​

API​

Upload and Start Processing​

Check Progress​

Get Results​

Processing Options​

Filtering and Sorting Results​

Available Filters​

Sort Fields​

Sort Direction​

Real-Time Progress Updates​

JavaScript Example​

Message Format​

Understanding Results​

Statistics Summary​

Individual Results​

Handling Partial Failures​

Performance Tips​

Analytics​

Subset Actions​

Sharing & Notifications​

Next Steps​