Skip to main content

Batch Analytics & Visualizations

After batch processing completes, ChemAudit automatically computes analytics and provides interactive visualizations for exploring your dataset. Additional on-demand analyses are available for deeper investigation.

Automatic Analytics

These analyses run immediately after batch completion at no additional cost:

Deduplication

Identifies duplicate molecules across four comparison levels:

LevelMethodWhat It Catches
ExactCanonical SMILESIdentical structures
TautomericCanonical tautomer SMILESTautomeric forms (e.g., keto/enol)
Stereo-insensitiveInChI without stereo layersEnantiomers and diastereomers
Salt-formChEMBL parent compoundDifferent salt forms of the same drug

Each level reports the number of unique compounds and groups duplicate molecules together with a representative structure.

Statistics

Comprehensive property statistics computed across all successful molecules:

  • Per-property stats: mean, median, standard deviation, Q1/Q3, IQR, min/max for validation score, QED, SA score, ML-readiness, and Fsp3
  • Pairwise correlations: Pearson correlation between all property pairs (requires ≥10 values)
  • Outlier detection: IQR-fence method (below Q1 − 1.5×IQR or above Q3 + 1.5×IQR)
  • Composite quality score (0–100): weighted combination of validity rate (40%), scaffold diversity via Shannon entropy (35%), and Lipinski pass rate (25%)

On-Demand Analytics

These computationally expensive analyses are triggered manually and run in the background:

Scaffold Analysis

Extracts Murcko scaffolds (ring systems + linkers) and generic scaffolds (all atoms replaced with carbon):

  • Unique scaffold count and Shannon entropy as a diversity metric
  • Frequency distribution: top 50 scaffolds plus an "Other" bucket
  • Acyclic molecules share an empty scaffold
  • Click any scaffold in the treemap to select all molecules containing it
# Trigger scaffold analysis
curl -X POST http://localhost:8001/api/v1/batch/{job_id}/analytics/scaffold

Chemical Space Mapping

Projects molecules into 2D space using molecular fingerprints (Morgan ECFP4, 2048-bit):

MethodAlgorithmLimitBest For
PCARandomized SVD (pure numpy)No limitFast overview, variance analysis
t-SNEopenTSNE≤ 2,000 moleculesCluster visualization

Both methods require at least 3 molecules. PCA also reports variance explained per component.

# Trigger PCA chemical space mapping
curl -X POST http://localhost:8001/api/v1/batch/{job_id}/analytics/chemical_space \
-H "Content-Type: application/json" \
-d '{"method": "pca"}'

# Trigger t-SNE (for smaller datasets)
curl -X POST http://localhost:8001/api/v1/batch/{job_id}/analytics/chemical_space \
-H "Content-Type: application/json" \
-d '{"method": "tsne"}'

Matched Molecular Pairs (MMP)

Detects structurally similar molecule pairs that differ by a single chemical transformation:

  • BRICS single-cut fragmentation identifies core + R-group pairs
  • Core heuristic: core must have more heavy atoms than the R-group
  • Activity cliffs (optional): SALI index = |Δactivity| / (1 − Tanimoto) for highly similar pairs with large activity differences
  • Lipophilic Ligand Efficiency (optional): LLE = pActivity − LogP
# Trigger MMP analysis with activity cliff detection
curl -X POST http://localhost:8001/api/v1/batch/{job_id}/analytics/mmp \
-H "Content-Type: application/json" \
-d '{"activity_column": "validation_score"}'
Batch Size Limit

MMP analysis is limited to 5,000 molecules. For larger datasets, select a subset first.

Find molecules similar to a query structure using ECFP4 Tanimoto similarity:

# Search by SMILES
curl -X POST http://localhost:8001/api/v1/batch/{job_id}/analytics/similarity_search \
-H "Content-Type: application/json" \
-d '{"query_smiles": "c1ccccc1", "top_k": 10}'

# Search by molecule index in the batch
curl -X POST http://localhost:8001/api/v1/batch/{job_id}/analytics/similarity_search \
-H "Content-Type: application/json" \
-d '{"query_index": 42, "top_k": 10}'

R-Group Decomposition

Decompose molecules around a user-supplied core SMARTS pattern:

curl -X POST http://localhost:8001/api/v1/batch/{job_id}/analytics/rgroup \
-H "Content-Type: application/json" \
-d '{"core_smarts": "[#6]1:[#6]:[#6]:[#6]:[#6]:[#6]:1"}'

Interactive Visualizations

The analytics panel below the batch results table provides two tabs of interactive charts with linked selection.

Distributions Tab

Score Histogram — Color-coded bars showing the distribution of validation scores across four quality bands:

BandScore RangeColor
Excellent80–100Amber
Good60–80Dark amber
Moderate40–60Orange
Poor0–40Red

Click any bar to select all molecules in that score range.

Property Scatter Plot — Configurable X/Y scatter plot for exploring property relationships. Choose axes from: MW, LogP, TPSA, QED, Overall Score, SA Score, Fsp3. Points are colored by a selectable property. Click individual points to toggle selection.

Alert Frequency Chart — Horizontal bar chart showing structural alert counts by type, sorted by frequency. Height scales dynamically with the number of alert types.

Validation Treemap — Hierarchical treemap grouping validation issues by category and count. Provides a quick visual overview of the most common problems in your dataset.

Chemical Space Tab

Scaffold Treemap — Scaffold frequency treemap showing the most common ring systems. Click any scaffold cell to select all molecules sharing that scaffold. Uses 16 distinct colors for scaffold groups.

Chemical Space Scatter — Canvas-rendered 2D scatter plot for PCA or t-SNE projections. Optimized for performance with large datasets (800+ points). Features:

  • Brush selection: Click and drag to draw a rectangle selecting all enclosed points
  • Click selection: Click individual points to toggle them
  • Color-by dropdown: Color points by overall score, QED, SA score, or Fsp3
  • PCA/t-SNE toggle: Switch between projection methods (t-SNE triggers a background computation)
  • PNG export: Download the current view as an image

Linked Brushing

Selections are synchronized across all charts. When you select molecules in one visualization, the selection is reflected everywhere:

  • Click a score histogram bar → those molecules highlight in the scatter plot
  • Brush-select in chemical space → those molecules highlight in score histogram
  • Click a scaffold treemap cell → those molecules appear selected in all charts

A selection toolbar appears showing the count of selected molecules with options to Compare (1–2 molecules), open Actions (subset operations), or Clear the selection.

Molecule Comparison

Select 1 or 2 molecules from the results table and click the Compare button to open the comparison panel.

Side-by-side view includes:

  • 2D structure images for both molecules
  • Property comparison table with color-coded highlighting (green for better, red for worse):
    • Overall Score, QED, SA Score, Fsp3, Lipinski Violations, Alert Count
  • Overlaid 6-axis radar chart showing normalized property profiles
  • ECFP4 Tanimoto similarity score between the pair (via POST /validate/similarity)

ECFP Similarity API

curl -X POST http://localhost:8001/api/v1/validate/similarity \
-H "Content-Type: application/json" \
-d '{
"smiles_a": "CC(=O)Oc1ccccc1C(=O)O",
"smiles_b": "CC(=O)Nc1ccc(O)cc1"
}'

Batch Timeline

A 4-phase horizontal timeline at the top of the batch view tracks processing progress:

  1. Upload — File received and parsed
  2. Validation — Molecules being validated
  3. Analytics — Post-processing analytics running
  4. Complete — All processing finished

Each phase shows a status indicator with live color updates.

API Reference

Get Analytics Status and Results

curl http://localhost:8001/api/v1/batch/{job_id}/analytics

Response:

{
"job_id": "550e8400-...",
"status": {
"deduplication": {"status": "complete", "computed_at": "2026-02-26T10:35:00Z"},
"statistics": {"status": "complete", "computed_at": "2026-02-26T10:35:01Z"},
"scaffold": {"status": "pending"},
"chemical_space": {"status": "pending"},
"mmp": {"status": "pending"}
},
"deduplication": {
"exact": {"total_unique": 950, "groups": [...]},
"tautomeric": {"total_unique": 920, "groups": [...]},
"stereo_insensitive": {"total_unique": 900, "groups": [...]},
"salt_form": {"total_unique": 880, "groups": [...]}
},
"statistics": {
"property_stats": [...],
"correlations": [...],
"outliers": [...],
"quality_score": {"score": 78.5, "validity": 0.95, "diversity": 0.72, "druglikeness": 0.65}
}
}

Trigger On-Demand Analytics

curl -X POST http://localhost:8001/api/v1/batch/{job_id}/analytics/{type}

Available types: scaffold, chemical_space, mmp, similarity_search, rgroup

Response:

{
"job_id": "550e8400-...",
"analysis_type": "scaffold",
"status": "queued"
}

Poll GET /batch/{job_id}/analytics to check when the analysis completes (status changes from computing to complete).

Analytics Caching

Analytics results are cached in Redis for 24 hours. Subsequent requests return cached results instantly without recomputation.

Next Steps