Batch Analytics & Visualizations

After batch processing completes, ChemAudit automatically computes analytics and provides interactive visualizations for exploring your dataset. Additional on-demand analyses are available for deeper investigation.

Automatic Analytics

These analyses run immediately after batch completion at no additional cost:

Deduplication

Identifies duplicate molecules across four comparison levels:

Level	Method	What It Catches
Exact	Canonical SMILES	Identical structures
Tautomeric	Canonical tautomer SMILES	Tautomeric forms (e.g., keto/enol)
Stereo-insensitive	InChI without stereo layers	Enantiomers and diastereomers
Salt-form	ChEMBL parent compound	Different salt forms of the same drug

Each level reports the number of unique compounds and groups duplicate molecules together with a representative structure.

Statistics

Comprehensive property statistics computed across all successful molecules:

Per-property stats: mean, median, standard deviation, Q1/Q3, IQR, min/max for validation score, QED, SA score, ML-readiness, and Fsp3
Pairwise correlations: Pearson correlation between all property pairs (requires ≥10 values)
Outlier detection: IQR-fence method (below Q1 − 1.5×IQR or above Q3 + 1.5×IQR)
Composite quality score (0–100): weighted combination of validity rate (40%), scaffold diversity via Shannon entropy (35%), and Lipinski pass rate (25%)

On-Demand Analytics

These computationally expensive analyses are triggered manually and run in the background:

Butina Clustering

Sphere-exclusion clustering groups structurally similar molecules using Morgan fingerprints (ECFP4, radius=2, 2048 bits) and Tanimoto distance:

Parameter	Value	Description
Fingerprint	Morgan ECFP4	radius=2, 2048 bits
Distance metric	1 − Tanimoto	Higher = more dissimilar
Distance cutoff	0.2–0.6 (default: 0.35)	Configurable in the UI
Max molecules	1,000	Larger datasets must be subsetted first

The algorithm selects centroids (molecules with the most neighbors within the cutoff), assigns other molecules to the nearest centroid, and leaves unassigned molecules as singletons.

Output: Cluster assignments, cluster count, singleton count, largest cluster size, and a SMILES map for 2D depictions.

Molecule Limit

Butina clustering is limited to 1,000 molecules due to the O(n²) pairwise distance computation. For larger datasets, select a subset first using the subset actions panel.

Chemical Taxonomy

Classifies molecules into chemical categories using ~50 curated SMARTS rules organized into three groups:

Group	Examples
Ring Systems	Benzene, pyridine, pyrimidine, indole, quinoline, furan, thiophene, imidazole, piperidine
Functional Groups	Amide, ester, carboxylic acid, sulfonamide, nitro, halide, ketone, aldehyde
Pharmacophoric / Drug-class	Beta-lactam, benzodiazepine, statin-like, kinase inhibitor-like, steroid

Each molecule can match multiple categories (multi-label classification). The output includes:

Per-molecule classifications: List of matched categories for each molecule
Category counts: Aggregate counts across the dataset
Classified/unclassified counts: How many molecules matched at least one rule

Registration Hashing

Computes canonical registration hashes for deduplication using RDKit's RegistrationHash with tautomer hash v2:

Tautomer-invariant: Tautomeric forms of the same molecule receive the same hash
Collision groups: Sets of molecules that hash identically (potential duplicates)
Unique count: Number of structurally unique molecules in the dataset

This is useful for identifying duplicates that would be missed by simple SMILES comparison (e.g., different tautomeric representations of the same compound).

Scaffold Analysis

Extracts Murcko scaffolds (ring systems + linkers) and generic scaffolds (all atoms replaced with carbon):

Unique scaffold count and Shannon entropy as a diversity metric
Frequency distribution: top 50 scaffolds plus an "Other" bucket
Acyclic molecules share an empty scaffold
Click any scaffold in the treemap to select all molecules containing it

# Trigger scaffold analysis
curl -X POST http://localhost:8001/api/v1/batch/{job_id}/analytics/scaffold

Chemical Space Mapping

Projects molecules into 2D space using molecular fingerprints (Morgan ECFP4, 2048-bit):

Method	Algorithm	Limit	Best For
PCA	Randomized SVD (pure numpy)	No limit	Fast overview, variance analysis
t-SNE	openTSNE	≤ 2,000 molecules	Cluster visualization

Both methods require at least 3 molecules. PCA also reports variance explained per component.

# Trigger PCA chemical space mapping
curl -X POST http://localhost:8001/api/v1/batch/{job_id}/analytics/chemical_space \
  -H "Content-Type: application/json" \
  -d '{"method": "pca"}'

# Trigger t-SNE (for smaller datasets)
curl -X POST http://localhost:8001/api/v1/batch/{job_id}/analytics/chemical_space \
  -H "Content-Type: application/json" \
  -d '{"method": "tsne"}'

Matched Molecular Pairs (MMP)

Detects structurally similar molecule pairs that differ by a single chemical transformation:

BRICS single-cut fragmentation identifies core + R-group pairs
Core heuristic: core must have more heavy atoms than the R-group
Activity cliffs (optional): SALI index = |Δactivity| / (1 − Tanimoto) for highly similar pairs with large activity differences
Lipophilic Ligand Efficiency (optional): LLE = pActivity − LogP

# Trigger MMP analysis with activity cliff detection
curl -X POST http://localhost:8001/api/v1/batch/{job_id}/analytics/mmp \
  -H "Content-Type: application/json" \
  -d '{"activity_column": "validation_score"}'

Batch Size Limit

MMP analysis is limited to 5,000 molecules. For larger datasets, select a subset first.

Similarity Search

Find molecules similar to a query structure using ECFP4 Tanimoto similarity:

# Search by SMILES
curl -X POST http://localhost:8001/api/v1/batch/{job_id}/analytics/similarity_search \
  -H "Content-Type: application/json" \
  -d '{"query_smiles": "c1ccccc1", "top_k": 10}'

# Search by molecule index in the batch
curl -X POST http://localhost:8001/api/v1/batch/{job_id}/analytics/similarity_search \
  -H "Content-Type: application/json" \
  -d '{"query_index": 42, "top_k": 10}'

R-Group Decomposition

Decompose molecules around a user-supplied core SMARTS pattern:

curl -X POST http://localhost:8001/api/v1/batch/{job_id}/analytics/rgroup \
  -H "Content-Type: application/json" \
  -d '{"core_smarts": "[#6]1:[#6]:[#6]:[#6]:[#6]:[#6]:1"}'

Interactive Visualizations

The analytics panel below the batch results table provides two tabs of interactive charts with linked selection.

Distributions Tab

Score Histogram — Color-coded bars showing the distribution of validation scores across four quality bands:

Band	Score Range	Color
Excellent	80–100	Amber
Good	60–80	Dark amber
Moderate	40–60	Orange
Poor	0–40	Red

Click any bar to select all molecules in that score range.

Property Scatter Plot — Configurable X/Y scatter plot for exploring property relationships. Choose axes from: MW, LogP, TPSA, QED, Overall Score, SA Score, Fsp3. Points are colored by a selectable property. Click individual points to toggle selection.

Alert Frequency Chart — Horizontal bar chart showing structural alert counts by type, sorted by frequency. Height scales dynamically with the number of alert types.

Validation Treemap — Hierarchical treemap grouping validation issues by category and count. Provides a quick visual overview of the most common problems in your dataset.

Chemical Space Tab

Scaffold Treemap — Scaffold frequency treemap showing the most common ring systems. Click any scaffold cell to select all molecules sharing that scaffold. Uses 16 distinct colors for scaffold groups.

Chemical Space Scatter — Canvas-rendered 2D scatter plot for PCA or t-SNE projections. Optimized for performance with large datasets (800+ points). Features:

Brush selection: Click and drag to draw a rectangle selecting all enclosed points
Click selection: Click individual points to toggle them
Color-by dropdown: Color points by overall score, QED, SA score, or Fsp3
PCA/t-SNE toggle: Switch between projection methods (t-SNE triggers a background computation)
PNG export: Download the current view as an image

Linked Brushing

Selections are synchronized across all charts. When you select molecules in one visualization, the selection is reflected everywhere:

Click a score histogram bar → those molecules highlight in the scatter plot
Brush-select in chemical space → those molecules highlight in score histogram
Click a scaffold treemap cell → those molecules appear selected in all charts

A selection toolbar appears showing the count of selected molecules with options to Compare (1–2 molecules), open Actions (subset operations), or Clear the selection.

Molecule Comparison

Select 1 or 2 molecules from the results table and click the Compare button to open the comparison panel.

Side-by-side view includes:

2D structure images for both molecules
Property comparison table with color-coded highlighting (green for better, red for worse):
- Overall Score, QED, SA Score, Fsp3, Lipinski Violations, Alert Count
Overlaid 6-axis radar chart showing normalized property profiles
ECFP4 Tanimoto similarity score between the pair (via POST /validate/similarity)

ECFP Similarity API

curl -X POST http://localhost:8001/api/v1/validate/similarity \
  -H "Content-Type: application/json" \
  -d '{
    "smiles_a": "CC(=O)Oc1ccccc1C(=O)O",
    "smiles_b": "CC(=O)Nc1ccc(O)cc1"
  }'

Batch Timeline

A 4-phase horizontal timeline at the top of the batch view tracks processing progress:

Upload — File received and parsed
Validation — Molecules being validated
Analytics — Post-processing analytics running
Complete — All processing finished

Each phase shows a status indicator with live color updates.

API Reference

Get Analytics Status and Results

curl http://localhost:8001/api/v1/batch/{job_id}/analytics

Response:

{
  "job_id": "550e8400-...",
  "status": {
    "deduplication": {"status": "complete", "computed_at": "2026-02-26T10:35:00Z"},
    "statistics": {"status": "complete", "computed_at": "2026-02-26T10:35:01Z"},
    "scaffold": {"status": "pending"},
    "chemical_space": {"status": "pending"},
    "mmp": {"status": "pending"}
  },
  "deduplication": {
    "exact": {"total_unique": 950, "groups": [...]},
    "tautomeric": {"total_unique": 920, "groups": [...]},
    "stereo_insensitive": {"total_unique": 900, "groups": [...]},
    "salt_form": {"total_unique": 880, "groups": [...]}
  },
  "statistics": {
    "property_stats": [...],
    "correlations": [...],
    "outliers": [...],
    "quality_score": {"score": 78.5, "validity": 0.95, "diversity": 0.72, "druglikeness": 0.65}
  }
}

Trigger On-Demand Analytics

curl -X POST http://localhost:8001/api/v1/batch/{job_id}/analytics/{type}

Available types: scaffold, chemical_space, mmp, similarity_search, rgroup

Response:

{
  "job_id": "550e8400-...",
  "analysis_type": "scaffold",
  "status": "queued"
}

Poll GET /batch/{job_id}/analytics to check when the analysis completes (status changes from computing to complete).

Analytics Caching

Analytics results are cached in Redis for 24 hours. Subsequent requests return cached results instantly without recomputation.

Next Steps

Subset Actions — Work with selected molecules from analytics
Exporting Results — Export analytics data and visualizations
Scoring Profiles — Score molecules against custom criteria
Structural Alerts — Understand the alert frequency chart

Automatic Analytics​

Deduplication​

Statistics​

On-Demand Analytics​

Butina Clustering​

Chemical Taxonomy​

Registration Hashing​

Scaffold Analysis​

Chemical Space Mapping​

Matched Molecular Pairs (MMP)​

Similarity Search​

R-Group Decomposition​

Interactive Visualizations​

Distributions Tab​

Chemical Space Tab​

Linked Brushing​

Molecule Comparison​

ECFP Similarity API​

Batch Timeline​

API Reference​

Get Analytics Status and Results​

Trigger On-Demand Analytics​

Next Steps​