Dataset Audit

The Dataset Audit tool provides comprehensive health auditing for chemical datasets. Upload a CSV or SDF file and receive a detailed health score, contradictory label detection, property distribution analysis, and a curation report with actionable recommendations.

Features

Feature	Description
Health Score	Overall 0–100 dataset quality score with weighted sub-scores
Contradictory Labels	Detects molecules with the same structure but different activity labels
Property Distributions	Statistical analysis of molecular properties across the dataset
Curation Report	Actionable recommendations for dataset improvement
Dataset Diff	Compare two dataset versions to find added, removed, and modified molecules
Curated CSV	Download a cleaned version of the dataset with audit columns appended

Using the Web Interface

Navigate to Dataset Audit under the Data Preparation dropdown
Upload a CSV or SDF file
- Optionally specify the SMILES column (auto-detected if not provided)
- Optionally specify the activity column for contradiction detection
Monitor audit progress through the stages:
- Health audit — Structural validity, property distributions, standardization coverage
- Contradiction detection — Same-structure/different-label analysis
- Curation — Generating recommendations and curated output
Review results:
- Health Score — Overall score with sub-score breakdown
- Issues — Categorized problems found in the dataset
- Contradictions — Molecules with conflicting labels
- Property Distributions — Histograms and statistics
- Treemap — Interactive drill-down visualization of issues by category
Download the curation report or curated CSV

Health Score

The health score is a weighted combination of sub-scores:

Sub-Score	What It Measures
Validity	Percentage of molecules that parse and pass basic validation
Diversity	Scaffold diversity via Shannon entropy
Property Coverage	Whether molecular properties span reasonable drug-like ranges
Standardization	Consistency of molecular representations

The overall score (0–100) provides a quick indicator of dataset quality for ML modeling.

Low Health Scores

A health score below 50 indicates significant data quality issues. Review the issues list and consider curating the dataset before using it for modeling.

Contradictory Label Detection

When an activity column is provided, the audit identifies molecules that have the same structure (by InChIKey) but different activity labels. These contradictions can introduce noise into ML models and should be reviewed.

Each contradiction includes:

The shared structure (SMILES and InChIKey)
All conflicting labels found
The number of occurrences of each label

Dataset Diff

Compare two versions of a dataset to understand what changed:

Complete the initial audit
Click Compare and upload a second file
View results categorized as:
- Added — Molecules in the new version but not the original
- Removed — Molecules in the original but not the new version
- Modified — Same molecule with changed properties or labels
- Unchanged — Identical across both versions

API Reference

Upload Dataset

curl -X POST http://localhost:8001/api/v1/dataset/upload \
  -F "file=@dataset.csv" \
  -F "smiles_column=SMILES" \
  -F "activity_column=Activity"

Response:

{
  "job_id": "550e8400-...",
  "filename": "dataset.csv",
  "file_type": "csv",
  "status": "pending",
  "message": "Dataset audit submitted"
}

Check Status

curl http://localhost:8001/api/v1/dataset/{job_id}/status

Response:

{
  "job_id": "550e8400-...",
  "status": "processing",
  "progress": 60.0,
  "current_stage": "health_audit",
  "eta_seconds": 15.0
}

Get Results

curl http://localhost:8001/api/v1/dataset/{job_id}/results

Returns 202 if the audit is still processing.

Dataset Diff

curl -X POST http://localhost:8001/api/v1/dataset/{job_id}/diff \
  -F "file=@dataset_v2.csv"

Response:

{
  "added": [...],
  "removed": [...],
  "modified": [...],
  "added_count": 50,
  "removed_count": 10,
  "modified_count": 25,
  "unchanged_count": 915
}

Download Outputs

# Curation report (JSON)
curl http://localhost:8001/api/v1/dataset/{job_id}/download/report -o report.json

# Curated CSV (with appended audit columns)
curl http://localhost:8001/api/v1/dataset/{job_id}/download/csv -o curated.csv

import requests
import time

# Upload dataset
with open("dataset.csv", "rb") as f:
    response = requests.post(
        "http://localhost:8001/api/v1/dataset/upload",
        files={"file": f},
        data={"smiles_column": "SMILES", "activity_column": "Activity"}
    )
job_id = response.json()["job_id"]

# Poll for completion
while True:
    status = requests.get(f"http://localhost:8001/api/v1/dataset/{job_id}/status").json()
    if status["status"] in ["complete", "error"]:
        break
    time.sleep(2)

# Get results
results = requests.get(f"http://localhost:8001/api/v1/dataset/{job_id}/results").json()
print(f"Health Score: {results['health_audit']['overall_score']}")
print(f"Contradictions: {len(results['contradictions'])}")

Rate Limits

Endpoint	Limit
`POST /dataset/upload`	3 req/min
`GET /dataset/*/status`	60 req/min
`GET /dataset/*/results`	30 req/min
`POST /dataset/*/diff`	10 req/min
`GET /dataset//download/`	10 req/min

Next Steps

QSAR-Ready Pipeline — Curate datasets for ML after auditing
Batch Processing — Full validation and scoring
Batch Analytics — Clustering, taxonomy, and chemical space analysis

Features​

Using the Web Interface​

Health Score​

Contradictory Label Detection​

Dataset Diff​

API Reference​

Upload Dataset​

Check Status​

Get Results​

Dataset Diff​

Download Outputs​

Rate Limits​

Next Steps​

Features

Using the Web Interface

Health Score

Contradictory Label Detection

Dataset Diff

API Reference

Upload Dataset

Check Status

Get Results

Dataset Diff

Download Outputs

Rate Limits

Next Steps