Monitoring
ChemAudit includes built-in monitoring with Prometheus metrics and optional Grafana dashboards.
Prometheus Metrics
Enabling Monitoring
# Start with monitoring profile
docker-compose --profile monitoring up -d
Access Points
| Service | URL | Credentials |
|---|---|---|
| Grafana | http://localhost:3001 | admin / (from .env) |
| Prometheus | http://localhost:9090 | None |
| Backend Metrics | http://localhost:8000/metrics | None |
Key Metrics
ChemAudit exposes these metrics:
Request Metrics
validation_requests_total: Total validation requestsvalidation_duration_seconds: Request duration histogramhttp_requests_total: Total HTTP requests by endpointhttp_request_duration_seconds: HTTP request duration
Batch Processing
batch_jobs_active: Currently processing batch jobsbatch_jobs_total: Total batch jobs by statusmolecules_processed_total: Total molecules processedbatch_job_duration_seconds: Batch job processing time
System Metrics
celery_tasks_active: Active Celery tasksredis_connected_clients: Redis client countpostgres_connections: Database connections
Grafana Dashboards
Pre-built dashboards are available after enabling monitoring:
Application Overview
- Request rate by endpoint
- Average response time
- Error rate
- Active batch jobs
Batch Processing
- Job queue depth
- Processing rate (molecules/second)
- Job completion time distribution
- Success/failure ratio
Infrastructure
- Container resource usage
- Database connection pool
- Redis memory usage
- Celery worker status
Queries
Prometheus Query Examples
Request rate:
rate(validation_requests_total[5m])
95th percentile latency:
histogram_quantile(0.95, rate(validation_duration_seconds_bucket[5m]))
Active batch jobs:
batch_jobs_active
Error rate:
rate(http_requests_total{status="500"}[5m])
Alerts
Set up alerts in Prometheus for critical conditions:
# prometheus/alerts.yml
groups:
- name: chemaudit
rules:
- alert: HighErrorRate
expr: rate(http_requests_total{status="500"}[5m]) > 0.05
for: 5m
annotations:
summary: "High error rate detected"
- alert: BatchQueueBacklog
expr: batch_jobs_active > 10
for: 10m
annotations:
summary: "Batch queue backlog"
Best Practices
- Monitor regularly: Check dashboards at least daily
- Set up alerts: Configure alerts for critical metrics
- Track trends: Monitor trends over time, not just current values
- Capacity planning: Use metrics to plan scaling
- Performance optimization: Identify slow endpoints for optimization
Next Steps
- Production - Production deployment
- Docker - Docker setup
- Troubleshooting - Debug issues