Monitoring Best Practices for AI Systems
A comprehensive guide to setting up monitoring, alerting, and observability for LLM-powered applications.Effective monitoring is critical for AI systems where costs can spiral, latency affects UX, and quality degrades silently. This guide covers end-to-end monitoring strategy for production LLM applications.
The Three Pillars of AI Observability
1. Performance Monitoring
Track how fast and reliably your system responds
2. Cost Monitoring
Track every dollar spent on tokens, infrastructure, and API calls
3. Quality Monitoring
Track whether responses are accurate, helpful, and safe
Essential Metrics Dashboard
Performance Metrics
| Metric | Description | Alert Threshold |
|---|---|---|
| Request Latency (p50) | Median response time | >3 seconds |
| Request Latency (p95) | 95th percentile response time | >8 seconds |
| Request Latency (p99) | 99th percentile response time | >15 seconds |
| Error Rate | Failed requests / total requests | >5% |
| Cache Hit Rate | Requests served from cache | <30% |
| Queue Depth | Pending async tasks | >1000 |
Prometheus Queries:
# p95 latency by model
histogram_quantile(0.95, rate(llm_query_latency_seconds_bucket{model="GPT-5"}[5m]))
# Error rate by endpoint
rate(llm_queries_total{status="error"}[5m]) / rate(llm_queries_total[5m])
# Cache hit rate
rate(cache_hits_total[5m]) / (rate(cache_hits_total[5m]) + rate(cache_misses_total[5m]))
Cost Metrics
| Metric | Description | Budget Threshold |
|---|---|---|
| Daily Token Spend | Total cost per day | >$500/day |
| Cost Per Query | Average cost per request | >$0.10/query |
| Embedding Costs | Vector search costs | >$50/day |
| Provider Spend Distribution | Spend by LLM provider | Any provider >80% |
Cost Tracking Query:
-- Daily cost breakdown from query logs
SELECT
DATE(created_at) as date,
model_used,
COUNT(*) as query_count,
SUM(prompt_tokens) as total_prompt_tokens,
SUM(completion_tokens) as total_completion_tokens,
SUM(prompt_tokens * 0.00001 + completion_tokens * 0.00003) as estimated_cost_usd
FROM query_logs
WHERE created_at >= NOW() - INTERVAL '7 days'
GROUP BY DATE(created_at), model_used
ORDER BY date DESC, estimated_cost_usd DESC;
Quality Metrics
| Metric | Description | Target |
|---|---|---|
| User Satisfaction | Thumbs up / total responses | >85% |
| Retrieval Accuracy | Relevant docs retrieved / total | >80% |
| Response Relevance | Human evaluation score (1-5) | >4.0 |
| Hallucination Rate | Factually incorrect responses | <5% |
| Guardrail Triggers | Content filtered / total requests | Track baseline |
Implementation: Monitoring Stack
Architecture
[Application]
↓ (Metrics)
[Prometheus / Grafana]
↓ (Alerts)
[PagerDuty / Slack / Email]
Step 1: Application Metrics (OpenTelemetry)
FastAPI with Prometheus:
from prometheus_client import Counter, Histogram, Gauge, make_asgi_app
from fastapi import FastAPI
app = FastAPI()
# Define metrics
query_counter = Counter(
'llm_queries_total',
'Total LLM queries',
['model', 'status', 'user_tier']
)
query_latency = Histogram(
'llm_query_latency_seconds',
'LLM query latency',
['model'],
buckets=[0.1, 0.5, 1.0, 2.0, 5.0, 10.0, 30.0]
)
token_usage = Counter(
'llm_tokens_total',
'Total tokens used',
['model', 'token_type']
)
cache_performance = Counter(
'cache_operations_total',
'Cache operations',
['operation', 'cache_layer']
)
active_connections = Gauge(
'db_active_connections',
'Active database connections'
)
# Expose metrics endpoint
metrics_app = make_asgi_app()
app.mount("/metrics", metrics_app)
# Usage in endpoints
@app.post("/api/v1/query")
async def query(request: QueryRequest):
with query_latency.labels(model='GPT-5').time():
try:
response = await call_llm(request)
query_counter.labels(model='GPT-5', status='success', user_tier='pro').inc()
token_usage.labels(model='GPT-5', token_type='prompt').inc(response.prompt_tokens)
token_usage.labels(model='GPT-5', token_type='completion').inc(response.completion_tokens)
return response
except Exception as e:
query_counter.labels(model='GPT-5', status='error', user_tier='pro').inc()
raise
Step 2: Grafana Dashboards
Panel Configuration:
{
"dashboard": {
"title": "AI Application Monitoring",
"panels": [
{
"title": "Request Rate (req/s)",
"targets": [
{
"expr": "rate(llm_queries_total[5m])",
"legendFormat": "{{model}}"
}
],
"type": "graph"
},
{
"title": "P95 Latency by Model",
"targets": [
{
"expr": "histogram_quantile(0.95, rate(llm_query_latency_seconds_bucket[5m]))",
"legendFormat": "{{model}}"
}
],
"type": "graph"
},
{
"title": "Daily Cost ($)",
"targets": [
{
"expr": "sum(increase(llm_tokens_total{token_type=\"prompt\"}[1d])) * 0.00001 + sum(increase(llm_tokens_total{token_type=\"completion\"}[1d])) * 0.00003",
"legendFormat": "Total Cost"
}
],
"type": "stat"
}
]
}
}
Step 3: Alert Configuration
Prometheus AlertManager Rules:
groups:
- name: ai_app_alerts
interval: 30s
rules:
# High error rate
- alert: HighErrorRate
expr: rate(llm_queries_total{status="error"}[5m]) / rate(llm_queries_total[5m]) > 0.05
for: 2m
labels:
severity: warning
annotations:
summary: "Error rate above 5%"
description: "Error rate is {{ $value | humanizePercentage }} for the last 5 minutes"
# Slow responses
- alert: SlowResponseTime
expr: histogram_quantile(0.95, rate(llm_query_latency_seconds_bucket[5m])) > 10
for: 5m
labels:
severity: warning
annotations:
summary: "P95 latency > 10 seconds"
description: "{{ $value }}s P95 latency for the last 5 minutes"
# Daily budget exceeded
- alert: DailyBudgetExceeded
expr: sum(increase(llm_tokens_total[1d])) * 0.00002 > 500
labels:
severity: critical
annotations:
summary: "Daily spend > $500"
description: "Estimated daily cost is ${{ $value }}"
# Low cache hit rate
- alert: LowCacheHitRate
expr: rate(cache_hits_total[5m]) / (rate(cache_hits_total[5m]) + rate(cache_misses_total[5m])) < 0.3
for: 10m
labels:
severity: info
annotations:
summary: "Cache hit rate below 30%"
description: "Current hit rate is {{ $value | humanizePercentage }}"
Slack Alert Integration:
receivers:
- name: "slack-alerts"
slack_configs:
- api_url: "https://hooks.slack.com/services/YOUR/WEBHOOK/URL"
channel: "#ai-ops-alerts"
title: "{{ .GroupLabels.alertname }}"
text: "{{ range .Alerts }}{{ .Annotations.description }}{{ end }}"
send_resolved: true
Logging Strategy
Structured Logging Best Practices
import structlog
from datetime import datetime
# Configure structured logging
structlog.configure(
processors=[
structlog.stdlib.filter_by_level,
structlog.stdlib.add_logger_name,
structlog.stdlib.add_log_level,
structlog.processors.TimeStamper(fmt="iso"),
structlog.processors.StackInfoRenderer(),
structlog.processors.format_exc_info,
structlog.processors.JSONRenderer()
]
)
logger = structlog.get_logger()
# Use in application
@app.post("/api/v1/query")
async def query_handler(request: QueryRequest, user_id: str):
logger.info(
"query_started",
user_id=user_id,
query_length=len(request.query),
model=request.model
)
start_time = datetime.utcnow()
try:
response = await process_query(request)
logger.info(
"query_completed",
user_id=user_id,
latency_ms=(datetime.utcnow() - start_time).total_seconds() * 1000,
prompt_tokens=response.prompt_tokens,
completion_tokens=response.completion_tokens,
estimated_cost_usd=response.estimated_cost
)
return response
except Exception as e:
logger.error(
"query_failed",
user_id=user_id,
error=str(e),
error_type=type(e).__name__
)
raise
Log Sampling
High-traffic applications should sample logs:
import random
def should_log_sample(sample_rate=0.1):
"""Log 10% of requests, adjust based on traffic"""
return random.random() < sample_rate
logger.info(
"query_completed",
sample_rate=0.1,
**log_context # Only log if should_log_sample() returns True
)
Quality Monitoring
Automated Quality Metrics
Response Relevance (Embedding-based):
async def measure_response_relevance(query: str, response: str) -> float:
"""Measure semantic similarity between query and response"""
query_emb = await get_embedding(query)
response_emb = await get_embedding(response)
similarity = cosine_similarity(query_emb, response_emb)
# Log for monitoring
relevance_score.labels(model='GPT-5').set(similarity)
return similarity
# Alert if average relevance drops below threshold
# Alert: avg(relevance_score) < 0.7 for 10 minutes
Hallucination Detection (Keyword-based):
HALLUCINATION_KEYWORDS = [
"I don't have information",
"I cannot confirm",
"As an AI language model",
"I'm not sure about"
]
def detect_potential_hallucination(response: str) -> bool:
"""Detect responses that may indicate uncertainty"""
return any(keyword.lower() in response.lower() for keyword in HALLUCINATION_KEYWORDS)
# Log hallucination rate
hallucination_counter = Counter('llm_potential_hallucinations_total', 'Potential hallucinations', ['model'])
User Feedback Integration
from fastapi import FastAPI
from sqlalchemy import insert
feedback_table = Table('user_feedback', metadata,
Column('id', Integer, primary_key=True),
Column('query_id', String),
Column('user_id', String),
Column('rating', String), # 'thumbs_up' or 'thumbs_down'
Column('category', String), # optional: 'inaccurate', 'irrelevant', etc.
Column('created_at', DateTime)
)
@app.post("/api/v1/feedback")
async def submit_feedback(query_id: str, rating: str, category: str = None):
await db.execute(insert(feedback_table).values(
query_id=query_id,
rating=rating,
category=category
))
# Update satisfaction metric
satisfaction_gauge.labels(rating=rating).inc()
return {"status": "recorded"}
Advanced Monitoring Techniques
Distributed Tracing with Jaeger
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.jaeger.thrift import JaegerExporter
# Setup tracing
trace.set_tracer_provider(TracerProvider())
jaeger_exporter = JaegerExporter(
agent_host_name="jaeger",
agent_port=6831,
)
trace.get_tracer_provider().add_span_processor(BatchSpanProcessor(jaeger_exporter))
tracer = trace.get_tracer(__name__)
@app.post("/api/v1/query")
async def query_handler(request: QueryRequest):
with tracer.start_as_current_span("llm_query"):
with tracer.start_as_current_span("retrieval"):
docs = await retrieve_documents(request.query)
with tracer.start_as_current_span("llm_generation"):
response = await generate_response(request.query, docs)
return response
Synthetic Monitoring
import asyncio
from datetime import datetime
SYNTHETIC_QUERIES = [
"What is your return policy?",
"How do I reset my password?",
"Explain your pricing tiers",
"Contact customer support"
]
async def run_synthetic_checks():
"""Run periodic synthetic queries to monitor system health"""
results = []
for query in SYNTHETIC_QUERIES:
start = datetime.utcnow()
try:
response = await make_api_call(query)
latency = (datetime.utcnow() - start).total_seconds()
results.append({
"query": query,
"status": "success",
"latency": latency,
"has_response": bool(response.text)
})
except Exception as e:
results.append({
"query": query,
"status": "error",
"error": str(e)
})
# Log results
synthetic_check_gauge.set(
sum(1 for r in results if r["status"] == "success") / len(results)
)
return results
# Run every 5 minutes
@repeat_every(seconds=300)
async def scheduled_synthetics():
await run_synthetic_checks()
Monitoring Runbook
Daily Checks (Automated)
- [ ] Review cost dashboard: Any unusual spending spikes?
- [ ] Check error rates: Above 1% requires investigation
- [ ] Verify alerts firing: Are alerts actionable or noise?
- [ ] Review latency p95: Sudden increases indicate problems
Weekly Reviews
- [ ] Cost trend analysis: Week-over-week changes
- [ ] Cache hit rate optimization: Identify cache misses
- [ ] Quality metrics review: User satisfaction trends
- [ ] Alert tuning: Adjust thresholds based on feedback
Monthly Deep Dives
- [ ] Per-user cost analysis: Identify high-cost users
- [ ] Model performance comparison: A/B test results
- [ ] Infrastructure cost optimization: Right-sizing opportunities
- [ ] Update dashboards: Add new metrics as features evolve
Related Articles:
- Backend Architecture Checklist for AI Applications
- Deployment Guide: AWS/GCP/Azure Best Practices
- Cost Optimization Strategies for AI Projects