Monitoring Best Practices for AI Systems

A comprehensive guide to setting up monitoring, alerting, and observability for LLM-powered applications.

Effective monitoring is critical for AI systems where costs can spiral, latency affects UX, and quality degrades silently. This guide covers end-to-end monitoring strategy for production LLM applications.

The Three Pillars of AI Observability

1. Performance Monitoring

Track how fast and reliably your system responds

2. Cost Monitoring

Track every dollar spent on tokens, infrastructure, and API calls

3. Quality Monitoring

Track whether responses are accurate, helpful, and safe

Essential Metrics Dashboard

Performance Metrics

MetricDescriptionAlert Threshold
Request Latency (p50)Median response time>3 seconds
Request Latency (p95)95th percentile response time>8 seconds
Request Latency (p99)99th percentile response time>15 seconds
Error RateFailed requests / total requests>5%
Cache Hit RateRequests served from cache<30%
Queue DepthPending async tasks>1000

Prometheus Queries:

# p95 latency by model
histogram_quantile(0.95, rate(llm_query_latency_seconds_bucket{model="GPT-5"}[5m]))

# Error rate by endpoint
rate(llm_queries_total{status="error"}[5m]) / rate(llm_queries_total[5m])

# Cache hit rate
rate(cache_hits_total[5m]) / (rate(cache_hits_total[5m]) + rate(cache_misses_total[5m]))

Cost Metrics

MetricDescriptionBudget Threshold
Daily Token SpendTotal cost per day>$500/day
Cost Per QueryAverage cost per request>$0.10/query
Embedding CostsVector search costs>$50/day
Provider Spend DistributionSpend by LLM providerAny provider >80%

Cost Tracking Query:

-- Daily cost breakdown from query logs
SELECT
  DATE(created_at) as date,
  model_used,
  COUNT(*) as query_count,
  SUM(prompt_tokens) as total_prompt_tokens,
  SUM(completion_tokens) as total_completion_tokens,
  SUM(prompt_tokens * 0.00001 + completion_tokens * 0.00003) as estimated_cost_usd
FROM query_logs
WHERE created_at >= NOW() - INTERVAL '7 days'
GROUP BY DATE(created_at), model_used
ORDER BY date DESC, estimated_cost_usd DESC;

Quality Metrics

MetricDescriptionTarget
User SatisfactionThumbs up / total responses>85%
Retrieval AccuracyRelevant docs retrieved / total>80%
Response RelevanceHuman evaluation score (1-5)>4.0
Hallucination RateFactually incorrect responses<5%
Guardrail TriggersContent filtered / total requestsTrack baseline

Implementation: Monitoring Stack

Architecture

[Application]
    ↓ (Metrics)
[Prometheus / Grafana]
    ↓ (Alerts)
[PagerDuty / Slack / Email]

Step 1: Application Metrics (OpenTelemetry)

FastAPI with Prometheus:

from prometheus_client import Counter, Histogram, Gauge, make_asgi_app
from fastapi import FastAPI

app = FastAPI()

# Define metrics
query_counter = Counter(
    'llm_queries_total',
    'Total LLM queries',
    ['model', 'status', 'user_tier']
)

query_latency = Histogram(
    'llm_query_latency_seconds',
    'LLM query latency',
    ['model'],
    buckets=[0.1, 0.5, 1.0, 2.0, 5.0, 10.0, 30.0]
)

token_usage = Counter(
    'llm_tokens_total',
    'Total tokens used',
    ['model', 'token_type']
)

cache_performance = Counter(
    'cache_operations_total',
    'Cache operations',
    ['operation', 'cache_layer']
)

active_connections = Gauge(
    'db_active_connections',
    'Active database connections'
)

# Expose metrics endpoint
metrics_app = make_asgi_app()
app.mount("/metrics", metrics_app)

# Usage in endpoints
@app.post("/api/v1/query")
async def query(request: QueryRequest):
    with query_latency.labels(model='GPT-5').time():
        try:
            response = await call_llm(request)
            query_counter.labels(model='GPT-5', status='success', user_tier='pro').inc()
            token_usage.labels(model='GPT-5', token_type='prompt').inc(response.prompt_tokens)
            token_usage.labels(model='GPT-5', token_type='completion').inc(response.completion_tokens)
            return response
        except Exception as e:
            query_counter.labels(model='GPT-5', status='error', user_tier='pro').inc()
            raise

Step 2: Grafana Dashboards

Panel Configuration:

{
  "dashboard": {
    "title": "AI Application Monitoring",
    "panels": [
      {
        "title": "Request Rate (req/s)",
        "targets": [
          {
            "expr": "rate(llm_queries_total[5m])",
            "legendFormat": "{{model}}"
          }
        ],
        "type": "graph"
      },
      {
        "title": "P95 Latency by Model",
        "targets": [
          {
            "expr": "histogram_quantile(0.95, rate(llm_query_latency_seconds_bucket[5m]))",
            "legendFormat": "{{model}}"
          }
        ],
        "type": "graph"
      },
      {
        "title": "Daily Cost ($)",
        "targets": [
          {
            "expr": "sum(increase(llm_tokens_total{token_type=\"prompt\"}[1d])) * 0.00001 + sum(increase(llm_tokens_total{token_type=\"completion\"}[1d])) * 0.00003",
            "legendFormat": "Total Cost"
          }
        ],
        "type": "stat"
      }
    ]
  }
}

Step 3: Alert Configuration

Prometheus AlertManager Rules:

groups:
  - name: ai_app_alerts
    interval: 30s
    rules:
      # High error rate
      - alert: HighErrorRate
        expr: rate(llm_queries_total{status="error"}[5m]) / rate(llm_queries_total[5m]) > 0.05
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "Error rate above 5%"
          description: "Error rate is {{ $value | humanizePercentage }} for the last 5 minutes"

      # Slow responses
      - alert: SlowResponseTime
        expr: histogram_quantile(0.95, rate(llm_query_latency_seconds_bucket[5m])) > 10
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "P95 latency > 10 seconds"
          description: "{{ $value }}s P95 latency for the last 5 minutes"

      # Daily budget exceeded
      - alert: DailyBudgetExceeded
        expr: sum(increase(llm_tokens_total[1d])) * 0.00002 > 500
        labels:
          severity: critical
        annotations:
          summary: "Daily spend > $500"
          description: "Estimated daily cost is ${{ $value }}"

      # Low cache hit rate
      - alert: LowCacheHitRate
        expr: rate(cache_hits_total[5m]) / (rate(cache_hits_total[5m]) + rate(cache_misses_total[5m])) < 0.3
        for: 10m
        labels:
          severity: info
        annotations:
          summary: "Cache hit rate below 30%"
          description: "Current hit rate is {{ $value | humanizePercentage }}"

Slack Alert Integration:

receivers:
  - name: "slack-alerts"
    slack_configs:
      - api_url: "https://hooks.slack.com/services/YOUR/WEBHOOK/URL"
        channel: "#ai-ops-alerts"
        title: "{{ .GroupLabels.alertname }}"
        text: "{{ range .Alerts }}{{ .Annotations.description }}{{ end }}"
        send_resolved: true

Logging Strategy

Structured Logging Best Practices

import structlog
from datetime import datetime

# Configure structured logging
structlog.configure(
    processors=[
        structlog.stdlib.filter_by_level,
        structlog.stdlib.add_logger_name,
        structlog.stdlib.add_log_level,
        structlog.processors.TimeStamper(fmt="iso"),
        structlog.processors.StackInfoRenderer(),
        structlog.processors.format_exc_info,
        structlog.processors.JSONRenderer()
    ]
)

logger = structlog.get_logger()

# Use in application
@app.post("/api/v1/query")
async def query_handler(request: QueryRequest, user_id: str):
    logger.info(
        "query_started",
        user_id=user_id,
        query_length=len(request.query),
        model=request.model
    )

    start_time = datetime.utcnow()

    try:
        response = await process_query(request)

        logger.info(
            "query_completed",
            user_id=user_id,
            latency_ms=(datetime.utcnow() - start_time).total_seconds() * 1000,
            prompt_tokens=response.prompt_tokens,
            completion_tokens=response.completion_tokens,
            estimated_cost_usd=response.estimated_cost
        )

        return response

    except Exception as e:
        logger.error(
            "query_failed",
            user_id=user_id,
            error=str(e),
            error_type=type(e).__name__
        )
        raise

Log Sampling

High-traffic applications should sample logs:

import random

def should_log_sample(sample_rate=0.1):
    """Log 10% of requests, adjust based on traffic"""
    return random.random() < sample_rate

logger.info(
    "query_completed",
    sample_rate=0.1,
    **log_context  # Only log if should_log_sample() returns True
)

Quality Monitoring

Automated Quality Metrics

Response Relevance (Embedding-based):

async def measure_response_relevance(query: str, response: str) -> float:
    """Measure semantic similarity between query and response"""
    query_emb = await get_embedding(query)
    response_emb = await get_embedding(response)

    similarity = cosine_similarity(query_emb, response_emb)

    # Log for monitoring
    relevance_score.labels(model='GPT-5').set(similarity)

    return similarity

# Alert if average relevance drops below threshold
# Alert: avg(relevance_score) < 0.7 for 10 minutes

Hallucination Detection (Keyword-based):

HALLUCINATION_KEYWORDS = [
    "I don't have information",
    "I cannot confirm",
    "As an AI language model",
    "I'm not sure about"
]

def detect_potential_hallucination(response: str) -> bool:
    """Detect responses that may indicate uncertainty"""
    return any(keyword.lower() in response.lower() for keyword in HALLUCINATION_KEYWORDS)

# Log hallucination rate
hallucination_counter = Counter('llm_potential_hallucinations_total', 'Potential hallucinations', ['model'])

User Feedback Integration

from fastapi import FastAPI
from sqlalchemy import insert

feedback_table = Table('user_feedback', metadata,
    Column('id', Integer, primary_key=True),
    Column('query_id', String),
    Column('user_id', String),
    Column('rating', String),  # 'thumbs_up' or 'thumbs_down'
    Column('category', String),  # optional: 'inaccurate', 'irrelevant', etc.
    Column('created_at', DateTime)
)

@app.post("/api/v1/feedback")
async def submit_feedback(query_id: str, rating: str, category: str = None):
    await db.execute(insert(feedback_table).values(
        query_id=query_id,
        rating=rating,
        category=category
    ))

    # Update satisfaction metric
    satisfaction_gauge.labels(rating=rating).inc()

    return {"status": "recorded"}

Advanced Monitoring Techniques

Distributed Tracing with Jaeger

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.jaeger.thrift import JaegerExporter

# Setup tracing
trace.set_tracer_provider(TracerProvider())
jaeger_exporter = JaegerExporter(
    agent_host_name="jaeger",
    agent_port=6831,
)
trace.get_tracer_provider().add_span_processor(BatchSpanProcessor(jaeger_exporter))

tracer = trace.get_tracer(__name__)

@app.post("/api/v1/query")
async def query_handler(request: QueryRequest):
    with tracer.start_as_current_span("llm_query"):
        with tracer.start_as_current_span("retrieval"):
            docs = await retrieve_documents(request.query)

        with tracer.start_as_current_span("llm_generation"):
            response = await generate_response(request.query, docs)

        return response

Synthetic Monitoring

import asyncio
from datetime import datetime

SYNTHETIC_QUERIES = [
    "What is your return policy?",
    "How do I reset my password?",
    "Explain your pricing tiers",
    "Contact customer support"
]

async def run_synthetic_checks():
    """Run periodic synthetic queries to monitor system health"""
    results = []

    for query in SYNTHETIC_QUERIES:
        start = datetime.utcnow()

        try:
            response = await make_api_call(query)
            latency = (datetime.utcnow() - start).total_seconds()

            results.append({
                "query": query,
                "status": "success",
                "latency": latency,
                "has_response": bool(response.text)
            })
        except Exception as e:
            results.append({
                "query": query,
                "status": "error",
                "error": str(e)
            })

    # Log results
    synthetic_check_gauge.set(
        sum(1 for r in results if r["status"] == "success") / len(results)
    )

    return results

# Run every 5 minutes
@repeat_every(seconds=300)
async def scheduled_synthetics():
    await run_synthetic_checks()

Monitoring Runbook

Daily Checks (Automated)

  • [ ] Review cost dashboard: Any unusual spending spikes?
  • [ ] Check error rates: Above 1% requires investigation
  • [ ] Verify alerts firing: Are alerts actionable or noise?
  • [ ] Review latency p95: Sudden increases indicate problems

Weekly Reviews

  • [ ] Cost trend analysis: Week-over-week changes
  • [ ] Cache hit rate optimization: Identify cache misses
  • [ ] Quality metrics review: User satisfaction trends
  • [ ] Alert tuning: Adjust thresholds based on feedback

Monthly Deep Dives

  • [ ] Per-user cost analysis: Identify high-cost users
  • [ ] Model performance comparison: A/B test results
  • [ ] Infrastructure cost optimization: Right-sizing opportunities
  • [ ] Update dashboards: Add new metrics as features evolve

Related Articles:

  • Backend Architecture Checklist for AI Applications
  • Deployment Guide: AWS/GCP/Azure Best Practices
  • Cost Optimization Strategies for AI Projects