Backend Architecture for AI Applications
A comprehensive checklist for designing scalable, production-ready backend infrastructure for AI and LLM applications.Building a robust backend for AI applications requires careful planning around scalability, latency, reliability, and cost. This guide provides a comprehensive architecture checklist for production-grade AI systems.
Core Architecture Principles
1. API-First Design
Design your AI backend as a set of well-defined APIs:
Best Practices:
- REST APIs for simplicity (FastAPI, Express)
- GraphQL for complex data fetching needs
- gRPC for internal microservice communication
- OpenAPI/Swagger documentation for all endpoints
Example: FastAPI Service Structure
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
app = FastAPI(title="Kurai AI API", version="1.0.0")
class QueryRequest(BaseModel):
query: str
context_window: int = 4000
temperature: float = 0.7
class QueryResponse(BaseModel):
answer: str
sources: list[str]
latency_ms: int
@app.post("/api/v1/query", response_model=QueryResponse)
async def query_documents(request: QueryRequest):
# Implementation here
pass
2. Request Processing Pipeline
Implement a robust pipeline for LLM requests:
User Request → Rate Limiting → Validation → Caching Check →
LLM Provider → Response Formatting → Monitoring → Response
Key Components:
-
Rate Limiting: Prevent abuse and control costs
- Token bucket or leaky bucket algorithm
- Per-user and global limits
- Example: 100 requests/minute per user
-
Request Validation: Sanitize inputs before LLM calls
- Max length checks (context window limits)
- Content filtering (PII, prohibited content)
- Prompt injection detection
-
Caching Layer: Reduce redundant LLM calls
- Redis for semantic caching (embeddings similarity)
- In-memory cache for exact matches
- TTL: 1-24 hours depending on use case
3. Scalability Patterns
Horizontal Scaling with Load Balancing:
Load Balancer (NGINX/ALB)
↓
API Gateway (Kong/Envoy)
↓
[API Service 1] [API Service 2] [API Service 3]
↓ ↓ ↓
[Redis Cluster] [PostgreSQL] [Vector DB]
Scaling Considerations:
- Stateless Services: API servers should be stateless
- Connection Pooling: Reuse DB and LLM provider connections
- Queue-Based Processing: For async, long-running tasks
- RabbitMQ, SQS, or Kafka for job queues
- Celery or Bull for task processing
4. Error Handling & Resilience
Retry Logic for LLM API Calls:
import asyncio
from tenacity import retry, stop_after_attempt, wait_exponential
@retry(stop=stop_after_attempt(3), wait=wait_exponential(min=1, max=10))
async def call_llm_with_retry(prompt: str):
try:
response = await openai.ChatCompletion.acreate(
model="GPT-5-turbo",
messages=[{"role": "user", "content": prompt}]
)
return response
except RateLimitError:
# Back off and retry
raise
except APIError as e:
# Log but don't retry on certain errors
logger.error(f"API error: {e}")
raise
Circuit Breaker Pattern:
- Open circuit after 5 consecutive failures
- Keep open for 60 seconds before attempting recovery
- Route to backup provider if available
Fallback Strategies:
- Try GPT-5, fallback to Claude 3 on error
- Try production API, fallback to cached response
- Return gracefully degraded response (e.g., “Try again later”)
Database Architecture
Vector Database for RAG
Schema Design for Pinecone:
Namespace: production_docs
Dimensions: 1536 (OpenAI embeddings)
Metric: cosine similarity
Metadata fields:
- document_id: string (partition key)
- chunk_id: string
- created_at: timestamp
- category: string (for filtering)
- access_level: string (public/internal/restricted)
Partitioning Strategy:
- Namespace per client or data source
- Metadata filtering for access control
- Separate indexes for different languages
Relational Database for Metadata
PostgreSQL Schema:
-- Documents table
CREATE TABLE documents (
id UUID PRIMARY KEY,
title VARCHAR(500),
source_url TEXT,
uploaded_at TIMESTAMP DEFAULT NOW(),
chunk_count INT,
last_indexed_at TIMESTAMP
);
-- Query logs for analytics
CREATE TABLE query_logs (
id UUID PRIMARY KEY,
user_id UUID,
query_text TEXT,
model_used VARCHAR(50),
prompt_tokens INT,
completion_tokens INT,
latency_ms INT,
created_at TIMESTAMP DEFAULT NOW()
);
-- Create indexes for performance
CREATE INDEX idx_query_logs_user_created ON query_logs(user_id, created_at DESC);
CREATE INDEX idx_documents_last_indexed ON documents(last_indexed_at);
Connection Pooling:
from sqlalchemy import create_engine
from sqlalchemy.pool import QueuePool
engine = create_engine(
DATABASE_URL,
poolclass=QueuePool,
pool_size=20, # Max concurrent connections
max_overflow=10, # Additional connections during spikes
pool_pre_ping=True, # Verify connections before use
pool_recycle=3600 # Recycle connections after 1 hour
)
Caching Strategy
Multi-Layer Caching
Layer 1: In-Memory Cache (LRU)
- Store: 1000 most recent queries
- TTL: 5 minutes
- Hit rate target: 30-40%
Layer 2: Redis Semantic Cache
- Store: embeddings-based similarity matches
- TTL: 1-24 hours (depending on data freshness)
- Hit rate target: 20-30%
Layer 3: CDN Cache (for static content)
- Cache API responses with immutable data
- TTL: 1-7 days
Cache Implementation
import hashlib
import json
def generate_cache_key(query: str, context: dict) -> str:
"""Deterministic cache key generation"""
payload = json.dumps({"query": query, "context": context}, sort_keys=True)
return hashlib.sha256(payload.encode()).hexdigest()
# Semantic caching with embeddings
async def get_similar_cached_response(query_embedding: list[float]):
"""Find cached responses with similar queries"""
similar_queries = await redis.zrange(
f"queries:semantic",
0, 5,
withscores=True,
by="VECTOR", # Redis 7.2+ vector similarity
limit=5
)
if similar_queries and similarity_score > 0.95:
return similar_queries[0]["response"]
return None
Monitoring & Observability
Key Metrics to Track
Performance Metrics:
- Request latency (p50, p95, p99)
- Token usage per request (input + output)
- Cache hit rates (by layer)
- Error rates by endpoint
- LLM provider response times
Business Metrics:
- Queries per user/day
- Cost per query
- User satisfaction scores
- Feature usage distribution
System Metrics:
- CPU/memory utilization by service
- Database connection pool usage
- Queue depth for async tasks
- Rate limit violations
Monitoring Stack
[Application Metrics]
↓ (Prometheus format)
[Prometheus Server]
↓
[Grafana Dashboards]
↓ (Alerts)
[PagerDuty / Slack]
Implementation with Prometheus:
from prometheus_client import Counter, Histogram, Gauge
# Define metrics
query_counter = Counter(
'llm_queries_total',
'Total LLM queries',
['model', 'status']
)
query_latency = Histogram(
'llm_query_latency_seconds',
'LLM query latency',
['model'],
buckets=[0.1, 0.5, 1.0, 2.0, 5.0, 10.0]
)
active_connections = Gauge(
'db_active_connections',
'Active database connections'
)
# Use in code
@query_latency.labels(model='GPT-5').time()
def process_query(query: str):
query_counter.labels(model='GPT-5', status='success').inc()
# ... processing logic
Security Architecture
API Security Checklist
- [ ] Authentication: JWT-based auth, OAuth2/OIDC
- [ ] Authorization: Role-based access control (RBAC)
- [ ] Rate Limiting: Per-user and per-API-key limits
- [ ] Input Validation: Max lengths, content filtering
- [ ] API Key Management: Rotation, encryption at rest
- [ ] Audit Logging: All requests logged with user ID
Example: Authentication Middleware
from fastapi import Security, HTTPException, status
from fastapi.security import HTTPBearer, HTTPAuthorizationCredentials
security = HTTPBearer()
async def verify_token(
credentials: HTTPAuthorizationCredentials = Security(security)
) -> dict:
token = credentials.credentials
try:
payload = jwt.decode(token, SECRET_KEY, algorithms=[ALGORITHM])
return payload
except JWTError:
raise HTTPException(
status_code=status.HTTP_401_UNAUTHORIZED,
detail="Invalid authentication credentials"
)
Data Protection
Sensitive Data Handling:
- Never log prompts/responses containing PII
- Encrypt stored embeddings at rest
- Use TLS 1.3 for all API communications
- Implement data retention policies (auto-delete old logs)
Content Filtering:
import re
def sanitize_input(text: str) -> str:
"""Remove or redact sensitive patterns"""
# Remove email addresses
text = re.sub(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b', '[EMAIL]', text)
# Remove SSN patterns
text = re.sub(r'\b\d{3}-\d{2}-\d{4}\b', '[SSN]', text)
# Remove credit card patterns
text = re.sub(r'\b\d{4}[-\s]?\d{4}[-\s]?\d{4}[-\s]?\d{4}\b', '[CARD]', text)
return text
Deployment Checklist
Pre-Production
- [ ] Load Testing: 10x expected traffic
- [ ] Security Audit: Penetration testing
- [ ] Cost Projections: Token cost modeling
- [ ] Runbook Creation: Incident response procedures
- [ ] On-Call Setup: PagerDuty rotation
Production Rollout
- [ ] Canary Deployment: 5% → 25% → 100% traffic
- [ ] Monitoring Dashboards: All metrics visible
- [ ] Alert Configuration: Critical alerts tested
- [ ] Backup Plans: Rollback procedure documented
Infrastructure as Code
Terraform Example:
resource "aws_ecs_task_definition" "ai_api" {
family = "ai-api"
network_mode = "awsvpc"
requires_compatibilities = ["FARGATE"]
cpu = "2048"
memory = "4096"
container_definitions = jsonencode([
{
name = "ai-api"
image = "${var.ecr_repository}/ai-api:${var.image_tag}"
cpu = 2048
memory = 4096
essential = true
portMappings = [{
containerPort = 8000
protocol = "tcp"
}]
environment = [
{ name = "ENVIRONMENT", value = "production" },
{ name = "LOG_LEVEL", value = "info" }
]
secrets = [
{ name = "OPENAI_API_KEY", valueFrom = aws_secretsmanager_secret.openai_key.arn }
]
}
])
}
Related Articles:
- Getting Started with AI Integration
- Deployment Guide: AWS/GCP/Azure Best Practices
- Monitoring Best Practices for AI Systems