Kurai - AI & Backend Development Agency

Building a robust backend for AI applications requires careful planning around scalability, latency, reliability, and cost. This guide provides a comprehensive architecture checklist for production-grade AI systems.

Core Architecture Principles

1. API-First Design

Design your AI backend as a set of well-defined APIs:

Best Practices:

REST APIs for simplicity (FastAPI, Express)
GraphQL for complex data fetching needs
gRPC for internal microservice communication
OpenAPI/Swagger documentation for all endpoints

Example: FastAPI Service Structure

from fastapi import FastAPI, HTTPException
from pydantic import BaseModel

app = FastAPI(title="Kurai AI API", version="1.0.0")

class QueryRequest(BaseModel):
    query: str
    context_window: int = 4000
    temperature: float = 0.7

class QueryResponse(BaseModel):
    answer: str
    sources: list[str]
    latency_ms: int

@app.post("/api/v1/query", response_model=QueryResponse)
async def query_documents(request: QueryRequest):
    # Implementation here
    pass

2. Request Processing Pipeline

Implement a robust pipeline for LLM requests:

User Request → Rate Limiting → Validation → Caching Check →
LLM Provider → Response Formatting → Monitoring → Response

Key Components:

Rate Limiting: Prevent abuse and control costs
- Token bucket or leaky bucket algorithm
- Per-user and global limits
- Example: 100 requests/minute per user
Request Validation: Sanitize inputs before LLM calls
- Max length checks (context window limits)
- Content filtering (PII, prohibited content)
- Prompt injection detection
Caching Layer: Reduce redundant LLM calls
- Redis for semantic caching (embeddings similarity)
- In-memory cache for exact matches
- TTL: 1-24 hours depending on use case

3. Scalability Patterns

Horizontal Scaling with Load Balancing:

Load Balancer (NGINX/ALB)
    ↓
API Gateway (Kong/Envoy)
    ↓
[API Service 1] [API Service 2] [API Service 3]
    ↓                    ↓                    ↓
[Redis Cluster]    [PostgreSQL]      [Vector DB]

Scaling Considerations:

Stateless Services: API servers should be stateless
Connection Pooling: Reuse DB and LLM provider connections
Queue-Based Processing: For async, long-running tasks
- RabbitMQ, SQS, or Kafka for job queues
- Celery or Bull for task processing

4. Error Handling & Resilience

Retry Logic for LLM API Calls:

import asyncio
from tenacity import retry, stop_after_attempt, wait_exponential

@retry(stop=stop_after_attempt(3), wait=wait_exponential(min=1, max=10))
async def call_llm_with_retry(prompt: str):
    try:
        response = await openai.ChatCompletion.acreate(
            model="GPT-5-turbo",
            messages=[{"role": "user", "content": prompt}]
        )
        return response
    except RateLimitError:
        # Back off and retry
        raise
    except APIError as e:
        # Log but don't retry on certain errors
        logger.error(f"API error: {e}")
        raise

Circuit Breaker Pattern:

Open circuit after 5 consecutive failures
Keep open for 60 seconds before attempting recovery
Route to backup provider if available

Fallback Strategies:

Try GPT-5, fallback to Claude 3 on error
Try production API, fallback to cached response
Return gracefully degraded response (e.g., “Try again later”)

Database Architecture

Vector Database for RAG

Schema Design for Pinecone:

Namespace: production_docs
Dimensions: 1536 (OpenAI embeddings)
Metric: cosine similarity

Metadata fields:
  - document_id: string (partition key)
  - chunk_id: string
  - created_at: timestamp
  - category: string (for filtering)
  - access_level: string (public/internal/restricted)

Partitioning Strategy:

Namespace per client or data source
Metadata filtering for access control
Separate indexes for different languages

Relational Database for Metadata

PostgreSQL Schema:

-- Documents table
CREATE TABLE documents (
    id UUID PRIMARY KEY,
    title VARCHAR(500),
    source_url TEXT,
    uploaded_at TIMESTAMP DEFAULT NOW(),
    chunk_count INT,
    last_indexed_at TIMESTAMP
);

-- Query logs for analytics
CREATE TABLE query_logs (
    id UUID PRIMARY KEY,
    user_id UUID,
    query_text TEXT,
    model_used VARCHAR(50),
    prompt_tokens INT,
    completion_tokens INT,
    latency_ms INT,
    created_at TIMESTAMP DEFAULT NOW()
);

-- Create indexes for performance
CREATE INDEX idx_query_logs_user_created ON query_logs(user_id, created_at DESC);
CREATE INDEX idx_documents_last_indexed ON documents(last_indexed_at);

Connection Pooling:

from sqlalchemy import create_engine
from sqlalchemy.pool import QueuePool

engine = create_engine(
    DATABASE_URL,
    poolclass=QueuePool,
    pool_size=20,  # Max concurrent connections
    max_overflow=10,  # Additional connections during spikes
    pool_pre_ping=True,  # Verify connections before use
    pool_recycle=3600  # Recycle connections after 1 hour
)

Caching Strategy

Multi-Layer Caching

Layer 1: In-Memory Cache (LRU)

Store: 1000 most recent queries
TTL: 5 minutes
Hit rate target: 30-40%

Layer 2: Redis Semantic Cache

Store: embeddings-based similarity matches
TTL: 1-24 hours (depending on data freshness)
Hit rate target: 20-30%

Layer 3: CDN Cache (for static content)

Cache API responses with immutable data
TTL: 1-7 days

Cache Implementation

import hashlib
import json

def generate_cache_key(query: str, context: dict) -> str:
    """Deterministic cache key generation"""
    payload = json.dumps({"query": query, "context": context}, sort_keys=True)
    return hashlib.sha256(payload.encode()).hexdigest()

# Semantic caching with embeddings
async def get_similar_cached_response(query_embedding: list[float]):
    """Find cached responses with similar queries"""
    similar_queries = await redis.zrange(
        f"queries:semantic",
        0, 5,
        withscores=True,
        by="VECTOR",  # Redis 7.2+ vector similarity
        limit=5
    )
    if similar_queries and similarity_score > 0.95:
        return similar_queries[0]["response"]
    return None

Monitoring & Observability

Key Metrics to Track

Performance Metrics:

Request latency (p50, p95, p99)
Token usage per request (input + output)
Cache hit rates (by layer)
Error rates by endpoint
LLM provider response times

Business Metrics:

Queries per user/day
Cost per query
User satisfaction scores
Feature usage distribution

System Metrics:

CPU/memory utilization by service
Database connection pool usage
Queue depth for async tasks
Rate limit violations

Monitoring Stack

[Application Metrics]
    ↓ (Prometheus format)
[Prometheus Server]
    ↓
[Grafana Dashboards]
    ↓ (Alerts)
[PagerDuty / Slack]

Implementation with Prometheus:

from prometheus_client import Counter, Histogram, Gauge

# Define metrics
query_counter = Counter(
    'llm_queries_total',
    'Total LLM queries',
    ['model', 'status']
)

query_latency = Histogram(
    'llm_query_latency_seconds',
    'LLM query latency',
    ['model'],
    buckets=[0.1, 0.5, 1.0, 2.0, 5.0, 10.0]
)

active_connections = Gauge(
    'db_active_connections',
    'Active database connections'
)

# Use in code
@query_latency.labels(model='GPT-5').time()
def process_query(query: str):
    query_counter.labels(model='GPT-5', status='success').inc()
    # ... processing logic

Security Architecture

API Security Checklist

[ ] Authentication: JWT-based auth, OAuth2/OIDC
[ ] Authorization: Role-based access control (RBAC)
[ ] Rate Limiting: Per-user and per-API-key limits
[ ] Input Validation: Max lengths, content filtering
[ ] API Key Management: Rotation, encryption at rest
[ ] Audit Logging: All requests logged with user ID

Example: Authentication Middleware

from fastapi import Security, HTTPException, status
from fastapi.security import HTTPBearer, HTTPAuthorizationCredentials

security = HTTPBearer()

async def verify_token(
    credentials: HTTPAuthorizationCredentials = Security(security)
) -> dict:
    token = credentials.credentials
    try:
        payload = jwt.decode(token, SECRET_KEY, algorithms=[ALGORITHM])
        return payload
    except JWTError:
        raise HTTPException(
            status_code=status.HTTP_401_UNAUTHORIZED,
            detail="Invalid authentication credentials"
        )

Data Protection

Sensitive Data Handling:

Never log prompts/responses containing PII
Encrypt stored embeddings at rest
Use TLS 1.3 for all API communications
Implement data retention policies (auto-delete old logs)

Content Filtering:

import re

def sanitize_input(text: str) -> str:
    """Remove or redact sensitive patterns"""
    # Remove email addresses
    text = re.sub(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b', '[EMAIL]', text)
    # Remove SSN patterns
    text = re.sub(r'\b\d{3}-\d{2}-\d{4}\b', '[SSN]', text)
    # Remove credit card patterns
    text = re.sub(r'\b\d{4}[-\s]?\d{4}[-\s]?\d{4}[-\s]?\d{4}\b', '[CARD]', text)
    return text

Deployment Checklist

Pre-Production

[ ] Load Testing: 10x expected traffic
[ ] Security Audit: Penetration testing
[ ] Cost Projections: Token cost modeling
[ ] Runbook Creation: Incident response procedures
[ ] On-Call Setup: PagerDuty rotation

Production Rollout

[ ] Canary Deployment: 5% → 25% → 100% traffic
[ ] Monitoring Dashboards: All metrics visible
[ ] Alert Configuration: Critical alerts tested
[ ] Backup Plans: Rollback procedure documented

Infrastructure as Code

Terraform Example:

resource "aws_ecs_task_definition" "ai_api" {
  family                   = "ai-api"
  network_mode             = "awsvpc"
  requires_compatibilities = ["FARGATE"]
  cpu                      = "2048"
  memory                   = "4096"

  container_definitions = jsonencode([
    {
      name      = "ai-api"
      image     = "${var.ecr_repository}/ai-api:${var.image_tag}"
      cpu       = 2048
      memory    = 4096
      essential = true
      portMappings = [{
        containerPort = 8000
        protocol      = "tcp"
      }]
      environment = [
        { name = "ENVIRONMENT", value = "production" },
        { name = "LOG_LEVEL", value = "info" }
      ]
      secrets = [
        { name = "OPENAI_API_KEY", valueFrom = aws_secretsmanager_secret.openai_key.arn }
      ]
    }
  ])
}

Related Articles:

Getting Started with AI Integration
Deployment Guide: AWS/GCP/Azure Best Practices
Monitoring Best Practices for AI Systems

Backend Architecture for AI Applications