API Infrastructure at Scale: Building for Billions of Requests
The Problem
Unfil AI was exploding—but their infrastructure wasn’t keeping up. The generative AI boom sent API requests skyrocketing from 50M to 500M per month. Their monolithic architecture was buckling under the load.
CTO Sarah Kim said: “We were growing 10x quarter-over-quarter, but our API latency was degrading just as fast. Customers were abandoning us for competitors with faster response times.”
The Pain Points:
- P95 latency at 3.2 seconds (unusable for real-time applications)
- 4.7% error rate during peak hours
- Frequent rate limiting angering developers
- Single region deployment causing 600ms round-trips for EU/Asia customers
- Database connection exhaustion during spikes
The Solution
Kurai completely reimagined Unfil AI’s infrastructure as a globally distributed, event-driven platform:
Architecture Transformation:
Before:
[Load Balancer] → [Monolithic API] → [Single DB]
↓
[Redis Cache]
After:
[Edge Nodes] → [API Gateway] → [Microservices] → [Event Bus]
↓ ↓ ↓
[CDN Cache] [Read Replicas] [Worker Queue]
↓
[Write Master]
Key Components:
-
Microservices Split:
- Authentication service
- Embedding generation service
- Vector search service
- Rate limiting service
- Analytics service
-
Edge Computing:
- Cloudflare Workers in 300+ locations
- Cache 80% of read requests at edge
- DDoS protection and bot filtering
-
Database Optimization:
- PostgreSQL read replicas (1 master, 5 replicas)
- PgBouncer connection pooling (10K connections)
- Partitioned tables by customer_id
- Optimized indexes reduced query time by 85%
-
Event-Driven Architecture:
- Apache Kafka for async processing
- 50 partitions for parallelism
- Dead letter queues for failed events
- Exactly-once semantics
-
Auto-Scaling:
- Kubernetes Horizontal Pod Autoscaler
- Scale based on requests per second (RPS)
- Scale up: 30 seconds
- Scale down: 5 minutes
- Min 20 pods, max 500 pods
The Results
Performance Improvements:
| Metric | Before | After | Improvement |
|---|
| P95 Latency | 3,200ms | 180ms | 94% faster |
| P99 Latency | 8,500ms | 420ms | 95% faster |
| Error Rate | 4.7% | 0.02% | 99.6% reduction |
| Throughput | 200 req/sec | 5,000 req/sec | 25x capacity |
| Uptime | 99.5% | 99.99% | +7x reliability |
Cost Savings:
- Before: $120K/month (AWS over-provisioned)
- After: $42K/month (auto-scaling + spot instances)
- Savings: $78K/month (65% reduction)
Geographic Reach:
- Before: Single region (us-east-1)
- After: 5 regions + edge caching
- Global latency: 600ms → 80ms average
Developer Experience:
“Unfil AI’s API went from a bottleneck to our fastest dependency. Integration took 10 minutes, and we haven’t seen a single timeout in 3 months.”
— Alex Rivera, Lead Engineer at PromptCraft
What’s Next
Phase 2 initiatives:
- GraphQL API for flexible queries
- WebSocket support for real-time streaming
- SDKs for Python, JavaScript, Go, and Rust
- API analytics dashboard for customers
- Custom fine-tuning endpoints
Technology Stack
- Runtime: Kubernetes (EKS) + Docker
- API Gateway: Kong Enterprise
- Service Mesh: Istio
- Message Queue: Apache Kafka (Confluent Cloud)
- Database: PostgreSQL 15 (Amazon RDS)
- Cache: Redis Cluster (ElastiCache)
- Edge: Cloudflare Workers + KV
- Monitoring: Datadog + OpenTelemetry
- CI/CD: GitHub Actions + ArgoCD
Timeline
- Month 1: Microservices design and Kafka setup
- Month 2: Database migration and read replicas
- Month 3: Kubernetes deployment and auto-scaling
- Month 4: Edge caching and CDN integration
- Month 5: Load testing and optimization
- Month 6: Gradual traffic rollout (10% → 100%)
Lessons Learned
- Measure everything: We traced 100M requests to find 3 critical bottlenecks
- Cache aggressively: Edge caching reduced origin load by 80%
- Embrace async: Synchronous processing doesn’t scale; Kafka handles bursts gracefully
- Test at scale: Load testing with production traffic patterns caught 12 issues before launch
- Monitor costs: Spot instances saved 60% on compute with minimal impact
Unfil AI now processes 1B+ API requests monthly with sub-200ms latency globally. Their platform has become the go-to choice for developers building AI-powered applications.