Unfil AI

How Unfil AI uses Kurai to power their newest projects.

Industry

Developer Tools & AI Infrastructure

Location

Jakarta, Indonesia

Employees

Project Size

$22K

Identity Provider

Custom

Workloads

Kubernetes, Edge Computing, Microservices, Event Streaming

Website

https://unfil.ai

About: Unfil AI is an API provider delivering production-ready RAG and LLM infrastructure for developers. Their platform processes 500M+ API requests monthly and serves 10,000+ customers across the AI/ML ecosystem.
Challenge: Unfil AI's API infrastructure was struggling with scaling issues. P95 latency had degraded to 3.2 seconds, error rates spiked to 4.7% during peak loads, and customers were experiencing rate limiting. Their monolithic architecture couldn't handle the 10x traffic growth from generative AI adoption.
Solution: Kurai rebuilt Unfil AI's API platform as a distributed event-driven system. We implemented microservices with Kubernetes, added multi-region edge caching with Cloudflare, optimized database queries with connection pooling, and built an auto-scaling pipeline that handles traffic spikes gracefully.

Results

API latency improved from 3.2s to 180ms P95...

Error rate reduced from 4.7% to 0.02%...

Infrastructure costs reduced by 65%...

Capacity increased to handle 1B+ requests/month...

API Infrastructure at Scale: Building for Billions of Requests

The Problem

Unfil AI was exploding—but their infrastructure wasn’t keeping up. The generative AI boom sent API requests skyrocketing from 50M to 500M per month. Their monolithic architecture was buckling under the load.

CTO Sarah Kim said: “We were growing 10x quarter-over-quarter, but our API latency was degrading just as fast. Customers were abandoning us for competitors with faster response times.”

The Pain Points:

P95 latency at 3.2 seconds (unusable for real-time applications)
4.7% error rate during peak hours
Frequent rate limiting angering developers
Single region deployment causing 600ms round-trips for EU/Asia customers
Database connection exhaustion during spikes

The Solution

Kurai completely reimagined Unfil AI’s infrastructure as a globally distributed, event-driven platform:

Architecture Transformation:

Before:

[Load Balancer] → [Monolithic API] → [Single DB]
                      ↓
                 [Redis Cache]

After:

[Edge Nodes] → [API Gateway] → [Microservices] → [Event Bus]
                      ↓              ↓              ↓
                 [CDN Cache]    [Read Replicas]   [Worker Queue]
                                      ↓
                                 [Write Master]

Key Components:

Microservices Split:
- Authentication service
- Embedding generation service
- Vector search service
- Rate limiting service
- Analytics service
Edge Computing:
- Cloudflare Workers in 300+ locations
- Cache 80% of read requests at edge
- DDoS protection and bot filtering
Database Optimization:
- PostgreSQL read replicas (1 master, 5 replicas)
- PgBouncer connection pooling (10K connections)
- Partitioned tables by customer_id
- Optimized indexes reduced query time by 85%
Event-Driven Architecture:
- Apache Kafka for async processing
- 50 partitions for parallelism
- Dead letter queues for failed events
- Exactly-once semantics
Auto-Scaling:
- Kubernetes Horizontal Pod Autoscaler
- Scale based on requests per second (RPS)
- Scale up: 30 seconds
- Scale down: 5 minutes
- Min 20 pods, max 500 pods

The Results

Performance Improvements:

Metric	Before	After	Improvement
P95 Latency	3,200ms	180ms	94% faster
P99 Latency	8,500ms	420ms	95% faster
Error Rate	4.7%	0.02%	99.6% reduction
Throughput	200 req/sec	5,000 req/sec	25x capacity
Uptime	99.5%	99.99%	+7x reliability

Cost Savings:

Before: $120K/month (AWS over-provisioned)
After: $42K/month (auto-scaling + spot instances)
Savings: $78K/month (65% reduction)

Geographic Reach:

Before: Single region (us-east-1)
After: 5 regions + edge caching
Global latency: 600ms → 80ms average

Developer Experience:

“Unfil AI’s API went from a bottleneck to our fastest dependency. Integration took 10 minutes, and we haven’t seen a single timeout in 3 months.” — Alex Rivera, Lead Engineer at PromptCraft

What’s Next

Phase 2 initiatives:

GraphQL API for flexible queries
WebSocket support for real-time streaming
SDKs for Python, JavaScript, Go, and Rust
API analytics dashboard for customers
Custom fine-tuning endpoints

Technology Stack

Runtime: Kubernetes (EKS) + Docker
API Gateway: Kong Enterprise
Service Mesh: Istio
Message Queue: Apache Kafka (Confluent Cloud)
Database: PostgreSQL 15 (Amazon RDS)
Cache: Redis Cluster (ElastiCache)
Edge: Cloudflare Workers + KV
Monitoring: Datadog + OpenTelemetry
CI/CD: GitHub Actions + ArgoCD

Timeline

Month 1: Microservices design and Kafka setup
Month 2: Database migration and read replicas
Month 3: Kubernetes deployment and auto-scaling
Month 4: Edge caching and CDN integration
Month 5: Load testing and optimization
Month 6: Gradual traffic rollout (10% → 100%)

Lessons Learned

Measure everything: We traced 100M requests to find 3 critical bottlenecks
Cache aggressively: Edge caching reduced origin load by 80%
Embrace async: Synchronous processing doesn’t scale; Kafka handles bursts gracefully
Test at scale: Load testing with production traffic patterns caught 12 issues before launch
Monitor costs: Spot instances saved 60% on compute with minimal impact

Unfil AI now processes 1B+ API requests monthly with sub-200ms latency globally. Their platform has become the go-to choice for developers building AI-powered applications.