Monolith to Microservices: 40% Cost Reduction
The Problem
TeamFlow had grown fast—and their monolithic architecture was showing cracks:
Symptoms:
- Deployments took 45 minutes (entire app rebuilt each time)
- Database locks caused 5-10 minute outages weekly
- Scaling the app meant scaling everything (even reporting that wasn’t used often)
- Developer productivity: Teams waited days for other teams to finish features
- AWS bill: $85K/month and climbing
CTO Mike Johnson: “Every deployment was a white-knuckle experience. One bug in reporting could take down the entire app. We couldn’t scale teams or infrastructure independently.”
The Solution
Kurai executed a 6-month microservices migration using the strangler fig pattern:
Phase 1: Identify Boundaries (Month 1)
# Domain-driven design to identify services
services = {
"user_management": {
"responsibility": "Auth, profiles, permissions",
"database": "PostgreSQL",
"apis": "/api/v1/users/*, /api/v1/auth/*"
},
"tasks": {
"responsibility": "Task CRUD, assignments, due dates",
"database": "PostgreSQL",
"apis": "/api/v1/tasks/*"
},
"notifications": {
"responsibility": "Email, push, in-app notifications",
"database": "MongoDB (document store)",
"apis": "/api/v1/notifications/*"
},
"reporting": {
"responsibility": "Analytics, dashboards, exports",
"database": "TimescaleDB (time-series)",
"apis": "/api/v1/reports/*"
}
}
Phase 2: Extract Services Incrementally (Months 2-5)
Extraction Order (Low Risk → High Risk):
- Notifications (Month 2) - Non-critical, read-heavy
- Reporting (Month 3) - Separate database, isolated workload
- Time Tracking (Month 4) - Moderate complexity
- Tasks (Month 5) - Core feature, high complexity
Migration Strategy for Each Service:
-
Create API Gateway (Kong)
- Route traffic to old monolith OR new service
- Feature flags for gradual traffic shift
- Observability from day one
-
Implement Data Synchronization
- Change Data Capture (CDC) with Debezium
- Dual-write pattern during transition
- Event bus (Kafka) for async communication
-
Deploy Service (Kubernetes)
# Kubernetes deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: tasks-service
spec:
replicas: 3 # Scale independently
resources:
requests:
cpu: "500m"
memory: "512Mi"
limits:
cpu: "2000m"
memory: "2Gi"
-
Migrate Gradually
- Week 1: Internal testing (1% traffic)
- Week 2: Beta customers (5% traffic)
- Week 3-4: 50% traffic split
- Week 5: 100% to new service
- Week 6: Remove old code from monolith
Phase 3: Database Decoupling (Month 5-6)
Approach: Database per Service + API Joins
- Each service owns its database
- Cross-service data via API calls or events
- No distributed transactions (eventual consistency)
Infrastructure:
- Orchestration: Kubernetes (EKS, 12 nodes, m5.2xlarge)
- Service Mesh: Istio for traffic management and observability
- Message Queue: Kafka (3 brokers, MSK)
- API Gateway: Kong with rate limiting per service
- Databases: 6x RDS PostgreSQL, 1x MongoDB Atlas, 1x TimescaleDB
- Monitoring: Prometheus + Grafana + Loki
- CI/CD: GitHub Actions + ArgoCD (GitOps)
The Results
Cost Impact:
- Previous: $85K/month (monolithic over-provisioned)
- New: $51K/month (right-sized services)
- Savings: $408K/year (40% reduction)
Performance:
- Deployment time: 45 min → 8 min (82% faster)
- Uptime: 99.7% → 99.95% (4x fewer incidents)
- Database CPU: 70% → 35% (better isolation)
- P95 latency: 450ms → 180ms (60% faster)
Developer Productivity:
- Feature release cycle: 2 weeks → 3 days
- Team independence: 3x more parallel development
- Onboarding time: 4 weeks → 2 weeks
Scalability:
- Scale individual services:
- Reporting service: 2 pods (rarely used)
- Tasks service: 10 pods (heavy traffic)
- Notifications service: 5 pods + auto-scaling
- Handle traffic spikes without over-provisioning
Service-Level Performance
| Service | P95 Latency | Throughput | Uptime |
|---|
| User Management | 45ms | 2.5K req/s | 99.98% |
| Tasks | 120ms | 5.8K req/s | 99.95% |
| Notifications | 180ms | 1.2K req/s | 99.97% |
| Reporting | 350ms | 0.8K req/s | 99.92% |
| Time Tracking | 95ms | 3.4K req/s | 99.96% |
Customer Feedback
“Our reporting features used to slow down the entire app. Now reporting runs on its own infrastructure and the core app is snappy even during heavy usage.”
— Sarah Lee, Engineering Lead at TechCorp
What’s Next
Phase 3 roadmap:
- Multi-region deployment: US-East + EU-West for global latency
- Service-level authorization: Fine-grained permissions per microservice
- Contract testing: Pact for consumer-driven contract testing
- Chaos engineering: Gremlin for failure injection testing
Technology Stack
- Orchestration: Kubernetes (EKS 1.27)
- Service Mesh: Istio 1.18
- API Gateway: Kong 3.3
- Message Broker: Kafka (MSK, 3 brokers)
- Databases: PostgreSQL 15, MongoDB 6, TimescaleDB 2.10
- CI/CD: GitHub Actions + ArgoCD 2.6
- Monitoring: Prometheus + Grafana + Tempo + Loki
- Language: Python 3.11, Node.js 20
- Framework: FastAPI 0.100, Express 4.18
Key Metrics
| Metric | Before | After | Improvement |
|---|
| Monthly AWS cost | $85K | $51K | -40% |
| Deployment time | 45 min | 8 min | -82% |
| Uptime | 99.7% | 99.95% | +4x fewer incidents |
| P95 latency | 450ms | 180ms | -60% |
| Feature cycle | 2 weeks | 3 days | -79% |
Architecture Changes
Before (Monolith):
[Load Balancer] → [Monolithic App] → [Single Database]
↓
(All services coupled)
After (Microservices):
[API Gateway] → [Service Mesh] → [User Service] → [PostgreSQL #1]
→ [Tasks Service] → [PostgreSQL #2]
→ [Notifications Service] → [MongoDB]
→ [Reporting Service] → [TimescaleDB]
→ [Kafka] ← [All Services]
Timeline
- Month 1: Service boundaries, API gateway setup
- Month 2: Notifications service extraction
- Month 3: Reporting service extraction
- Month 4: Time tracking service extraction
- Month 5: Tasks service extraction (most complex)
- Month 6: Optimization, monitoring, documentation
Lessons Learned
- Extract low-risk services first: Build confidence before touching core features
- Invest in observability early: Distributed tracing is non-negotiable
- Embrace eventual consistency: Distributed transactions will kill you
- API version from day one: Breaking changes are inevitable
- Service ownership matters: One team per service, no shared code
TeamFlow now scales teams and infrastructure independently. They deploy 15x per day per service vs. once per week before, and their AWS bill is 40% lower despite 3x user growth.