Cost Optimization for AI Projects
Strategies and tactics for reducing LLM API costs and cloud infrastructure expenses while maintaining quality.LLM API costs can quickly spiral out of control without proper optimization. This guide provides actionable strategies to reduce costs by 40-80% while maintaining (or improving) response quality.
Cost Optimization Quick Wins
1. Intelligent Caching (30-50% savings)
Semantic Caching: Cache responses for semantically similar queries, not just exact matches.
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
async def get_semantic_cache_response(query: str, threshold: float = 0.95):
"""Check cache for semantically similar queries"""
query_embedding = await get_embedding(query)
# Find similar cached queries
cached_queries = await redis.zrange(
f"query_embeddings",
0, 10,
withscores=True
)
for cached_q, cached_emb, score in cached_queries:
similarity = cosine_similarity(
[query_embedding],
[cached_emb]
)[0][0]
if similarity >= threshold:
logger.info("cache_hit", similarity=similarity)
return cached_q["response"]
return None # Cache miss
Implementation Tips:
- Set similarity threshold: 0.93-0.97 for most use cases
- Cache TTL: 1-24 hours based on data freshness needs
- Cache size: Store 10K-100K most recent queries
- Expected savings: 30-50% reduction in API calls
2. Model Routing (40-60% savings)
Route queries to the cheapest model that can handle them adequately.
async def route_query(query: str, complexity: str):
"""Route to appropriate model based on query analysis"""
# Simple queries → Haiku ($0.00125/1K output)
if complexity == "simple":
return await call_claude_haiku(query)
# Medium complexity → Sonnet ($0.015/1K output)
elif complexity == "medium":
return await call_claude_sonnet(query)
# Complex reasoning → GPT-5 Turbo or Opus ($0.03-0.075/1K output)
else:
return await call_gpt4_turbo(query)
# Classification logic
def classify_query_complexity(query: str) -> str:
"""Determine query complexity"""
if len(query) < 100 and "explain" not in query.lower():
return "simple"
elif requires_reasoning(query):
return "complex"
else:
return "medium"
Cost Comparison Example:
- 10K queries/day, all through GPT-5 Turbo: $600/month
- 60% Haiku, 30% Sonnet, 10% GPT-5: $180/month (70% savings)
3. Prompt Optimization (15-30% savings)
Reduce token usage through efficient prompting.
Before:
Please act as a helpful customer service assistant. I need you to
carefully read the following customer query and provide a detailed,
thoughtful response that addresses all their concerns. Make sure to
be friendly and professional in your tone. Here is the query: {query}
(48 tokens in prompt template)
After:
Helpful assistant. Query: {query}
(6 tokens in prompt template)
Savings: 42 tokens saved per query × 10K queries/day × $0.01/1K tokens = $4.20/day (~$126/month)
4. Response Streaming (5-10% savings + better UX)
Stream responses to:
- Reduce perceived latency (users see output immediately)
- Allow early termination if user is satisfied
- Enable token counting for accurate cost tracking
from fastapi.responses import StreamingResponse
@app.post("/api/v1/query-stream")
async def query_stream(request: QueryRequest):
async def generate():
async for chunk in openai.ChatCompletion.acreate(
model="GPT-5-turbo",
messages=[{"role": "user", "content": request.query}],
stream=True
):
yield chunk.choices[0].delta.content
return StreamingResponse(generate(), media_type="text/plain")
Infrastructure Cost Optimization
Vector Database Optimization
Pinecone Cost Calculator:
| Pod Type | Monthly Cost | Vectors | Use Case |
|---|---|---|---|
| Starter | $70 | 100K | MVP testing |
| Basic | $140 | 1M | Small production |
| Standard | $440 | 5M | Medium production |
| Highly Available | $880 | 10M | Enterprise |
Optimization Strategies:
- Namespace Partitioning: Separate indexes per client/tenant
- Metadata Filtering: Reduce search space before vector search
- Batch Upserts: Upload in batches of 100-500 vectors
- Delete Unused Vectors: Remove old documents regularly
Expected Savings: 20-40% on vector DB costs
Database Connection Pooling
Poor connection pooling = wasted database resources.
from sqlalchemy import create_engine
from sqlalchemy.pool import QueuePool
# BEFORE: No pooling (new connection per request)
engine = create_engine(DATABASE_URL)
# AFTER: Optimized pooling
engine = create_engine(
DATABASE_URL,
poolclass=QueuePool,
pool_size=20, # 20 concurrent connections
max_overflow=10, # 10 additional during spikes
pool_pre_ping=True, # Validate connections
pool_recycle=3600, # Recycle after 1 hour
pool_timeout=30 # Wait 30s for connection
)
Expected Savings: 30-50% reduction in database costs
Container Rightsizing
Common Mistake: Over-provisioned containers
# BEFORE: Over-provisioned
resources:
requests:
cpu: "4000m" # 4 CPUs
memory: "8Gi"
limits:
cpu: "8000m"
memory: "16Gi"
# AFTER: Rightsized based on actual usage
resources:
requests:
cpu: "500m" # 0.5 CPU
memory: "1Gi"
limits:
cpu: "2000m" # 2 CPU (spikes)
memory: "4Gi"
How to Right-Size:
- Monitor actual CPU/memory usage for 7 days
- Set requests = p95 usage
- Set limits = 2× requests
- Use horizontal pod autoscaling (HPA) for spikes
Expected Savings: 50-70% reduction in compute costs
Cloud Provider Selection
Cost Comparison by Region (per 100M tokens):
| Provider/Region | GPT-5 Turbo | Claude 3 Sonnet | Llama 2 70B |
|---|---|---|---|
| AWS us-east-1 | Baseline | Baseline | N/A |
| AWS us-west-2 | Same | Same | N/A |
| GCP us-central1 | Same | Same | 15% cheaper |
| Azure eastus | Same | Same | Same |
Tip: Use spot instances for non-critical workloads (60-80% cheaper)
Advanced Optimization Techniques
1. Token Compression
Technique: Use abbreviations and structured formats in prompts.
def compress_prompt(prompt: str) -> str:
"""Compress prompt to reduce tokens"""
replacements = {
"customer": "cust",
"information": "info",
"please": "plz",
"question": "q",
"answer": "ans",
"instruction": "instr"
}
for old, new in replacements.items():
prompt = prompt.replace(old, new)
return prompt
# Use JSON instead of natural language
COMPACT_PROMPT = """
Role: CSA
Task: {task}
Context: {context}
Format: JSON
"""
Savings: 10-20% reduction in prompt tokens
2. Batch Processing
Process multiple queries in a single API call when possible.
async def batch_process(queries: list[str], batch_size: int = 10):
"""Process multiple queries efficiently"""
results = []
for i in range(0, len(queries), batch_size):
batch = queries[i:i+batch_size]
# Single API call for batch
response = await openai.ChatCompletion.acreate(
model="GPT-5-turbo",
messages=[
{"role": "system", "content": "Process these queries:"},
{"role": "user", "content": "\n".join(batch)}
]
)
# Parse batch response
for result in parse_batch_response(response):
results.append(result)
return results
Best For:
- Content classification
- Sentiment analysis
- Content moderation
- Text extraction
Savings: 30-50% through reduced API overhead
3. Caching Embeddings
Problem: Re-embedding documents wastes money
Solution: Cache embeddings permanently
async def get_cached_embedding(text: str) -> list[float]:
"""Get embedding from cache or compute and store"""
cache_key = hashlib.sha256(text.encode()).hexdigest()
# Check cache
cached = await redis.get(f"emb:{cache_key}")
if cached:
return json.loads(cached)
# Compute and cache
embedding = await openai.Embedding.acreate(
input=text,
model="text-embedding-3-small"
)
await redis.setex(
f"emb:{cache_key}",
86400 * 30, # 30 days
json.dumps(embedding)
)
return embedding
Cost: Embeddings are $0.00002/1K tokens → caching saves $0.20 per 1M embeddings
4. Request Coalescing
Combine multiple similar requests into one.
# BEFORE: 3 separate requests
for user_id in user_list:
await get_user_recommendations(user_id)
# AFTER: 1 batch request
await get_batch_recommendations(user_list)
Implementation:
- Queue requests for 100-500ms
- Batch similar requests
- Distribute results to original requesters
Savings: 20-40% on high-volume endpoints
Monitoring and Budget Controls
Real-Time Cost Tracking
import time
from decimal import Decimal
# Pricing per 1K tokens
PRICING = {
"GPT-5-turbo": {"prompt": 0.01, "completion": 0.03},
"claude-3-sonnet": {"prompt": 0.003, "completion": 0.015}
}
async def call_with_cost_tracking(model: str, prompt: str):
"""Track costs in real-time"""
start = time.time()
response = await openai.ChatCompletion.acreate(
model=model,
messages=[{"role": "user", "content": prompt}]
)
latency = time.time() - start
# Calculate cost
prompt_cost = (response.usage.prompt_tokens / 1000) * PRICING[model]["prompt"]
completion_cost = (response.usage.completion_tokens / 1000) * PRICING[model]["completion"]
total_cost = prompt_cost + completion_cost
# Log metrics
logger.info(
"llm_request_completed",
model=model,
prompt_tokens=response.usage.prompt_tokens,
completion_tokens=response.usage.completion_tokens,
latency_ms=latency * 1000,
cost_usd=total_cost
)
# Update real-time budget tracker
await redis.incrbyfloat(
f"cost:daily:{datetime.now().strftime('%Y-%m-%d')}",
total_cost
)
# Check budget
daily_cost = await redis.get(f"cost:daily:{datetime.now().strftime('%Y-%m-%d')}")
if float(daily_cost) > DAILY_BUDGET:
logger.warning("daily_budget_exceeded", cost=daily_cost)
raise BudgetExceededError()
return response
Budget Alerting
# Set up alerts
BUDGET_THRESHOLDS = {
"warning": 0.7, # Alert at 70% of budget
"critical": 0.9, # Alert at 90% of budget
"enforce": 1.0 # Block requests at 100%
}
async def check_budget_alerts(daily_cost: float):
budget = await get_daily_budget()
usage_ratio = daily_cost / budget
if usage_ratio >= BUDGET_THRESHOLDS["enforce"]:
await send_slack_alert(
channel="#alerts",
message=f"🚨 BUDGET EXCEEDED: ${daily_cost:.2f} spent"
)
raise BudgetExceededError()
elif usage_ratio >= BUDGET_THRESHOLDS["critical"]:
await send_slack_alert(
channel="#alerts",
message=f"⚠️ CRITICAL: {usage_ratio*100:.0f}% of daily budget used"
)
elif usage_ratio >= BUDGET_THRESHOLDS["warning"]:
await send_slack_alert(
channel="#alerts",
message=f"⚡ WARNING: {usage_ratio*100:.0f}% of daily budget used"
)
ROI Calculation Framework
Cost Per User
Total Monthly Cost = (LLM API costs) + (Infrastructure) + (Tools/Licenses)
Cost Per User = Total Monthly Cost / Monthly Active Users
Example:
- LLM API: $3,000/month
- Infrastructure: $1,500/month
- Tools: $500/month
- Total: $5,000/month
- MAU: 10,000 users
- Cost per user: $0.50
Revenue Per User
Break-even Analysis:
Price per user > Cost per user + Desired margin
Example:
- Cost per user: $0.50
- Desired margin: 50%
- Minimum price: $1.00/user
Cost Optimization Checklist
Immediate Actions (Week 1)
- [ ] Implement semantic caching → Target: 30% savings
- [ ] Add model routing → Target: 40% savings
- [ ] Set up cost tracking → Real-time visibility
- [ ] Configure budget alerts → Prevent overspending
Short-Term (Month 1)
- [ ] Optimize prompts → Target: 15% savings
- [ ] Implement streaming → Better UX + 5% savings
- [ ] Right-size containers → Target: 50% infra savings
- [ ] Add connection pooling → Target: 30% DB savings
Long-Term (Quarter 1)
- [ ] Evaluate self-hosted models (Llama 2) → 70% savings at scale
- [ ] Implement fine-tuning → Better quality + reduced prompt size
- [ ] Explore spot instances → 60% compute savings
- [ ] Multi-region deployment → Optimized routing
Expected Total Savings
By implementing all strategies:
- LLM API Costs: 50-70% reduction
- Infrastructure: 40-60% reduction
- Total Projected Savings: 45-65%
Example:
- Before optimization: $10,000/month
- After optimization: $3,500-5,500/month
- Annual savings: $54,000-$78,000
Related Articles:
- Choosing the Right LLM for Your Use Case
- Backend Architecture Checklist for AI Applications
- Monitoring Best Practices for AI Systems