Kurai - AI & Backend Development Agency

LLM API costs can quickly spiral out of control without proper optimization. This guide provides actionable strategies to reduce costs by 40-80% while maintaining (or improving) response quality.

Cost Optimization Quick Wins

1. Intelligent Caching (30-50% savings)

Semantic Caching: Cache responses for semantically similar queries, not just exact matches.

import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

async def get_semantic_cache_response(query: str, threshold: float = 0.95):
    """Check cache for semantically similar queries"""
    query_embedding = await get_embedding(query)

    # Find similar cached queries
    cached_queries = await redis.zrange(
        f"query_embeddings",
        0, 10,
        withscores=True
    )

    for cached_q, cached_emb, score in cached_queries:
        similarity = cosine_similarity(
            [query_embedding],
            [cached_emb]
        )[0][0]

        if similarity >= threshold:
            logger.info("cache_hit", similarity=similarity)
            return cached_q["response"]

    return None  # Cache miss

Implementation Tips:

Set similarity threshold: 0.93-0.97 for most use cases
Cache TTL: 1-24 hours based on data freshness needs
Cache size: Store 10K-100K most recent queries
Expected savings: 30-50% reduction in API calls

2. Model Routing (40-60% savings)

Route queries to the cheapest model that can handle them adequately.

async def route_query(query: str, complexity: str):
    """Route to appropriate model based on query analysis"""

    # Simple queries → Haiku ($0.00125/1K output)
    if complexity == "simple":
        return await call_claude_haiku(query)

    # Medium complexity → Sonnet ($0.015/1K output)
    elif complexity == "medium":
        return await call_claude_sonnet(query)

    # Complex reasoning → GPT-5 Turbo or Opus ($0.03-0.075/1K output)
    else:
        return await call_gpt4_turbo(query)

# Classification logic
def classify_query_complexity(query: str) -> str:
    """Determine query complexity"""
    if len(query) < 100 and "explain" not in query.lower():
        return "simple"
    elif requires_reasoning(query):
        return "complex"
    else:
        return "medium"

Cost Comparison Example:

10K queries/day, all through GPT-5 Turbo: $600/month
60% Haiku, 30% Sonnet, 10% GPT-5: $180/month (70% savings)

3. Prompt Optimization (15-30% savings)

Reduce token usage through efficient prompting.

Before:

Please act as a helpful customer service assistant. I need you to
carefully read the following customer query and provide a detailed,
thoughtful response that addresses all their concerns. Make sure to
be friendly and professional in your tone. Here is the query: {query}

(48 tokens in prompt template)

After:

Helpful assistant. Query: {query}

(6 tokens in prompt template)

Savings: 42 tokens saved per query × 10K queries/day × $0.01/1K tokens = $4.20/day (~$126/month)

4. Response Streaming (5-10% savings + better UX)

Stream responses to:

Reduce perceived latency (users see output immediately)
Allow early termination if user is satisfied
Enable token counting for accurate cost tracking

from fastapi.responses import StreamingResponse

@app.post("/api/v1/query-stream")
async def query_stream(request: QueryRequest):
    async def generate():
        async for chunk in openai.ChatCompletion.acreate(
            model="GPT-5-turbo",
            messages=[{"role": "user", "content": request.query}],
            stream=True
        ):
            yield chunk.choices[0].delta.content

    return StreamingResponse(generate(), media_type="text/plain")

Infrastructure Cost Optimization

Vector Database Optimization

Pinecone Cost Calculator:

Pod Type	Monthly Cost	Vectors	Use Case
Starter	$70	100K	MVP testing
Basic	$140	1M	Small production
Standard	$440	5M	Medium production
Highly Available	$880	10M	Enterprise

Optimization Strategies:

Namespace Partitioning: Separate indexes per client/tenant
Metadata Filtering: Reduce search space before vector search
Batch Upserts: Upload in batches of 100-500 vectors
Delete Unused Vectors: Remove old documents regularly

Expected Savings: 20-40% on vector DB costs

Database Connection Pooling

Poor connection pooling = wasted database resources.

from sqlalchemy import create_engine
from sqlalchemy.pool import QueuePool

# BEFORE: No pooling (new connection per request)
engine = create_engine(DATABASE_URL)

# AFTER: Optimized pooling
engine = create_engine(
    DATABASE_URL,
    poolclass=QueuePool,
    pool_size=20,  # 20 concurrent connections
    max_overflow=10,  # 10 additional during spikes
    pool_pre_ping=True,  # Validate connections
    pool_recycle=3600,  # Recycle after 1 hour
    pool_timeout=30  # Wait 30s for connection
)

Expected Savings: 30-50% reduction in database costs

Container Rightsizing

Common Mistake: Over-provisioned containers

# BEFORE: Over-provisioned
resources:
  requests:
    cpu: "4000m"  # 4 CPUs
    memory: "8Gi"
  limits:
    cpu: "8000m"
    memory: "16Gi"

# AFTER: Rightsized based on actual usage
resources:
  requests:
    cpu: "500m"  # 0.5 CPU
    memory: "1Gi"
  limits:
    cpu: "2000m"  # 2 CPU (spikes)
    memory: "4Gi"

How to Right-Size:

Monitor actual CPU/memory usage for 7 days
Set requests = p95 usage
Set limits = 2× requests
Use horizontal pod autoscaling (HPA) for spikes

Expected Savings: 50-70% reduction in compute costs

Cloud Provider Selection

Cost Comparison by Region (per 100M tokens):

Provider/Region	GPT-5 Turbo	Claude 3 Sonnet	Llama 2 70B
AWS us-east-1	Baseline	Baseline	N/A
AWS us-west-2	Same	Same	N/A
GCP us-central1	Same	Same	15% cheaper
Azure eastus	Same	Same	Same

Tip: Use spot instances for non-critical workloads (60-80% cheaper)

Advanced Optimization Techniques

1. Token Compression

Technique: Use abbreviations and structured formats in prompts.

def compress_prompt(prompt: str) -> str:
    """Compress prompt to reduce tokens"""
    replacements = {
        "customer": "cust",
        "information": "info",
        "please": "plz",
        "question": "q",
        "answer": "ans",
        "instruction": "instr"
    }

    for old, new in replacements.items():
        prompt = prompt.replace(old, new)

    return prompt

# Use JSON instead of natural language
COMPACT_PROMPT = """
Role: CSA
Task: {task}
Context: {context}
Format: JSON
"""

Savings: 10-20% reduction in prompt tokens

2. Batch Processing

Process multiple queries in a single API call when possible.

async def batch_process(queries: list[str], batch_size: int = 10):
    """Process multiple queries efficiently"""

    results = []

    for i in range(0, len(queries), batch_size):
        batch = queries[i:i+batch_size]

        # Single API call for batch
        response = await openai.ChatCompletion.acreate(
            model="GPT-5-turbo",
            messages=[
                {"role": "system", "content": "Process these queries:"},
                {"role": "user", "content": "\n".join(batch)}
            ]
        )

        # Parse batch response
        for result in parse_batch_response(response):
            results.append(result)

    return results

Best For:

Content classification
Sentiment analysis
Content moderation
Text extraction

Savings: 30-50% through reduced API overhead

3. Caching Embeddings

Problem: Re-embedding documents wastes money

Solution: Cache embeddings permanently

async def get_cached_embedding(text: str) -> list[float]:
    """Get embedding from cache or compute and store"""
    cache_key = hashlib.sha256(text.encode()).hexdigest()

    # Check cache
    cached = await redis.get(f"emb:{cache_key}")
    if cached:
        return json.loads(cached)

    # Compute and cache
    embedding = await openai.Embedding.acreate(
        input=text,
        model="text-embedding-3-small"
    )

    await redis.setex(
        f"emb:{cache_key}",
        86400 * 30,  # 30 days
        json.dumps(embedding)
    )

    return embedding

Cost: Embeddings are $0.00002/1K tokens → caching saves $0.20 per 1M embeddings

4. Request Coalescing

Combine multiple similar requests into one.

# BEFORE: 3 separate requests
for user_id in user_list:
    await get_user_recommendations(user_id)

# AFTER: 1 batch request
await get_batch_recommendations(user_list)

Implementation:

Queue requests for 100-500ms
Batch similar requests
Distribute results to original requesters

Savings: 20-40% on high-volume endpoints

Monitoring and Budget Controls

Real-Time Cost Tracking

import time
from decimal import Decimal

# Pricing per 1K tokens
PRICING = {
    "GPT-5-turbo": {"prompt": 0.01, "completion": 0.03},
    "claude-3-sonnet": {"prompt": 0.003, "completion": 0.015}
}

async def call_with_cost_tracking(model: str, prompt: str):
    """Track costs in real-time"""

    start = time.time()

    response = await openai.ChatCompletion.acreate(
        model=model,
        messages=[{"role": "user", "content": prompt}]
    )

    latency = time.time() - start

    # Calculate cost
    prompt_cost = (response.usage.prompt_tokens / 1000) * PRICING[model]["prompt"]
    completion_cost = (response.usage.completion_tokens / 1000) * PRICING[model]["completion"]
    total_cost = prompt_cost + completion_cost

    # Log metrics
    logger.info(
        "llm_request_completed",
        model=model,
        prompt_tokens=response.usage.prompt_tokens,
        completion_tokens=response.usage.completion_tokens,
        latency_ms=latency * 1000,
        cost_usd=total_cost
    )

    # Update real-time budget tracker
    await redis.incrbyfloat(
        f"cost:daily:{datetime.now().strftime('%Y-%m-%d')}",
        total_cost
    )

    # Check budget
    daily_cost = await redis.get(f"cost:daily:{datetime.now().strftime('%Y-%m-%d')}")
    if float(daily_cost) > DAILY_BUDGET:
        logger.warning("daily_budget_exceeded", cost=daily_cost)
        raise BudgetExceededError()

    return response

Budget Alerting

# Set up alerts
BUDGET_THRESHOLDS = {
    "warning": 0.7,  # Alert at 70% of budget
    "critical": 0.9,  # Alert at 90% of budget
    "enforce": 1.0   # Block requests at 100%
}

async def check_budget_alerts(daily_cost: float):
    budget = await get_daily_budget()
    usage_ratio = daily_cost / budget

    if usage_ratio >= BUDGET_THRESHOLDS["enforce"]:
        await send_slack_alert(
            channel="#alerts",
            message=f"🚨 BUDGET EXCEEDED: ${daily_cost:.2f} spent"
        )
        raise BudgetExceededError()

    elif usage_ratio >= BUDGET_THRESHOLDS["critical"]:
        await send_slack_alert(
            channel="#alerts",
            message=f"⚠️ CRITICAL: {usage_ratio*100:.0f}% of daily budget used"
        )

    elif usage_ratio >= BUDGET_THRESHOLDS["warning"]:
        await send_slack_alert(
            channel="#alerts",
            message=f"⚡ WARNING: {usage_ratio*100:.0f}% of daily budget used"
        )

ROI Calculation Framework

Cost Per User

Total Monthly Cost = (LLM API costs) + (Infrastructure) + (Tools/Licenses)

Cost Per User = Total Monthly Cost / Monthly Active Users

Example:

LLM API: $3,000/month
Infrastructure: $1,500/month
Tools: $500/month
Total: $5,000/month
MAU: 10,000 users
Cost per user: $0.50

Revenue Per User

Break-even Analysis:

Price per user > Cost per user + Desired margin

Example:
- Cost per user: $0.50
- Desired margin: 50%
- Minimum price: $1.00/user

Cost Optimization Checklist

Immediate Actions (Week 1)

[ ] Implement semantic caching → Target: 30% savings
[ ] Add model routing → Target: 40% savings
[ ] Set up cost tracking → Real-time visibility
[ ] Configure budget alerts → Prevent overspending

Short-Term (Month 1)

[ ] Optimize prompts → Target: 15% savings
[ ] Implement streaming → Better UX + 5% savings
[ ] Right-size containers → Target: 50% infra savings
[ ] Add connection pooling → Target: 30% DB savings

Long-Term (Quarter 1)

[ ] Evaluate self-hosted models (Llama 2) → 70% savings at scale
[ ] Implement fine-tuning → Better quality + reduced prompt size
[ ] Explore spot instances → 60% compute savings
[ ] Multi-region deployment → Optimized routing

Expected Total Savings

By implementing all strategies:

LLM API Costs: 50-70% reduction
Infrastructure: 40-60% reduction
Total Projected Savings: 45-65%