Cost Optimization for AI Projects

Strategies and tactics for reducing LLM API costs and cloud infrastructure expenses while maintaining quality.

LLM API costs can quickly spiral out of control without proper optimization. This guide provides actionable strategies to reduce costs by 40-80% while maintaining (or improving) response quality.

Cost Optimization Quick Wins

1. Intelligent Caching (30-50% savings)

Semantic Caching: Cache responses for semantically similar queries, not just exact matches.

import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

async def get_semantic_cache_response(query: str, threshold: float = 0.95):
    """Check cache for semantically similar queries"""
    query_embedding = await get_embedding(query)

    # Find similar cached queries
    cached_queries = await redis.zrange(
        f"query_embeddings",
        0, 10,
        withscores=True
    )

    for cached_q, cached_emb, score in cached_queries:
        similarity = cosine_similarity(
            [query_embedding],
            [cached_emb]
        )[0][0]

        if similarity >= threshold:
            logger.info("cache_hit", similarity=similarity)
            return cached_q["response"]

    return None  # Cache miss

Implementation Tips:

  • Set similarity threshold: 0.93-0.97 for most use cases
  • Cache TTL: 1-24 hours based on data freshness needs
  • Cache size: Store 10K-100K most recent queries
  • Expected savings: 30-50% reduction in API calls

2. Model Routing (40-60% savings)

Route queries to the cheapest model that can handle them adequately.

async def route_query(query: str, complexity: str):
    """Route to appropriate model based on query analysis"""

    # Simple queries → Haiku ($0.00125/1K output)
    if complexity == "simple":
        return await call_claude_haiku(query)

    # Medium complexity → Sonnet ($0.015/1K output)
    elif complexity == "medium":
        return await call_claude_sonnet(query)

    # Complex reasoning → GPT-5 Turbo or Opus ($0.03-0.075/1K output)
    else:
        return await call_gpt4_turbo(query)

# Classification logic
def classify_query_complexity(query: str) -> str:
    """Determine query complexity"""
    if len(query) < 100 and "explain" not in query.lower():
        return "simple"
    elif requires_reasoning(query):
        return "complex"
    else:
        return "medium"

Cost Comparison Example:

  • 10K queries/day, all through GPT-5 Turbo: $600/month
  • 60% Haiku, 30% Sonnet, 10% GPT-5: $180/month (70% savings)

3. Prompt Optimization (15-30% savings)

Reduce token usage through efficient prompting.

Before:

Please act as a helpful customer service assistant. I need you to
carefully read the following customer query and provide a detailed,
thoughtful response that addresses all their concerns. Make sure to
be friendly and professional in your tone. Here is the query: {query}

(48 tokens in prompt template)

After:

Helpful assistant. Query: {query}

(6 tokens in prompt template)

Savings: 42 tokens saved per query × 10K queries/day × $0.01/1K tokens = $4.20/day (~$126/month)

4. Response Streaming (5-10% savings + better UX)

Stream responses to:

  • Reduce perceived latency (users see output immediately)
  • Allow early termination if user is satisfied
  • Enable token counting for accurate cost tracking
from fastapi.responses import StreamingResponse

@app.post("/api/v1/query-stream")
async def query_stream(request: QueryRequest):
    async def generate():
        async for chunk in openai.ChatCompletion.acreate(
            model="GPT-5-turbo",
            messages=[{"role": "user", "content": request.query}],
            stream=True
        ):
            yield chunk.choices[0].delta.content

    return StreamingResponse(generate(), media_type="text/plain")

Infrastructure Cost Optimization

Vector Database Optimization

Pinecone Cost Calculator:

Pod TypeMonthly CostVectorsUse Case
Starter$70100KMVP testing
Basic$1401MSmall production
Standard$4405MMedium production
Highly Available$88010MEnterprise

Optimization Strategies:

  1. Namespace Partitioning: Separate indexes per client/tenant
  2. Metadata Filtering: Reduce search space before vector search
  3. Batch Upserts: Upload in batches of 100-500 vectors
  4. Delete Unused Vectors: Remove old documents regularly

Expected Savings: 20-40% on vector DB costs

Database Connection Pooling

Poor connection pooling = wasted database resources.

from sqlalchemy import create_engine
from sqlalchemy.pool import QueuePool

# BEFORE: No pooling (new connection per request)
engine = create_engine(DATABASE_URL)

# AFTER: Optimized pooling
engine = create_engine(
    DATABASE_URL,
    poolclass=QueuePool,
    pool_size=20,  # 20 concurrent connections
    max_overflow=10,  # 10 additional during spikes
    pool_pre_ping=True,  # Validate connections
    pool_recycle=3600,  # Recycle after 1 hour
    pool_timeout=30  # Wait 30s for connection
)

Expected Savings: 30-50% reduction in database costs

Container Rightsizing

Common Mistake: Over-provisioned containers

# BEFORE: Over-provisioned
resources:
  requests:
    cpu: "4000m"  # 4 CPUs
    memory: "8Gi"
  limits:
    cpu: "8000m"
    memory: "16Gi"

# AFTER: Rightsized based on actual usage
resources:
  requests:
    cpu: "500m"  # 0.5 CPU
    memory: "1Gi"
  limits:
    cpu: "2000m"  # 2 CPU (spikes)
    memory: "4Gi"

How to Right-Size:

  1. Monitor actual CPU/memory usage for 7 days
  2. Set requests = p95 usage
  3. Set limits = 2× requests
  4. Use horizontal pod autoscaling (HPA) for spikes

Expected Savings: 50-70% reduction in compute costs

Cloud Provider Selection

Cost Comparison by Region (per 100M tokens):

Provider/RegionGPT-5 TurboClaude 3 SonnetLlama 2 70B
AWS us-east-1BaselineBaselineN/A
AWS us-west-2SameSameN/A
GCP us-central1SameSame15% cheaper
Azure eastusSameSameSame

Tip: Use spot instances for non-critical workloads (60-80% cheaper)

Advanced Optimization Techniques

1. Token Compression

Technique: Use abbreviations and structured formats in prompts.

def compress_prompt(prompt: str) -> str:
    """Compress prompt to reduce tokens"""
    replacements = {
        "customer": "cust",
        "information": "info",
        "please": "plz",
        "question": "q",
        "answer": "ans",
        "instruction": "instr"
    }

    for old, new in replacements.items():
        prompt = prompt.replace(old, new)

    return prompt

# Use JSON instead of natural language
COMPACT_PROMPT = """
Role: CSA
Task: {task}
Context: {context}
Format: JSON
"""

Savings: 10-20% reduction in prompt tokens

2. Batch Processing

Process multiple queries in a single API call when possible.

async def batch_process(queries: list[str], batch_size: int = 10):
    """Process multiple queries efficiently"""

    results = []

    for i in range(0, len(queries), batch_size):
        batch = queries[i:i+batch_size]

        # Single API call for batch
        response = await openai.ChatCompletion.acreate(
            model="GPT-5-turbo",
            messages=[
                {"role": "system", "content": "Process these queries:"},
                {"role": "user", "content": "\n".join(batch)}
            ]
        )

        # Parse batch response
        for result in parse_batch_response(response):
            results.append(result)

    return results

Best For:

  • Content classification
  • Sentiment analysis
  • Content moderation
  • Text extraction

Savings: 30-50% through reduced API overhead

3. Caching Embeddings

Problem: Re-embedding documents wastes money

Solution: Cache embeddings permanently

async def get_cached_embedding(text: str) -> list[float]:
    """Get embedding from cache or compute and store"""
    cache_key = hashlib.sha256(text.encode()).hexdigest()

    # Check cache
    cached = await redis.get(f"emb:{cache_key}")
    if cached:
        return json.loads(cached)

    # Compute and cache
    embedding = await openai.Embedding.acreate(
        input=text,
        model="text-embedding-3-small"
    )

    await redis.setex(
        f"emb:{cache_key}",
        86400 * 30,  # 30 days
        json.dumps(embedding)
    )

    return embedding

Cost: Embeddings are $0.00002/1K tokens → caching saves $0.20 per 1M embeddings

4. Request Coalescing

Combine multiple similar requests into one.

# BEFORE: 3 separate requests
for user_id in user_list:
    await get_user_recommendations(user_id)

# AFTER: 1 batch request
await get_batch_recommendations(user_list)

Implementation:

  • Queue requests for 100-500ms
  • Batch similar requests
  • Distribute results to original requesters

Savings: 20-40% on high-volume endpoints

Monitoring and Budget Controls

Real-Time Cost Tracking

import time
from decimal import Decimal

# Pricing per 1K tokens
PRICING = {
    "GPT-5-turbo": {"prompt": 0.01, "completion": 0.03},
    "claude-3-sonnet": {"prompt": 0.003, "completion": 0.015}
}

async def call_with_cost_tracking(model: str, prompt: str):
    """Track costs in real-time"""

    start = time.time()

    response = await openai.ChatCompletion.acreate(
        model=model,
        messages=[{"role": "user", "content": prompt}]
    )

    latency = time.time() - start

    # Calculate cost
    prompt_cost = (response.usage.prompt_tokens / 1000) * PRICING[model]["prompt"]
    completion_cost = (response.usage.completion_tokens / 1000) * PRICING[model]["completion"]
    total_cost = prompt_cost + completion_cost

    # Log metrics
    logger.info(
        "llm_request_completed",
        model=model,
        prompt_tokens=response.usage.prompt_tokens,
        completion_tokens=response.usage.completion_tokens,
        latency_ms=latency * 1000,
        cost_usd=total_cost
    )

    # Update real-time budget tracker
    await redis.incrbyfloat(
        f"cost:daily:{datetime.now().strftime('%Y-%m-%d')}",
        total_cost
    )

    # Check budget
    daily_cost = await redis.get(f"cost:daily:{datetime.now().strftime('%Y-%m-%d')}")
    if float(daily_cost) > DAILY_BUDGET:
        logger.warning("daily_budget_exceeded", cost=daily_cost)
        raise BudgetExceededError()

    return response

Budget Alerting

# Set up alerts
BUDGET_THRESHOLDS = {
    "warning": 0.7,  # Alert at 70% of budget
    "critical": 0.9,  # Alert at 90% of budget
    "enforce": 1.0   # Block requests at 100%
}

async def check_budget_alerts(daily_cost: float):
    budget = await get_daily_budget()
    usage_ratio = daily_cost / budget

    if usage_ratio >= BUDGET_THRESHOLDS["enforce"]:
        await send_slack_alert(
            channel="#alerts",
            message=f"🚨 BUDGET EXCEEDED: ${daily_cost:.2f} spent"
        )
        raise BudgetExceededError()

    elif usage_ratio >= BUDGET_THRESHOLDS["critical"]:
        await send_slack_alert(
            channel="#alerts",
            message=f"⚠️ CRITICAL: {usage_ratio*100:.0f}% of daily budget used"
        )

    elif usage_ratio >= BUDGET_THRESHOLDS["warning"]:
        await send_slack_alert(
            channel="#alerts",
            message=f"⚡ WARNING: {usage_ratio*100:.0f}% of daily budget used"
        )

ROI Calculation Framework

Cost Per User

Total Monthly Cost = (LLM API costs) + (Infrastructure) + (Tools/Licenses)

Cost Per User = Total Monthly Cost / Monthly Active Users

Example:

  • LLM API: $3,000/month
  • Infrastructure: $1,500/month
  • Tools: $500/month
  • Total: $5,000/month
  • MAU: 10,000 users
  • Cost per user: $0.50

Revenue Per User

Break-even Analysis:

Price per user > Cost per user + Desired margin

Example:
- Cost per user: $0.50
- Desired margin: 50%
- Minimum price: $1.00/user

Cost Optimization Checklist

Immediate Actions (Week 1)

  • [ ] Implement semantic caching → Target: 30% savings
  • [ ] Add model routing → Target: 40% savings
  • [ ] Set up cost tracking → Real-time visibility
  • [ ] Configure budget alerts → Prevent overspending

Short-Term (Month 1)

  • [ ] Optimize prompts → Target: 15% savings
  • [ ] Implement streaming → Better UX + 5% savings
  • [ ] Right-size containers → Target: 50% infra savings
  • [ ] Add connection pooling → Target: 30% DB savings

Long-Term (Quarter 1)

  • [ ] Evaluate self-hosted models (Llama 2) → 70% savings at scale
  • [ ] Implement fine-tuning → Better quality + reduced prompt size
  • [ ] Explore spot instances → 60% compute savings
  • [ ] Multi-region deployment → Optimized routing

Expected Total Savings

By implementing all strategies:

  • LLM API Costs: 50-70% reduction
  • Infrastructure: 40-60% reduction
  • Total Projected Savings: 45-65%

Example:

  • Before optimization: $10,000/month
  • After optimization: $3,500-5,500/month
  • Annual savings: $54,000-$78,000

Related Articles:

  • Choosing the Right LLM for Your Use Case
  • Backend Architecture Checklist for AI Applications
  • Monitoring Best Practices for AI Systems