LLM API Cost Optimization: 12 Tips to Cut Your Bill

The top LLM cost optimizations: prompt caching (save 50-90% on repeated prefixes), model routing (save 40-60% by using cheap models for simple tasks), and response caching (save 30-50% on duplicate queries). Combined, these can reduce your API bill by 50-80%.

1. Prompt Caching

Anthropic and OpenAI automatically cache repeated prompt prefixes. If your system prompt is 2,000 tokens and you make 1,000 calls:

Approach	Input Tokens Billed	Cost (Claude Sonnet 4)
Without caching	2,000,000	$6.00
With prompt caching	~200,000 (90% cached)	$0.60
Savings		90%

2. Model Routing

Route simple tasks to cheap models, complex tasks to premium ones:

# Simple router based on input length and keywords
def route_model(prompt):
    if len(prompt) < 200 and not any(w in prompt for w in ['analyze', 'compare', 'explain']):
        return "gemini-2.0-flash"      # $0.075/$0.30
    elif any(w in prompt for w in ['code', 'debug', 'review']):
        return "claude-sonnet-4"        # $3/$15
    else:
        return "gpt-4o-mini"            # $0.15/$0.60

3. Response Caching

Cache identical or semantically similar queries. Use embedding similarity with a threshold of 0.95+ for semantic matching.

4. Shorter Prompts

Every token in your system prompt is billed on every request. Reducing a 2,000-token system prompt to 500 tokens saves 75% on input costs for that portion.

System Prompt Size	Monthly Cost at 10K req/day (Sonnet 4)
2,000 tokens	$1800.00
500 tokens	$450.00
Savings	$1350.00/month

5. Batch API (50% Off)

Both OpenAI and Anthropic offer batch APIs at 50% discount for non-realtime processing. Use for document summarization, data extraction, and offline analysis.

6. Set max_tokens

Always set a reasonable max_tokens limit. A runaway 4,000-token response when you only need 200 tokens costs 20x more on output.

7. Conversation Summarization

Instead of sending full conversation history, summarize previous turns. A 20-turn conversation might have 15,000 tokens of history — summarize to 500 tokens for 97% input savings.

8. Use Structured Output

JSON mode and structured outputs produce shorter, more parseable responses. Typically 30-50% fewer output tokens than free-form text.

9. Fine-Tuning (for high volume)

At 1M+ requests/month, fine-tuning a smaller model on your specific task can match larger model quality at 10-50x lower per-request cost.

10. Embedding Model Selection

Use text-embedding-3-small ($0.02/1M tokens) instead of text-embedding-3-large ($0.13/1M tokens) unless you need maximum retrieval quality. The small model is 85-90% as good at 85% less cost.

11. Rate Limit Awareness

Avoid retries caused by rate limits — they waste tokens and time. Implement proper exponential backoff and request queuing.

12. Monitor and Alert

Use tools like Helicone, LangSmith, or custom dashboards to track cost per feature, per user, and per model. Set budget alerts to catch runaway costs early.

Combined Savings Example

Optimization	Savings	Cumulative Bill
Baseline (Claude Sonnet 4, 10K req/day)	—	$2,700/month
+ Model routing (60% to Flash)	-45%	$1,485/month
+ Prompt caching	-30%	$1,040/month
+ Response caching (30% hit rate)	-30%	$728/month
+ Shorter prompts	-15%	$619/month
Total savings	77%	$619/month

FAQ

What is the easiest way to reduce LLM API costs?

Model routing is the easiest win: send simple queries to Gemini Flash ($0.075/$0.30) and complex ones to a premium model. This alone saves 40-60% with minimal code changes.

Does prompt caching really save money?

Yes. Anthropic and OpenAI cache repeated prompt prefixes automatically. If your system prompt is constant across requests, you save up to 90% on input tokens for that portion after the first request.

How much can I save on LLM API costs?

Combining model routing, prompt caching, response caching, and shorter prompts can reduce costs by 50-80%. A $2,700/month bill can drop to $600-$800/month with these optimizations.