How LLM API Pricing Works
Every major LLM provider charges based on token consumption, not time, compute, or number of characters. Tokens are sub-word units that the model processes internally. English text averages roughly 1.3 tokens per word, meaning a 1,000-word prompt consumes approximately 1,300 tokens. Code, JSON, and non-Latin scripts typically consume more tokens per word because specialized vocabulary gets broken into smaller subword pieces.
LLM APIs charge separately for input tokens (your prompt, system instructions, and any context you send) and output tokens (the model's response). Output tokens are always more expensive, typically 3x to 5x the input price. This asymmetry exists because generating each output token requires a full forward pass through the model, while input tokens can be processed in parallel during the prefill stage.
Understanding this split is critical for cost optimization. A retrieval-augmented generation (RAG) system that sends 10,000 tokens of context but only generates a 200-token answer is heavily input-weighted. A creative writing tool that takes a 50-token prompt and generates 2,000 tokens of prose is output-weighted. The same model can cost dramatically different amounts depending on which side of the split dominates your workload.
Understanding the Calculator Inputs
This calculator asks for three values that define your workload profile. Getting accurate numbers here is the difference between a useful estimate and a meaningless one.
Average input tokens per request includes everything you send to the API: system prompts, user messages, retrieved context, conversation history, function definitions, and any other text in the request body. For a chatbot with a 500-token system prompt and 300 tokens of user message, that is 800 input tokens. For a RAG pipeline injecting 5 retrieved passages at 400 tokens each plus the query, that is 2,100 input tokens. Measure this from your actual API logs rather than guessing.
Average output tokens per request is the length of the model's response. Short classification outputs might be 10-50 tokens. Conversational replies are typically 100-500 tokens. Long-form content generation can reach 2,000-4,000 tokens per response. You can control this with the max_tokens parameter, but setting it too low may truncate useful responses.
Requests per day is your daily API call volume. For a B2B tool with 200 users making 5 requests each, that is 1,000 requests per day. For a consumer app with 50,000 daily active users and 3 interactions each, that is 150,000. Include retry requests and any background processing calls in this count.
Token Counting: Rules of Thumb
Precise token counting requires the provider's actual tokenizer, but these rules of thumb help with estimation:
- English prose: 1 word = ~1.3 tokens. A 500-word email is approximately 650 tokens.
- Code (Python, JavaScript): 1 line = ~8-15 tokens depending on complexity. A 100-line function is roughly 1,000-1,500 tokens.
- JSON/structured data: Higher token density due to brackets, keys, and punctuation. A 1KB JSON payload is typically 300-500 tokens.
- Non-English text: Chinese, Japanese, Korean, and Arabic text use 1.5-3x more tokens per word than English due to character-level tokenization.
- System prompts: Often 200-2,000 tokens. These are sent with every request, so they contribute significantly to input costs at scale.
For exact counts, use OpenAI's tiktoken library for GPT models or Anthropic's token counting API endpoint for Claude. Our LLM Token Counter tool provides quick estimates across multiple tokenizers.
Batch API Pricing: Cut Costs by 50%
If your workload does not require real-time responses, batch processing can halve your API spend. OpenAI's Batch API processes requests asynchronously and returns results within 24 hours at 50% off standard pricing. Anthropic's Message Batches API offers similar discounts for bulk workloads.
Batch pricing is ideal for:
- Content generation pipelines that produce articles, product descriptions, or marketing copy on a schedule
- Data extraction and classification running against large datasets overnight
- Evaluation and testing where you run thousands of test cases against model outputs
- Document summarization processing backlogs of reports, emails, or support tickets
Batch pricing is not reflected in the calculator above (which uses standard real-time rates), but you can mentally halve the displayed costs for any workload that qualifies. At scale, the savings are substantial: a $3,000/month workload drops to $1,500/month with batch processing.
Cost Optimization Strategies
Beyond choosing the cheapest model, several engineering techniques can reduce LLM API costs by 50-90% without sacrificing quality.
1. Prompt Caching
Both OpenAI and Anthropic offer prompt caching that reduces input token costs when you repeatedly send the same prefix (system prompt, instructions, or static context). Anthropic's prompt caching charges a small write fee on the first request but then discounts cached input tokens by 90% on subsequent requests. If your system prompt is 1,500 tokens and you make 10,000 requests per day, caching saves roughly $40/day on Claude Sonnet alone.
2. Model Routing
Not every request needs the most capable model. A classifier (running on a cheap model like GPT-4o-mini or Haiku) can examine incoming requests and route simple ones to a budget model while sending complex queries to a premium model. Many production systems report that 70-80% of requests can be handled by the cheapest tier, reducing blended cost by 60% or more. See our open source vs API comparison for more on tiered model architectures.
3. Response Caching
If users frequently ask similar questions, caching model responses eliminates redundant API calls entirely. A semantic similarity cache using embeddings can match new queries against previously generated answers. Even a simple exact-match cache on normalized inputs catches 10-30% of requests in most customer support and FAQ workloads.
4. Output Length Control
Since output tokens cost 3-5x more than input tokens, constraining response length has an outsized impact on cost. Use explicit instructions like "respond in 2-3 sentences" or "limit your answer to 100 words" in your system prompt. Set max_tokens to a reasonable ceiling. For structured outputs, use JSON mode or function calling to eliminate verbose prose.
5. Context Window Management
In multi-turn conversations, the context window grows with each exchange because you resend the full conversation history. Implement conversation summarization (compress earlier turns into a summary) or sliding window truncation (keep only the last N turns) to prevent input tokens from growing unboundedly. A 20-turn conversation without management can reach 15,000+ input tokens per request; with summarization, you keep it under 3,000.
How GPT-4o Pricing Compares to Claude and Gemini
GPT-4o sits in the mid-range of the pricing spectrum at $2.50 per million input tokens and $10.00 per million output tokens. It is substantially cheaper than Claude Opus 4.6 ($15/$75) but more expensive than Claude Sonnet 4.6 ($3/$15) on output and cheaper on input. For input-heavy workloads like RAG, GPT-4o has a slight price advantage over Sonnet. For output-heavy workloads like content generation, Sonnet and GPT-4o are comparable.
The budget tier tells a different story. GPT-4o-mini ($0.15/$0.60) is cheaper than Claude Haiku 4.5 ($0.80/$4.00) by a significant margin, but Gemini 2.0 Flash ($0.075/$0.30) undercuts both. DeepSeek V3 ($0.27/$1.10) slots between Flash and GPT-4o-mini. For high-volume, cost-sensitive applications, the choice often comes down to which budget model performs adequately for your specific use case.
Use the comparison mode in the calculator above to see exact dollar amounts for your specific workload. The model that is cheapest per token is not always cheapest in practice, because different models require different prompt lengths and produce different response lengths to achieve the same quality of output.
Real-World Cost Examples
Here are concrete cost projections for common application types, based on the 2026 pricing data in this calculator:
- Customer support chatbot (1,000 conversations/day, 5 turns each, 1,500 input + 400 output tokens per turn): Claude Sonnet costs ~$1,125/month. GPT-4o costs ~$1,225/month. Haiku costs ~$132/month.
- Code review assistant (200 pull requests/day, 3,000 input + 800 output tokens per review): Claude Sonnet costs ~$270/month. GPT-4o costs ~$195/month.
- Document summarization pipeline (500 documents/day, 8,000 input + 600 output tokens per document): Claude Sonnet costs ~$495/month. Gemini Flash costs ~$12/month.
- AI writing assistant (5,000 requests/day, 500 input + 1,500 output tokens per request): Claude Sonnet costs ~$3,600/month. DeepSeek V3 costs ~$290/month.
When to Consider Self-Hosting
API pricing makes sense when your monthly spend stays below $2,000-$5,000. Beyond that threshold, self-hosting open-source models like Llama 3 or Mistral on dedicated GPU infrastructure can deliver the same throughput at lower cost. A dual-A100 server running Llama 3 70B costs approximately $3,500/month in cloud GPU rental and handles 50-100 requests per second with no per-token charges.
The break-even depends on your quality requirements. If you need GPT-4o or Claude Opus-level reasoning, no open-source model matches them yet, and API access remains the only option. If your task is well-served by a 70B-parameter model, self-hosting pays for itself quickly at high volume. See our full open-source vs. API cost analysis for detailed break-even calculations.