LLM API Cost Calculator

Estimate per-request, daily, monthly, and yearly costs for GPT-4o, Claude, Gemini, and DeepSeek. Compare all models side by side with real 2026 pricing.

Model Selection

Select a model

$3.00 / 1M input · $15.00 / 1M output

Usage Parameters

Avg input tokens per request

~750 words of English text

Avg output tokens per request

~375 words of generated text

Requests per day

Claude Sonnet 4.6 — Cost Breakdown

Per Request

$0.0000

Daily

$0.00

Monthly

$0.00

Yearly

$0.00

2026 LLM API Pricing Reference

Model	Provider	Input (per 1M tokens)	Output (per 1M tokens)	Output / Input Ratio
GPT-4o	OpenAI	$2.50	$10.00	4.0x
GPT-4o-mini	OpenAI	$0.15	$0.60	4.0x
Claude Opus 4.6	Anthropic	$15.00	$75.00	5.0x
Claude Sonnet 4.6	Anthropic	$3.00	$15.00	5.0x
Claude Haiku 4.5	Anthropic	$0.80	$4.00	5.0x
Gemini 2.0 Flash	Google	$0.075	$0.30	4.0x
DeepSeek V3	DeepSeek	$0.27	$1.10	4.1x

How LLM API Pricing Works

Every major LLM provider charges based on token consumption, not time, compute, or number of characters. Tokens are sub-word units that the model processes internally. English text averages roughly 1.3 tokens per word, meaning a 1,000-word prompt consumes approximately 1,300 tokens. Code, JSON, and non-Latin scripts typically consume more tokens per word because specialized vocabulary gets broken into smaller subword pieces.

LLM APIs charge separately for input tokens (your prompt, system instructions, and any context you send) and output tokens (the model's response). Output tokens are always more expensive, typically 3x to 5x the input price. This asymmetry exists because generating each output token requires a full forward pass through the model, while input tokens can be processed in parallel during the prefill stage.

Understanding this split is critical for cost optimization. A retrieval-augmented generation (RAG) system that sends 10,000 tokens of context but only generates a 200-token answer is heavily input-weighted. A creative writing tool that takes a 50-token prompt and generates 2,000 tokens of prose is output-weighted. The same model can cost dramatically different amounts depending on which side of the split dominates your workload.

Understanding the Calculator Inputs

This calculator asks for three values that define your workload profile. Getting accurate numbers here is the difference between a useful estimate and a meaningless one.

Average input tokens per request includes everything you send to the API: system prompts, user messages, retrieved context, conversation history, function definitions, and any other text in the request body. For a chatbot with a 500-token system prompt and 300 tokens of user message, that is 800 input tokens. For a RAG pipeline injecting 5 retrieved passages at 400 tokens each plus the query, that is 2,100 input tokens. Measure this from your actual API logs rather than guessing.

Average output tokens per request is the length of the model's response. Short classification outputs might be 10-50 tokens. Conversational replies are typically 100-500 tokens. Long-form content generation can reach 2,000-4,000 tokens per response. You can control this with the max_tokens parameter, but setting it too low may truncate useful responses.

Requests per day is your daily API call volume. For a B2B tool with 200 users making 5 requests each, that is 1,000 requests per day. For a consumer app with 50,000 daily active users and 3 interactions each, that is 150,000. Include retry requests and any background processing calls in this count.

Token Counting: Rules of Thumb

Precise token counting requires the provider's actual tokenizer, but these rules of thumb help with estimation:

English prose: 1 word = ~1.3 tokens. A 500-word email is approximately 650 tokens.
Code (Python, JavaScript): 1 line = ~8-15 tokens depending on complexity. A 100-line function is roughly 1,000-1,500 tokens.
JSON/structured data: Higher token density due to brackets, keys, and punctuation. A 1KB JSON payload is typically 300-500 tokens.
Non-English text: Chinese, Japanese, Korean, and Arabic text use 1.5-3x more tokens per word than English due to character-level tokenization.
System prompts: Often 200-2,000 tokens. These are sent with every request, so they contribute significantly to input costs at scale.

For exact counts, use OpenAI's tiktoken library for GPT models or Anthropic's token counting API endpoint for Claude. Our LLM Token Counter tool provides quick estimates across multiple tokenizers.

Batch API Pricing: Cut Costs by 50%

If your workload does not require real-time responses, batch processing can halve your API spend. OpenAI's Batch API processes requests asynchronously and returns results within 24 hours at 50% off standard pricing. Anthropic's Message Batches API offers similar discounts for bulk workloads.

Batch pricing is ideal for:

Content generation pipelines that produce articles, product descriptions, or marketing copy on a schedule
Data extraction and classification running against large datasets overnight
Evaluation and testing where you run thousands of test cases against model outputs
Document summarization processing backlogs of reports, emails, or support tickets

Batch pricing is not reflected in the calculator above (which uses standard real-time rates), but you can mentally halve the displayed costs for any workload that qualifies. At scale, the savings are substantial: a $3,000/month workload drops to $1,500/month with batch processing.

Cost Optimization Strategies

Beyond choosing the cheapest model, several engineering techniques can reduce LLM API costs by 50-90% without sacrificing quality.

1. Prompt Caching

Both OpenAI and Anthropic offer prompt caching that reduces input token costs when you repeatedly send the same prefix (system prompt, instructions, or static context). Anthropic's prompt caching charges a small write fee on the first request but then discounts cached input tokens by 90% on subsequent requests. If your system prompt is 1,500 tokens and you make 10,000 requests per day, caching saves roughly $40/day on Claude Sonnet alone.

2. Model Routing

Not every request needs the most capable model. A classifier (running on a cheap model like GPT-4o-mini or Haiku) can examine incoming requests and route simple ones to a budget model while sending complex queries to a premium model. Many production systems report that 70-80% of requests can be handled by the cheapest tier, reducing blended cost by 60% or more. See our open source vs API comparison for more on tiered model architectures.

3. Response Caching

If users frequently ask similar questions, caching model responses eliminates redundant API calls entirely. A semantic similarity cache using embeddings can match new queries against previously generated answers. Even a simple exact-match cache on normalized inputs catches 10-30% of requests in most customer support and FAQ workloads.

4. Output Length Control

Since output tokens cost 3-5x more than input tokens, constraining response length has an outsized impact on cost. Use explicit instructions like "respond in 2-3 sentences" or "limit your answer to 100 words" in your system prompt. Set max_tokens to a reasonable ceiling. For structured outputs, use JSON mode or function calling to eliminate verbose prose.

5. Context Window Management

In multi-turn conversations, the context window grows with each exchange because you resend the full conversation history. Implement conversation summarization (compress earlier turns into a summary) or sliding window truncation (keep only the last N turns) to prevent input tokens from growing unboundedly. A 20-turn conversation without management can reach 15,000+ input tokens per request; with summarization, you keep it under 3,000.

How GPT-4o Pricing Compares to Claude and Gemini

GPT-4o sits in the mid-range of the pricing spectrum at $2.50 per million input tokens and $10.00 per million output tokens. It is substantially cheaper than Claude Opus 4.6 ($15/$75) but more expensive than Claude Sonnet 4.6 ($3/$15) on output and cheaper on input. For input-heavy workloads like RAG, GPT-4o has a slight price advantage over Sonnet. For output-heavy workloads like content generation, Sonnet and GPT-4o are comparable.

The budget tier tells a different story. GPT-4o-mini ($0.15/$0.60) is cheaper than Claude Haiku 4.5 ($0.80/$4.00) by a significant margin, but Gemini 2.0 Flash ($0.075/$0.30) undercuts both. DeepSeek V3 ($0.27/$1.10) slots between Flash and GPT-4o-mini. For high-volume, cost-sensitive applications, the choice often comes down to which budget model performs adequately for your specific use case.

Use the comparison mode in the calculator above to see exact dollar amounts for your specific workload. The model that is cheapest per token is not always cheapest in practice, because different models require different prompt lengths and produce different response lengths to achieve the same quality of output.

Real-World Cost Examples

Here are concrete cost projections for common application types, based on the 2026 pricing data in this calculator:

Customer support chatbot (1,000 conversations/day, 5 turns each, 1,500 input + 400 output tokens per turn): Claude Sonnet costs ~$1,125/month. GPT-4o costs ~$1,225/month. Haiku costs ~$132/month.
Code review assistant (200 pull requests/day, 3,000 input + 800 output tokens per review): Claude Sonnet costs ~$270/month. GPT-4o costs ~$195/month.
Document summarization pipeline (500 documents/day, 8,000 input + 600 output tokens per document): Claude Sonnet costs ~$495/month. Gemini Flash costs ~$12/month.
AI writing assistant (5,000 requests/day, 500 input + 1,500 output tokens per request): Claude Sonnet costs ~$3,600/month. DeepSeek V3 costs ~$290/month.

When to Consider Self-Hosting

API pricing makes sense when your monthly spend stays below $2,000-$5,000. Beyond that threshold, self-hosting open-source models like Llama 3 or Mistral on dedicated GPU infrastructure can deliver the same throughput at lower cost. A dual-A100 server running Llama 3 70B costs approximately $3,500/month in cloud GPU rental and handles 50-100 requests per second with no per-token charges.

The break-even depends on your quality requirements. If you need GPT-4o or Claude Opus-level reasoning, no open-source model matches them yet, and API access remains the only option. If your task is well-served by a 70B-parameter model, self-hosting pays for itself quickly at high volume. See our full open-source vs. API cost analysis for detailed break-even calculations.

Frequently Asked Questions

How much does GPT-4o cost per request?

GPT-4o costs $2.50 per million input tokens and $10.00 per million output tokens. A typical request with 1,000 input tokens and 500 output tokens costs approximately $0.0075 (three-quarters of a cent). Monthly cost depends on request volume: 1,000 requests per day would cost roughly $225 per month.

What is the cheapest LLM API in 2026?

Gemini 2.0 Flash is the cheapest major LLM API at $0.075 per million input tokens and $0.30 per million output tokens. DeepSeek V3 is the second cheapest at $0.27/$1.10 per million tokens. GPT-4o-mini at $0.15/$0.60 offers an excellent balance of cost and quality for lighter workloads.

How do I calculate LLM API costs?

Multiply your average input tokens per request by the model's input price per token, add your average output tokens multiplied by the output price per token. That gives you cost per request. Multiply by daily request count for daily cost, then by 30 for monthly cost. Remember that output tokens are typically 3-5x more expensive than input tokens.

How much does Claude Opus 4.6 cost compared to GPT-4o?

Claude Opus 4.6 is significantly more expensive: $15.00 per million input tokens and $75.00 per million output tokens, compared to GPT-4o at $2.50/$10.00. Opus costs roughly 6-7.5x more than GPT-4o. Opus is designed for complex reasoning tasks where maximum quality justifies the premium.

What are tokens and how do they relate to words?

A token is the smallest unit of text processed by an LLM. In English, one token is roughly 0.75 words, or equivalently, one word is about 1.3 tokens. A 500-word prompt uses approximately 650 tokens. Code and non-English text typically use more tokens per word. Each provider uses a different tokenizer, so exact counts vary by 10-15%.

Does batch API pricing reduce LLM costs?

Yes. OpenAI offers 50% off with Batch API for non-time-sensitive workloads (results within 24 hours). Anthropic offers a similar Message Batches API with reduced pricing. Batch processing is ideal for content generation, data extraction, and classification tasks where you do not need real-time responses.

Built by Michael Lip. Pricing data sourced from official provider pages and updated as of May 2026. Actual costs may vary based on volume discounts, committed use agreements, and regional pricing.

LLM API Cost Calculator

Model Selection

Usage Parameters

Claude Sonnet 4.6 — Cost Breakdown

Side-by-Side Cost Comparison

2026 LLM API Pricing Reference

How LLM API Pricing Works

Understanding the Calculator Inputs

Token Counting: Rules of Thumb

Batch API Pricing: Cut Costs by 50%

Cost Optimization Strategies

1. Prompt Caching

2. Model Routing

3. Response Caching

4. Output Length Control

5. Context Window Management

How GPT-4o Pricing Compares to Claude and Gemini

Real-World Cost Examples

When to Consider Self-Hosting

Frequently Asked Questions

Related Tools and Guides