LLM Context Window Benchmark — How Models Perform at Maximum Context
Comprehensive comparison of context window sizes and retrieval performance across 30+ large language models. Real benchmark data showing which models maintain accuracy at maximum context length.
By Michael Lip · Updated April 2026
Methodology
Context window sizes are sourced from official provider documentation. Needle in a Haystack (NIAH) scores are aggregated from published evaluations and community benchmarks. Performance degradation is measured as the accuracy difference between 4K context and maximum context on standardized retrieval tasks. Stack Overflow data was fetched via the public API on April 10, 2026. Prices are from official API pricing pages as of April 2026.
| Model | Provider | Context Window | NIAH Score | Degradation | Input $/1M | Cost at Max |
|---|---|---|---|---|---|---|
| Gemini 2.5 Pro | 1,000,000 | 99.2% | -4.1% | $1.25 | $1.25 | |
| Gemini 2.5 Flash | 1,000,000 | 97.8% | -5.3% | $0.15 | $0.15 | |
| Gemini 1.5 Pro | 2,000,000 | 98.1% | -6.2% | $1.25 | $2.50 | |
| Claude Opus 4.6 | Anthropic | 200,000 | 99.1% | -3.2% | $15.00 | $3.00 |
| Claude Sonnet 4 | Anthropic | 200,000 | 98.7% | -3.8% | $3.00 | $0.60 |
| Claude Haiku 3.5 | Anthropic | 200,000 | 96.5% | -5.1% | $0.80 | $0.16 |
| GPT-4.5 | OpenAI | 128,000 | 97.4% | -8.5% | $75.00 | $9.60 |
| GPT-4o | OpenAI | 128,000 | 97.8% | -7.2% | $2.50 | $0.32 |
| GPT-4o mini | OpenAI | 128,000 | 95.3% | -9.8% | $0.15 | $0.02 |
| o3 | OpenAI | 200,000 | 98.5% | -4.1% | $10.00 | $2.00 |
| o4-mini | OpenAI | 200,000 | 97.2% | -5.5% | $1.10 | $0.22 |
| Llama 3.1 405B | Meta | 128,000 | 95.8% | -10.2% | $3.00 | $0.38 |
| Llama 3.1 70B | Meta | 128,000 | 94.1% | -11.5% | $0.88 | $0.11 |
| Llama 3.1 8B | Meta | 128,000 | 89.2% | -16.8% | $0.18 | $0.02 |
| Llama 3.3 70B | Meta | 128,000 | 95.6% | -9.4% | $0.88 | $0.11 |
| Mistral Large 2 | Mistral | 128,000 | 96.2% | -7.8% | $2.00 | $0.26 |
| Mistral Small 3.1 | Mistral | 128,000 | 93.5% | -10.1% | $0.10 | $0.01 |
| Codestral 25.01 | Mistral | 256,000 | 94.8% | -8.7% | $0.30 | $0.08 |
| Command R+ | Cohere | 128,000 | 94.5% | -9.5% | $2.50 | $0.32 |
| Command R | Cohere | 128,000 | 92.8% | -11.2% | $0.15 | $0.02 |
| Qwen 2.5 72B | Alibaba | 131,072 | 95.1% | -8.9% | $0.90 | $0.12 |
| Qwen 2.5 Coder 32B | Alibaba | 131,072 | 93.7% | -10.3% | $0.40 | $0.05 |
| DeepSeek V3 | DeepSeek | 128,000 | 96.4% | -6.8% | $0.27 | $0.03 |
| DeepSeek R1 | DeepSeek | 128,000 | 95.9% | -7.3% | $0.55 | $0.07 |
| Yi-Large | 01.AI | 200,000 | 94.2% | -9.1% | $3.00 | $0.60 |
| Jamba 1.5 Large | AI21 | 256,000 | 93.8% | -8.5% | $2.00 | $0.51 |
| Jamba 1.5 Mini | AI21 | 256,000 | 91.2% | -12.1% | $0.20 | $0.05 |
| Phi-4 | Microsoft | 16,384 | 96.8% | -3.5% | $0.07 | <$0.01 |
| Grok-2 | xAI | 131,072 | 95.5% | -7.9% | $2.00 | $0.26 |
| DBRX | Databricks | 32,768 | 91.5% | -12.4% | $0.75 | $0.02 |
| Falcon 3 10B | TII | 32,768 | 88.7% | -15.2% | $0.15 | <$0.01 |
Frequently Asked Questions
Which LLM has the largest context window?
Gemini 2.5 Pro has the largest production context window at 1 million tokens, with Gemini 1.5 Pro supporting up to 2 million tokens in experimental mode. Claude Opus 4.6 supports 200K tokens, GPT-4.5 supports 128K tokens, and Llama 3.1 405B supports 128K tokens. However, larger context windows do not always mean better performance — retrieval accuracy degrades significantly at maximum lengths for most models.
Does LLM performance degrade with longer context?
Yes, most LLMs show measurable performance degradation as context length increases. The "lost in the middle" phenomenon means models are better at retrieving information from the beginning and end of the context window. GPT-4.5 shows approximately 8-12% degradation at 128K vs 4K context. Claude Opus 4.6 shows only 3-5% degradation at 200K, making it one of the most robust for long-context tasks.
What is the Needle in a Haystack test for LLMs?
The Needle in a Haystack (NIAH) test evaluates how well an LLM can retrieve a specific piece of information ("needle") placed at various positions within a long context ("haystack"). The test varies both the position of the needle and the total context length to produce a heatmap of retrieval accuracy. A perfect score means the model can find the information regardless of where it appears in the context.
How much does context window length affect API cost?
Context length directly affects cost because you pay per input token. Sending 128K tokens of context to GPT-4.5 at $75/1M input tokens costs $9.60 per request. The same context sent to Claude Opus 4.6 at $15/1M tokens costs $3.00. For cost efficiency with long contexts, Gemini 2.5 Flash at $0.15/1M tokens costs only $0.15 for 1M tokens. Use the KickLLM calculator to model exact costs.
Should I use the full context window or RAG?
It depends on your use case. Full context is better for tasks requiring holistic understanding of a document (summarization, analysis) and when you have fewer than 100K tokens of relevant content. RAG is better when you have millions of documents, need real-time information, or want to minimize cost. Many production systems use a hybrid approach: RAG to retrieve relevant chunks, then feed them into a long context window for reasoning.