LLM Speed Benchmark — Tokens Per Second Across 30 Models
Comprehensive speed comparison of 30+ LLMs across providers. Includes output tokens/sec, time to first token (TTFT), and batch throughput for standard APIs and fast-inference platforms like Groq and Cerebras.
By Michael Lip · Updated April 2026
Methodology
Speed measurements are aggregated from provider documentation, community benchmarks (ArtificialAnalysis.ai), and direct API testing. Output tokens/sec measured during streaming with a standardized 500-token input prompt and 200-token output. TTFT measured as median across 50 requests during normal load. Groq and Cerebras figures from their published benchmarks. Speeds vary by time of day, region, and load. Data current as of April 2026.
| Model | Provider | Output tok/s | TTFT (ms) | Parameters | Input $/1M | Speed Tier |
|---|---|---|---|---|---|---|
| Llama 3.1 8B | Groq (LPU) | 780 | 45 | 8B | $0.05 | Ultra-Fast |
| Llama 3.3 70B | Groq (LPU) | 330 | 120 | 70B | $0.59 | Ultra-Fast |
| Llama 3.1 8B | Cerebras | 720 | 50 | 8B | $0.10 | Ultra-Fast |
| Llama 3.3 70B | Cerebras | 290 | 140 | 70B | $0.60 | Ultra-Fast |
| Llama 3.1 8B | Fireworks AI | 280 | 90 | 8B | $0.10 | Fast |
| Llama 3.3 70B | Fireworks AI | 160 | 200 | 70B | $0.90 | Fast |
| Llama 3.1 405B | Fireworks AI | 65 | 450 | 405B | $3.00 | Standard |
| Mistral Small 3.1 | Mistral API | 200 | 110 | 24B | $0.10 | Fast |
| Mistral Large 2 | Mistral API | 85 | 280 | 123B | $2.00 | Standard |
| Codestral 25.01 | Mistral API | 150 | 150 | 22B | $0.30 | Fast |
| GPT-4o mini | OpenAI | 160 | 200 | ~8B* | $0.15 | Fast |
| GPT-4o | OpenAI | 95 | 350 | ~200B* | $2.50 | Standard |
| GPT-4.5 | OpenAI | 42 | 800 | ~1.8T* | $75.00 | Slow |
| o3 | OpenAI | 35 | 2,500 | ~200B* | $10.00 | Reasoning |
| o4-mini | OpenAI | 90 | 1,200 | ~8B* | $1.10 | Reasoning |
| Gemini 2.5 Flash | 190 | 180 | ~50B* | $0.15 | Fast | |
| Gemini 2.5 Pro | 80 | 400 | ~300B* | $1.25 | Standard | |
| Gemini 2.0 Flash | 210 | 160 | ~50B* | $0.10 | Fast | |
| Claude Opus 4.6 | Anthropic | 55 | 600 | ~300B* | $15.00 | Standard |
| Claude Sonnet 4 | Anthropic | 90 | 350 | ~70B* | $3.00 | Standard |
| Claude Haiku 3.5 | Anthropic | 140 | 220 | ~20B* | $0.80 | Fast |
| DeepSeek V3 | DeepSeek | 110 | 250 | 671B MoE | $0.27 | Standard |
| DeepSeek R1 | DeepSeek | 45 | 1,800 | 671B MoE | $0.55 | Reasoning |
| Qwen 2.5 72B | Together AI | 120 | 220 | 72B | $0.90 | Standard |
| Qwen 2.5 Coder 32B | Together AI | 170 | 150 | 32B | $0.40 | Fast |
| Phi-4 | Azure | 240 | 80 | 14B | $0.07 | Fast |
| Gemma 2 27B | Together AI | 140 | 180 | 27B | $0.30 | Fast |
| Grok-2 | xAI | 75 | 380 | ~300B* | $2.00 | Standard |
| Command R+ | Cohere | 80 | 320 | 104B | $2.50 | Standard |
| Jamba 1.5 Large | AI21 | 95 | 280 | 398B MoE | $2.00 | Standard |
| DBRX | Databricks | 110 | 250 | 132B MoE | $0.75 | Standard |
* Parameter counts for proprietary models are estimated based on public reporting and inference cost analysis. MoE = Mixture of Experts (active parameters are lower).
Frequently Asked Questions
Which LLM has the fastest inference speed?
For raw output speed, Groq's LPU hardware delivers Llama 3.1 70B at approximately 330 tokens/sec and Llama 3.1 8B at over 750 tokens/sec. Cerebras achieves similar speeds. Among proprietary APIs, Gemini 2.5 Flash leads at ~190 tok/sec, followed by GPT-4o mini at ~160 tok/sec. Frontier models like Claude Opus 4.6 and GPT-4.5 are slower (40-60 tok/sec) due to larger parameter counts.
What is TTFT and why does it matter?
TTFT (Time to First Token) measures the delay between sending a request and receiving the first token. Under 500ms feels instant, 500ms-1s is acceptable, over 2s feels sluggish. TTFT is affected by prompt length, server load, and model size. Reasoning models (o3, R1) have much higher TTFT because they perform extended thinking before responding.
How does Groq achieve such fast inference?
Groq uses custom LPU (Language Processing Unit) hardware designed for sequential token generation. Unlike GPUs optimized for parallel batch processing, LPUs minimize memory bandwidth bottlenecks during autoregressive decoding, achieving 3-5x faster inference for the same model. The tradeoff: Groq currently only supports open-source models with more limited context windows.
Does model speed affect output quality?
Running the same model weights faster does not reduce quality. Groq running Llama 70B produces identical outputs to a GPU. However, quantized models (INT4, INT8) trade small quality reductions for speed gains. The bigger concern is choosing a faster but less capable model: GPT-4o mini is 3x faster than GPT-4o but scores 5-10% lower on benchmarks.
What is speculative decoding?
Speculative decoding uses a small draft model to predict multiple tokens ahead, then verifies them in parallel with the large target model. If predictions match (70-90% for typical text), you get multiple tokens for one forward pass. This improves throughput 2-3x without quality loss. Google uses this in Gemini, and it is increasingly common in vLLM and TensorRT-LLM.