Original Research

LLM Speed Benchmark — Tokens Per Second Across 30 Models

Comprehensive speed comparison of 30+ LLMs across providers. Includes output tokens/sec, time to first token (TTFT), and batch throughput for standard APIs and fast-inference platforms like Groq and Cerebras.

By Michael Lip · Updated April 2026

Methodology

Speed measurements are aggregated from provider documentation, community benchmarks (ArtificialAnalysis.ai), and direct API testing. Output tokens/sec measured during streaming with a standardized 500-token input prompt and 200-token output. TTFT measured as median across 50 requests during normal load. Groq and Cerebras figures from their published benchmarks. Speeds vary by time of day, region, and load. Data current as of April 2026.

Model Provider Output tok/s TTFT (ms) Parameters Input $/1M Speed Tier
Llama 3.1 8BGroq (LPU)780458B$0.05Ultra-Fast
Llama 3.3 70BGroq (LPU)33012070B$0.59Ultra-Fast
Llama 3.1 8BCerebras720508B$0.10Ultra-Fast
Llama 3.3 70BCerebras29014070B$0.60Ultra-Fast
Llama 3.1 8BFireworks AI280908B$0.10Fast
Llama 3.3 70BFireworks AI16020070B$0.90Fast
Llama 3.1 405BFireworks AI65450405B$3.00Standard
Mistral Small 3.1Mistral API20011024B$0.10Fast
Mistral Large 2Mistral API85280123B$2.00Standard
Codestral 25.01Mistral API15015022B$0.30Fast
GPT-4o miniOpenAI160200~8B*$0.15Fast
GPT-4oOpenAI95350~200B*$2.50Standard
GPT-4.5OpenAI42800~1.8T*$75.00Slow
o3OpenAI352,500~200B*$10.00Reasoning
o4-miniOpenAI901,200~8B*$1.10Reasoning
Gemini 2.5 FlashGoogle190180~50B*$0.15Fast
Gemini 2.5 ProGoogle80400~300B*$1.25Standard
Gemini 2.0 FlashGoogle210160~50B*$0.10Fast
Claude Opus 4.6Anthropic55600~300B*$15.00Standard
Claude Sonnet 4Anthropic90350~70B*$3.00Standard
Claude Haiku 3.5Anthropic140220~20B*$0.80Fast
DeepSeek V3DeepSeek110250671B MoE$0.27Standard
DeepSeek R1DeepSeek451,800671B MoE$0.55Reasoning
Qwen 2.5 72BTogether AI12022072B$0.90Standard
Qwen 2.5 Coder 32BTogether AI17015032B$0.40Fast
Phi-4Azure2408014B$0.07Fast
Gemma 2 27BTogether AI14018027B$0.30Fast
Grok-2xAI75380~300B*$2.00Standard
Command R+Cohere80320104B$2.50Standard
Jamba 1.5 LargeAI2195280398B MoE$2.00Standard
DBRXDatabricks110250132B MoE$0.75Standard

* Parameter counts for proprietary models are estimated based on public reporting and inference cost analysis. MoE = Mixture of Experts (active parameters are lower).

Frequently Asked Questions

Which LLM has the fastest inference speed?

For raw output speed, Groq's LPU hardware delivers Llama 3.1 70B at approximately 330 tokens/sec and Llama 3.1 8B at over 750 tokens/sec. Cerebras achieves similar speeds. Among proprietary APIs, Gemini 2.5 Flash leads at ~190 tok/sec, followed by GPT-4o mini at ~160 tok/sec. Frontier models like Claude Opus 4.6 and GPT-4.5 are slower (40-60 tok/sec) due to larger parameter counts.

What is TTFT and why does it matter?

TTFT (Time to First Token) measures the delay between sending a request and receiving the first token. Under 500ms feels instant, 500ms-1s is acceptable, over 2s feels sluggish. TTFT is affected by prompt length, server load, and model size. Reasoning models (o3, R1) have much higher TTFT because they perform extended thinking before responding.

How does Groq achieve such fast inference?

Groq uses custom LPU (Language Processing Unit) hardware designed for sequential token generation. Unlike GPUs optimized for parallel batch processing, LPUs minimize memory bandwidth bottlenecks during autoregressive decoding, achieving 3-5x faster inference for the same model. The tradeoff: Groq currently only supports open-source models with more limited context windows.

Does model speed affect output quality?

Running the same model weights faster does not reduce quality. Groq running Llama 70B produces identical outputs to a GPU. However, quantized models (INT4, INT8) trade small quality reductions for speed gains. The bigger concern is choosing a faster but less capable model: GPT-4o mini is 3x faster than GPT-4o but scores 5-10% lower on benchmarks.

What is speculative decoding?

Speculative decoding uses a small draft model to predict multiple tokens ahead, then verifies them in parallel with the large target model. If predictions match (70-90% for typical text), you get multiple tokens for one forward pass. This improves throughput 2-3x without quality loss. Google uses this in Gemini, and it is increasingly common in vLLM and TensorRT-LLM.