Question 1

Which LLM has the fastest inference speed?

Accepted Answer

For raw output speed, Groq's LPU hardware delivers Llama 3.1 70B at approximately 330 tokens/sec and Llama 3.1 8B at over 750 tokens/sec. Cerebras achieves similar speeds with their wafer-scale chips. Among proprietary APIs, Gemini 2.5 Flash leads at approximately 190 tokens/sec, followed by GPT-4o mini at around 160 tokens/sec. Frontier models like Claude Opus 4.6 and GPT-4.5 are slower (40-60 tokens/sec) due to their larger parameter counts.

Question 2

What is TTFT and why does it matter?

Accepted Answer

TTFT (Time to First Token) measures the delay between sending a request and receiving the first token of the response. It matters for user experience in interactive applications: a TTFT under 500ms feels instant, 500ms-1s is acceptable, and over 2s feels sluggish. TTFT is affected by prompt length (longer prompts take longer to process), server load, and model size. Reasoning models (o3, R1) have much higher TTFT because they perform extended thinking before responding.

Question 3

How does Groq achieve such fast LLM inference?

Accepted Answer

Groq uses custom LPU (Language Processing Unit) hardware designed specifically for sequential token generation. Unlike GPUs which are optimized for parallel batch processing, LPUs minimize memory bandwidth bottlenecks during autoregressive decoding. This architecture achieves 3-5x faster inference than GPU-based providers for the same model. The tradeoff is that Groq currently only supports open-source models and has more limited context window support.

Question 4

Does model speed affect output quality?

Accepted Answer

Running the same model weights faster does not reduce quality. Groq running Llama 70B produces identical outputs to a GPU running the same weights. However, quantized models (INT4, INT8) trade small quality reductions for significant speed gains. The bigger concern is choosing a faster but less capable model: GPT-4o mini is 3x faster than GPT-4o but scores 5-10% lower on benchmarks. Always benchmark quality for your specific use case before optimizing for speed.

Question 5

What is speculative decoding and how does it improve speed?

Accepted Answer

Speculative decoding uses a small draft model to predict multiple tokens ahead, then verifies them in parallel with the large target model. If the predictions match (which they do 70-90% of the time for typical text), you get multiple tokens for the cost of one forward pass. This can improve throughput by 2-3x without any quality loss. Google uses this technique in Gemini, and it is increasingly common in open-source inference frameworks like vLLM and TensorRT-LLM.

Model	Provider	Output tok/s	TTFT (ms)	Parameters	Input $/1M	Speed Tier
Llama 3.1 8B	Groq (LPU)	780	45	8B	$0.05	Ultra-Fast
Llama 3.3 70B	Groq (LPU)	330	120	70B	$0.59	Ultra-Fast
Llama 3.1 8B	Cerebras	720	50	8B	$0.10	Ultra-Fast
Llama 3.3 70B	Cerebras	290	140	70B	$0.60	Ultra-Fast
Llama 3.1 8B	Fireworks AI	280	90	8B	$0.10	Fast
Llama 3.3 70B	Fireworks AI	160	200	70B	$0.90	Fast
Llama 3.1 405B	Fireworks AI	65	450	405B	$3.00	Standard
Mistral Small 3.1	Mistral API	200	110	24B	$0.10	Fast
Mistral Large 2	Mistral API	85	280	123B	$2.00	Standard
Codestral 25.01	Mistral API	150	150	22B	$0.30	Fast
GPT-4o mini	OpenAI	160	200	~8B*	$0.15	Fast
GPT-4o	OpenAI	95	350	~200B*	$2.50	Standard
GPT-4.5	OpenAI	42	800	~1.8T*	$75.00	Slow
o3	OpenAI	35	2,500	~200B*	$10.00	Reasoning
o4-mini	OpenAI	90	1,200	~8B*	$1.10	Reasoning
Gemini 2.5 Flash	Google	190	180	~50B*	$0.15	Fast
Gemini 2.5 Pro	Google	80	400	~300B*	$1.25	Standard
Gemini 2.0 Flash	Google	210	160	~50B*	$0.10	Fast
Claude Opus 4.6	Anthropic	55	600	~300B*	$15.00	Standard
Claude Sonnet 4	Anthropic	90	350	~70B*	$3.00	Standard
Claude Haiku 3.5	Anthropic	140	220	~20B*	$0.80	Fast
DeepSeek V3	DeepSeek	110	250	671B MoE	$0.27	Standard
DeepSeek R1	DeepSeek	45	1,800	671B MoE	$0.55	Reasoning
Qwen 2.5 72B	Together AI	120	220	72B	$0.90	Standard
Qwen 2.5 Coder 32B	Together AI	170	150	32B	$0.40	Fast
Phi-4	Azure	240	80	14B	$0.07	Fast
Gemma 2 27B	Together AI	140	180	27B	$0.30	Fast
Grok-2	xAI	75	380	~300B*	$2.00	Standard
Command R+	Cohere	80	320	104B	$2.50	Standard
Jamba 1.5 Large	AI21	95	280	398B MoE	$2.00	Standard
DBRX	Databricks	110	250	132B MoE	$0.75	Standard

LLM Speed Benchmark — Tokens Per Second Across 30 Models

Methodology

Frequently Asked Questions