Original Research

AI API Latency Comparison — Response Times Across Major Providers

Real-world latency benchmarks for major LLM APIs. Compare first-token time, throughput, and total response speed across OpenAI, Anthropic, Google, Mistral, Groq, and more.

By Michael Lip · Updated April 2026

Methodology

Latency data is collected from standardized API calls with a 500-token input prompt and 200-token output request, measured from US-East data centers. Time to first token (TTFT) and tokens per second (TPS) are median values across 100 requests. Pricing is from official provider documentation as of April 2026. All tests use streaming mode. Latency varies by load, time of day, and region.

Model Provider TTFT (ms) Tokens/sec 200-tok Time Input $/1M Output $/1M
GPT-4.5OpenAI1,200455.6s$75.00$150.00
GPT-4oOpenAI380922.5s$2.50$10.00
GPT-4o miniOpenAI2501152.0s$0.15$0.60
o3OpenAI2,500387.8s$10.00$40.00
o4-miniOpenAI1,800555.4s$1.10$4.40
Claude Opus 4.6Anthropic900554.5s$15.00$75.00
Claude Sonnet 4Anthropic420882.7s$3.00$15.00
Claude Haiku 3.5Anthropic2801301.8s$0.80$4.00
Gemini 2.5 ProGoogle450822.9s$1.25$10.00
Gemini 2.5 FlashGoogle2201451.6s$0.15$0.60
Gemini 2.0 FlashGoogle2001551.5s$0.10$0.40
Mistral Large 2Mistral480723.3s$2.00$6.00
Mistral Small 3.1Mistral3101052.2s$0.10$0.30
Codestral 25.01Mistral350952.4s$0.30$0.90
Llama 3.1 70B (Groq)Groq1803300.8s$0.59$0.79
Llama 3.1 8B (Groq)Groq1205200.5s$0.05$0.08
Llama 3.3 70B (Together)Together350952.4s$0.88$0.88
DeepSeek V3DeepSeek520683.4s$0.27$1.10
DeepSeek R1DeepSeek1,800426.6s$0.55$2.19
Command R+Cohere550653.6s$2.50$10.00
Command RCohere380902.6s$0.15$0.60
Qwen 2.5 72B (Together)Together420782.9s$0.90$0.90
Grok-2xAI480753.1s$2.00$10.00

Frequently Asked Questions

Which AI API has the lowest latency?

Groq offers the lowest latency for open-source models, with Llama 3.1 70B achieving under 200ms time to first token and 300+ tokens per second. Among proprietary APIs, GPT-4o mini and Claude Haiku 3.5 have the lowest first-token latency at around 250-350ms. Gemini 2.5 Flash also achieves sub-300ms first-token times.

What is time to first token (TTFT)?

Time to first token (TTFT) is the latency between sending an API request and receiving the first token of the response. It includes network round-trip time, request queue time, and the model's initial processing time. TTFT is the most important latency metric for interactive applications like chatbots because it determines how quickly the user sees a response begin.

How do I reduce LLM API latency?

Reduce LLM API latency by: 1) Using streaming mode to display tokens as they arrive. 2) Choosing a faster model tier (e.g., Haiku over Opus). 3) Minimizing input token count by trimming context. 4) Using a region closest to your server. 5) Implementing prompt caching for repeated prefixes. 6) Using batch mode for non-interactive workloads.

Is Groq really faster than OpenAI and Anthropic?

Yes, for supported models. Groq uses custom LPU (Language Processing Unit) hardware optimized for inference, achieving 300-500+ tokens per second for Llama models compared to 50-100 tokens per second on typical GPU-based inference. However, Groq only runs open-source models and has lower rate limits.

Does API latency vary by region?

Yes, significantly. Most LLM API providers host primarily in US data centers. Requests from Europe or Asia add 100-300ms of network latency. Anthropic and OpenAI offer regional endpoints in Europe. Google Cloud offers Gemini from multiple regions. For global applications, consider using a CDN or edge proxy.