Original Research

Open Source vs API: Break-Even Analysis for Common LLM Workloads

Q: How much does it cost to self-host Llama 4?

Llama 4 Maverick (400B MoE, 17B active parameters) requires 2x H100 80GB GPUs. Monthly GPU rental costs: Lambda Labs $4,548/mo, RunPod $3,888/mo, AWS p5.xlarge $5,270/mo. Add $200-500/mo for storage, networking, and monitoring. Total: $4,100-$5,800/month fixed cost regardless of usage.

Q: How many GPUs do I need for Llama 4?

Llama 4 Scout (109B MoE) fits on 1x H100 80GB. Llama 4 Maverick (400B MoE) requires 2x H100 80GB in FP16, or 1x H100 with INT8 quantization (with quality tradeoffs). Llama 3.1 8B fits on a single A10G 24GB GPU.

Q: Is self-hosting LLMs worth it for a startup?

For most startups, no. API access is more cost-effective below $2,000-5,000/month in token costs, requires zero ML infrastructure expertise, and scales instantly. Self-hosting makes sense for: (1) high-volume production (>500M tokens/month), (2) strict data residency requirements, (3) latency-sensitive applications needing local inference, or (4) heavy customization with fine-tuned models.

Q: What GPU should I rent for LLM inference?

H100 80GB ($2.49/hr on RunPod, $2.79/hr on Lambda) is the best price/performance for large models (70B+). A100 80GB ($1.89/hr on RunPod) works for models up to 70B in FP16. For small models (7-13B), an A10G 24GB ($0.50/hr) is sufficient and much cheaper.

Published April 7, 2026 · Author: Michael Lip · 14 min read · Updated monthly

The decision between self-hosting an open-weight LLM and using a managed API comes down to volume, quality requirements, and operational capacity. This analysis provides concrete break-even calculations using real GPU rental prices from Lambda Labs, RunPod, and AWS as of April 2026.

TL;DR

Below $2,000/month in API costs: Use APIs. Self-hosting overhead erases any savings.
$2,000-$5,000/month: Gray zone. Self-hosting may save money if you have ML infrastructure expertise.
Above $5,000/month: Self-hosting saves 40-70% on compute costs. But add $500-$2,000/month for engineering overhead.
Data privacy requirement: Self-host regardless of volume. No data leaves your infrastructure.

GPU Infrastructure Costs (April 2026)

Self-hosting an LLM requires renting (or buying) GPU compute. Here are current monthly costs for the most common GPU configurations used for LLM inference:

GPU	VRAM	Lambda Labs	RunPod	AWS	Best For
H100 80GB	80 GB	$2.79/hr ($2,009/mo)	$2.49/hr ($1,793/mo)	$3.67/hr ($2,642/mo)	70B-400B models
A100 80GB	80 GB	$2.21/hr ($1,591/mo)	$1.89/hr ($1,361/mo)	$3.07/hr ($2,210/mo)	13B-70B models
A10G 24GB	24 GB	$0.60/hr ($432/mo)	$0.50/hr ($360/mo)	$1.01/hr ($727/mo)	7B-13B models
L40S 48GB	48 GB	$1.10/hr ($792/mo)	$0.99/hr ($713/mo)	$1.58/hr ($1,138/mo)	13B-34B models
2x H100 80GB	160 GB	$5.58/hr ($4,018/mo)	$4.98/hr ($3,586/mo)	$7.34/hr ($5,285/mo)	400B+ models, high throughput
4x A100 80GB	320 GB	$8.84/hr ($6,365/mo)	$7.56/hr ($5,443/mo)	$12.28/hr ($8,842/mo)	405B models, max throughput

Monthly costs assume 24/7 operation (720 hours). Prices from official pricing pages, April 2026. Spot/interruptible pricing can reduce costs by 30-60% but is unsuitable for production inference.

Model Hardware Requirements

Each model has specific GPU memory requirements depending on precision (FP16, INT8, INT4). Here is what you need to run the most popular open-weight models:

Model	Parameters	FP16 VRAM	INT8 VRAM	Min GPU Config	Throughput (vLLM)
Llama 4 Maverick	400B MoE (17B active)	~160 GB	~80 GB	2x H100 80GB (FP16) or 1x H100 (INT8)	~65 tok/s
Llama 4 Scout	109B MoE (17B active)	~55 GB	~30 GB	1x H100 80GB (FP16) or 1x L40S (INT8)	~95 tok/s
Llama 3.3 70B	70B dense	~140 GB	~70 GB	2x A100 80GB (FP16) or 1x H100 (INT8)	~55 tok/s
Llama 3.1 405B	405B dense	~810 GB	~405 GB	4x A100 80GB (INT4) or 8x H100	~18 tok/s
Llama 3.1 8B	8B dense	~16 GB	~8 GB	1x A10G 24GB	~180 tok/s
Mistral Small 3.1	24B dense	~48 GB	~24 GB	1x L40S 48GB (FP16) or 1x A10G (INT8)	~120 tok/s
DeepSeek V3	671B MoE (37B active)	~320 GB	~160 GB	2x H100 80GB (INT8) or 4x A100	~40 tok/s
Qwen 2.5 72B	72B dense	~144 GB	~72 GB	2x A100 80GB (FP16) or 1x H100 (INT8)	~50 tok/s

Throughput measured with vLLM on batch size 1 with 2K context. Production throughput with batching can be 2-5x higher. MoE models use less active VRAM per forward pass than their total parameter count suggests.

Break-Even Scenarios

Scenario 1: Llama 4 Maverick Self-Hosted vs GPT-4o API

Model: Llama 4 Maverick (400B MoE) vs GPT-4o GPU: 2x H100 on RunPod = $3,586/mo Overhead (monitoring, storage, networking): $400/mo Total fixed: $3,986/mo GPT-4o blended rate (60/40 split): $5.50/1M tokens

Break-even point: ~724M tokens/month ($3,986 / $5.50 per 1M tokens)

At 724M tokens/month, GPT-4o API costs match the fixed self-hosting cost. Above that volume, every additional token is effectively free on self-hosted infrastructure (until you hit throughput limits at around 2-3B tokens/month per 2x H100 with vLLM batching).

Monthly Volume	GPT-4o API	Self-Host Llama 4 Maverick	Savings
100M tokens	$550	$3,986	-$3,436 (API cheaper)
500M tokens	$2,750	$3,986	-$1,236 (API cheaper)
724M tokens	$3,982	$3,986	~Break even
1B tokens	$5,500	$3,986	+$1,514 (self-host cheaper)
2B tokens	$11,000	$3,986	+$7,014 (self-host 64% cheaper)
5B tokens	$27,500	$3,986	+$23,514 (self-host 86% cheaper)

Scenario 2: Llama 3.1 8B Self-Hosted vs Claude Haiku API

Model: Llama 3.1 8B vs Claude Haiku 3.5 GPU: 1x A10G on RunPod = $360/mo Overhead: $100/mo Total fixed: $460/mo Haiku 3.5 blended rate (60/40 split): $2.08/1M tokens

Break-even point: ~221M tokens/month ($460 / $2.08 per 1M tokens)

Claude Haiku 3.5 is significantly more capable than Llama 3.1 8B (MMLU 83.5% vs 68.4%). This comparison only makes sense if the 8B model is sufficient for your task. For simple classification, extraction, or routing tasks, the 8B model often works fine.

Monthly Volume	Claude Haiku 3.5 API	Self-Host Llama 3.1 8B	Savings
50M tokens	$104	$460	-$356 (API cheaper)
221M tokens	$460	$460	~Break even
500M tokens	$1,040	$460	+$580 (self-host 56% cheaper)
1B tokens	$2,080	$460	+$1,620 (self-host 78% cheaper)
5B tokens	$10,400	$460	+$9,940 (self-host 96% cheaper)

Scenario 3: DeepSeek V3 Self-Hosted vs DeepSeek V3 API

Model: DeepSeek V3 self-hosted vs DeepSeek V3 API GPU: 2x H100 on RunPod (INT8) = $3,586/mo Overhead: $500/mo Total fixed: $4,086/mo DeepSeek V3 API blended rate (60/40 split): $0.60/1M tokens

Break-even point: ~6.8B tokens/month ($4,086 / $0.60 per 1M tokens)

Because DeepSeek's API pricing is already extremely low, self-hosting DeepSeek V3 only makes sense at very high volumes (6.8B+ tokens/month) or when data privacy requires on-premise deployment. At 2x H100 throughput limits of around 2-3B tokens/month, you would need to scale to 4+ GPUs to reach the break-even volume, pushing the calculation even further into API-favored territory.

Verdict: For DeepSeek V3, use the API unless you have strict data residency requirements.

Scenario 4: Mistral Small 3.1 Self-Hosted vs Gemini 2.5 Flash API

Model: Mistral Small 3.1 (24B) vs Gemini 2.5 Flash GPU: 1x L40S on RunPod = $713/mo Overhead: $150/mo Total fixed: $863/mo Gemini 2.5 Flash blended rate (60/40 split): $0.33/1M tokens

Break-even point: ~2.6B tokens/month ($863 / $0.33 per 1M tokens)

Gemini 2.5 Flash is so aggressively priced that self-hosting a comparable open model is only cheaper at very high volume. Additionally, Flash scores higher on both MMLU (86.5% vs 80.6%) and offers a 1M token context window. In most cases, Gemini Flash API is the better choice.

Visual Break-Even Comparison

Monthly cost at 1 billion tokens/month (60% input / 40% output):

API vs Self-Hosting Cost at 1B Tokens/Month

API cost

Self-hosting cost

Hidden Costs of Self-Hosting

The GPU rental is only part of the equation. Factor in these additional costs when planning self-hosted infrastructure:

Cost Category	Typical Range (Monthly)	Notes
GPU compute	$360 - $8,842	Fixed cost, 24/7 operation
Storage (model weights + logs)	$50 - $200	NVMe for fast model loading
Networking / bandwidth	$50 - $300	Egress costs for high-traffic deployments
Monitoring / alerting	$50 - $200	Grafana, Prometheus, PagerDuty
Engineering time (10-20 hrs/mo)	$1,000 - $3,000	Updates, debugging, scaling, on-call
Redundancy (2nd GPU for failover)	$360 - $3,586	Required for production SLAs
Load balancing	$20 - $100	nginx, Traefik, or cloud LB

Engineering time is the most commonly underestimated cost. At $150/hour for a senior ML engineer, 15 hours/month of maintenance equals $2,250 -- which shifts the break-even point significantly higher.

When to Self-Host: Decision Framework

Volume > break-even point: If your monthly token volume consistently exceeds the break-even threshold, self-hosting saves money at scale.
Data privacy is required: Healthcare (HIPAA), finance, government, or any workload where data cannot leave your infrastructure. No volume calculation needed.
Latency sensitivity: Self-hosting eliminates network round-trips. For sub-100ms first-token latency, local inference on H100 is fastest.
Fine-tuned models: If you need to serve a custom fine-tuned model that no API provider hosts, self-hosting is your only option.
Cost predictability: Fixed GPU rental costs are predictable. API costs scale linearly and can spike unexpectedly with traffic.

When to Use APIs Instead

Volume below break-even: At low-to-moderate volume, APIs are cheaper and require zero infrastructure management.
Need frontier quality: Claude Opus 4.6, GPT-4.5, and Gemini 2.5 Pro have no open-weight equivalents. If you need top-tier quality, API is the only option.
No ML infrastructure team: Self-hosting requires GPU management, vLLM/TGI deployment, monitoring, and on-call engineering. If your team lacks this expertise, the hidden costs will dwarf any savings.
Rapid scaling needs: APIs scale instantly. Self-hosted infrastructure requires provisioning new GPUs, which can take hours to days.
Multi-model strategy: If you use different models for different tasks, managing multiple self-hosted deployments compounds complexity. APIs let you switch models with a single parameter change.

Recommended Stack for Self-Hosting

If you decide to self-host, here is the proven production stack as of 2026:

Inference engine: vLLM (best throughput with continuous batching and PagedAttention)
Model format: GPTQ or AWQ quantization for INT4/INT8 (via AutoGPTQ or AutoAWQ)
GPU provider: RunPod for cost, Lambda Labs for reliability, AWS for enterprise compliance
Orchestration: Docker + Kubernetes (or RunPod Serverless for auto-scaling)
Monitoring: Prometheus + Grafana for latency, throughput, GPU utilization
Load balancing: nginx or Traefik with health checks

Methodology

GPU rental prices sourced from official pricing pages of Lambda Labs, RunPod, and AWS as of April 2026. Throughput benchmarks measured using vLLM with batch size 1 and 2K context length on the specified GPU configurations. API pricing from official provider documentation. Break-even calculations use a 60/40 input/output token split unless otherwise stated. Infrastructure overhead estimates based on typical production deployments including storage, networking, monitoring, and engineering time.

LLM Cost CalculatorModel your API costs LLM Value Index 202642 models compared Best LLM by Use Case7 workload recommendations Self-Hosted vs API Blog PostExtended deep dive

Frequently Asked Questions

When should I self-host an LLM instead of using an API?

Self-hosting becomes cheaper than API access when your monthly API spend exceeds $2,000-$5,000 for comparable open-source models. For Llama 4 Maverick on 2x H100 GPUs, the break-even point against GPT-4o API is approximately 724M tokens/month. Below that volume, API access is cheaper when you factor in infrastructure overhead.

How much does it cost to self-host Llama 4?

Llama 4 Maverick (400B MoE) requires 2x H100 80GB GPUs. Monthly GPU rental costs: Lambda Labs $4,018/mo, RunPod $3,586/mo, AWS $5,285/mo. Add $400-$500/mo for storage, networking, and monitoring. Total: $3,986-$5,785/month fixed cost regardless of usage volume.

How many GPUs do I need for Llama 4?

Llama 4 Scout (109B MoE) fits on 1x H100 80GB in FP16. Llama 4 Maverick (400B MoE) requires 2x H100 80GB in FP16, or can fit on 1x H100 with INT8 quantization at some quality cost. For maximum throughput, 4x H100 allows batched inference serving hundreds of concurrent requests.

Is self-hosting LLMs worth it for a startup?

For most startups, no. API access costs less below $2,000-5,000/month in token spend, requires zero ML infrastructure expertise, and scales instantly. Self-hosting only makes sense for startups with: (1) very high volume (>500M tokens/month), (2) strict data residency requirements, (3) latency-sensitive applications, or (4) a team with existing ML operations expertise.

What GPU should I rent for LLM inference?

H100 80GB ($2.49/hr on RunPod) is the best price/performance for large models (70B+). A100 80GB ($1.89/hr on RunPod) works for models up to 70B in FP16. For small models (7-13B), an A10G 24GB ($0.50/hr) is sufficient and much cheaper. Always benchmark your specific model before committing.

Last updated: April 7, 2026