Open Source vs API: Break-Even Analysis for Common LLM Workloads
The decision between self-hosting an open-weight LLM and using a managed API comes down to volume, quality requirements, and operational capacity. This analysis provides concrete break-even calculations using real GPU rental prices from Lambda Labs, RunPod, and AWS as of April 2026.
TL;DR
- Below $2,000/month in API costs: Use APIs. Self-hosting overhead erases any savings.
- $2,000-$5,000/month: Gray zone. Self-hosting may save money if you have ML infrastructure expertise.
- Above $5,000/month: Self-hosting saves 40-70% on compute costs. But add $500-$2,000/month for engineering overhead.
- Data privacy requirement: Self-host regardless of volume. No data leaves your infrastructure.
GPU Infrastructure Costs (April 2026)
Self-hosting an LLM requires renting (or buying) GPU compute. Here are current monthly costs for the most common GPU configurations used for LLM inference:
| GPU | VRAM | Lambda Labs | RunPod | AWS | Best For |
|---|---|---|---|---|---|
| H100 80GB | 80 GB | $2.79/hr ($2,009/mo) | $2.49/hr ($1,793/mo) | $3.67/hr ($2,642/mo) | 70B-400B models |
| A100 80GB | 80 GB | $2.21/hr ($1,591/mo) | $1.89/hr ($1,361/mo) | $3.07/hr ($2,210/mo) | 13B-70B models |
| A10G 24GB | 24 GB | $0.60/hr ($432/mo) | $0.50/hr ($360/mo) | $1.01/hr ($727/mo) | 7B-13B models |
| L40S 48GB | 48 GB | $1.10/hr ($792/mo) | $0.99/hr ($713/mo) | $1.58/hr ($1,138/mo) | 13B-34B models |
| 2x H100 80GB | 160 GB | $5.58/hr ($4,018/mo) | $4.98/hr ($3,586/mo) | $7.34/hr ($5,285/mo) | 400B+ models, high throughput |
| 4x A100 80GB | 320 GB | $8.84/hr ($6,365/mo) | $7.56/hr ($5,443/mo) | $12.28/hr ($8,842/mo) | 405B models, max throughput |
Monthly costs assume 24/7 operation (720 hours). Prices from official pricing pages, April 2026. Spot/interruptible pricing can reduce costs by 30-60% but is unsuitable for production inference.
Model Hardware Requirements
Each model has specific GPU memory requirements depending on precision (FP16, INT8, INT4). Here is what you need to run the most popular open-weight models:
| Model | Parameters | FP16 VRAM | INT8 VRAM | Min GPU Config | Throughput (vLLM) |
|---|---|---|---|---|---|
| Llama 4 Maverick | 400B MoE (17B active) | ~160 GB | ~80 GB | 2x H100 80GB (FP16) or 1x H100 (INT8) | ~65 tok/s |
| Llama 4 Scout | 109B MoE (17B active) | ~55 GB | ~30 GB | 1x H100 80GB (FP16) or 1x L40S (INT8) | ~95 tok/s |
| Llama 3.3 70B | 70B dense | ~140 GB | ~70 GB | 2x A100 80GB (FP16) or 1x H100 (INT8) | ~55 tok/s |
| Llama 3.1 405B | 405B dense | ~810 GB | ~405 GB | 4x A100 80GB (INT4) or 8x H100 | ~18 tok/s |
| Llama 3.1 8B | 8B dense | ~16 GB | ~8 GB | 1x A10G 24GB | ~180 tok/s |
| Mistral Small 3.1 | 24B dense | ~48 GB | ~24 GB | 1x L40S 48GB (FP16) or 1x A10G (INT8) | ~120 tok/s |
| DeepSeek V3 | 671B MoE (37B active) | ~320 GB | ~160 GB | 2x H100 80GB (INT8) or 4x A100 | ~40 tok/s |
| Qwen 2.5 72B | 72B dense | ~144 GB | ~72 GB | 2x A100 80GB (FP16) or 1x H100 (INT8) | ~50 tok/s |
Throughput measured with vLLM on batch size 1 with 2K context. Production throughput with batching can be 2-5x higher. MoE models use less active VRAM per forward pass than their total parameter count suggests.
Break-Even Scenarios
Scenario 1: Llama 4 Maverick Self-Hosted vs GPT-4o API
Break-even point: ~724M tokens/month ($3,986 / $5.50 per 1M tokens)
At 724M tokens/month, GPT-4o API costs match the fixed self-hosting cost. Above that volume, every additional token is effectively free on self-hosted infrastructure (until you hit throughput limits at around 2-3B tokens/month per 2x H100 with vLLM batching).
| Monthly Volume | GPT-4o API | Self-Host Llama 4 Maverick | Savings |
|---|---|---|---|
| 100M tokens | $550 | $3,986 | -$3,436 (API cheaper) |
| 500M tokens | $2,750 | $3,986 | -$1,236 (API cheaper) |
| 724M tokens | $3,982 | $3,986 | ~Break even |
| 1B tokens | $5,500 | $3,986 | +$1,514 (self-host cheaper) |
| 2B tokens | $11,000 | $3,986 | +$7,014 (self-host 64% cheaper) |
| 5B tokens | $27,500 | $3,986 | +$23,514 (self-host 86% cheaper) |
Scenario 2: Llama 3.1 8B Self-Hosted vs Claude Haiku API
Break-even point: ~221M tokens/month ($460 / $2.08 per 1M tokens)
Claude Haiku 3.5 is significantly more capable than Llama 3.1 8B (MMLU 83.5% vs 68.4%). This comparison only makes sense if the 8B model is sufficient for your task. For simple classification, extraction, or routing tasks, the 8B model often works fine.
| Monthly Volume | Claude Haiku 3.5 API | Self-Host Llama 3.1 8B | Savings |
|---|---|---|---|
| 50M tokens | $104 | $460 | -$356 (API cheaper) |
| 221M tokens | $460 | $460 | ~Break even |
| 500M tokens | $1,040 | $460 | +$580 (self-host 56% cheaper) |
| 1B tokens | $2,080 | $460 | +$1,620 (self-host 78% cheaper) |
| 5B tokens | $10,400 | $460 | +$9,940 (self-host 96% cheaper) |
Scenario 3: DeepSeek V3 Self-Hosted vs DeepSeek V3 API
Break-even point: ~6.8B tokens/month ($4,086 / $0.60 per 1M tokens)
Because DeepSeek's API pricing is already extremely low, self-hosting DeepSeek V3 only makes sense at very high volumes (6.8B+ tokens/month) or when data privacy requires on-premise deployment. At 2x H100 throughput limits of around 2-3B tokens/month, you would need to scale to 4+ GPUs to reach the break-even volume, pushing the calculation even further into API-favored territory.
Verdict: For DeepSeek V3, use the API unless you have strict data residency requirements.
Scenario 4: Mistral Small 3.1 Self-Hosted vs Gemini 2.5 Flash API
Break-even point: ~2.6B tokens/month ($863 / $0.33 per 1M tokens)
Gemini 2.5 Flash is so aggressively priced that self-hosting a comparable open model is only cheaper at very high volume. Additionally, Flash scores higher on both MMLU (86.5% vs 80.6%) and offers a 1M token context window. In most cases, Gemini Flash API is the better choice.
Visual Break-Even Comparison
Monthly cost at 1 billion tokens/month (60% input / 40% output):
API vs Self-Hosting Cost at 1B Tokens/Month
Hidden Costs of Self-Hosting
The GPU rental is only part of the equation. Factor in these additional costs when planning self-hosted infrastructure:
| Cost Category | Typical Range (Monthly) | Notes |
|---|---|---|
| GPU compute | $360 - $8,842 | Fixed cost, 24/7 operation |
| Storage (model weights + logs) | $50 - $200 | NVMe for fast model loading |
| Networking / bandwidth | $50 - $300 | Egress costs for high-traffic deployments |
| Monitoring / alerting | $50 - $200 | Grafana, Prometheus, PagerDuty |
| Engineering time (10-20 hrs/mo) | $1,000 - $3,000 | Updates, debugging, scaling, on-call |
| Redundancy (2nd GPU for failover) | $360 - $3,586 | Required for production SLAs |
| Load balancing | $20 - $100 | nginx, Traefik, or cloud LB |
Engineering time is the most commonly underestimated cost. At $150/hour for a senior ML engineer, 15 hours/month of maintenance equals $2,250 -- which shifts the break-even point significantly higher.
When to Self-Host: Decision Framework
- Volume > break-even point: If your monthly token volume consistently exceeds the break-even threshold, self-hosting saves money at scale.
- Data privacy is required: Healthcare (HIPAA), finance, government, or any workload where data cannot leave your infrastructure. No volume calculation needed.
- Latency sensitivity: Self-hosting eliminates network round-trips. For sub-100ms first-token latency, local inference on H100 is fastest.
- Fine-tuned models: If you need to serve a custom fine-tuned model that no API provider hosts, self-hosting is your only option.
- Cost predictability: Fixed GPU rental costs are predictable. API costs scale linearly and can spike unexpectedly with traffic.
When to Use APIs Instead
- Volume below break-even: At low-to-moderate volume, APIs are cheaper and require zero infrastructure management.
- Need frontier quality: Claude Opus 4.6, GPT-4.5, and Gemini 2.5 Pro have no open-weight equivalents. If you need top-tier quality, API is the only option.
- No ML infrastructure team: Self-hosting requires GPU management, vLLM/TGI deployment, monitoring, and on-call engineering. If your team lacks this expertise, the hidden costs will dwarf any savings.
- Rapid scaling needs: APIs scale instantly. Self-hosted infrastructure requires provisioning new GPUs, which can take hours to days.
- Multi-model strategy: If you use different models for different tasks, managing multiple self-hosted deployments compounds complexity. APIs let you switch models with a single parameter change.
Recommended Stack for Self-Hosting
If you decide to self-host, here is the proven production stack as of 2026:
- Inference engine: vLLM (best throughput with continuous batching and PagedAttention)
- Model format: GPTQ or AWQ quantization for INT4/INT8 (via AutoGPTQ or AutoAWQ)
- GPU provider: RunPod for cost, Lambda Labs for reliability, AWS for enterprise compliance
- Orchestration: Docker + Kubernetes (or RunPod Serverless for auto-scaling)
- Monitoring: Prometheus + Grafana for latency, throughput, GPU utilization
- Load balancing: nginx or Traefik with health checks
Frequently Asked Questions
When should I self-host an LLM instead of using an API?
Self-hosting becomes cheaper than API access when your monthly API spend exceeds $2,000-$5,000 for comparable open-source models. For Llama 4 Maverick on 2x H100 GPUs, the break-even point against GPT-4o API is approximately 724M tokens/month. Below that volume, API access is cheaper when you factor in infrastructure overhead.
How much does it cost to self-host Llama 4?
Llama 4 Maverick (400B MoE) requires 2x H100 80GB GPUs. Monthly GPU rental costs: Lambda Labs $4,018/mo, RunPod $3,586/mo, AWS $5,285/mo. Add $400-$500/mo for storage, networking, and monitoring. Total: $3,986-$5,785/month fixed cost regardless of usage volume.
How many GPUs do I need for Llama 4?
Llama 4 Scout (109B MoE) fits on 1x H100 80GB in FP16. Llama 4 Maverick (400B MoE) requires 2x H100 80GB in FP16, or can fit on 1x H100 with INT8 quantization at some quality cost. For maximum throughput, 4x H100 allows batched inference serving hundreds of concurrent requests.
Is self-hosting LLMs worth it for a startup?
For most startups, no. API access costs less below $2,000-5,000/month in token spend, requires zero ML infrastructure expertise, and scales instantly. Self-hosting only makes sense for startups with: (1) very high volume (>500M tokens/month), (2) strict data residency requirements, (3) latency-sensitive applications, or (4) a team with existing ML operations expertise.
What GPU should I rent for LLM inference?
H100 80GB ($2.49/hr on RunPod) is the best price/performance for large models (70B+). A100 80GB ($1.89/hr on RunPod) works for models up to 70B in FP16. For small models (7-13B), an A10G 24GB ($0.50/hr) is sufficient and much cheaper. Always benchmark your specific model before committing.