Original Research

Open Source vs API: Break-Even Analysis for Common LLM Workloads

Published April 7, 2026 · Author: Michael Lip · 14 min read · Updated monthly

The decision between self-hosting an open-weight LLM and using a managed API comes down to volume, quality requirements, and operational capacity. This analysis provides concrete break-even calculations using real GPU rental prices from Lambda Labs, RunPod, and AWS as of April 2026.

TL;DR

GPU Infrastructure Costs (April 2026)

Self-hosting an LLM requires renting (or buying) GPU compute. Here are current monthly costs for the most common GPU configurations used for LLM inference:

GPU VRAM Lambda Labs RunPod AWS Best For
H100 80GB 80 GB $2.79/hr ($2,009/mo) $2.49/hr ($1,793/mo) $3.67/hr ($2,642/mo) 70B-400B models
A100 80GB 80 GB $2.21/hr ($1,591/mo) $1.89/hr ($1,361/mo) $3.07/hr ($2,210/mo) 13B-70B models
A10G 24GB 24 GB $0.60/hr ($432/mo) $0.50/hr ($360/mo) $1.01/hr ($727/mo) 7B-13B models
L40S 48GB 48 GB $1.10/hr ($792/mo) $0.99/hr ($713/mo) $1.58/hr ($1,138/mo) 13B-34B models
2x H100 80GB 160 GB $5.58/hr ($4,018/mo) $4.98/hr ($3,586/mo) $7.34/hr ($5,285/mo) 400B+ models, high throughput
4x A100 80GB 320 GB $8.84/hr ($6,365/mo) $7.56/hr ($5,443/mo) $12.28/hr ($8,842/mo) 405B models, max throughput

Monthly costs assume 24/7 operation (720 hours). Prices from official pricing pages, April 2026. Spot/interruptible pricing can reduce costs by 30-60% but is unsuitable for production inference.

Model Hardware Requirements

Each model has specific GPU memory requirements depending on precision (FP16, INT8, INT4). Here is what you need to run the most popular open-weight models:

Model Parameters FP16 VRAM INT8 VRAM Min GPU Config Throughput (vLLM)
Llama 4 Maverick 400B MoE (17B active) ~160 GB ~80 GB 2x H100 80GB (FP16) or 1x H100 (INT8) ~65 tok/s
Llama 4 Scout 109B MoE (17B active) ~55 GB ~30 GB 1x H100 80GB (FP16) or 1x L40S (INT8) ~95 tok/s
Llama 3.3 70B 70B dense ~140 GB ~70 GB 2x A100 80GB (FP16) or 1x H100 (INT8) ~55 tok/s
Llama 3.1 405B 405B dense ~810 GB ~405 GB 4x A100 80GB (INT4) or 8x H100 ~18 tok/s
Llama 3.1 8B 8B dense ~16 GB ~8 GB 1x A10G 24GB ~180 tok/s
Mistral Small 3.1 24B dense ~48 GB ~24 GB 1x L40S 48GB (FP16) or 1x A10G (INT8) ~120 tok/s
DeepSeek V3 671B MoE (37B active) ~320 GB ~160 GB 2x H100 80GB (INT8) or 4x A100 ~40 tok/s
Qwen 2.5 72B 72B dense ~144 GB ~72 GB 2x A100 80GB (FP16) or 1x H100 (INT8) ~50 tok/s

Throughput measured with vLLM on batch size 1 with 2K context. Production throughput with batching can be 2-5x higher. MoE models use less active VRAM per forward pass than their total parameter count suggests.

Break-Even Scenarios

Scenario 1: Llama 4 Maverick Self-Hosted vs GPT-4o API

Model: Llama 4 Maverick (400B MoE) vs GPT-4o GPU: 2x H100 on RunPod = $3,586/mo Overhead (monitoring, storage, networking): $400/mo Total fixed: $3,986/mo GPT-4o blended rate (60/40 split): $5.50/1M tokens

Break-even point: ~724M tokens/month ($3,986 / $5.50 per 1M tokens)

At 724M tokens/month, GPT-4o API costs match the fixed self-hosting cost. Above that volume, every additional token is effectively free on self-hosted infrastructure (until you hit throughput limits at around 2-3B tokens/month per 2x H100 with vLLM batching).

Monthly VolumeGPT-4o APISelf-Host Llama 4 MaverickSavings
100M tokens$550$3,986-$3,436 (API cheaper)
500M tokens$2,750$3,986-$1,236 (API cheaper)
724M tokens$3,982$3,986~Break even
1B tokens$5,500$3,986+$1,514 (self-host cheaper)
2B tokens$11,000$3,986+$7,014 (self-host 64% cheaper)
5B tokens$27,500$3,986+$23,514 (self-host 86% cheaper)

Scenario 2: Llama 3.1 8B Self-Hosted vs Claude Haiku API

Model: Llama 3.1 8B vs Claude Haiku 3.5 GPU: 1x A10G on RunPod = $360/mo Overhead: $100/mo Total fixed: $460/mo Haiku 3.5 blended rate (60/40 split): $2.08/1M tokens

Break-even point: ~221M tokens/month ($460 / $2.08 per 1M tokens)

Claude Haiku 3.5 is significantly more capable than Llama 3.1 8B (MMLU 83.5% vs 68.4%). This comparison only makes sense if the 8B model is sufficient for your task. For simple classification, extraction, or routing tasks, the 8B model often works fine.

Monthly VolumeClaude Haiku 3.5 APISelf-Host Llama 3.1 8BSavings
50M tokens$104$460-$356 (API cheaper)
221M tokens$460$460~Break even
500M tokens$1,040$460+$580 (self-host 56% cheaper)
1B tokens$2,080$460+$1,620 (self-host 78% cheaper)
5B tokens$10,400$460+$9,940 (self-host 96% cheaper)

Scenario 3: DeepSeek V3 Self-Hosted vs DeepSeek V3 API

Model: DeepSeek V3 self-hosted vs DeepSeek V3 API GPU: 2x H100 on RunPod (INT8) = $3,586/mo Overhead: $500/mo Total fixed: $4,086/mo DeepSeek V3 API blended rate (60/40 split): $0.60/1M tokens

Break-even point: ~6.8B tokens/month ($4,086 / $0.60 per 1M tokens)

Because DeepSeek's API pricing is already extremely low, self-hosting DeepSeek V3 only makes sense at very high volumes (6.8B+ tokens/month) or when data privacy requires on-premise deployment. At 2x H100 throughput limits of around 2-3B tokens/month, you would need to scale to 4+ GPUs to reach the break-even volume, pushing the calculation even further into API-favored territory.

Verdict: For DeepSeek V3, use the API unless you have strict data residency requirements.

Scenario 4: Mistral Small 3.1 Self-Hosted vs Gemini 2.5 Flash API

Model: Mistral Small 3.1 (24B) vs Gemini 2.5 Flash GPU: 1x L40S on RunPod = $713/mo Overhead: $150/mo Total fixed: $863/mo Gemini 2.5 Flash blended rate (60/40 split): $0.33/1M tokens

Break-even point: ~2.6B tokens/month ($863 / $0.33 per 1M tokens)

Gemini 2.5 Flash is so aggressively priced that self-hosting a comparable open model is only cheaper at very high volume. Additionally, Flash scores higher on both MMLU (86.5% vs 80.6%) and offers a 1M token context window. In most cases, Gemini Flash API is the better choice.

Visual Break-Even Comparison

Monthly cost at 1 billion tokens/month (60% input / 40% output):

API vs Self-Hosting Cost at 1B Tokens/Month

API cost
Self-hosting cost

Hidden Costs of Self-Hosting

The GPU rental is only part of the equation. Factor in these additional costs when planning self-hosted infrastructure:

Cost CategoryTypical Range (Monthly)Notes
GPU compute$360 - $8,842Fixed cost, 24/7 operation
Storage (model weights + logs)$50 - $200NVMe for fast model loading
Networking / bandwidth$50 - $300Egress costs for high-traffic deployments
Monitoring / alerting$50 - $200Grafana, Prometheus, PagerDuty
Engineering time (10-20 hrs/mo)$1,000 - $3,000Updates, debugging, scaling, on-call
Redundancy (2nd GPU for failover)$360 - $3,586Required for production SLAs
Load balancing$20 - $100nginx, Traefik, or cloud LB

Engineering time is the most commonly underestimated cost. At $150/hour for a senior ML engineer, 15 hours/month of maintenance equals $2,250 -- which shifts the break-even point significantly higher.

When to Self-Host: Decision Framework

  1. Volume > break-even point: If your monthly token volume consistently exceeds the break-even threshold, self-hosting saves money at scale.
  2. Data privacy is required: Healthcare (HIPAA), finance, government, or any workload where data cannot leave your infrastructure. No volume calculation needed.
  3. Latency sensitivity: Self-hosting eliminates network round-trips. For sub-100ms first-token latency, local inference on H100 is fastest.
  4. Fine-tuned models: If you need to serve a custom fine-tuned model that no API provider hosts, self-hosting is your only option.
  5. Cost predictability: Fixed GPU rental costs are predictable. API costs scale linearly and can spike unexpectedly with traffic.

When to Use APIs Instead

  1. Volume below break-even: At low-to-moderate volume, APIs are cheaper and require zero infrastructure management.
  2. Need frontier quality: Claude Opus 4.6, GPT-4.5, and Gemini 2.5 Pro have no open-weight equivalents. If you need top-tier quality, API is the only option.
  3. No ML infrastructure team: Self-hosting requires GPU management, vLLM/TGI deployment, monitoring, and on-call engineering. If your team lacks this expertise, the hidden costs will dwarf any savings.
  4. Rapid scaling needs: APIs scale instantly. Self-hosted infrastructure requires provisioning new GPUs, which can take hours to days.
  5. Multi-model strategy: If you use different models for different tasks, managing multiple self-hosted deployments compounds complexity. APIs let you switch models with a single parameter change.

Recommended Stack for Self-Hosting

If you decide to self-host, here is the proven production stack as of 2026:

Frequently Asked Questions

When should I self-host an LLM instead of using an API?

Self-hosting becomes cheaper than API access when your monthly API spend exceeds $2,000-$5,000 for comparable open-source models. For Llama 4 Maverick on 2x H100 GPUs, the break-even point against GPT-4o API is approximately 724M tokens/month. Below that volume, API access is cheaper when you factor in infrastructure overhead.

How much does it cost to self-host Llama 4?

Llama 4 Maverick (400B MoE) requires 2x H100 80GB GPUs. Monthly GPU rental costs: Lambda Labs $4,018/mo, RunPod $3,586/mo, AWS $5,285/mo. Add $400-$500/mo for storage, networking, and monitoring. Total: $3,986-$5,785/month fixed cost regardless of usage volume.

How many GPUs do I need for Llama 4?

Llama 4 Scout (109B MoE) fits on 1x H100 80GB in FP16. Llama 4 Maverick (400B MoE) requires 2x H100 80GB in FP16, or can fit on 1x H100 with INT8 quantization at some quality cost. For maximum throughput, 4x H100 allows batched inference serving hundreds of concurrent requests.

Is self-hosting LLMs worth it for a startup?

For most startups, no. API access costs less below $2,000-5,000/month in token spend, requires zero ML infrastructure expertise, and scales instantly. Self-hosting only makes sense for startups with: (1) very high volume (>500M tokens/month), (2) strict data residency requirements, (3) latency-sensitive applications, or (4) a team with existing ML operations expertise.

What GPU should I rent for LLM inference?

H100 80GB ($2.49/hr on RunPod) is the best price/performance for large models (70B+). A100 80GB ($1.89/hr on RunPod) works for models up to 70B in FP16. For small models (7-13B), an A10G 24GB ($0.50/hr) is sufficient and much cheaper. Always benchmark your specific model before committing.

Download Raw Data

Free under CC BY 4.0. Cite this page when sharing.