Self-Hosted vs API: When Each Makes Financial Sense
The self-hosting vs API debate has real financial stakes. Choosing wrong can mean overpaying by 5-10x. This article provides concrete cost analysis for both approaches with 2025 hardware and API pricing.
The API Cost Baseline
API pricing has dropped significantly over the past year. Claude 3.5 Sonnet costs $3/$15 per million tokens. GPT-4o costs $5/$15. Open-model inference providers like Groq offer Llama 3 70B at $0.59/$0.79 per million tokens — roughly 10-25x cheaper than frontier closed models.
For a workload of 100 million tokens per month with a 60/40 input/output split:
Claude 3.5 Sonnet: 60M * $3/1M + 40M * $15/1M = $780/month
GPT-4o: 60M * $5/1M + 40M * $15/1M = $900/month
Llama 3 70B (Groq): 60M * $0.59/1M + 40M * $0.79/1M = $67/month
The Self-Hosting Cost Reality
Self-hosting an open model requires GPU infrastructure. Here are the real numbers for running Llama 3.1 70B with vLLM:
Cloud GPU Pricing (2025)
- A100 80GB (on-demand): $2.50-3.50/hr depending on cloud provider
- A100 80GB (reserved 1yr): $1.50-2.00/hr
- H100 80GB (on-demand): $4.00-5.50/hr
- H100 80GB (reserved 1yr): $2.50-3.50/hr
- L40S 48GB (on-demand): $1.20-1.80/hr
Running Llama 3.1 70B
Llama 3.1 70B in FP16 requires approximately 140GB of VRAM. That means 2x A100-80GB GPUs minimum. With vLLM optimizations and quantization (AWQ 4-bit), you can fit it on a single A100-80GB with reduced quality.
# 2x A100-80GB on-demand
GPU cost: 2 * $3.00/hr * 730 hours/month = $4,380/month
# vLLM throughput on 2x A100-80GB for Llama 3.1 70B
# ~50-80 tokens/second per request, ~200-400 concurrent requests
# Approximate capacity: 500M-1B tokens/month at full utilization
# Effective cost per million tokens:
$4,380 / 750 = ~$5.84 per million tokens (mixed)
Compare that to Groq's $0.59/$0.79 per million tokens for the same model. At low volumes, API access through inference providers is substantially cheaper than self-hosting.
The Crossover Point
Self-hosting becomes cheaper when you can maintain high GPU utilization. The math shifts at around 500 million to 1 billion tokens per month for 70B-class models.
# Crossover analysis for Llama 3.1 70B
# Self-hosted (2x A100, reserved): $2,190/month fixed
# Capacity: ~750M tokens/month at 80% utilization
# Break-even with Groq API ($0.69 average/1M):
# $2,190 / $0.69 = ~3.2B tokens/month break-even
# For Claude 3.5 Sonnet comparison (no self-host option):
# $2,190/month buys you: $2,190 / $9 average = ~243M tokens/month
The key insight: self-hosting only saves money if you can keep GPUs running at 60%+ utilization. Idle GPUs are burning money.
Costs Beyond Compute
GPU cost is not the full picture for self-hosting. Add these to your budget:
- MLOps engineering: Someone needs to manage the infrastructure. Expect 0.25-0.5 FTE of an ML engineer ($3,000-8,000/month in allocated salary)
- Inference framework: vLLM, TensorRT-LLM, or TGI need setup, tuning, and maintenance
- Monitoring: GPU utilization, model latency, error rates ($200-500/month)
- Model updates: Swapping in new model versions requires testing, validation, and deployment
- Redundancy: Production systems need at least 2x capacity for failover
A realistic self-hosting budget for a 70B model in production is $8,000-15,000/month when you include everything. Track these costs carefully using infrastructure tools like KappaKit for monitoring and zovo.one for optimization guidance.
When API Makes Sense
Use API access when:
- Volume is under 500M tokens/month for open models, or any volume for closed models (Claude, GPT-4)
- You need frontier quality: Claude 3.5 Sonnet and GPT-4o cannot be self-hosted
- Traffic is bursty: APIs handle spikes without idle GPU costs
- Team is small: No MLOps engineer available for infrastructure
- Speed to market matters: APIs require zero infrastructure setup
When Self-Hosting Makes Sense
Self-host when:
- Volume exceeds 1B tokens/month with consistent utilization
- Data privacy is non-negotiable: Regulated industries, healthcare, finance
- You need custom models: Fine-tuned models or domain-specific architectures
- Latency requirements are extreme: Co-located inference eliminates network hops
- You have MLOps capacity: Existing infrastructure team can absorb the work
The Hybrid Approach
Many production systems use both. A common pattern:
- Self-host a fast, small model (Llama 3.1 8B or Phi-3) for simple classification, routing, and extraction tasks
- Use API access to Claude 3.5 Sonnet or GPT-4o for complex reasoning and generation
- Route requests based on complexity: 70-80% go to the cheap self-hosted model, 20-30% go to the API
async def route_request(query, complexity_score):
if complexity_score < 0.3:
# Simple query — use self-hosted Llama 3.1 8B
return await local_model.generate(query)
elif complexity_score < 0.7:
# Medium complexity — use Groq API (Llama 3 70B)
return await groq_client.generate(query)
else:
# Complex reasoning — use Claude 3.5 Sonnet
return await anthropic_client.generate(query)
This pattern reduces API costs by 50-70% while maintaining quality on difficult queries.
Making the Decision
Start with API access. Measure your actual token volumes, latency requirements, and quality needs for 2-3 months. Then run the self-hosting math with real numbers instead of estimates.
Use our cost calculator to model your current API spend, and compare models on LockML to identify which open-source alternatives might work for your use case. The right answer is usually not "all API" or "all self-hosted" — it is a thoughtful combination based on your specific workload profile.