Self-Hosted vs API: When Each Makes Financial Sense

Published January 2025 · 13 min read

The self-hosting vs API debate has real financial stakes. Choosing wrong can mean overpaying by 5-10x. This article provides concrete cost analysis for both approaches with 2025 hardware and API pricing.

The API Cost Baseline

API pricing has dropped significantly over the past year. Claude 3.5 Sonnet costs $3/$15 per million tokens. GPT-4o costs $5/$15. Open-model inference providers like Groq offer Llama 3 70B at $0.59/$0.79 per million tokens — roughly 10-25x cheaper than frontier closed models.

For a workload of 100 million tokens per month with a 60/40 input/output split:

Claude 3.5 Sonnet:  60M * $3/1M + 40M * $15/1M = $780/month
GPT-4o:             60M * $5/1M + 40M * $15/1M = $900/month
Llama 3 70B (Groq): 60M * $0.59/1M + 40M * $0.79/1M = $67/month

The Self-Hosting Cost Reality

Self-hosting an open model requires GPU infrastructure. Here are the real numbers for running Llama 3.1 70B with vLLM:

Cloud GPU Pricing (2025)

A100 80GB (on-demand): $2.50-3.50/hr depending on cloud provider
A100 80GB (reserved 1yr): $1.50-2.00/hr
H100 80GB (on-demand): $4.00-5.50/hr
H100 80GB (reserved 1yr): $2.50-3.50/hr
L40S 48GB (on-demand): $1.20-1.80/hr

Running Llama 3.1 70B

Llama 3.1 70B in FP16 requires approximately 140GB of VRAM. That means 2x A100-80GB GPUs minimum. With vLLM optimizations and quantization (AWQ 4-bit), you can fit it on a single A100-80GB with reduced quality.

# 2x A100-80GB on-demand
GPU cost: 2 * $3.00/hr * 730 hours/month = $4,380/month

# vLLM throughput on 2x A100-80GB for Llama 3.1 70B
# ~50-80 tokens/second per request, ~200-400 concurrent requests
# Approximate capacity: 500M-1B tokens/month at full utilization

# Effective cost per million tokens:
$4,380 / 750 = ~$5.84 per million tokens (mixed)

Compare that to Groq's $0.59/$0.79 per million tokens for the same model. At low volumes, API access through inference providers is substantially cheaper than self-hosting.

The Crossover Point

Self-hosting becomes cheaper when you can maintain high GPU utilization. The math shifts at around 500 million to 1 billion tokens per month for 70B-class models.

# Crossover analysis for Llama 3.1 70B
# Self-hosted (2x A100, reserved): $2,190/month fixed
# Capacity: ~750M tokens/month at 80% utilization
# Break-even with Groq API ($0.69 average/1M):
#   $2,190 / $0.69 = ~3.2B tokens/month break-even

# For Claude 3.5 Sonnet comparison (no self-host option):
# $2,190/month buys you: $2,190 / $9 average = ~243M tokens/month

The key insight: self-hosting only saves money if you can keep GPUs running at 60%+ utilization. Idle GPUs are burning money.

Costs Beyond Compute

GPU cost is not the full picture for self-hosting. Add these to your budget:

MLOps engineering: Someone needs to manage the infrastructure. Expect 0.25-0.5 FTE of an ML engineer ($3,000-8,000/month in allocated salary)
Inference framework: vLLM, TensorRT-LLM, or TGI need setup, tuning, and maintenance
Monitoring: GPU utilization, model latency, error rates ($200-500/month)
Model updates: Swapping in new model versions requires testing, validation, and deployment
Redundancy: Production systems need at least 2x capacity for failover

A realistic self-hosting budget for a 70B model in production is $8,000-15,000/month when you include everything. Track these costs carefully using infrastructure tools like KappaKit for monitoring and cost comparison tools for optimization guidance.

Need developer utilities for your LLM infrastructure? Check out KappaKit's developer toolkit for hash generators and formatters.

When API Makes Sense

Use API access when:

Volume is under 500M tokens/month for open models, or any volume for closed models (Claude, GPT-4)
You need frontier quality: Claude 3.5 Sonnet and GPT-4o cannot be self-hosted
Traffic is bursty: APIs handle spikes without idle GPU costs
Team is small: No MLOps engineer available for infrastructure
Speed to market matters: APIs require zero infrastructure setup

When Self-Hosting Makes Sense

Self-host when:

Volume exceeds 1B tokens/month with consistent utilization
Data privacy is non-negotiable: Regulated industries, healthcare, finance
You need custom models: Fine-tuned models or domain-specific architectures
Latency requirements are extreme: Co-located inference eliminates network hops
You have MLOps capacity: Existing infrastructure team can absorb the work

The Hybrid Approach

Many production systems use both. A common pattern:

Self-host a fast, small model (Llama 3.1 8B or Phi-3) for simple classification, routing, and extraction tasks
Use API access to Claude 3.5 Sonnet or GPT-4o for complex reasoning and generation
Route requests based on complexity: 70-80% go to the cheap self-hosted model, 20-30% go to the API

async def route_request(query, complexity_score):
    if complexity_score < 0.3:
        # Simple query — use self-hosted Llama 3.1 8B
        return await local_model.generate(query)
    elif complexity_score < 0.7:
        # Medium complexity — use Groq API (Llama 3 70B)
        return await groq_client.generate(query)
    else:
        # Complex reasoning — use Claude 3.5 Sonnet
        return await anthropic_client.generate(query)

This pattern reduces API costs by 50-70% while maintaining quality on difficult queries.

Making the Decision

Start with API access. Measure your actual token volumes, latency requirements, and quality needs for 2-3 months. Then run the self-hosting math with real numbers instead of estimates.

Use our cost calculator to model your current API spend, and compare models on LockML to identify which open-source alternatives might work for your use case. The right answer is usually not "all API" or "all self-hosted" — it is a thoughtful combination based on your specific workload profile.

For additional context, see AWS SageMaker documentation, cloud computing on Wikipedia, and Hugging Face Text Generation Inference.