Self-Hosted vs API: When Each Makes Financial Sense

Published January 2025 · 13 min read

The self-hosting vs API debate has real financial stakes. Choosing wrong can mean overpaying by 5-10x. This article provides concrete cost analysis for both approaches with 2025 hardware and API pricing.

The API Cost Baseline

API pricing has dropped significantly over the past year. Claude 3.5 Sonnet costs $3/$15 per million tokens. GPT-4o costs $5/$15. Open-model inference providers like Groq offer Llama 3 70B at $0.59/$0.79 per million tokens — roughly 10-25x cheaper than frontier closed models.

For a workload of 100 million tokens per month with a 60/40 input/output split:

Claude 3.5 Sonnet:  60M * $3/1M + 40M * $15/1M = $780/month
GPT-4o:             60M * $5/1M + 40M * $15/1M = $900/month
Llama 3 70B (Groq): 60M * $0.59/1M + 40M * $0.79/1M = $67/month

The Self-Hosting Cost Reality

Self-hosting an open model requires GPU infrastructure. Here are the real numbers for running Llama 3.1 70B with vLLM:

Cloud GPU Pricing (2025)

Running Llama 3.1 70B

Llama 3.1 70B in FP16 requires approximately 140GB of VRAM. That means 2x A100-80GB GPUs minimum. With vLLM optimizations and quantization (AWQ 4-bit), you can fit it on a single A100-80GB with reduced quality.

# 2x A100-80GB on-demand
GPU cost: 2 * $3.00/hr * 730 hours/month = $4,380/month

# vLLM throughput on 2x A100-80GB for Llama 3.1 70B
# ~50-80 tokens/second per request, ~200-400 concurrent requests
# Approximate capacity: 500M-1B tokens/month at full utilization

# Effective cost per million tokens:
$4,380 / 750 = ~$5.84 per million tokens (mixed)

Compare that to Groq's $0.59/$0.79 per million tokens for the same model. At low volumes, API access through inference providers is substantially cheaper than self-hosting.

The Crossover Point

Self-hosting becomes cheaper when you can maintain high GPU utilization. The math shifts at around 500 million to 1 billion tokens per month for 70B-class models.

# Crossover analysis for Llama 3.1 70B
# Self-hosted (2x A100, reserved): $2,190/month fixed
# Capacity: ~750M tokens/month at 80% utilization
# Break-even with Groq API ($0.69 average/1M):
#   $2,190 / $0.69 = ~3.2B tokens/month break-even

# For Claude 3.5 Sonnet comparison (no self-host option):
# $2,190/month buys you: $2,190 / $9 average = ~243M tokens/month

The key insight: self-hosting only saves money if you can keep GPUs running at 60%+ utilization. Idle GPUs are burning money.

Costs Beyond Compute

GPU cost is not the full picture for self-hosting. Add these to your budget:

A realistic self-hosting budget for a 70B model in production is $8,000-15,000/month when you include everything. Track these costs carefully using infrastructure tools like KappaKit for monitoring and zovo.one for optimization guidance.

When API Makes Sense

Use API access when:

When Self-Hosting Makes Sense

Self-host when:

The Hybrid Approach

Many production systems use both. A common pattern:

  1. Self-host a fast, small model (Llama 3.1 8B or Phi-3) for simple classification, routing, and extraction tasks
  2. Use API access to Claude 3.5 Sonnet or GPT-4o for complex reasoning and generation
  3. Route requests based on complexity: 70-80% go to the cheap self-hosted model, 20-30% go to the API
async def route_request(query, complexity_score):
    if complexity_score < 0.3:
        # Simple query — use self-hosted Llama 3.1 8B
        return await local_model.generate(query)
    elif complexity_score < 0.7:
        # Medium complexity — use Groq API (Llama 3 70B)
        return await groq_client.generate(query)
    else:
        # Complex reasoning — use Claude 3.5 Sonnet
        return await anthropic_client.generate(query)

This pattern reduces API costs by 50-70% while maintaining quality on difficult queries.

Making the Decision

Start with API access. Measure your actual token volumes, latency requirements, and quality needs for 2-3 months. Then run the self-hosting math with real numbers instead of estimates.

Use our cost calculator to model your current API spend, and compare models on LockML to identify which open-source alternatives might work for your use case. The right answer is usually not "all API" or "all self-hosted" — it is a thoughtful combination based on your specific workload profile.