What Does Fine-Tuning an LLM Actually Cost?
Fine-tuning costs break into three parts: training compute, data preparation, and ongoing inference. This calculator focuses on training compute, which is the most variable and hardest to estimate without tooling.
Managed APIs like OpenAI charge per training token. OpenAI's GPT-4o mini fine-tuning costs $0.008 per 1,000 training tokens, meaning a 10,000-example dataset with 512 tokens per example at 3 epochs costs roughly $123. GPT-3.5 Turbo fine-tuning is cheaper at $0.008/1K as well, though the base model quality is lower.
For self-hosted models like Llama 3, Mistral, or Gemma, you pay GPU compute rather than a per-token fee. A single A100-80GB GPU rents for approximately $2.50–$3.50/hour on major cloud providers. An 8B model fine-tuned with QLoRA can process 1–3 million tokens per hour, while a 70B model processes 200K–400K tokens per hour on the same hardware. This calculator uses representative throughput estimates to convert total training tokens to wall-clock hours and cost.
Understanding Training Tokens
Every training example contributes tokens across all epochs. If your dataset has 10,000 examples at 512 tokens each, that is 5.12 million tokens per epoch. At 3 epochs, total training tokens = 15.36 million. This is what you are charged for with managed APIs and what drives GPU-hours for self-hosted training.
The tokens per example field should include both the prompt (instruction) and the completion (desired output). For instruction-following fine-tunes, a typical example might be a 200-token system message, 100-token user query, and 200-token ideal response — totaling 500 tokens. For code generation tasks, examples tend to be longer (1,000–2,000 tokens) because code is verbose.
One important nuance: with some fine-tuning frameworks, only completion tokens are used in the loss calculation (not prompt tokens). OpenAI's API charges for all tokens in the training file regardless. Check provider documentation carefully, as this can double the effective cost if your examples are prompt-heavy.
Managed API Fine-Tuning: GPT-4o mini and GPT-3.5
OpenAI currently supports fine-tuning on GPT-4o mini and GPT-3.5 Turbo (and GPT-4o for select partners). The workflow is straightforward: upload a JSONL file of examples, kick off a training job via the API, and OpenAI handles all infrastructure. Jobs typically complete in 30 minutes to a few hours depending on dataset size.
Pricing is flat per training token regardless of dataset complexity. Inference on a fine-tuned GPT-4o mini model is priced higher than the base model: $0.30 per million input tokens and $1.20 per million output tokens (vs $0.15/$0.60 for base). This premium is important for the break-even calculation — fine-tuning improves quality but costs more per inference call.
Anthropic does not currently offer a public fine-tuning API for Claude models, though this is expected to change. Models requiring Claude-level capability with task specialization typically use prompt engineering, retrieval-augmented generation, or few-shot examples instead.
Self-Hosted Fine-Tuning: Llama 3, Mistral, Gemma
Open-source models can be fine-tuned on your own infrastructure with full control over data, training configuration, and the resulting weights. The standard approach in 2026 is QLoRA (Quantized Low-Rank Adaptation), which reduces VRAM requirements dramatically: an 8B model fine-tunes in 12–16GB of VRAM (a single consumer GPU), and a 70B model requires two A100-80GB GPUs or one H100.
Common frameworks include HuggingFace's transformers + trl, Axolotl, and LLaMA-Factory. Each supports JSONL training data in the same Alpaca or ShareGPT format. Training a Llama 3 8B model on 10,000 examples at 3 epochs takes approximately 2–4 hours on a single A100. At $2.50/hour for a spot instance, that is $5–$10 total — dramatically cheaper than managed APIs at scale.
The tradeoff is operational complexity. You are responsible for cloud instance provisioning, monitoring training loss, saving checkpoints, and serving the fine-tuned weights. For teams with ML infrastructure experience, this is straightforward. For product teams without a dedicated ML engineer, managed APIs often win despite higher per-token costs.
When Does Fine-Tuning Beat Prompt Engineering?
Fine-tuning is not always the right answer. It makes sense when:
- You need consistent output format across thousands of requests. Prompt engineering is fragile; fine-tuning bakes in the format.
- Your task requires specialized knowledge not present in the base model. Medical terminology, legal language, and domain-specific jargon improve significantly with domain-specific training data.
- Latency matters. A smaller fine-tuned model (e.g., Llama 3 8B) can outperform a larger base model on your specific task while being 5–10x faster and cheaper to serve.
- System prompt overhead is large. If your system prompt is 3,000 tokens per request, fine-tuning that behavior into the model eliminates the overhead entirely.
Fine-tuning is the wrong approach when you have fewer than 500–1,000 high-quality training examples, when your task changes frequently (making retraining expensive), or when you need the model to generalize to many different task types within one deployment.
The Break-Even Calculation Explained
The break-even point is the number of inference requests after which the upfront training cost is paid back by savings on inference. This requires comparing two scenarios:
- Fine-tune route: Pay training cost upfront, then serve inference on the fine-tuned model. For managed fine-tunes (GPT-4o mini), inference costs slightly more than base. For self-hosted open-source models, inference costs are just GPU compute with no per-token fees.
- Base API route: No upfront cost. Pay standard API rates for every request. The base model may need longer prompts (system instructions, few-shot examples) to match fine-tuned quality — increasing per-request token counts.
The break-even formula is: Training Cost / (Monthly API Spend - Monthly Fine-Tune Spend) = Months to Break Even. If the fine-tuned model costs the same or more per inference call (as with OpenAI's fine-tuned endpoint pricing), the break-even only comes from reduced prompt length (fewer tokens per request). If the fine-tuned model is cheaper per call (as with self-hosted), break-even can be very fast at high volume.