LLM Fine-Tuning Cost Calculator

Estimate training cost, training time, and total tokens processed for any dataset. Compare across providers and calculate break-even versus ongoing API inference costs.

How this works: Enter your dataset parameters and select a base model. The calculator estimates training cost using each provider's fine-tuning pricing (charged per training token), training time based on throughput benchmarks, and break-even point versus calling the base API model directly.

Base Model

$0.008 / 1K training tokens · OpenAI managed
Typical range: 2–5 epochs for fine-tuning

Dataset Parameters

Each example = one prompt/completion pair
~394 words (prompt + completion combined)

Break-Even: API Inference Comparison

How many requests/day will use the fine-tuned model?
Input + output tokens per production request

GPT-4o mini — Training Cost Estimate

Training Cost
$0.00
one-time
Training Time
0h 0m
estimated
Total Tokens
0
training tokens processed
Cost / 1K Ex.
$0.00
per 1,000 examples

Provider Comparison — Same Dataset

Model / Provider Training Cost Cost / 1K Tokens Est. Time Hosting

Break-Even Analysis: Fine-Tune vs Pure API

Fine-Tuned Model Costs

$0.00
One-time training cost
$0.00/mo
Monthly inference on fine-tuned model
$0.00
Total cost at 3 months

Base API Model Costs

$0.00
Upfront cost (none)
$0.00/mo
Monthly API inference cost
$0.00
Total cost at 3 months
Run the calculator to see the break-even analysis.

What Does Fine-Tuning an LLM Actually Cost?

Fine-tuning costs break into three parts: training compute, data preparation, and ongoing inference. This calculator focuses on training compute, which is the most variable and hardest to estimate without tooling.

Managed APIs like OpenAI charge per training token. OpenAI's GPT-4o mini fine-tuning costs $0.008 per 1,000 training tokens, meaning a 10,000-example dataset with 512 tokens per example at 3 epochs costs roughly $123. GPT-3.5 Turbo fine-tuning is cheaper at $0.008/1K as well, though the base model quality is lower.

For self-hosted models like Llama 3, Mistral, or Gemma, you pay GPU compute rather than a per-token fee. A single A100-80GB GPU rents for approximately $2.50–$3.50/hour on major cloud providers. An 8B model fine-tuned with QLoRA can process 1–3 million tokens per hour, while a 70B model processes 200K–400K tokens per hour on the same hardware. This calculator uses representative throughput estimates to convert total training tokens to wall-clock hours and cost.

Understanding Training Tokens

Every training example contributes tokens across all epochs. If your dataset has 10,000 examples at 512 tokens each, that is 5.12 million tokens per epoch. At 3 epochs, total training tokens = 15.36 million. This is what you are charged for with managed APIs and what drives GPU-hours for self-hosted training.

The tokens per example field should include both the prompt (instruction) and the completion (desired output). For instruction-following fine-tunes, a typical example might be a 200-token system message, 100-token user query, and 200-token ideal response — totaling 500 tokens. For code generation tasks, examples tend to be longer (1,000–2,000 tokens) because code is verbose.

One important nuance: with some fine-tuning frameworks, only completion tokens are used in the loss calculation (not prompt tokens). OpenAI's API charges for all tokens in the training file regardless. Check provider documentation carefully, as this can double the effective cost if your examples are prompt-heavy.

Managed API Fine-Tuning: GPT-4o mini and GPT-3.5

OpenAI currently supports fine-tuning on GPT-4o mini and GPT-3.5 Turbo (and GPT-4o for select partners). The workflow is straightforward: upload a JSONL file of examples, kick off a training job via the API, and OpenAI handles all infrastructure. Jobs typically complete in 30 minutes to a few hours depending on dataset size.

Pricing is flat per training token regardless of dataset complexity. Inference on a fine-tuned GPT-4o mini model is priced higher than the base model: $0.30 per million input tokens and $1.20 per million output tokens (vs $0.15/$0.60 for base). This premium is important for the break-even calculation — fine-tuning improves quality but costs more per inference call.

Anthropic does not currently offer a public fine-tuning API for Claude models, though this is expected to change. Models requiring Claude-level capability with task specialization typically use prompt engineering, retrieval-augmented generation, or few-shot examples instead.

Self-Hosted Fine-Tuning: Llama 3, Mistral, Gemma

Open-source models can be fine-tuned on your own infrastructure with full control over data, training configuration, and the resulting weights. The standard approach in 2026 is QLoRA (Quantized Low-Rank Adaptation), which reduces VRAM requirements dramatically: an 8B model fine-tunes in 12–16GB of VRAM (a single consumer GPU), and a 70B model requires two A100-80GB GPUs or one H100.

Common frameworks include HuggingFace's transformers + trl, Axolotl, and LLaMA-Factory. Each supports JSONL training data in the same Alpaca or ShareGPT format. Training a Llama 3 8B model on 10,000 examples at 3 epochs takes approximately 2–4 hours on a single A100. At $2.50/hour for a spot instance, that is $5–$10 total — dramatically cheaper than managed APIs at scale.

The tradeoff is operational complexity. You are responsible for cloud instance provisioning, monitoring training loss, saving checkpoints, and serving the fine-tuned weights. For teams with ML infrastructure experience, this is straightforward. For product teams without a dedicated ML engineer, managed APIs often win despite higher per-token costs.

When Does Fine-Tuning Beat Prompt Engineering?

Fine-tuning is not always the right answer. It makes sense when:

Fine-tuning is the wrong approach when you have fewer than 500–1,000 high-quality training examples, when your task changes frequently (making retraining expensive), or when you need the model to generalize to many different task types within one deployment.

The Break-Even Calculation Explained

The break-even point is the number of inference requests after which the upfront training cost is paid back by savings on inference. This requires comparing two scenarios:

  1. Fine-tune route: Pay training cost upfront, then serve inference on the fine-tuned model. For managed fine-tunes (GPT-4o mini), inference costs slightly more than base. For self-hosted open-source models, inference costs are just GPU compute with no per-token fees.
  2. Base API route: No upfront cost. Pay standard API rates for every request. The base model may need longer prompts (system instructions, few-shot examples) to match fine-tuned quality — increasing per-request token counts.

The break-even formula is: Training Cost / (Monthly API Spend - Monthly Fine-Tune Spend) = Months to Break Even. If the fine-tuned model costs the same or more per inference call (as with OpenAI's fine-tuned endpoint pricing), the break-even only comes from reduced prompt length (fewer tokens per request). If the fine-tuned model is cheaper per call (as with self-hosted), break-even can be very fast at high volume.

Frequently Asked Questions

How much does it cost to fine-tune GPT-4o mini?

OpenAI charges $0.008 per 1,000 training tokens for GPT-4o mini fine-tuning. A dataset of 10,000 examples at 512 tokens per example trained for 3 epochs totals 15.36 million training tokens, costing approximately $122.88. Inference on the resulting model costs $0.30 per million input tokens and $1.20 per million output tokens — slightly higher than the base model rates.

Can you fine-tune Claude models?

Anthropic does not currently offer a public fine-tuning API for Claude models as of 2026. Enterprise customers with specific needs can contact Anthropic about model customization programs. Most teams achieving specialized behavior from Claude use detailed system prompts, retrieval-augmented generation, or few-shot examples in the prompt.

What is QLoRA and why does it matter for fine-tuning costs?

QLoRA (Quantized Low-Rank Adaptation) is a fine-tuning technique that quantizes the base model to 4-bit precision and trains only small adapter layers (LoRA adapters) rather than the full model weights. This reduces VRAM requirements by 8–10x, enabling a 70B model to be fine-tuned on a single H100 or two A100s instead of requiring a multi-GPU cluster. QLoRA makes self-hosted fine-tuning practical for most teams, dramatically reducing cloud GPU costs.

How long does fine-tuning take?

Training time depends on model size, dataset size, epochs, and hardware. As a rough guide: Llama 3 8B with QLoRA on an A100 processes roughly 1.5 million tokens per hour. A 10,000-example dataset at 512 tokens per example across 3 epochs (15.36M total tokens) takes approximately 10 hours. Llama 3 70B on 2x A100s processes around 300K tokens per hour, making the same dataset take roughly 51 hours. Managed OpenAI jobs typically complete in 30 minutes to 3 hours.

How many training examples do I need for effective fine-tuning?

The minimum effective dataset size depends on the task. For format/style fine-tuning, 500–1,000 high-quality examples can produce noticeable improvement. For domain adaptation (specialized vocabulary, technical topics), 2,000–10,000 examples is typical. For complex reasoning tasks or significant capability changes, 10,000–100,000 examples may be needed. Quality matters far more than quantity — 500 carefully curated examples beat 5,000 mediocre ones.

Is it cheaper to fine-tune or use prompt engineering?

At low request volumes (under ~1,000 requests/day), prompt engineering is almost always cheaper because there is no upfront training cost. At high volumes, a well-fine-tuned smaller model can be significantly cheaper per request than a large base model with a long system prompt. The break-even calculator above shows the exact crossover point for your specific workload.

Built by Michael Lip. Cost estimates based on 2026 provider pricing and GPU benchmark data. Self-hosted training time estimates assume QLoRA on A100-80GB. Actual results vary with hardware, batch size, and sequence length.