LLM Deployment Checklist

40 steps across 6 phases from model selection to production. Check items off, expand best-practice details, and export your progress. State saved locally — no login required.

Overall Progress 0 / 40

Exported Checklist

Why a Deployment Checklist Matters for LLMs

Deploying an LLM to production is categorically different from deploying a traditional API. The model is a probabilistic system that can fail in non-deterministic ways: it may refuse valid requests, hallucinate factual information, generate unsafe content, or regress in quality after a seemingly unrelated infrastructure change. A structured checklist catches the gaps that exploratory development misses.

The 40 items in this checklist represent patterns from teams shipping production LLM applications. The phases follow the natural workflow: deciding what to build, fine-tuning if needed, validating quality rigorously, standing up infrastructure, monitoring in production, and ensuring legal and compliance requirements are met.

Phase 1: Model Selection

The model selection phase is often the most consequential and the most under-researched. Teams frequently default to the largest available model without measuring whether a smaller model is sufficient for their task, and without calculating whether the cost is sustainable at projected scale.

The key questions are: What quality level does your task actually require? Does the task need the reasoning depth of Claude Opus or GPT-4o, or is a cheaper model (Haiku, GPT-4o-mini, Gemini Flash) adequate? Can the model handle your expected context length? Does it support structured output formats your application requires (JSON mode, function calling, etc.)? Use our LLM comparison tool to evaluate these dimensions side by side.

Phase 2: Fine-Tuning

Fine-tuning is optional but can unlock significant gains in format consistency, domain accuracy, and inference cost efficiency. The decision to fine-tune should be based on concrete evidence — a baseline evaluation showing that prompt engineering alone is insufficient — rather than assumption.

If you do fine-tune, data quality is the primary determinant of outcome. 500 carefully curated examples outperform 5,000 mediocre ones. Each training example should represent the exact input/output behavior you want the model to exhibit, including edge cases and refusals. Use our fine-tuning cost calculator to estimate training cost and break-even before committing.

Phase 3: Evaluation

Evaluation is where most LLM deployments are weakest. "Looked good in testing" is not sufficient. A proper evaluation harness includes: a held-out test set with ground truth answers, automated metrics (ROUGE, BLEU, semantic similarity, or task-specific scoring), human rating for subjective quality, adversarial test cases targeting known failure modes, and regression tests that run on every deployment.

For safety-critical applications, red-teaming — systematically attempting to elicit harmful or incorrect outputs — should be documented before launch. Track evaluation results across model versions so regressions are immediately visible.

Phase 4: Infrastructure

Infrastructure for LLM applications has unique requirements. Latency is typically 1–30 seconds per request (vs milliseconds for traditional APIs), requiring timeout handling, streaming support, and user-facing loading states. Rate limits must be managed both at the provider level (per-minute and per-day caps) and at the application level (per-user quotas to prevent runaway costs). Retry logic with exponential backoff is essential for the 429 and 503 errors that occur under load.

Prompt and response logging requires careful design. Logs are essential for debugging and evaluation, but may contain sensitive user data. Decide on a data retention and access policy before launch, not after an incident.

Phase 5: Monitoring

Production LLM systems degrade in ways that are invisible without active monitoring. Model providers update underlying models (sometimes without notice), prompt caching behavior changes, latency profiles shift, and user behavior evolves in ways that surface new edge cases. Monitoring should track: response latency (p50, p95, p99), error rates by error type, token consumption and cost per day, output quality via automated scoring, and user-reported feedback signals.

Set alerting thresholds before launch. A latency spike or cost anomaly discovered the next morning in a dashboard is much worse than one that triggers an alert at 3am.

Phase 6: Compliance

Compliance requirements for LLM applications vary by use case and jurisdiction. Consumer-facing applications have stricter obligations than internal tools. Applications handling medical, legal, or financial information face additional regulatory scrutiny. The minimum requirements for most applications include: a reviewed terms of service covering AI-generated content, a data handling policy covering what user inputs are stored and for how long, compliance with provider usage policies (each provider has restrictions on use cases), and content filtering to prevent generation of illegal or harmful material.

Many teams defer compliance to "later" and find themselves unable to launch or forced into expensive retroactive changes. Completing the compliance phase before launch avoids both scenarios.

Frequently Asked Questions

How long does it take to deploy an LLM to production?

For a straightforward API integration (no fine-tuning, existing infrastructure), a focused team can complete all 40 checklist items in 2–4 weeks. Projects requiring fine-tuning add 1–2 weeks for data preparation and training. Applications in regulated industries (healthcare, finance, legal) typically add 2–4 weeks for compliance review and legal sign-off. The most common time sink is evaluation — building a rigorous test harness and running systematic red-teaming takes longer than most teams anticipate.

Do I need to fine-tune my model before deploying to production?

No. Most production LLM applications do not require fine-tuning. Prompt engineering — carefully designed system prompts, few-shot examples, and output constraints — is sufficient for the majority of use cases. Fine-tuning makes sense when you need consistent output formatting that prompt engineering cannot reliably achieve, when your domain contains specialized terminology not well-represented in the base model, when you need to reduce per-request token counts significantly for cost reasons, or when you need a smaller model to match a larger model's performance on your specific task.

What is the most common LLM deployment failure mode?

Lack of output validation. Teams build applications assuming the model will always return the expected format, and they break when the model returns an explanation instead of JSON, or returns an answer in a different language, or refuses a borderline request. Defensive parsing, schema validation, and graceful fallback handling should be implemented before launch. The second most common failure is inadequate rate limit handling — applications that hit provider rate limits return 429 errors to users with no retry logic.

How do I monitor LLM output quality in production?

The most practical approach is a combination of automated scoring and sampling-based human review. For structured outputs (JSON, function calls, classification), automated validation catches format failures immediately. For free-text quality, use an LLM-as-judge approach: send a sample of production outputs to a secondary model for quality scoring against a rubric. Track the score distribution over time and alert on significant drops. Human review a random 1–2% of requests weekly to catch systematic issues the automated scorer misses.

What provider usage policies must I comply with?

All major providers prohibit certain use cases. OpenAI's usage policy prohibits generating content that exploits minors, creating malware, generating disinformation at scale, and several other categories. Anthropic's usage policy has similar prohibitions plus restrictions on fully automated decision-making in high-stakes contexts. Google's Gemini API Terms restrict use in regulated healthcare and financial advice contexts without appropriate disclosures. Review the current usage policy for each provider you use — violations can result in API access termination. Store a copy of the policy version you reviewed for your compliance records.

Built by Michael Lip. Checklist items reflect best practices as of May 2026. Checklist state is stored in your browser's localStorage — nothing is sent to a server.