LLM Token Counter
Understand tokenization and estimate token counts for GPT, Claude, Llama, and other models.
How LLM Tokenization Works
Every LLM API charges based on tokens, not words or characters. Understanding tokenization is essential for accurate cost estimation and optimizing your API spend. A token is the fundamental unit of text that a language model processes. Modern LLMs use subword tokenization algorithms like Byte Pair Encoding (BPE) that split text into variable-length chunks based on frequency patterns learned from training data.
Common English words like "the," "is," and "for" are typically single tokens. Less common words get split into pieces: "tokenization" might become ["token", "ization"], consuming 2 tokens instead of 1. Very rare words, technical terms, and non-English text often break down further. Whitespace and punctuation are also tokenized, though many tokenizers attach leading spaces to the following token rather than treating them separately.
The practical implication is that token count does not map linearly to word count or character count. However, useful rules of thumb exist for estimation purposes.
Token Estimation Rules
For quick estimates without running text through a tokenizer, these approximations work well across most modern LLMs.
English text: 1 token is approximately 4 characters or 0.75 words. Equivalently, 1,000 words of English prose produces roughly 1,300 to 1,500 tokens. This ratio holds for well-written content with standard vocabulary. Academic papers and technical documentation trend toward the higher end due to specialized terminology.
Code: Programming languages use 20-30% more tokens per character than English text. This is because variable names, operators, brackets, and indentation all consume tokens. Python is relatively token-efficient due to its minimal syntax, while languages like Java and TypeScript tend to be more token-heavy. A 100-line Python function typically uses 800-1,200 tokens, while the equivalent Java code might use 1,200-1,800 tokens.
JSON and structured data: JSON is token-expensive because of quotation marks, colons, commas, and brackets. A JSON object with 10 key-value pairs uses roughly 80-120 tokens, compared to 30-50 tokens for the same information in plain text. If your API responses include large JSON payloads, consider requesting condensed formats or using shorter key names.
Non-English languages: Languages with non-Latin scripts (Chinese, Japanese, Korean, Arabic) typically use 2-3x more tokens per character than English because the tokenizer vocabulary was primarily trained on English text. A 500-character Chinese passage may use 700-1,000 tokens compared to 125-150 tokens for 500 English characters.
Differences Between Tokenizers
Different LLM providers use different tokenizers, which means the same text produces different token counts across providers. OpenAI's GPT-4 models use the cl100k_base tokenizer (part of the tiktoken library) with a vocabulary of approximately 100,000 tokens. Claude uses a proprietary tokenizer that Anthropic has not publicly documented in the same detail, though it produces comparable token counts for English text.
Llama models use a SentencePiece-based tokenizer with a 32,000-token vocabulary, which is smaller than GPT-4's vocabulary. This means Llama tends to produce slightly more tokens for the same text, especially for rare words and technical terms. In practice, the difference between tokenizers is usually within 10-15% for English text, but can be larger for multilingual content or specialized domains.
For accurate cost estimation, always use the specific provider's tokenizer. OpenAI provides tiktoken as an open-source Python library. Anthropic offers a token counting endpoint in their API. For Llama, the tokenizer is included in the model weights download. When precision is not critical, using the 1 token per 4 characters rule provides a reasonable estimate across all providers.
From Token Counts to API Costs
Once you know your token count, calculating cost is straightforward. Multiply your input tokens by the provider's input rate and your output tokens by the output rate. For example, a request with 2,000 input tokens and 500 output tokens on Claude Sonnet costs: (2,000 / 1,000,000 * $3) + (500 / 1,000,000 * $15) = $0.006 + $0.0075 = $0.0135 per request. At 10,000 requests per day, that totals $135/day or roughly $4,050/month. Use the KickLLM calculator to model these scenarios interactively with real pricing data from all providers.
Frequently Asked Questions
How many tokens is 1,000 words?
In English, 1,000 words is approximately 1,300 to 1,500 tokens depending on the tokenizer and vocabulary complexity. Technical text with code snippets or specialized terms may produce more tokens. A rough rule of thumb is 1 word equals 1.3 tokens for English text.
What is a token in LLM APIs?
A token is a chunk of text that the model processes as a single unit. Tokens can be whole words, parts of words, or individual characters. Common English words like "the" or "and" are single tokens. Longer or rarer words get split into multiple tokens. Spaces and punctuation are also tokenized.
Do GPT and Claude use the same tokenizer?
No. GPT models use OpenAI's tiktoken (cl100k_base for GPT-4), while Claude uses a proprietary tokenizer. The same text may produce different token counts across providers, typically within a 10-15% range. Always check the specific provider's tokenizer for accurate cost estimates.
How do I count tokens before making an API call?
For OpenAI, use the tiktoken Python library or the online tokenizer tool. For Claude, use Anthropic's token counting API endpoint. For quick estimates, divide your character count by 4 for English text. JSON and code typically use 20-30% more tokens than plain English text.
Why does code use more tokens than English text?
Code uses more tokens because variable names, syntax characters, and indentation each consume tokens. A single line like "console.log(result)" uses 6-8 tokens. Camel case and snake case identifiers get split into multiple tokens. Minified code uses fewer tokens than formatted code.
Related Guides
Built by Michael Lip. Pricing data updated regularly from official provider pages.