What is token counting?

6 min read

·
┌──────────────────────────────────────────────────────────┐
│  ═══════════════════════════════════════════════════     │
│  ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░     │
│  ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░     │
│  ────────────────────────────────────────────────────    │
│  ██████████████████████████░░░░░░░░░░░░░░░░░░░░░░░░░     │
│  █████████████████████████████████░░░░░░░░░░░░░░░░░░     │
│  ██████████████████████████████████████░░░░░░░░░░░░░     │
│  ████████████████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░     │
│  ────────────────────────────────────────────────────    │
│  ███████████████████████████████████████░░░░░░░░░░░░     │
└──────────────────────────────────────────────────────────┘

Token counting is the practice of measuring how many tokens your text consumes when processed by a language model. Since providers charge per token and models have fixed context window limits, understanding token counts is essential for managing costs, staying within limits, and building reliable AI applications.

Why Counting Tokens Matters

────────────────────────────────────────

Every API call to a language model has a cost, and that cost is measured in tokens. If you are building an application that makes thousands of API calls per day, the difference between a 500-token prompt and a 2,000-token prompt directly affects your bill.

Beyond cost, tokens determine whether your request fits within the model's context window. If you are building a RAG system that retrieves documents and includes them in the prompt, you need to know exactly how many tokens those documents consume to avoid exceeding the limit.

Token counting also affects latency. More input tokens take longer to process. More output tokens take longer to generate. If you are optimizing for response speed, reducing token count is one of the most effective levers.

How Tokenization Works

────────────────────────────────────────

Tokenization is the process of breaking text into tokens. Modern language models do not process text character by character or word by word. They use subword tokenization algorithms that break text into pieces that balance vocabulary size with representation efficiency.

[Byte Pair Encoding (BPE)] is the most common approach, used by OpenAI's models and many others. BPE starts with individual characters and iteratively merges the most frequent pairs. Common words like "the" become a single token. Rare words get split into subword pieces. "Tokenization" might become "Token" + "ization" or "Tok" + "en" + "ization" depending on the tokenizer.

[SentencePiece] is another popular algorithm, used by Google and Meta models. It works directly on raw text without pre-tokenization, making it language-agnostic. It supports both BPE and unigram model approaches.

[WordPiece] is used by some models including older BERT variants. It is similar to BPE but uses a slightly different merging criterion.

The key thing to understand is that different models use different tokenizers. The same text produces different token counts depending on which model you are using. "Hello, how are you?" might be 6 tokens in one tokenizer and 7 in another.

Different Tokenizers for Different Models

────────────────────────────────────────

[OpenAI] uses a tokenizer called cl100k_base for GPT-4 and GPT-3.5 Turbo, and o200k_base for GPT-4o and newer models. These tokenizers have vocabularies of 100,000 and 200,000 tokens respectively.

[Anthropic] uses their own tokenizer for Claude models. Anthropic does not publish their tokenizer publicly in the same way OpenAI does, but their API reports exact token counts in responses.

[Google] uses SentencePiece-based tokenizers for Gemini models. Their tokenizer handles multilingual text efficiently.

[Meta] uses a SentencePiece tokenizer for Llama models with a vocabulary of 32,000 tokens for Llama 2 and 128,000 for Llama 3.

[Mistral] uses a SentencePiece tokenizer similar to Llama's approach.

The practical implication: if you are working with multiple providers, you cannot use a single token counter for all of them. You need the right tokenizer for each model.

Tools for Counting Tokens

────────────────────────────────────────

[tiktoken] is OpenAI's open source tokenizer library for Python. It is fast, accurate, and the standard tool for counting tokens for OpenAI models. You pass in text and get back the exact token count or the list of token IDs.

[Provider APIs]: Most providers return token counts in their API responses. The response metadata includes prompt_tokens, completion_tokens, and total_tokens. This is the most accurate count, but it only works after you have already made the call.

[Provider tokenizer tools]: OpenAI has a web-based tokenizer at platform.openai.com/tokenizer. Google provides a count_tokens method in their Gemini SDK. These let you check token counts before making API calls.

[Hugging Face tokenizers]: The transformers library includes tokenizers for thousands of models. If you are using open source models, this is how you count tokens accurately.

[Approximation]: When exact counting is not critical, the rule of thumb is roughly 1 token per 4 characters in English, or about 0.75 words per token. This is imprecise but useful for quick estimates.

Estimating Costs Before Making API Calls

────────────────────────────────────────

Smart applications estimate costs before committing to an API call. Here is the process:

  1. [Count input tokens]: Use the appropriate tokenizer to count the tokens in your prompt, including system messages, context, and the user's input.
  2. [Estimate output tokens]: This is harder since you do not know how long the response will be. Use your max_tokens setting as the upper bound, and historical averages for a more realistic estimate.
  3. [Apply pricing]: Multiply input tokens by the input price and estimated output tokens by the output price.

For example, with a model that charges $3 per million input tokens and $15 per million output tokens, a 2,000-token input with an estimated 500-token output costs about $0.006 + $0.0075 = roughly $0.01 per request. At 10,000 requests per day, that is $100 per day.

Input vs Output Token Pricing

────────────────────────────────────────

Almost all providers charge different rates for input and output tokens, with output tokens being significantly more expensive, often 3-5 times more.

This pricing difference reflects the actual computation required. Processing input tokens is relatively cheap because it can be parallelized efficiently. Generating output tokens is sequential, with each token depending on all previous tokens, making it more computationally expensive.

This has practical implications for application design:

  • [Long prompts with short answers] (like classification) are relatively cheap
  • [Short prompts with long answers] (like content generation) are relatively expensive
  • [Caching input] is valuable because it avoids re-processing the same tokens repeatedly. Some providers offer prompt caching features at reduced rates.

Strategies for Reducing Token Usage

────────────────────────────────────────

[Write concise prompts]: Remove unnecessary words, redundant instructions, and verbose examples. A prompt that says "Please classify the following text" uses more tokens than "Classify this text" for the same result.

[Use system prompts wisely]: System prompts are sent with every request. A bloated system prompt adds up across thousands of calls.

[Cache and reuse]: If many requests share the same context or instructions, take advantage of prompt caching features offered by providers like Anthropic and OpenAI.

[Choose the right model]: Smaller models are cheaper per token and often sufficient for simpler tasks. Use GPT-4 or Claude Opus for complex reasoning and cheaper models for classification or extraction.

[Limit output length]: Set appropriate max_tokens values to prevent unnecessarily long responses.

[Compress context]: In RAG applications, summarize retrieved documents before including them in the prompt. Use the most relevant excerpts rather than entire documents.

[Batch similar requests]: Some providers offer batch processing at discounted rates for non-time-sensitive workloads.

Special Tokens and Their Role

────────────────────────────────────────

Beyond the tokens that represent your text, models use special tokens for internal bookkeeping:

[Beginning-of-sequence (BOS) and end-of-sequence (EOS) tokens] mark the start and end of text. These help the model understand boundaries.

[Chat formatting tokens] separate different roles in a conversation. Tokens like <|user|>, <|assistant|>, and <|system|> tell the model who said what. Different models use different formatting tokens.

[Tool-use tokens] in some models signal function calls and results.

These special tokens count toward your total token usage and context window. Chat formatting overhead can be meaningful. A conversation with 50 turns includes formatting tokens for each turn, which adds up.

Understanding token counting might seem like a minor implementation detail, but it directly affects the cost, reliability, and performance of everything you build with language models. The developers who track their token usage closely are the ones who build sustainable, cost-effective AI applications.

Related Articles

Building with AI