LLM token counting is the primary mechanism by which API providers meter usage and charge you, and understanding it is the single biggest lever you have for controlling costs.

Let’s see this in action. Imagine you’re sending a prompt to an LLM.

{
  "model": "gpt-4-turbo-preview",
  "messages": [
    {
      "role": "system",
      "content": "You are a helpful assistant that summarizes text."
    },
    {
      "role": "user",
      "content": "Summarize the following article:\n\n[Article text goes here, let's say it's 1000 words long]"
    }
  ]
}

The API provider doesn’t just look at the raw character count. They use a tokenizer, a specific algorithm, to break down your text into "tokens." A token can be a word, part of a word, or punctuation. For English text, a rough rule of thumb is that 100 tokens is about 75 words.

Here’s how the token count breaks down for that example:

  • System message: "You are a helpful assistant that summarizes text." (This is a short string, let’s estimate 12 tokens).
  • User message:
    • The instruction "Summarize the following article:\n\n" (e.g., 8 tokens).
    • The article text itself. If 1000 words is roughly 1333 tokens (1000 * 1.333), this is the bulk.
  • Total Prompt Tokens: 12 (system) + 8 (instruction) + 1333 (article) = ~1353 tokens.

Now, the LLM generates a response. Let’s say it generates a summary that’s 200 words long. That’s roughly 267 tokens.

  • Total Completion Tokens: ~267 tokens.
  • Total Tokens for this API Call: 1353 (prompt) + 267 (completion) = ~1620 tokens.

If the API charges $0.01 per 1000 tokens for prompts and $0.03 per 1000 tokens for completions, this single call would cost:

  • Prompt cost: (1353 / 1000) * $0.01 = $0.01353
  • Completion cost: (267 / 1000) * $0.03 = $0.00801
  • Total cost: $0.02154

The problem is that most developers don’t have visibility into these token counts before they make the API call. They might be sending excessively long prompts, or getting overly verbose completions, without realizing the impact on their bill.

The solution is to integrate token counting before the API call. Most LLM providers offer their own tokenization libraries. For OpenAI models, you’d use the tiktoken library.

Here’s how you’d use tiktoken to count tokens for a prompt:

import tiktoken

def num_tokens_from_string(string: str, encoding_name: str) -> int:
    """Returns the number of tokens in a text string using a specific encoding."""
    encoding = tiktoken.get_encoding(encoding_name)
    num_tokens = encoding.encode(string)
    # num_tokens is a list of integers, we want the count
    return len(num_tokens)

# Example usage for a common GPT-4 encoding
encoding = "cl100k_base" # This is the encoding for gpt-4, gpt-3.5-turbo, text-embedding-ada-002
prompt_text = "This is a sample prompt to count tokens."
token_count = num_tokens_from_string(prompt_text, encoding)
print(f"The prompt has {token_count} tokens.")
# Output: The prompt has 9 tokens.

To count tokens for a structured message list (like you send to chat models):

import tiktoken

def num_tokens_from_messages(messages: list[dict], model: str = "gpt-4-turbo-preview") -> int:
    """Returns the number of tokens in a list of messages for a specific model."""
    try:
        encoding = tiktoken.encoding_for_model(model)
    except KeyError:
        print("Warning: model not found. Using cl100k_base encoding.")
        encoding = tiktoken.get_encoding("cl100k_base")

    if model == "gpt-3.5-turbo-0301" or model == "gpt-3.5-turbo":
        # these models have a different token counting system
        # see https://github.com/openai/openai-cookbook/blob/main/examples/How_to_count_tokens_with_tiktoken.ipynb
        tokens_per_message = 4  # every message follows <|start|>user\n{content}
        tokens_per_name = -1  # if there's a name, the role is omitted
    elif model == "gpt-4-0314" or model == "gpt-4-turbo-preview" or model == "gpt-4":
        tokens_per_message = 3
        tokens_per_name = 1
    else:
        # Default to a common encoding if model is unknown
        print(f"Warning: Unknown model: {model}. Using default token counting.")
        tokens_per_message = 3
        tokens_per_name = 1

    num_tokens = 0
    for message in messages:
        num_tokens += tokens_per_message
        for key, value in message.items():
            # num_tokens += len(encoding.encode(value)) # Original, but can be slow
            num_tokens += len(encoding.encode(str(value))) # More robust for non-string values
            if key == "name":
                num_tokens += tokens_per_name
    num_tokens += 3  # every reply is primed with <|start|>assistant
    return num_tokens

# Example usage
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Count these tokens!"}
]
token_count = num_tokens_from_messages(messages, model="gpt-4-turbo-preview")
print(f"The messages have {token_count} tokens.")
# Output: The messages have 25 tokens.

This tiktoken library is crucial. It mirrors exactly how the API provider counts tokens, allowing you to predict costs accurately before sending requests. You can use this to:

  1. Truncate long inputs: If a user uploads a massive document, count its tokens. If it exceeds a certain threshold (e.g., 4000 tokens for a model with an 8k context window), truncate the document before sending it to the LLM.
  2. Summarize context: Instead of sending the full conversation history, summarize older turns to reduce prompt length.
  3. Constrain output length: For models that allow it, specify a max_tokens parameter for the completion. This not only limits output length but also its cost.
  4. Choose cheaper models: If a less powerful, cheaper model (e.g., gpt-3.5-turbo) can achieve the desired quality, use tiktoken to estimate token usage for both models and compare total costs.

A common pitfall is assuming the token count for a given number of characters is constant. It’s not. For example, tiktoken.get_encoding("cl100k_base").encode("I love LLMs!") results in 6 tokens, while tiktoken.get_encoding("cl100k_base").encode("I love large language models!") results in 7 tokens. The tokenizer often breaks down longer, more complex words into smaller, more manageable pieces. This means that even if you’re using a fixed character limit for your input, the actual token count can fluctuate.

The most surprising thing about token counting is that it’s not just about the text you send, but also the specific model you choose. Different models use different tokenization encodings, and even the way messages are formatted for chat models (like adding role tags) contributes to the token count.

If you’re building an application that relies heavily on LLM APIs, you should absolutely build a token_counter function into your core logic. This function should take your input (text, messages, etc.) and the target model name, and return the estimated token count. You can then use this estimate to enforce limits, choose models, and provide feedback to users about potential costs.

The next logical step after mastering token counting is understanding prompt engineering techniques that specifically aim to reduce token usage while maintaining or improving output quality.

Want structured learning?

Take the full Llm course →