Counting LLM Tokens

In the world of LLMs, someone eventually pays for “tokens.” But tokens are not necessarily equivalent to words. Understanding the relationship between words and tokens is critical to grasping how language models like GPT-4 process text.

While a simple word like “cat” may be a single token, a more complex word like “unbelievable” might be broken down into multiple tokens such as “un,” “believ,” and “able.” By converting text into these smaller units, language models can better understand and generate natural language, making them more effective at tasks like translation, summarization, and conversation.

Tokens matter because you pay by the token. If you are building an Enterprise application that leverages a cloud-based LLM service like OpenAI, you pay by how many tokens you consume in a billing period.

Since there is no 1:1 relationship between words and tokens, forecasting your token consumption might be tricky when your project starts. During your project testing phase, you should be able to come up with an estimate based on the prompts you use in testing compared to the token count it generates. There are libraries like “tiktoken” where you can feed it a word or phrase, and it gives you a token count.

Let’s look at some examples. The examples used the tiktoken Python library with the gpt-4o LLM.


Prompt: “dog” <<< just the word “dog” was analyzed

Number of tokens: 1

Number of words: 1

Token to word ratio: 1.00

“dog” is a common word, so we get one token.


Prompt: “unbelievable”

Number of tokens: 3

Number of words: 1

Token to word ratio: 3.00

Even though a single common word, “unbelievable,” is broken down into:

Tokens: [‘un’, ‘bel’, ‘ievable’] (not the same as the word syllables)


Looking Forward:

A basic understanding of tokens vs words is critical if you consume a cloud-based LLM AI service like OpenAI, Azure, Google, etc. If you build a Chatbot service, token consumption over time drives your billing and indicates your service consumption and adoption. If you are indexing documents to feed your LLM system, then statistics like “average tokens per document” might be helpful in forecasting costs. If you keep prompt history, then “average tokens per prompt” would be another metric allowing you to forecast cost and adoption growth over time.

The link below points to a sample Python script for calculating token counts in words or phrases. Try substituting “gpt-4o” with something older like “text-davinci-003” and comparing the results.

Link to code: