Back to glossary

Tokenization

The process of splitting text into smaller units (tokens) that an LLM can process, typically subword pieces averaging about 4 characters per token.

Tokenization is the first step in any LLM pipeline: converting human-readable text into a sequence of integer IDs that the model can process. Modern tokenizers (like BPE — Byte Pair Encoding) split text into subword pieces, balancing vocabulary size with coverage. Common words like "the" get their own token, while rare words are split into pieces: "tokenization" might become "token" + "ization."

Understanding tokenization matters for practical reasons. LLM pricing is per token, not per word — and a token is roughly 3/4 of a word in English. Context window limits are in tokens, so a 128K token window holds roughly 96K words. Non-English languages and code often tokenize less efficiently (more tokens per word), meaning they cost more and fill context windows faster.

For prompt engineering, token awareness helps you optimize costs: shorter prompts with the same meaning save money at scale. For RAG systems, chunk sizes should account for token limits, not just character or word counts. And for evaluation, token-level analysis helps you understand why a model produced unexpected output — sometimes it's the tokenizer splitting a word in an unexpected way.

Related Terms

Further Reading