Tokenization

Tokenization is the first step in any LLM pipeline: converting human-readable text into a sequence of integer IDs that the model can process. Modern tokenizers (like BPE — Byte Pair Encoding) split text into subword pieces, balancing vocabulary size with coverage. Common words like "the" get their own token, while rare words are split into pieces: "tokenization" might become "token" + "ization."

Understanding tokenization matters for practical reasons. LLM pricing is per token, not per word — and a token is roughly 3/4 of a word in English. Context window limits are in tokens, so a 128K token window holds roughly 96K words. Non-English languages and code often tokenize less efficiently (more tokens per word), meaning they cost more and fill context windows faster.

For prompt engineering, token awareness helps you optimize costs: shorter prompts with the same meaning save money at scale. For RAG systems, chunk sizes should account for token limits, not just character or word counts. And for evaluation, token-level analysis helps you understand why a model produced unexpected output — sometimes it's the tokenizer splitting a word in an unexpected way.

Related Terms

LLM (Large Language Model)

Transformer

Attention Mechanism

RAG (Retrieval-Augmented Generation)

Embeddings

Vector Database

Further Reading

Understanding LLM Context Windows: What 128K Really Means

LLM Cost Optimization: Cut Your API Bill by 80%