Back to glossary

Perplexity

A metric that measures how well a language model predicts a sequence of text, where lower perplexity indicates the model assigns higher probability to the actual text and is thus a better model of the language.

Perplexity quantifies a language model's surprise when encountering text. If a model perfectly predicted every next word, its perplexity would be 1. A perplexity of 20 means the model is, on average, as uncertain as if it were choosing uniformly among 20 equally likely options at each step. Modern LLMs achieve single-digit perplexity on common English text benchmarks.

Mathematically, perplexity is the exponential of the cross-entropy loss. It is computed by running the model over a test dataset and measuring how much probability mass it assigns to the actual next tokens. Models with lower perplexity are better at modeling the statistical patterns of language, which generally correlates with better performance on downstream tasks.

For practical AI engineering, perplexity is most useful for comparing models on the same test set, evaluating the impact of training changes, and detecting domain shift (perplexity spikes when the model encounters text very different from its training distribution). It is less useful as an absolute quality measure for user-facing applications, since low perplexity does not guarantee helpfulness, safety, or factual accuracy. Production evaluation should complement perplexity with task-specific metrics.

Related Terms