Perplexity
A metric that measures how well a language model predicts a sequence of text, where lower perplexity indicates the model assigns higher probability to the actual text and is thus a better model of the language.
Perplexity quantifies a language model's surprise when encountering text. If a model perfectly predicted every next word, its perplexity would be 1. A perplexity of 20 means the model is, on average, as uncertain as if it were choosing uniformly among 20 equally likely options at each step. Modern LLMs achieve single-digit perplexity on common English text benchmarks.
Mathematically, perplexity is the exponential of the cross-entropy loss. It is computed by running the model over a test dataset and measuring how much probability mass it assigns to the actual next tokens. Models with lower perplexity are better at modeling the statistical patterns of language, which generally correlates with better performance on downstream tasks.
For practical AI engineering, perplexity is most useful for comparing models on the same test set, evaluating the impact of training changes, and detecting domain shift (perplexity spikes when the model encounters text very different from its training distribution). It is less useful as an absolute quality measure for user-facing applications, since low perplexity does not guarantee helpfulness, safety, or factual accuracy. Production evaluation should complement perplexity with task-specific metrics.
Related Terms
RAG (Retrieval-Augmented Generation)
A technique that grounds LLM responses in external data by retrieving relevant documents at query time and injecting them into the prompt context.
Embeddings
Dense vector representations of text, images, or other data that capture semantic meaning in a high-dimensional space, enabling similarity search and clustering.
Vector Database
A specialized database optimized for storing, indexing, and querying high-dimensional vector embeddings with sub-millisecond similarity search.
LLM (Large Language Model)
A neural network trained on massive text corpora that can generate, understand, and transform natural language for tasks like summarization, classification, and conversation.
Fine-Tuning
The process of further training a pre-trained LLM on a domain-specific dataset to specialize its behavior, style, or knowledge for a particular task.
Prompt Engineering
The practice of designing and iterating on LLM input instructions to reliably produce desired outputs for a specific task.