Top-p Sampling (Nucleus Sampling)
A decoding strategy that samples from the smallest set of tokens whose cumulative probability exceeds a threshold p, dynamically adjusting the candidate pool based on the model's confidence.
Top-p sampling provides more nuanced control over randomness than temperature alone. Instead of considering all tokens or a fixed number of top tokens, it dynamically selects the minimum set of tokens whose probabilities sum to at least p. When the model is confident (one token has 95% probability), top-p 0.9 might consider only that single token. When the model is uncertain (many tokens with similar probabilities), it considers more options.
This adaptive behavior is the key advantage over top-k sampling, which always considers the same number of candidates regardless of the probability distribution. Top-p naturally narrows the pool when the model is confident and widens it when many options are viable, producing more contextually appropriate randomness.
In practice, top-p is often used alongside or instead of temperature. A common production configuration is temperature 0.7 with top-p 0.9, which provides moderate creativity while filtering out very unlikely tokens. For structured output tasks like JSON generation, top-p 0.1-0.3 helps ensure valid syntax. For open-ended generation, top-p 0.9-0.95 balances variety with coherence. Most API providers support both parameters, and experimentation is the best way to find optimal settings for your specific task.
Related Terms
RAG (Retrieval-Augmented Generation)
A technique that grounds LLM responses in external data by retrieving relevant documents at query time and injecting them into the prompt context.
Embeddings
Dense vector representations of text, images, or other data that capture semantic meaning in a high-dimensional space, enabling similarity search and clustering.
Vector Database
A specialized database optimized for storing, indexing, and querying high-dimensional vector embeddings with sub-millisecond similarity search.
LLM (Large Language Model)
A neural network trained on massive text corpora that can generate, understand, and transform natural language for tasks like summarization, classification, and conversation.
Fine-Tuning
The process of further training a pre-trained LLM on a domain-specific dataset to specialize its behavior, style, or knowledge for a particular task.
Prompt Engineering
The practice of designing and iterating on LLM input instructions to reliably produce desired outputs for a specific task.