Context Window
The maximum number of tokens an LLM can process in a single inference call, encompassing both the input prompt and the generated output, typically ranging from 4K to 1M tokens.
The context window is one of the most important practical constraints when building with LLMs. It determines how much information you can feed the model at once: a 128K token window holds roughly 96K words or about 300 pages of text. Everything the model needs to know for a given request must fit within this window, including system prompts, few-shot examples, retrieved context, user input, and the generated response.
Context window sizes have grown dramatically: from 4K tokens in early GPT-3.5, to 128K in GPT-4 Turbo, to 200K in Claude, and now 1M+ in some Gemini models. However, larger windows come with higher costs (pricing is per token) and potential quality degradation on tasks requiring precise attention to specific details buried in long contexts (the "lost in the middle" phenomenon).
For product architects, context window management is a core design consideration. RAG systems must balance retrieval quantity against window space. Conversational applications need conversation history management strategies (summarization, selective inclusion). Multi-step agent workflows must budget tokens across planning, tool calls, and responses. The practical pattern is to use the smallest effective context: include only what the model needs, not everything available.
Related Terms
RAG (Retrieval-Augmented Generation)
A technique that grounds LLM responses in external data by retrieving relevant documents at query time and injecting them into the prompt context.
Embeddings
Dense vector representations of text, images, or other data that capture semantic meaning in a high-dimensional space, enabling similarity search and clustering.
Vector Database
A specialized database optimized for storing, indexing, and querying high-dimensional vector embeddings with sub-millisecond similarity search.
LLM (Large Language Model)
A neural network trained on massive text corpora that can generate, understand, and transform natural language for tasks like summarization, classification, and conversation.
Fine-Tuning
The process of further training a pre-trained LLM on a domain-specific dataset to specialize its behavior, style, or knowledge for a particular task.
Prompt Engineering
The practice of designing and iterating on LLM input instructions to reliably produce desired outputs for a specific task.