Back to glossary

Context Window

The maximum number of tokens an LLM can process in a single inference call, encompassing both the input prompt and the generated output, typically ranging from 4K to 1M tokens.

The context window is one of the most important practical constraints when building with LLMs. It determines how much information you can feed the model at once: a 128K token window holds roughly 96K words or about 300 pages of text. Everything the model needs to know for a given request must fit within this window, including system prompts, few-shot examples, retrieved context, user input, and the generated response.

Context window sizes have grown dramatically: from 4K tokens in early GPT-3.5, to 128K in GPT-4 Turbo, to 200K in Claude, and now 1M+ in some Gemini models. However, larger windows come with higher costs (pricing is per token) and potential quality degradation on tasks requiring precise attention to specific details buried in long contexts (the "lost in the middle" phenomenon).

For product architects, context window management is a core design consideration. RAG systems must balance retrieval quantity against window space. Conversational applications need conversation history management strategies (summarization, selective inclusion). Multi-step agent workflows must budget tokens across planning, tool calls, and responses. The practical pattern is to use the smallest effective context: include only what the model needs, not everything available.

Related Terms