Context Window

The context window is one of the most important practical constraints when building with LLMs. It determines how much information you can feed the model at once: a 128K token window holds roughly 96K words or about 300 pages of text. Everything the model needs to know for a given request must fit within this window, including system prompts, few-shot examples, retrieved context, user input, and the generated response.

Context window sizes have grown dramatically: from 4K tokens in early GPT-3.5, to 128K in GPT-4 Turbo, to 200K in Claude, and now 1M+ in some Gemini models. However, larger windows come with higher costs (pricing is per token) and potential quality degradation on tasks requiring precise attention to specific details buried in long contexts (the "lost in the middle" phenomenon).

For product architects, context window management is a core design consideration. RAG systems must balance retrieval quantity against window space. Conversational applications need conversation history management strategies (summarization, selective inclusion). Multi-step agent workflows must budget tokens across planning, tool calls, and responses. The practical pattern is to use the smallest effective context: include only what the model needs, not everything available.

Related Terms

RAG (Retrieval-Augmented Generation)

Embeddings

Vector Database

LLM (Large Language Model)

Fine-Tuning

Prompt Engineering