Back to glossary

Inference

The process of running a trained AI model on new inputs to generate predictions or outputs, as opposed to training where the model learns from data. This is what happens every time a user interacts with an AI feature.

Inference is the production phase of AI: the model receives an input (a user query, an image, a data point), processes it through its learned weights, and produces an output (a response, a classification, a recommendation). While training happens once or periodically, inference happens millions of times per day in production systems.

The economics of inference dominate AI product costs. Training a model is a one-time (or periodic) expense, but inference costs scale linearly with usage. For LLMs, inference costs include compute for processing input tokens, generating output tokens, and the memory required to hold model weights. Optimizing inference through caching, batching, quantization, model routing, and smaller models is critical for sustainable unit economics.

For growth teams, inference is where AI meets the user. Inference latency directly impacts user experience (users expect sub-second responses), inference costs determine your margin per AI interaction, and inference reliability determines your uptime. The key production concerns are latency (how fast), throughput (how many concurrent requests), cost (price per prediction), and availability (what happens when the model or API is down).

Related Terms