Temperature

Temperature is the most commonly adjusted inference parameter. At temperature 0, the model always picks the highest-probability token, producing deterministic but potentially repetitive output. At temperature 1.0, tokens are sampled according to their learned probabilities. Above 1.0, the distribution flattens further, making unlikely tokens more probable and producing more creative but less reliable output.

Mathematically, temperature divides the logits (raw model outputs) before the softmax function. Lower temperature sharpens the probability distribution, concentrating mass on the top tokens. Higher temperature flattens it, spreading probability more evenly. This single parameter has an outsized impact on output quality for different use cases.

The practical guideline for production systems: use temperature 0-0.3 for factual tasks like classification, extraction, and Q&A where consistency matters. Use 0.5-0.8 for balanced tasks like summarization and content generation where you want some variation but not hallucination. Use 0.8-1.2 for creative tasks like brainstorming and fiction where diversity is valued. Always test your specific use case, as the optimal temperature depends on the model, prompt, and quality criteria.

Related Terms

RAG (Retrieval-Augmented Generation)

Embeddings

Vector Database

LLM (Large Language Model)

Fine-Tuning

Prompt Engineering