Reinforcement Learning from Human Feedback (RLHF)
A training method that aligns LLM outputs with human preferences by using human ratings of model responses to train a reward model, which then guides the LLM via reinforcement learning.
RLHF is the technique that transformed raw language models into helpful, harmless assistants like ChatGPT and Claude. The base model trained on internet text can generate fluent text but has no concept of helpfulness, safety, or user intent. RLHF bridges this gap by incorporating human judgment into the training loop.
The process has three stages. First, human annotators rank multiple model outputs for the same prompt, expressing which response they prefer. Second, these rankings train a reward model that predicts human preference scores for any given response. Third, the LLM is fine-tuned using reinforcement learning (typically PPO) to maximize the reward model's score while staying close to its original behavior via a KL divergence penalty.
RLHF is why modern chatbots refuse harmful requests, follow instructions accurately, and produce helpful responses. However, it introduces challenges: reward hacking (the model finds ways to score high without actually being helpful), mode collapse (losing diversity in outputs), and the cost of collecting high-quality human feedback. Alternatives like DPO and constitutional AI aim to achieve similar alignment with simpler or more scalable methods.
Related Terms
RAG (Retrieval-Augmented Generation)
A technique that grounds LLM responses in external data by retrieving relevant documents at query time and injecting them into the prompt context.
Embeddings
Dense vector representations of text, images, or other data that capture semantic meaning in a high-dimensional space, enabling similarity search and clustering.
Vector Database
A specialized database optimized for storing, indexing, and querying high-dimensional vector embeddings with sub-millisecond similarity search.
LLM (Large Language Model)
A neural network trained on massive text corpora that can generate, understand, and transform natural language for tasks like summarization, classification, and conversation.
Fine-Tuning
The process of further training a pre-trained LLM on a domain-specific dataset to specialize its behavior, style, or knowledge for a particular task.
Prompt Engineering
The practice of designing and iterating on LLM input instructions to reliably produce desired outputs for a specific task.