LoRA (Low-Rank Adaptation)
A parameter-efficient fine-tuning method that trains small, low-rank matrices alongside frozen model weights, enabling task-specific adaptation with a fraction of the memory and compute of full fine-tuning.
LoRA revolutionized fine-tuning by making it accessible without massive GPU clusters. Instead of updating all model parameters during fine-tuning, LoRA freezes the original weights and injects small trainable matrices (adapters) into each layer. These adapters capture task-specific adjustments while the base model's general knowledge remains unchanged.
The math behind LoRA is elegant: it approximates weight updates as a product of two low-rank matrices. A layer with a 4096x4096 weight matrix would normally need 16 million parameters to update. With LoRA rank 16, you only train two matrices (4096x16 and 16x4096) totaling 131K parameters, a 120x reduction. This dramatically cuts memory requirements and training time.
For production teams, LoRA's biggest advantage is multi-tenant model serving. You can maintain one base model and swap in different LoRA adapters for different tasks, customers, or domains. This is far more efficient than hosting separate fine-tuned models. The adapter files are tiny (often under 100MB), so you can store dozens of specialized adapters and load them dynamically based on the request context. Combined with quantization, LoRA makes custom fine-tuning practical on a single consumer GPU.
Related Terms
RAG (Retrieval-Augmented Generation)
A technique that grounds LLM responses in external data by retrieving relevant documents at query time and injecting them into the prompt context.
Embeddings
Dense vector representations of text, images, or other data that capture semantic meaning in a high-dimensional space, enabling similarity search and clustering.
Vector Database
A specialized database optimized for storing, indexing, and querying high-dimensional vector embeddings with sub-millisecond similarity search.
LLM (Large Language Model)
A neural network trained on massive text corpora that can generate, understand, and transform natural language for tasks like summarization, classification, and conversation.
Fine-Tuning
The process of further training a pre-trained LLM on a domain-specific dataset to specialize its behavior, style, or knowledge for a particular task.
Prompt Engineering
The practice of designing and iterating on LLM input instructions to reliably produce desired outputs for a specific task.