Weight Initialization
The strategy for setting initial parameter values before training a neural network, which critically affects training dynamics, convergence speed, and whether the network can learn at all.
Weight initialization determines the starting point of the optimization landscape. Poor initialization can cause vanishing gradients (weights too small), exploding gradients (weights too large), or symmetry problems (all neurons learn the same thing). Good initialization provides a starting point where gradients flow healthily through the network.
The most common strategies are Xavier/Glorot initialization (scales weights by 1/sqrt(fan_in)) which works well with sigmoid and tanh activations, and He/Kaiming initialization (scales by 2/sqrt(fan_in)) which is designed for ReLU activations. These methods ensure that the variance of activations and gradients remains roughly constant across layers, preventing the signal from either vanishing or exploding as it propagates through the network.
For practitioners, weight initialization is usually handled by framework defaults (PyTorch and TensorFlow use sensible defaults for standard layers). The cases where it matters most are custom architectures, very deep networks, and training instability debugging. If your model is not learning or training is unstable, checking that initialization matches your activation function is one of the first diagnostic steps. Pre-trained weights (via transfer learning) are the ultimate initialization strategy, providing a starting point that already encodes useful features.
Related Terms
RAG (Retrieval-Augmented Generation)
A technique that grounds LLM responses in external data by retrieving relevant documents at query time and injecting them into the prompt context.
Embeddings
Dense vector representations of text, images, or other data that capture semantic meaning in a high-dimensional space, enabling similarity search and clustering.
Vector Database
A specialized database optimized for storing, indexing, and querying high-dimensional vector embeddings with sub-millisecond similarity search.
LLM (Large Language Model)
A neural network trained on massive text corpora that can generate, understand, and transform natural language for tasks like summarization, classification, and conversation.
Fine-Tuning
The process of further training a pre-trained LLM on a domain-specific dataset to specialize its behavior, style, or knowledge for a particular task.
Prompt Engineering
The practice of designing and iterating on LLM input instructions to reliably produce desired outputs for a specific task.