Overfitting

Overfitting is one of the most common failure modes in machine learning. The model becomes so attuned to the specific examples in its training set that it fails to extract the general patterns needed for real-world performance. A churn prediction model that achieves 99% accuracy on training data but only 60% on new customers is overfitting.

The telltale sign is a growing gap between training performance and validation performance as training progresses. The training loss keeps decreasing while the validation loss starts increasing, meaning the model is fitting noise rather than signal. This happens more easily with complex models (more parameters than needed), small datasets, and noisy labels.

Standard mitigation techniques include regularization (L1, L2, dropout) that penalizes model complexity, early stopping that halts training when validation performance peaks, data augmentation that increases effective dataset size, cross-validation that ensures robust evaluation, and ensemble methods that average out individual model overfitting. For practical applications, the simplest effective model is often the best starting point: resist the urge to use a massive neural network when logistic regression handles your task adequately.

Related Terms

RAG (Retrieval-Augmented Generation)

Embeddings

Vector Database

LLM (Large Language Model)

Fine-Tuning

Prompt Engineering