Data Drift
A change in the statistical properties of model input data over time compared to the training data distribution, potentially degrading model performance if left undetected and unaddressed.
Data drift occurs when the real-world data your model encounters diverges from the data it was trained on. A product recommendation model trained on pre-pandemic shopping behavior will perform poorly when consumer preferences shift. A fraud detection model trained on historical patterns will miss new fraud techniques. The model itself has not changed, but the world around it has.
There are several types of drift. Covariate drift is when input feature distributions change (average order value increases). Concept drift is when the relationship between features and the target changes (what constitutes a "good" recommendation evolves). Prior probability drift is when the class distribution changes (fraud becomes more or less common).
Monitoring for drift involves statistical tests (Kolmogorov-Smirnov, Population Stability Index, Jensen-Shannon divergence) comparing recent data distributions against the training baseline. When significant drift is detected, teams can retrain the model on recent data, adjust feature engineering, or trigger alerts for manual investigation. Tools like Evidently, Arize, and WhyLabs provide drift monitoring as part of their ML observability platforms.
Related Terms
Cosine Similarity
A measure of similarity between two vectors based on the cosine of the angle between them, ranging from -1 (opposite) to 1 (identical), commonly used to compare embeddings.
Dimensionality Reduction
Techniques that reduce the number of dimensions in high-dimensional data while preserving meaningful structure, used for visualization, compression, and noise removal.
Batch Inference
Processing multiple ML predictions as a group at scheduled intervals rather than one-at-a-time on demand, optimizing for throughput and cost over latency.
Real-Time Inference
Generating ML predictions on-demand as requests arrive, typically with latency requirements under 200ms for user-facing features.
Data Pipeline
An automated sequence of data processing steps that moves data from source systems through transformations to destination systems, enabling reliable and repeatable data flows across an organization.
ETL (Extract, Transform, Load)
A data integration pattern that extracts data from source systems, transforms it into a structured format suitable for analysis, and loads it into a target data warehouse or database.