Data Normalization
The process of organizing data to reduce redundancy and improve integrity through a series of normal forms, or the statistical process of scaling numeric features to a standard range for machine learning.
Data normalization has two meanings depending on context. In database design, normalization reduces redundancy by decomposing tables into smaller, related tables. First Normal Form eliminates repeating groups. Second Normal Form removes partial dependencies. Third Normal Form removes transitive dependencies. The goal is a schema where each fact is stored once, preventing update anomalies.
In machine learning, normalization scales numeric features to comparable ranges. Min-max normalization scales values to [0, 1]. Z-score normalization (standardization) transforms features to have mean 0 and standard deviation 1. This prevents features with large numeric ranges (like salary in thousands) from dominating features with small ranges (like age in tens) in distance-based algorithms.
Both meanings are relevant for AI teams. Database normalization in operational systems keeps data clean and consistent at the source. Feature normalization in ML pipelines ensures models treat all features fairly. Choosing the right normalization technique depends on the algorithm (neural networks prefer min-max; tree-based models are often scale-invariant) and the data distribution (z-score handles outliers differently than min-max).
Related Terms
Cosine Similarity
A measure of similarity between two vectors based on the cosine of the angle between them, ranging from -1 (opposite) to 1 (identical), commonly used to compare embeddings.
Dimensionality Reduction
Techniques that reduce the number of dimensions in high-dimensional data while preserving meaningful structure, used for visualization, compression, and noise removal.
Batch Inference
Processing multiple ML predictions as a group at scheduled intervals rather than one-at-a-time on demand, optimizing for throughput and cost over latency.
Real-Time Inference
Generating ML predictions on-demand as requests arrive, typically with latency requirements under 200ms for user-facing features.
Data Pipeline
An automated sequence of data processing steps that moves data from source systems through transformations to destination systems, enabling reliable and repeatable data flows across an organization.
ETL (Extract, Transform, Load)
A data integration pattern that extracts data from source systems, transforms it into a structured format suitable for analysis, and loads it into a target data warehouse or database.