Data Deduplication
The process of identifying and removing duplicate records from a dataset using exact matching, fuzzy matching, or probabilistic techniques to ensure each real-world entity is represented exactly once.
Duplicate data degrades analytics and ML models. The same customer appearing three times inflates customer counts, skews segmentation analysis, and creates inconsistent model features. Deduplication identifies these duplicates and consolidates them into single, canonical records.
Exact deduplication matches on unique identifiers like email addresses or user IDs. Fuzzy deduplication handles variations: "John Smith" vs. "Jon Smith," "123 Main St" vs. "123 Main Street." Techniques include string similarity metrics (Levenshtein distance, Jaro-Winkler), phonetic matching (Soundex), and ML-based entity resolution that learns to identify duplicates from labeled examples.
For AI teams, deduplication is a critical data quality step. Duplicate training samples bias the model toward over-represented records. Duplicate users in a recommendation system produce inconsistent behavior signals. Deduplication during data pipeline ingestion prevents these issues from propagating. At scale, efficient deduplication requires blocking strategies (comparing records only within likely-match groups) to avoid the quadratic cost of comparing every record pair.
Related Terms
Cosine Similarity
A measure of similarity between two vectors based on the cosine of the angle between them, ranging from -1 (opposite) to 1 (identical), commonly used to compare embeddings.
Dimensionality Reduction
Techniques that reduce the number of dimensions in high-dimensional data while preserving meaningful structure, used for visualization, compression, and noise removal.
Batch Inference
Processing multiple ML predictions as a group at scheduled intervals rather than one-at-a-time on demand, optimizing for throughput and cost over latency.
Real-Time Inference
Generating ML predictions on-demand as requests arrive, typically with latency requirements under 200ms for user-facing features.
Data Pipeline
An automated sequence of data processing steps that moves data from source systems through transformations to destination systems, enabling reliable and repeatable data flows across an organization.
ETL (Extract, Transform, Load)
A data integration pattern that extracts data from source systems, transforms it into a structured format suitable for analysis, and loads it into a target data warehouse or database.