Schema Evolution
The process of modifying a data schema over time to accommodate changing requirements while maintaining backward and forward compatibility with existing data and consumers.
Schemas evolve as business requirements change: new fields are added, fields are renamed, types change, and deprecated fields are removed. Schema evolution manages these changes without breaking existing data pipelines, applications, or stored data. The challenge is coordinating changes across producers, storage, and consumers that may update at different times.
Schema registries (like Confluent Schema Registry) enforce compatibility rules. Backward-compatible changes (adding optional fields, removing fields with defaults) ensure new consumers can read old data. Forward-compatible changes ensure old consumers can read new data. Full compatibility ensures both, but limits the types of changes allowed.
For AI data pipelines, schema evolution is critical because training data accumulates over time. When a feature schema changes, historical data must remain usable or be migrated. Formats like Parquet, Avro, and Delta Lake support schema evolution natively, allowing columns to be added, removed, or renamed while maintaining the ability to read historical data. This ensures that model training can use the full historical dataset even as the schema evolves.
Related Terms
Cosine Similarity
A measure of similarity between two vectors based on the cosine of the angle between them, ranging from -1 (opposite) to 1 (identical), commonly used to compare embeddings.
Dimensionality Reduction
Techniques that reduce the number of dimensions in high-dimensional data while preserving meaningful structure, used for visualization, compression, and noise removal.
Batch Inference
Processing multiple ML predictions as a group at scheduled intervals rather than one-at-a-time on demand, optimizing for throughput and cost over latency.
Real-Time Inference
Generating ML predictions on-demand as requests arrive, typically with latency requirements under 200ms for user-facing features.
Data Pipeline
An automated sequence of data processing steps that moves data from source systems through transformations to destination systems, enabling reliable and repeatable data flows across an organization.
ETL (Extract, Transform, Load)
A data integration pattern that extracts data from source systems, transforms it into a structured format suitable for analysis, and loads it into a target data warehouse or database.