ETL (Extract, Transform, Load)
A data integration pattern that extracts data from source systems, transforms it into a structured format suitable for analysis, and loads it into a target data warehouse or database.
ETL is the traditional approach to data integration. The extract phase pulls data from operational databases, APIs, files, and other sources. The transform phase cleans, validates, deduplicates, and reshapes the data into a schema optimized for analytics. The load phase writes the transformed data into the destination system, typically a data warehouse.
Transformations happen before loading, meaning the data warehouse receives clean, structured data ready for querying. This approach works well when transformation logic is well-understood, compute resources at the transformation layer are cheaper than at the warehouse, and analysts need consistently structured data.
ETL tools like Informatica, Talend, and Apache NiFi have been the backbone of enterprise data integration for decades. For AI teams, ETL pipelines prepare training datasets by extracting raw data, applying feature engineering transformations, handling missing values, encoding categorical variables, and loading the results into feature stores or training data repositories. The key challenge is maintaining transformation logic as data sources and model requirements evolve.
Related Terms
Cosine Similarity
A measure of similarity between two vectors based on the cosine of the angle between them, ranging from -1 (opposite) to 1 (identical), commonly used to compare embeddings.
Dimensionality Reduction
Techniques that reduce the number of dimensions in high-dimensional data while preserving meaningful structure, used for visualization, compression, and noise removal.
Batch Inference
Processing multiple ML predictions as a group at scheduled intervals rather than one-at-a-time on demand, optimizing for throughput and cost over latency.
Real-Time Inference
Generating ML predictions on-demand as requests arrive, typically with latency requirements under 200ms for user-facing features.
Data Pipeline
An automated sequence of data processing steps that moves data from source systems through transformations to destination systems, enabling reliable and repeatable data flows across an organization.
ELT (Extract, Load, Transform)
A modern data integration pattern that loads raw data directly into a target system first and then transforms it in place, leveraging the processing power of cloud data warehouses.