Data Pipeline
An automated sequence of data processing steps that moves data from source systems through transformations to destination systems, enabling reliable and repeatable data flows across an organization.
Data pipelines orchestrate the movement and transformation of data from where it is generated to where it is needed. A pipeline might ingest raw clickstream events from a web application, clean and validate the data, join it with user profile information, compute aggregate metrics, and load the results into a data warehouse for analysis.
Pipelines are built with orchestration tools like Apache Airflow, Dagster, Prefect, or cloud-native services like AWS Step Functions. These tools manage scheduling, dependency resolution, retries, monitoring, and alerting. Each pipeline step is typically idempotent, so failed runs can be safely retried without producing duplicate data.
For AI teams, data pipelines feed the features and training data that models depend on. A well-designed pipeline ensures that feature computation is consistent between training and serving, that data quality issues are caught before they corrupt model inputs, and that new data flows are easy to add as models evolve. Pipeline reliability directly determines model reliability, making it foundational infrastructure for any AI-powered product.
Related Terms
Cosine Similarity
A measure of similarity between two vectors based on the cosine of the angle between them, ranging from -1 (opposite) to 1 (identical), commonly used to compare embeddings.
Dimensionality Reduction
Techniques that reduce the number of dimensions in high-dimensional data while preserving meaningful structure, used for visualization, compression, and noise removal.
Batch Inference
Processing multiple ML predictions as a group at scheduled intervals rather than one-at-a-time on demand, optimizing for throughput and cost over latency.
Real-Time Inference
Generating ML predictions on-demand as requests arrive, typically with latency requirements under 200ms for user-facing features.
ETL (Extract, Transform, Load)
A data integration pattern that extracts data from source systems, transforms it into a structured format suitable for analysis, and loads it into a target data warehouse or database.
ELT (Extract, Load, Transform)
A modern data integration pattern that loads raw data directly into a target system first and then transforms it in place, leveraging the processing power of cloud data warehouses.