Data Pipeline

Data pipelines orchestrate the movement and transformation of data from where it is generated to where it is needed. A pipeline might ingest raw clickstream events from a web application, clean and validate the data, join it with user profile information, compute aggregate metrics, and load the results into a data warehouse for analysis.

Pipelines are built with orchestration tools like Apache Airflow, Dagster, Prefect, or cloud-native services like AWS Step Functions. These tools manage scheduling, dependency resolution, retries, monitoring, and alerting. Each pipeline step is typically idempotent, so failed runs can be safely retried without producing duplicate data.

For AI teams, data pipelines feed the features and training data that models depend on. A well-designed pipeline ensures that feature computation is consistent between training and serving, that data quality issues are caught before they corrupt model inputs, and that new data flows are easy to add as models evolve. Pipeline reliability directly determines model reliability, making it foundational infrastructure for any AI-powered product.

Related Terms

Cosine Similarity

Dimensionality Reduction

Batch Inference

Real-Time Inference

ETL (Extract, Transform, Load)

ELT (Extract, Load, Transform)