Glossary
Data & Pipelines

Data & Pipelines Glossary

Data foundations for AI systems — pipelines, warehouses, feature stores, statistical methods, and the infrastructure that powers AI features.

Batch Inference

Processing multiple ML predictions as a group at scheduled intervals rather than one-at-a-time on demand, optimizing for throughput and cost over latency.

Bayesian Inference

A statistical framework that updates probability estimates as new evidence becomes available, combining prior beliefs with observed data to produce posterior probability distributions over hypotheses.

Change Data Capture (CDC)

A technique for identifying and capturing changes made to a database in real time by reading the database's transaction log, enabling downstream systems to react to data changes as they occur.

Columnar Storage

A data storage format that organizes data by columns rather than rows, enabling highly efficient compression and dramatically faster analytical queries that access only a subset of columns.

Cosine Similarity

A measure of similarity between two vectors based on the cosine of the angle between them, ranging from -1 (opposite) to 1 (identical), commonly used to compare embeddings.

Data Catalog

A centralized inventory of all data assets in an organization, providing searchable metadata, documentation, lineage, and quality information to help teams discover and understand available data.

Data Deduplication

The process of identifying and removing duplicate records from a dataset using exact matching, fuzzy matching, or probabilistic techniques to ensure each real-world entity is represented exactly once.

Data Drift

A change in the statistical properties of model input data over time compared to the training data distribution, potentially degrading model performance if left undetected and unaddressed.

Data Governance

The framework of policies, processes, and standards that ensure data is managed consistently, securely, and in compliance with regulations throughout its lifecycle across an organization.

Data Lake

A centralized storage repository that holds vast amounts of raw data in its native format, including structured, semi-structured, and unstructured data, until it is needed for analysis.

Data Lakehouse

A hybrid data architecture that combines the low-cost scalable storage of data lakes with the structured querying and ACID transaction capabilities of data warehouses in a single platform.

Data Lineage

The tracking of data's origin, transformations, and movement through systems over time, providing an audit trail that shows where data came from, how it was modified, and where it was delivered.

Data Mesh

A decentralized data architecture paradigm where domain teams own and operate their data as products, with federated governance and self-serve infrastructure replacing centralized data teams.

Data Normalization

The process of organizing data to reduce redundancy and improve integrity through a series of normal forms, or the statistical process of scaling numeric features to a standard range for machine learning.

Data Partitioning

The practice of dividing large datasets into smaller, manageable segments based on key attributes like date or region, improving query performance by allowing the system to scan only relevant partitions.

Data Pipeline

An automated sequence of data processing steps that moves data from source systems through transformations to destination systems, enabling reliable and repeatable data flows across an organization.

Data Quality

The measure of data's fitness for its intended use, assessed across dimensions including accuracy, completeness, consistency, timeliness, and validity, directly impacting the reliability of analytics and ML models.

Data Sampling

The technique of selecting a representative subset from a larger dataset for analysis or model training, reducing computational cost while preserving the statistical properties of the full dataset.

Data Warehouse

A centralized analytical database optimized for complex queries across large volumes of structured historical data, designed for reporting, business intelligence, and data-driven decision making.

Dimensionality Reduction

Techniques that reduce the number of dimensions in high-dimensional data while preserving meaningful structure, used for visualization, compression, and noise removal.

Document Database

A NoSQL database that stores data as flexible, self-describing documents (typically JSON or BSON), allowing varied structures within the same collection without requiring a predefined schema.

ELT (Extract, Load, Transform)

A modern data integration pattern that loads raw data directly into a target system first and then transforms it in place, leveraging the processing power of cloud data warehouses.

ETL (Extract, Transform, Load)

A data integration pattern that extracts data from source systems, transforms it into a structured format suitable for analysis, and loads it into a target data warehouse or database.

Feature Engineering

The process of creating, selecting, and transforming raw data into meaningful input variables (features) that improve machine learning model performance and predictive accuracy.

Feature Store

A centralized repository for storing, managing, and serving machine learning features, ensuring consistency between the features used during model training and those served during real-time inference.

Graph Database

A database that uses graph structures with nodes, edges, and properties to store and query data, excelling at traversing complex relationships that would require expensive joins in relational databases.

Key-Value Store

A simple, high-performance database that stores data as key-value pairs, optimized for fast lookups by key with minimal overhead, commonly used for caching, session storage, and feature serving.

Model Monitoring

The practice of continuously tracking ML model performance, data quality, and system health in production to detect degradation, drift, and anomalies before they significantly impact users.

Multi-Armed Bandit

An optimization algorithm that balances exploration of unknown options with exploitation of known good options, dynamically allocating more traffic to better-performing variants during an experiment.

OLAP (Online Analytical Processing)

A computing approach optimized for complex analytical queries over large datasets, supporting multi-dimensional analysis with operations like aggregation, filtering, and drill-down across multiple dimensions.

OLTP (Online Transaction Processing)

A database processing paradigm optimized for handling large volumes of short, atomic transactions with fast reads and writes, powering the operational systems that run day-to-day business operations.

P-Value

The probability of observing results at least as extreme as the actual results, assuming the null hypothesis is true, used to assess the strength of evidence against the null hypothesis in statistical testing.

Real-Time Inference

Generating ML predictions on-demand as requests arrive, typically with latency requirements under 200ms for user-facing features.

Schema Evolution

The process of modifying a data schema over time to accommodate changing requirements while maintaining backward and forward compatibility with existing data and consumers.

Slowly Changing Dimension (SCD)

A data warehousing technique for tracking changes to dimension attributes over time, preserving historical context so that past facts can be analyzed against the dimension values that were current at that time.

Star Schema

A data warehouse modeling pattern that organizes data into a central fact table containing measurable events surrounded by dimension tables containing descriptive attributes, resembling a star shape.

Statistical Significance

A determination that an observed result is unlikely to have occurred by random chance alone, typically declared when the p-value falls below a predetermined threshold, usually 0.05.

Streaming Data

Continuously generated data that is processed and analyzed in real time or near-real time as it arrives, rather than being stored first and processed in batches at scheduled intervals.

Thompson Sampling

A Bayesian bandit algorithm that selects actions by sampling from posterior probability distributions of each option's reward, naturally balancing exploration and exploitation as uncertainty decreases.

Time-Series Data

A sequence of data points collected or recorded at successive, typically uniform, time intervals, used for temporal analysis, forecasting, and detecting patterns that evolve over time.

Browse other categories