Data Catalog
A centralized inventory of all data assets in an organization, providing searchable metadata, documentation, lineage, and quality information to help teams discover and understand available data.
Data catalogs solve the "where is the data and what does it mean" problem. As organizations accumulate hundreds of tables, datasets, and pipelines, finding and understanding the right data becomes a major productivity challenge. A data catalog indexes all data assets with descriptions, ownership, quality scores, usage statistics, and lineage information.
Tools like Atlan, DataHub, Amundsen, and cloud-native catalogs (AWS Glue Catalog, Google Data Catalog) provide searchable interfaces where analysts and engineers can discover datasets, understand their schemas, trace lineage from source to destination, and assess data quality before building on top of them.
For AI teams, a data catalog accelerates feature engineering by making it easy to discover relevant datasets for model training. Instead of asking around or browsing database schemas, a data scientist can search the catalog for "user engagement metrics" and find documented, quality-assessed tables with clear ownership. This reduces the time from idea to model prototype and improves feature quality by surfacing the best available data.
Related Terms
Cosine Similarity
A measure of similarity between two vectors based on the cosine of the angle between them, ranging from -1 (opposite) to 1 (identical), commonly used to compare embeddings.
Dimensionality Reduction
Techniques that reduce the number of dimensions in high-dimensional data while preserving meaningful structure, used for visualization, compression, and noise removal.
Batch Inference
Processing multiple ML predictions as a group at scheduled intervals rather than one-at-a-time on demand, optimizing for throughput and cost over latency.
Real-Time Inference
Generating ML predictions on-demand as requests arrive, typically with latency requirements under 200ms for user-facing features.
Data Pipeline
An automated sequence of data processing steps that moves data from source systems through transformations to destination systems, enabling reliable and repeatable data flows across an organization.
ETL (Extract, Transform, Load)
A data integration pattern that extracts data from source systems, transforms it into a structured format suitable for analysis, and loads it into a target data warehouse or database.