Columnar Storage
A data storage format that organizes data by columns rather than rows, enabling highly efficient compression and dramatically faster analytical queries that access only a subset of columns.
Traditional row-based storage stores all columns of a record together: (name, age, city, salary), (name, age, city, salary). Columnar storage groups all values of the same column together: (name, name, name), (age, age, age). This layout has profound implications for analytical workloads that typically access a few columns across many rows.
Columnar storage enables two major optimizations. First, queries that select specific columns (SELECT avg(salary) FROM employees) read only the relevant column data, skipping all others. For wide tables with hundreds of columns, this can reduce I/O by 95%. Second, columns of the same type compress extremely well because similar values are stored adjacently, achieving 5-10x compression ratios.
Formats like Apache Parquet and Apache ORC are the standard columnar formats for data lakes. Data warehouses like BigQuery, Snowflake, and Redshift use columnar storage internally. For AI teams, columnar formats are ideal for storing training datasets and feature tables, where feature engineering queries typically select specific columns from tables with many features.
Related Terms
Cosine Similarity
A measure of similarity between two vectors based on the cosine of the angle between them, ranging from -1 (opposite) to 1 (identical), commonly used to compare embeddings.
Dimensionality Reduction
Techniques that reduce the number of dimensions in high-dimensional data while preserving meaningful structure, used for visualization, compression, and noise removal.
Batch Inference
Processing multiple ML predictions as a group at scheduled intervals rather than one-at-a-time on demand, optimizing for throughput and cost over latency.
Real-Time Inference
Generating ML predictions on-demand as requests arrive, typically with latency requirements under 200ms for user-facing features.
Data Pipeline
An automated sequence of data processing steps that moves data from source systems through transformations to destination systems, enabling reliable and repeatable data flows across an organization.
ETL (Extract, Transform, Load)
A data integration pattern that extracts data from source systems, transforms it into a structured format suitable for analysis, and loads it into a target data warehouse or database.