Data Governance
The framework of policies, processes, and standards that ensure data is managed consistently, securely, and in compliance with regulations throughout its lifecycle across an organization.
Data governance establishes the rules for how data is collected, stored, accessed, shared, and deleted. It covers data ownership (who is responsible for each dataset), access control (who can read or modify data), quality standards (what level of accuracy and completeness is required), retention policies (how long data is kept), and regulatory compliance (GDPR, CCPA, HIPAA requirements).
Effective governance balances control with accessibility. Overly restrictive governance creates bottlenecks where teams wait weeks for data access approvals. Too little governance leads to data quality issues, security breaches, and compliance violations. Modern data governance platforms provide self-serve access with automated policy enforcement, audit logging, and classification-based controls.
For AI teams, data governance has direct implications for model development. Training data must comply with privacy regulations, access to sensitive features requires proper authorization, model outputs containing personal information must respect data handling policies, and audit trails must document which data was used to train which models. Governance is not just a compliance requirement; it builds the trust needed for AI systems to operate responsibly.
Related Terms
Cosine Similarity
A measure of similarity between two vectors based on the cosine of the angle between them, ranging from -1 (opposite) to 1 (identical), commonly used to compare embeddings.
Dimensionality Reduction
Techniques that reduce the number of dimensions in high-dimensional data while preserving meaningful structure, used for visualization, compression, and noise removal.
Batch Inference
Processing multiple ML predictions as a group at scheduled intervals rather than one-at-a-time on demand, optimizing for throughput and cost over latency.
Real-Time Inference
Generating ML predictions on-demand as requests arrive, typically with latency requirements under 200ms for user-facing features.
Data Pipeline
An automated sequence of data processing steps that moves data from source systems through transformations to destination systems, enabling reliable and repeatable data flows across an organization.
ETL (Extract, Transform, Load)
A data integration pattern that extracts data from source systems, transforms it into a structured format suitable for analysis, and loads it into a target data warehouse or database.