Back to glossary

Disaster Recovery

The set of policies, tools, and procedures designed to restore critical systems and data after a catastrophic failure. Disaster recovery planning defines Recovery Time Objectives and Recovery Point Objectives that determine acceptable downtime and data loss.

Disaster recovery encompasses backup strategies, failover mechanisms, runbook documentation, and regular testing. The Recovery Time Objective defines how quickly systems must be restored, while the Recovery Point Objective defines the maximum acceptable data loss measured in time. These objectives drive architectural decisions about replication frequency, backup retention, and failover automation.

For AI product teams, disaster recovery must cover not just application data but also model artifacts, training data, and feature store state. Losing a trained model without backups could mean weeks of retraining. Growth teams should ensure experiment data and analytics are included in recovery plans because losing experiment results forces re-running tests, wasting time and user exposure. AI-specific recovery scenarios include model corruption, training data poisoning, and inference service failures. Teams should test recovery procedures regularly through chaos engineering practices and game days, verifying that AI services can be restored within their defined RTOs. The cost of disaster recovery infrastructure is an insurance premium that should be proportional to the business impact of extended downtime.

Related Terms

Content Delivery Network

A geographically distributed network of proxy servers that caches and delivers content from locations closest to end users. CDNs reduce latency, improve load times, and absorb traffic spikes by serving content from edge nodes rather than a single origin server.

Edge Computing

A distributed computing paradigm that processes data closer to the source of generation rather than in a centralized data center. Edge computing reduces latency, conserves bandwidth, and enables real-time processing for latency-sensitive applications.

Serverless Computing

A cloud execution model where the provider dynamically manages server allocation and scaling. Developers deploy functions or containers without provisioning infrastructure, paying only for actual compute time consumed rather than reserved capacity.

Function as a Service

A serverless computing category where developers deploy individual functions that execute in response to events. FaaS platforms like AWS Lambda, Google Cloud Functions, and Azure Functions handle all infrastructure management, scaling each function independently.

Platform as a Service

A cloud computing model that provides a complete development and deployment environment without managing underlying infrastructure. PaaS offerings like Heroku, Vercel, and Google App Engine handle servers, storage, networking, and runtime configuration.

Infrastructure as a Service

A cloud computing model that provides virtualized computing resources over the internet. IaaS offerings like AWS EC2, Google Compute Engine, and Azure Virtual Machines give teams full control over servers, storage, and networking without owning physical hardware.