planning-disaster-recovery_skill

This skill helps design and validate disaster recovery plans with RTO/RPO targets, cross-region replication, and chaotic testing to ensure resilience.

Python

291

GitHub Stars

2

Bundled Files

2 months ago

Catalog Refreshed

4 months ago

First Indexed

Readme & install

Copy the install command, review bundled files from the catalogue, and read any extended description pulled from the listing source.

Installation

Preview and clipboard use veilstrat where the catalogue uses aiagentskills.

npx veilstrat add skill ancoleman/ai-design-components --skill planning-disaster-recovery

outputs.yaml17.3 KB
SKILL.md12.1 KB

Overview

This skill helps design and implement practical disaster recovery (DR) strategies across databases, Kubernetes, and cloud infrastructure. It guides teams to define RTO/RPO, choose backup and replication patterns, automate backups, and validate recovery through chaos engineering and runbooks. Outcomes include repeatable backup workflows, tested failover procedures, and compliance-ready retention policies.

How this skill works

The skill inspects system criticality and maps RTO/RPO requirements to an appropriate DR tier and toolset. It provides concrete patterns for database backups (PITR, full/incremental), cluster backup and restore (Velero, etcd snapshots), and cross-region replication (active-active, warm standby, pilot light). It also supplies testing templates for chaos experiments, automated drills, monitoring alerts, and runbooks to operationalize recovery.

When to use it

Defining RTO and RPO for services and data
Implementing PITR and automated database backups
Setting up Kubernetes backups and control-plane snapshots
Configuring cross-region replication or multi-region failover
Validating DR procedures with chaos engineering and automated drills
Meeting regulatory retention and immutable backup requirements

Best practices

Classify workloads by RTO/RPO and apply matching DR tier
Follow the 3-2-1 backup rule and encrypt backups in transit and at rest
Automate scheduled backups and integrity validation with monitoring/alerts
Run restore tests monthly for mission-critical systems and include DR drills in CI/CD
Use immutable storage or object locking to protect against ransomware
Document and maintain runbooks: detect → verify secondary → promote → update DNS → notify

Example use cases

PostgreSQL production with pgBackRest + WAL archiving for sub-5-minute RPO
Kubernetes cluster protection using Velero with PV snapshots and etcd snapshots for control-plane recovery
Cross-region replication: Aurora Global DB for active-active, or pilot-light pattern with ASG scale-up automation
Chaos test: simulate primary DB failure and measure promotion time using scripted failover tests
Compliance: implement S3 lifecycle, immutability, and retention rules to meet GDPR/SOC2/HIPAA

FAQ

Map required RTO/RPO and budget: active-active suits sub-minute RTO/RPO and highest cost; warm-standby gives minutes-level recovery at moderate cost. Use active-passive or pilot-light for lower budgets and longer acceptable recovery.

How often should I test restores?

Test monthly for mission-critical systems, quarterly for important systems, and at least annually for standard services. Include automated validation in CI/CD for frequent, repeatable checks.