- Home
- Skills
- Ancoleman
- Ai Design Components
- Planning Disaster Recovery
planning-disaster-recovery_skill
- Python
291
GitHub Stars
2
Bundled Files
3 weeks ago
Catalog Refreshed
2 months ago
First Indexed
Readme & install
Copy the install command, review bundled files from the catalogue, and read any extended description pulled from the listing source.
Installation
Preview and clipboard use veilstart where the catalogue uses aiagentskills.
npx veilstart add skill ancoleman/ai-design-components --skill planning-disaster-recovery- outputs.yaml17.3 KB
- SKILL.md12.1 KB
Overview
This skill helps design and implement practical disaster recovery (DR) strategies across databases, Kubernetes, and cloud infrastructure. It guides teams to define RTO/RPO, choose backup and replication patterns, automate backups, and validate recovery through chaos engineering and runbooks. Outcomes include repeatable backup workflows, tested failover procedures, and compliance-ready retention policies.
How this skill works
The skill inspects system criticality and maps RTO/RPO requirements to an appropriate DR tier and toolset. It provides concrete patterns for database backups (PITR, full/incremental), cluster backup and restore (Velero, etcd snapshots), and cross-region replication (active-active, warm standby, pilot light). It also supplies testing templates for chaos experiments, automated drills, monitoring alerts, and runbooks to operationalize recovery.
When to use it
- Defining RTO and RPO for services and data
- Implementing PITR and automated database backups
- Setting up Kubernetes backups and control-plane snapshots
- Configuring cross-region replication or multi-region failover
- Validating DR procedures with chaos engineering and automated drills
- Meeting regulatory retention and immutable backup requirements
Best practices
- Classify workloads by RTO/RPO and apply matching DR tier
- Follow the 3-2-1 backup rule and encrypt backups in transit and at rest
- Automate scheduled backups and integrity validation with monitoring/alerts
- Run restore tests monthly for mission-critical systems and include DR drills in CI/CD
- Use immutable storage or object locking to protect against ransomware
- Document and maintain runbooks: detect → verify secondary → promote → update DNS → notify
Example use cases
- PostgreSQL production with pgBackRest + WAL archiving for sub-5-minute RPO
- Kubernetes cluster protection using Velero with PV snapshots and etcd snapshots for control-plane recovery
- Cross-region replication: Aurora Global DB for active-active, or pilot-light pattern with ASG scale-up automation
- Chaos test: simulate primary DB failure and measure promotion time using scripted failover tests
- Compliance: implement S3 lifecycle, immutability, and retention rules to meet GDPR/SOC2/HIPAA
FAQ
Map required RTO/RPO and budget: active-active suits sub-minute RTO/RPO and highest cost; warm-standby gives minutes-level recovery at moderate cost. Use active-passive or pilot-light for lower budgets and longer acceptable recovery.
How often should I test restores?
Test monthly for mission-critical systems, quarterly for important systems, and at least annually for standard services. Include automated validation in CI/CD for frequent, repeatable checks.