planning-disaster-recovery_skill

This skill helps design and validate disaster recovery plans with RTO/RPO targets, cross-region replication, and chaotic testing to ensure resilience.
  • Python

291

GitHub Stars

2

Bundled Files

3 weeks ago

Catalog Refreshed

2 months ago

First Indexed

Readme & install

Copy the install command, review bundled files from the catalogue, and read any extended description pulled from the listing source.

Installation

Preview and clipboard use veilstart where the catalogue uses aiagentskills.

npx veilstart add skill ancoleman/ai-design-components --skill planning-disaster-recovery

  • outputs.yaml17.3 KB
  • SKILL.md12.1 KB

Overview

This skill helps design and implement practical disaster recovery (DR) strategies across databases, Kubernetes, and cloud infrastructure. It guides teams to define RTO/RPO, choose backup and replication patterns, automate backups, and validate recovery through chaos engineering and runbooks. Outcomes include repeatable backup workflows, tested failover procedures, and compliance-ready retention policies.

How this skill works

The skill inspects system criticality and maps RTO/RPO requirements to an appropriate DR tier and toolset. It provides concrete patterns for database backups (PITR, full/incremental), cluster backup and restore (Velero, etcd snapshots), and cross-region replication (active-active, warm standby, pilot light). It also supplies testing templates for chaos experiments, automated drills, monitoring alerts, and runbooks to operationalize recovery.

When to use it

  • Defining RTO and RPO for services and data
  • Implementing PITR and automated database backups
  • Setting up Kubernetes backups and control-plane snapshots
  • Configuring cross-region replication or multi-region failover
  • Validating DR procedures with chaos engineering and automated drills
  • Meeting regulatory retention and immutable backup requirements

Best practices

  • Classify workloads by RTO/RPO and apply matching DR tier
  • Follow the 3-2-1 backup rule and encrypt backups in transit and at rest
  • Automate scheduled backups and integrity validation with monitoring/alerts
  • Run restore tests monthly for mission-critical systems and include DR drills in CI/CD
  • Use immutable storage or object locking to protect against ransomware
  • Document and maintain runbooks: detect → verify secondary → promote → update DNS → notify

Example use cases

  • PostgreSQL production with pgBackRest + WAL archiving for sub-5-minute RPO
  • Kubernetes cluster protection using Velero with PV snapshots and etcd snapshots for control-plane recovery
  • Cross-region replication: Aurora Global DB for active-active, or pilot-light pattern with ASG scale-up automation
  • Chaos test: simulate primary DB failure and measure promotion time using scripted failover tests
  • Compliance: implement S3 lifecycle, immutability, and retention rules to meet GDPR/SOC2/HIPAA

FAQ

Map required RTO/RPO and budget: active-active suits sub-minute RTO/RPO and highest cost; warm-standby gives minutes-level recovery at moderate cost. Use active-passive or pilot-light for lower budgets and longer acceptable recovery.

How often should I test restores?

Test monthly for mission-critical systems, quarterly for important systems, and at least annually for standard services. Include automated validation in CI/CD for frequent, repeatable checks.

Built by
VeilStrat
AI signals for GTM teams
© 2026 VeilStrat. All rights reserved.All systems operational
planning-disaster-recovery skill by ancoleman/ai-design-components | VeilStrat