incident_skill

This skill helps you systematically manage incidents from initial containment to postmortem, minimizing impact and preventing recurrence.

Python

1

GitHub Stars

1

Bundled Files

2 months ago

Catalog Refreshed

4 months ago

First Indexed

Readme & install

Copy the install command, review bundled files from the catalogue, and read any extended description pulled from the listing source.

Installation

Preview and clipboard use veilstrat where the catalogue uses aiagentskills.

npx veilstrat add skill hikaruegashira/agent-skills --skill incident

SKILL.md8.2 KB

Overview

This skill is an autonomous incident-handling meta-skill that drives structured response from initial detection through recovery and post-incident analysis. It reduces impact, shortens mean time to recovery, and embeds repeatable prevention measures. It operationalizes triage, root-cause analysis, phased recovery, and post-mortem documentation.

How this skill works

The skill inspects incident signals and executes a playbook: it visualizes affected components and users, performs layered why-why (root cause) analysis, recommends a recovery strategy (rollback, hotfix, feature-flag, or compatibility layer), and orchestrates phased restoration steps. After recovery it generates a post-mortem with timeline, root cause, detection/mitigation gaps, and prioritized preventive actions, plus an ADR-style decision record.

When to use it

On production outages or degraded service (SEV1/SEV2)
When errors or regressions appear after deployments or config changes
For recurring or unclear failures that need root-cause analysis
When multiple subsystems interact and failure patterns are composite
To formalize recovery and capture decisions after incident resolution

Best practices

Always start by visualizing impact: components, affected users, and severity
Use layered why-why analysis to drive to a single actionable root cause
Select recovery strategy based on immediacy and risk (rollback vs hotfix vs compatibility)
Define phased recovery steps with rollback criteria before making changes
Document decisions and trade-offs in an ADR-style record immediately after incident
Run a post-mortem that focuses on system fixes and process improvements, not blame

Example use cases

API gateway change causes 502s for user endpoints — visualize impact, set SEV, and choose rollback or compatibility layer
ECS tasks crash on start — perform why-why to reveal network/SG misconfig and restore connectivity
RDS migration fails — follow phased restore from snapshots, test in staging, and validate metrics before cutover
Lambda timeouts under load — detect composite pattern (VPC cold start + connection pool exhaustion) and implement proxy/concurrency fixes
Post-incident, generate ADR noting chosen recovery approach, rationale, residual risks, and assigned follow-ups

FAQ

Pick the strategy that minimizes user impact and risk: prefer rollback for immediate recovery when safe; use hotfix for small well-contained fixes; use compatibility layers for broad client compatibility issues.

How do I know when to roll back versus continue fixes in prod?

Define rollback criteria beforehand: new errors, data integrity issues, or failure to meet recovery milestones should trigger immediate rollback.