incident_skill
- Python
1
GitHub Stars
1
Bundled Files
3 weeks ago
Catalog Refreshed
2 months ago
First Indexed
Readme & install
Copy the install command, review bundled files from the catalogue, and read any extended description pulled from the listing source.
Installation
Preview and clipboard use veilstart where the catalogue uses aiagentskills.
npx veilstart add skill hikaruegashira/agent-skills --skill incident- SKILL.md8.2 KB
Overview
This skill is an autonomous incident-handling meta-skill that drives structured response from initial detection through recovery and post-incident analysis. It reduces impact, shortens mean time to recovery, and embeds repeatable prevention measures. It operationalizes triage, root-cause analysis, phased recovery, and post-mortem documentation.
How this skill works
The skill inspects incident signals and executes a playbook: it visualizes affected components and users, performs layered why-why (root cause) analysis, recommends a recovery strategy (rollback, hotfix, feature-flag, or compatibility layer), and orchestrates phased restoration steps. After recovery it generates a post-mortem with timeline, root cause, detection/mitigation gaps, and prioritized preventive actions, plus an ADR-style decision record.
When to use it
- On production outages or degraded service (SEV1/SEV2)
- When errors or regressions appear after deployments or config changes
- For recurring or unclear failures that need root-cause analysis
- When multiple subsystems interact and failure patterns are composite
- To formalize recovery and capture decisions after incident resolution
Best practices
- Always start by visualizing impact: components, affected users, and severity
- Use layered why-why analysis to drive to a single actionable root cause
- Select recovery strategy based on immediacy and risk (rollback vs hotfix vs compatibility)
- Define phased recovery steps with rollback criteria before making changes
- Document decisions and trade-offs in an ADR-style record immediately after incident
- Run a post-mortem that focuses on system fixes and process improvements, not blame
Example use cases
- API gateway change causes 502s for user endpoints — visualize impact, set SEV, and choose rollback or compatibility layer
- ECS tasks crash on start — perform why-why to reveal network/SG misconfig and restore connectivity
- RDS migration fails — follow phased restore from snapshots, test in staging, and validate metrics before cutover
- Lambda timeouts under load — detect composite pattern (VPC cold start + connection pool exhaustion) and implement proxy/concurrency fixes
- Post-incident, generate ADR noting chosen recovery approach, rationale, residual risks, and assigned follow-ups
FAQ
Pick the strategy that minimizes user impact and risk: prefer rollback for immediate recovery when safe; use hotfix for small well-contained fixes; use compatibility layers for broad client compatibility issues.
How do I know when to roll back versus continue fixes in prod?
Define rollback criteria beforehand: new errors, data integrity issues, or failure to meet recovery milestones should trigger immediate rollback.