incident_skill

This skill helps you systematically manage incidents from initial containment to postmortem, minimizing impact and preventing recurrence.
  • Python

1

GitHub Stars

1

Bundled Files

3 weeks ago

Catalog Refreshed

2 months ago

First Indexed

Readme & install

Copy the install command, review bundled files from the catalogue, and read any extended description pulled from the listing source.

Installation

Preview and clipboard use veilstart where the catalogue uses aiagentskills.

npx veilstart add skill hikaruegashira/agent-skills --skill incident

  • SKILL.md8.2 KB

Overview

This skill is an autonomous incident-handling meta-skill that drives structured response from initial detection through recovery and post-incident analysis. It reduces impact, shortens mean time to recovery, and embeds repeatable prevention measures. It operationalizes triage, root-cause analysis, phased recovery, and post-mortem documentation.

How this skill works

The skill inspects incident signals and executes a playbook: it visualizes affected components and users, performs layered why-why (root cause) analysis, recommends a recovery strategy (rollback, hotfix, feature-flag, or compatibility layer), and orchestrates phased restoration steps. After recovery it generates a post-mortem with timeline, root cause, detection/mitigation gaps, and prioritized preventive actions, plus an ADR-style decision record.

When to use it

  • On production outages or degraded service (SEV1/SEV2)
  • When errors or regressions appear after deployments or config changes
  • For recurring or unclear failures that need root-cause analysis
  • When multiple subsystems interact and failure patterns are composite
  • To formalize recovery and capture decisions after incident resolution

Best practices

  • Always start by visualizing impact: components, affected users, and severity
  • Use layered why-why analysis to drive to a single actionable root cause
  • Select recovery strategy based on immediacy and risk (rollback vs hotfix vs compatibility)
  • Define phased recovery steps with rollback criteria before making changes
  • Document decisions and trade-offs in an ADR-style record immediately after incident
  • Run a post-mortem that focuses on system fixes and process improvements, not blame

Example use cases

  • API gateway change causes 502s for user endpoints — visualize impact, set SEV, and choose rollback or compatibility layer
  • ECS tasks crash on start — perform why-why to reveal network/SG misconfig and restore connectivity
  • RDS migration fails — follow phased restore from snapshots, test in staging, and validate metrics before cutover
  • Lambda timeouts under load — detect composite pattern (VPC cold start + connection pool exhaustion) and implement proxy/concurrency fixes
  • Post-incident, generate ADR noting chosen recovery approach, rationale, residual risks, and assigned follow-ups

FAQ

Pick the strategy that minimizes user impact and risk: prefer rollback for immediate recovery when safe; use hotfix for small well-contained fixes; use compatibility layers for broad client compatibility issues.

How do I know when to roll back versus continue fixes in prod?

Define rollback criteria beforehand: new errors, data integrity issues, or failure to meet recovery milestones should trigger immediate rollback.

Built by
VeilStrat
AI signals for GTM teams
© 2026 VeilStrat. All rights reserved.All systems operational