systematic-debugging_skill

This skill enables systematic debugging to identify root causes using a 4-phase approach, improving reproducibility and fix quality.

Python

0

GitHub Stars

1

Bundled Files

2 months ago

Catalog Refreshed

4 months ago

First Indexed

Readme & install

Copy the install command, review bundled files from the catalogue, and read any extended description pulled from the listing source.

Installation

Preview and clipboard use veilstrat where the catalogue uses aiagentskills.

npx veilstrat add skill chunkytortoise/enterprisehub --skill systematic-debugging

SKILL.md11.9 KB

Overview

This skill teaches a structured 4-phase approach to systematic debugging and root cause analysis. It helps you reproduce issues, gather evidence, form testable hypotheses, and validate fixes so problems are resolved reliably and without guesswork.

How this skill works

The method divides debugging into Reproduce, Gather, Hypothesize, and Test phases. You first create a minimal, repeatable reproduction, then collect logs, metrics, and change history, form prioritized causes that are each testable, and finally validate fixes while changing one variable at a time and documenting results.

When to use it

When you need to find the root cause of a recurring or hard-to-reproduce bug
When a production incident requires systematic investigation and clear evidence
When changes or fixes should be validated to avoid regressions
When troubleshooting complex interactions across services, DB, network, or async code
When you want to convert ad-hoc debugging into reproducible diagnostics

Best practices

Create a minimal reproducible case and document exact steps and environment
Collect direct, circumstantial, and historical evidence before changing code
Form multiple prioritized, testable hypotheses and define clear validation criteria
Change only one variable per test and log outcomes to avoid compounding changes
Use targeted tools: interactive debuggers, structured logging, profilers, and system metrics

Example use cases

A web endpoint intermittently returns 500 — reproduce with a minimal request and inspect logs and DB queries
A background job leaks memory — gather memory profiles and trace object lifetimes to identify leaks
A race condition appears under load — simulate concurrent runs, add locks or atomic operations, and verify
An API integration fails for some users — collect request/response traces and check auth, timeouts, and schema
A slow query degrades performance — run EXPLAIN ANALYZE, check indexes, and profile hotspot code

FAQ

Prioritize hypotheses by likelihood and impact using available evidence; start with low-cost, high-probability tests (configuration, recent changes, obvious null handling).

What if I can't reproduce the issue locally?

Capture detailed production evidence (logs, traces, metrics) and create a minimal simulation of the production environment; consider feature flags, toggling config, or recording network traffic to reproduce conditions.