Self-Healing

Three layers of automated recovery -- from quick one-shot fixes to deep root-cause analysis. Up to 6 fix attempts and 2 healing cycles before requesting human intervention.

Overview

Three Layers of Recovery

Each layer escalates from the previous. Most failures resolve at Layer 1. Complex issues reach Layer 2. Only persistent failures hit Layer 3.

Recovery Layers

Escalation path from first-responder fix to deep healing to coordinator retry.

Layer 1 dx-step-fix Targeted fix

Layer 2 dx-step-fix Escalation (root cause)

Layer 3 Coordinator Retry loop

Layer 1: dx-step-fix

First responder. ONE fix attempt per invocation — apply a minimal fix, re-run the test command, report success or failure. No refactoring, no exploration.

Layer 2: dx-step-fix (escalation)

Triggered after 2 consecutive fix failures. dx-step-fix escalates to root cause analysis using extended thinking (ultrathink). Creates corrective steps with a DIFFERENT approach.

Layer 3: Coordinator

The coordinator loop (dx-step-all, dx-agent-all) orchestrates the retry cadence: fix, fix, escalate, execute corrective steps, repeat up to 2 escalation cycles.

Layer 1

dx-step-fix -- First Responder

One-shot fix attempt. Read the error, apply a minimal correction, re-verify, report.

Strategy

Read the blocked step’s error from implement.md
Apply a minimal fix (not refactoring)
Re-run the step’s test command
Success: mark step as done
Failure: mark step as blocked with diagnosis, STOP

Superpowers Integration

Optionally invokes superpowers:systematic-debugging for structured 4-phase diagnosis (Observe, Hypothesize, Test, Fix) when the error is ambiguous or multi-layered. Falls back to inline diagnosis when superpowers is not installed.

One Attempt Only

dx-step-fix makes exactly ONE attempt per invocation. If the fix does not resolve the error, it marks the step as blocked and stops. The coordinator decides what happens next.

Layer 2

dx-step-fix (escalation) -- Deep Root-Cause Analysis

After 2 consecutive fix failures, the healer takes over with extended thinking and a fundamentally different approach.

Type A: Step Blocked

Step has Status: blocked with a diagnosis. Heal creates ONE corrective step with distinctive numbering: Step 3h (first cycle), Step 3h2 (second cycle). The corrective step MUST use a different strategy from the original.

Type B: Review Failed

Full code review (dx-step-verify) failed after 3 cycles. Heal groups remaining Critical/Important issues by file and creates numbered corrective steps: R1, R2, etc. Second iteration uses b suffix.

Extended Thinking

Uses ultrathink mode for deep reasoning about the root cause. The healer sees the full error context, previous fix attempts, and the original step intent.

Never Writes Code

The healer NEVER writes source code directly. It only creates new steps in implement.md. The coordinator then executes those steps normally through dx-step.

Different Strategy

If the first fix failed, the corrective step MUST use a fundamentally different approach. Returns healed (continue) or unrecoverable (stop).

Layer 3

Coordinator Loop

The full recovery flow orchestrated by dx-step-all and dx-agent-all.

Full Recovery Flow

Execute, fix (x2), heal, execute corrective steps, repeat. Maximum 2 healing cycles before human intervention.

Execute dx-step Run step

Fix #1 dx-step-fix First attempt

Fix #2 dx-step-fix Second attempt

Heal #1 dx-step-fix (escalation) New strategy

Execute Corrective steps

Heal #2 Last resort or STOP

Maximum Recovery Attempts Per Step

Execution failures:

2 fix attempts (dx-step-fix)
2 healing cycles (dx-step-fix (escalation))
Each healing cycle: corrective step + 2 more fix attempts
Total: up to 6 fix attempts + 2 heal analyses

Code review failures:

3 review-fix cycles (dx-step-verify)
2 healing cycles (dx-step-fix (escalation) in coordinator)
Each healing: corrective steps + rebuild + re-review
Total: up to 9 review cycles

Verification

dx-step-verify -- The 6-Phase Gate

Runs after all steps complete, before commit. Five pre-review checks followed by deep code review.

Verification Phases

Sequential quality gates. Each phase has max 2 fix attempts before escalating.

1. Compile Build command from config

2. Lint Lint command auto-fix

3. Test Test command

4. Secrets IMMEDIATE STOP

5. Architecture Convention checks

6. Code Review Opus reviewer 80%+ confidence

Secret Scan (Phase 4)

IMMEDIATE STOP if secrets are found. No override, no retry, no healing. The pipeline halts and requires human intervention to remove the leaked secret.

Code Review (Phase 6)

Uses dx-code-reviewer agent (Opus model). Confidence threshold of 80 on a 0-100 scale. Only reports issues the reviewer is CERTAIN about. Severity: Critical > Important > Minor. Review-fix loop runs max 3 cycles.

Final Verification Gate

After the review-fix loop concludes, dx-step-verify optionally invokes superpowers:verification-before-completion for a final cross-cutting check — confirming all acceptance criteria are met before marking the step done.

Data

Healing Data Captured

What gets recorded during recovery and where it lives.

Data Point	Where Stored	Used For
Step status (done/blocked)	implement.md	Flow control
Block diagnosis	implement.md (`Blocked:`)	Step-heal input
Fix attempts count	Coordinator memory (not persisted)	Strike counting
Healing cycles count	Coordinator memory (not persisted)	Max cycle enforcement
Review issues	step-verify output (transient)	Fix prioritization
Corrective step numbering	implement.md (3h, R1, etc.)	Audit trail

Known Gaps

Fix/heal counts are only in coordinator memory — lost between sessions. No pattern aggregation, no success rate tracking, and no cross-story learning. These gaps are addressed by the self-learning system (see Learning and Feedback page).

Statistics

Recovery Statistics

Maximum recovery attempts before requesting human intervention.

Per-Step Execution Failures

2 fix attempts (dx-step-fix)
2 healing cycles (dx-step-fix (escalation))
Each healing cycle: new corrective step + 2 more fix attempts
Total: up to 6 fix attempts + 2 heal analyses

Code Review Failures

3 review-fix cycles (dx-step-verify)
2 healing cycles (dx-step-fix (escalation) in coordinator)
Each healing: corrective steps + rebuild + re-review
Total: up to 9 review cycles

Superpowers

Superpowers Integration

Optional structured methodology hooks that enhance debugging and verification.

systematic-debugging (dx-step-fix)

When the error is ambiguous or multi-layered, dx-step-fix optionally invokes superpowers:systematic-debugging for a structured 4-phase diagnosis:

Observe — gather all error context
Hypothesize — form candidate root causes
Test — verify each hypothesis
Fix — apply the validated correction

Falls back to inline diagnosis when superpowers is not installed.

verification-before-completion (dx-step-verify)

Soft Dependency

All superpowers hooks use a soft-dependency pattern: if the superpowers plugin is installed, the structured methodology is invoked via the Skill tool. If not installed, the skill falls back to condensed inline guidance. No configuration needed — detection is automatic.

Gaps

Gaps Identified

Self-learning opportunities discovered through healing analysis.

#	Gap	Impact
1	Fix/heal counts not persisted	Only in coordinator memory, lost between sessions
2	No pattern aggregation	Same fix types applied repeatedly without learning
3	No success rate tracking	Cannot identify which fix strategies work vs fail
4	No cross-story learning	Healing insights from Story A not available for Story B
5	Corrective step quality not measured	Cannot tell if heal creates better or worse steps
6	Review issue patterns not tracked	Same issues may recur across stories

Addressed by Self-Learning

These gaps are the motivation for the self-learning architecture. The .ai/learning/ directory, /dx-learn, and /dx-retro skills address gaps 1-6 by persisting fix patterns, aggregating success rates, and enabling cross-story knowledge transfer. See the Learning and Feedback page.