Self-Healing

Three layers of automated recovery -- from quick one-shot fixes to deep root-cause analysis. Up to 6 fix attempts and 2 healing cycles before requesting human intervention.

Overview

Three Layers of Recovery

Each layer escalates from the previous. Most failures resolve at Layer 1. Complex issues reach Layer 2. Only persistent failures hit Layer 3.

Recovery Layers

Escalation path from first-responder fix to deep healing to coordinator retry.

Layer 1 dx-step-fix Targeted fix
Layer 2 dx-step-fix Escalation (root cause)
Layer 3 Coordinator Retry loop

Layer 1: dx-step-fix

First responder. ONE fix attempt per invocation — apply a minimal fix, re-run the test command, report success or failure. No refactoring, no exploration.

Layer 2: dx-step-fix (escalation)

Triggered after 2 consecutive fix failures. dx-step-fix escalates to root cause analysis using extended thinking (ultrathink). Creates corrective steps with a DIFFERENT approach.

Layer 3: Coordinator

The coordinator loop (dx-step-all, dx-agent-all) orchestrates the retry cadence: fix, fix, escalate, execute corrective steps, repeat up to 2 escalation cycles.

Layer 1

dx-step-fix -- First Responder

One-shot fix attempt. Read the error, apply a minimal correction, re-verify, report.

Strategy

  • Read the blocked step’s error from implement.md
  • Apply a minimal fix (not refactoring)
  • Re-run the step’s test command
  • Success: mark step as done
  • Failure: mark step as blocked with diagnosis, STOP

Superpowers Integration

Optionally invokes superpowers:systematic-debugging for structured 4-phase diagnosis (Observe, Hypothesize, Test, Fix) when the error is ambiguous or multi-layered. Falls back to inline diagnosis when superpowers is not installed.

One Attempt Only

dx-step-fix makes exactly ONE attempt per invocation. If the fix does not resolve the error, it marks the step as blocked and stops. The coordinator decides what happens next.

Layer 2

dx-step-fix (escalation) -- Deep Root-Cause Analysis

After 2 consecutive fix failures, the healer takes over with extended thinking and a fundamentally different approach.

Type A: Step Blocked

Step has Status: blocked with a diagnosis. Heal creates ONE corrective step with distinctive numbering: Step 3h (first cycle), Step 3h2 (second cycle). The corrective step MUST use a different strategy from the original.

Type B: Review Failed

Full code review (dx-step-verify) failed after 3 cycles. Heal groups remaining Critical/Important issues by file and creates numbered corrective steps: R1, R2, etc. Second iteration uses b suffix.

Extended Thinking

Uses ultrathink mode for deep reasoning about the root cause. The healer sees the full error context, previous fix attempts, and the original step intent.

Never Writes Code

The healer NEVER writes source code directly. It only creates new steps in implement.md. The coordinator then executes those steps normally through dx-step.

Different Strategy

If the first fix failed, the corrective step MUST use a fundamentally different approach. Returns healed (continue) or unrecoverable (stop).

Layer 3

Coordinator Loop

The full recovery flow orchestrated by dx-step-all and dx-agent-all.

Full Recovery Flow

Execute, fix (x2), heal, execute corrective steps, repeat. Maximum 2 healing cycles before human intervention.

Execute dx-step Run step
Fix #1 dx-step-fix First attempt
Fix #2 dx-step-fix Second attempt
Heal #1 dx-step-fix (escalation) New strategy
Execute Corrective steps
Heal #2 Last resort or STOP

Maximum Recovery Attempts Per Step

Execution failures:
  • 2 fix attempts (dx-step-fix)
  • 2 healing cycles (dx-step-fix (escalation))
  • Each healing cycle: corrective step + 2 more fix attempts
  • Total: up to 6 fix attempts + 2 heal analyses
Code review failures:
  • 3 review-fix cycles (dx-step-verify)
  • 2 healing cycles (dx-step-fix (escalation) in coordinator)
  • Each healing: corrective steps + rebuild + re-review
  • Total: up to 9 review cycles
Verification

dx-step-verify -- The 6-Phase Gate

Runs after all steps complete, before commit. Five pre-review checks followed by deep code review.

Verification Phases

Sequential quality gates. Each phase has max 2 fix attempts before escalating.

1. Compile Build command from config
2. Lint Lint command auto-fix
3. Test Test command
4. Secrets IMMEDIATE STOP
5. Architecture Convention checks
6. Code Review Opus reviewer 80%+ confidence

Secret Scan (Phase 4)

IMMEDIATE STOP if secrets are found. No override, no retry, no healing. The pipeline halts and requires human intervention to remove the leaked secret.

Code Review (Phase 6)

Uses dx-code-reviewer agent (Opus model). Confidence threshold of 80 on a 0-100 scale. Only reports issues the reviewer is CERTAIN about. Severity: Critical > Important > Minor. Review-fix loop runs max 3 cycles.

Final Verification Gate

After the review-fix loop concludes, dx-step-verify optionally invokes superpowers:verification-before-completion for a final cross-cutting check — confirming all acceptance criteria are met before marking the step done.

Data

Healing Data Captured

What gets recorded during recovery and where it lives.

Data PointWhere StoredUsed For
Step status (done/blocked)implement.mdFlow control
Block diagnosisimplement.md (**Blocked:**)Step-heal input
Fix attempts countCoordinator memory (not persisted)Strike counting
Healing cycles countCoordinator memory (not persisted)Max cycle enforcement
Review issuesstep-verify output (transient)Fix prioritization
Corrective step numberingimplement.md (3h, R1, etc.)Audit trail

Known Gaps

Fix/heal counts are only in coordinator memory — lost between sessions. No pattern aggregation, no success rate tracking, and no cross-story learning. These gaps are addressed by the self-learning system (see Learning and Feedback page).

Statistics

Recovery Statistics

Maximum recovery attempts before requesting human intervention.

Per-Step Execution Failures

  • 2 fix attempts (dx-step-fix)
  • 2 healing cycles (dx-step-fix (escalation))
  • Each healing cycle: new corrective step + 2 more fix attempts
  • Total: up to 6 fix attempts + 2 heal analyses

Code Review Failures

  • 3 review-fix cycles (dx-step-verify)
  • 2 healing cycles (dx-step-fix (escalation) in coordinator)
  • Each healing: corrective steps + rebuild + re-review
  • Total: up to 9 review cycles
Superpowers

Superpowers Integration

Optional structured methodology hooks that enhance debugging and verification.

systematic-debugging (dx-step-fix)

When the error is ambiguous or multi-layered, dx-step-fix optionally invokes superpowers:systematic-debugging for a structured 4-phase diagnosis:

  1. Observe — gather all error context
  2. Hypothesize — form candidate root causes
  3. Test — verify each hypothesis
  4. Fix — apply the validated correction

Falls back to inline diagnosis when superpowers is not installed.

verification-before-completion (dx-step-verify)

After the review-fix loop concludes, dx-step-verify optionally invokes superpowers:verification-before-completion for a final cross-cutting check — confirming all acceptance criteria are met before marking the step done. This acts as a last safety net before commit.

Soft Dependency

All superpowers hooks use a soft-dependency pattern: if the superpowers plugin is installed, the structured methodology is invoked via the Skill tool. If not installed, the skill falls back to condensed inline guidance. No configuration needed — detection is automatic.

Gaps

Gaps Identified

Self-learning opportunities discovered through healing analysis.

#GapImpact
1Fix/heal counts not persistedOnly in coordinator memory, lost between sessions
2No pattern aggregationSame fix types applied repeatedly without learning
3No success rate trackingCannot identify which fix strategies work vs fail
4No cross-story learningHealing insights from Story A not available for Story B
5Corrective step quality not measuredCannot tell if heal creates better or worse steps
6Review issue patterns not trackedSame issues may recur across stories

Addressed by Self-Learning

These gaps are the motivation for the self-learning architecture. The .ai/learning/ directory, /dx-learn, and /dx-retro skills address gaps 1-6 by persisting fix patterns, aggregating success rates, and enabling cross-story knowledge transfer. See the Learning and Feedback page.

KAI by Dragan Filipovic