Files
analysis_claude_code/s11_error_recovery
gui-yue 1baf1aca5a Follow up PR #265: refine chapters, diagrams, and add S20 (#283)
* feat: s01-s14 docs quality overhaul — tool pipeline, single-agent, knowledge & resilience

Rewrite code.py and README (zh/en/ja) for s01-s14, each chapter building
incrementally on the previous. Key fixes across chapters:

- s01-s04: agent loop, tool dispatch, permission pipeline, hooks
- s05-s08: todo write, subagent, skill loading, context compact
- s09-s11: memory system, system prompt assembly, error recovery
- s12-s14: task graph, background tasks, cron scheduler

All chapters CC source-verified. Code inherits fixes forward (PROMPT_SECTIONS,
json.dumps cache, real-state context, can_start dep protection, etc.).

* feat: s15-s19 docs quality overhaul — multi-agent platform: teams, protocols, autonomy, worktree, MCP tools

Rewrite code.py and README (zh/en/ja) for s15-s19, the multi-agent platform
chapters. Each chapter inherits all previous fixes and adds one mechanism:

- s15: agent teams (TeamCreate, teammate threads, shared task list)
- s16: team protocols (plan approval, shutdown handshake, consume_inbox)
- s17: autonomous agents (idle polling, auto-claim, consume_lead_inbox)
- s18: worktree isolation (git worktree, bind_task, cwd switching, safety)
- s19: MCP tools (MCPClient, normalize_mcp_name, assemble_tool_pool, no cache)

All appendix source code references verified against CC source. Config priority
corrected: claude.ai < plugin < user < project < local.

* fix: 5 regressions across s05-s19 — glob safety, todo validation, memory extraction, protocol types, dep crash

- s05-s09: glob results now filter with is_relative_to(WORKDIR) (inherited from s02)
- s06-s08: todo_write validates content/status required fields (inherited from s05)
- s09: extract_memories uses pre-compression snapshot instead of compacted messages
- s16: submit_plan docstring clarifies protocol-only (not code-level gate)
- s17-s19: match_response restores type mismatch validation (from s16)
- s17-s19: claim_task deps list handles missing dep files without crashing

* fix: s12 Todo V2 logic reversal, s14/s15 cron range validation, s18/s19 worktree name validation

- s12 README (zh/en/ja): fix Todo V2 direction — interactive defaults to Task,
  non-interactive/SDK defaults to TodoWrite. Fix env var name to
  CLAUDE_CODE_ENABLE_TASKS (not TODO_V2).
- s14/s15: add _validate_cron_field with per-field range checks (minute 0-59,
  hour 0-23, dom 1-31, month 1-12, dow 0-6), step > 0, range lo <= hi.
  Replace old try/except validation that only caught exceptions.
- s18/s19: add validate_worktree_name() to remove_worktree and keep_worktree,
  not just create_worktree.

* fix: align s16-s19 teaching tool consistency

* fix pr265 chapter diagrams

* Add comprehensive s20 harness chapter

* Fix chapter smoke test regressions

* Clarify README tutorial track transition

---------

Co-authored-by: Haoran <bill-billion@outlook.com>
2026-05-20 21:45:38 +08:00
..

s11: Error Recovery — Errors aren't the end, they're the start of a retry

中文 · English · 日本語

s01 → ... → s09 → s10 → s11s12 → s13 → ... → s20

"Errors aren't the end, they're the start of a retry" — escalate tokens, compact context, switch models.

Harness layer: Resilience — classify and recover when the main loop hits errors.


The Problem

The Agent is running along and then errors out:

Error: 529 overloaded

The Agent crashes. It doesn't retry, doesn't switch models, doesn't reduce context — it just crashes.

In production, API errors are the norm. The three most common failure modes: truncated output (the model runs out of tokens mid-sentence), context overflow (still too long even after compaction), and transient failures (429 rate limiting / 529 overload). An Agent that doesn't handle errors is like a car that stalls at the slightest touch.


Solution

Error Recovery Overview

The loop and prompt assembly from s10 are fully preserved. The only change: the LLM call is wrapped in try/except, with different recovery paths based on error type. After recovery, continue loops back to the top to call the LLM again.

The three most common recovery patterns (the teaching version only handles 429/529; real systems also cover connection errors, timeouts, cloud vendor credential caches, etc. CC actually has 13+ reason codes; see the Deep Dive for the rest):

Pattern Trigger Recovery Action
Output truncated max_tokens Escalate 8K→64K / continuation prompt
Context overflow prompt_too_long Reactive compact → retry
Transient failure 429 / 529 Exponential backoff + jitter, fallback model on consecutive 529

How It Works

Path 1: Output Truncated

The model runs out of tokens mid-sentence — max_tokens is exhausted. The default 8000 tokens isn't enough for a complete response.

On the first occurrence, escalate max_tokens from 8K to 64K (8x the space) and retry the same request — the truncated output is NOT appended to messages, keeping the original request intact. If 64K is still not enough, save the truncated output and inject a continuation prompt telling the model to pick up where it left off, up to 3 times:

if response.stop_reason == "max_tokens":
    # First escalation: don't append truncated output, retry same request
    if not state.has_escalated:
        max_tokens = ESCALATED_MAX_TOKENS
        state.has_escalated = True
        continue  # messages unchanged, same request with more tokens
    # 64K still truncated: save output + continuation prompt
    messages.append({"role": "assistant", "content": response.content})
    if state.recovery_count < MAX_RECOVERY_RETRIES:
        messages.append({"role": "user", "content":
            "Output token limit hit. Resume directly — "
            "no apology, no recap. Pick up mid-thought."})
        state.recovery_count += 1
        continue
    return  # still truncated after 3 continuations
# Normal: append after max_tokens check
messages.append({"role": "assistant", "content": response.content})

Escalation gets one chance; continuation gets up to 3. After that, exit — further continuations won't produce meaningful output.

Path 2: Context Overflow

The LLM says "your context is too long" (prompt_too_long). All four compaction layers from s08 have already run, and it's still over the limit.

Trigger reactive compact — more aggressive than auto compact. The teaching version keeps only the last 5 messages to simulate compaction; real CC generates a compact summary via LLM, then retries with the compacted message list. Retry after compacting. But if it's still over the limit after one compaction, the only option is to exit — compacting again won't make it any smaller:

except PromptTooLongError:
    if not state.has_attempted_reactive_compact:
        messages[:] = reactive_compact(messages)
        state.has_attempted_reactive_compact = True
        continue
    return  # Already compacted and still over limit — must exit

Path 3: Transient Failures

Network blips, 429 rate limiting, 529 overload — these aren't bugs, they're normal in distributed systems.

Both 429 and 529 use exponential backoff + jitter: wait 0.5 seconds on the first attempt, 1 second on the second, 2 seconds on the third, up to 10 retries. Random jitter prevents concurrent requests from all retrying at the same instant. Three consecutive 529 overload errors → switch to the fallback model (if FALLBACK_MODEL_ID environment variable is configured):

def retry_delay(attempt, retry_after=None):
    if retry_after:
        return retry_after
    base = min(500 * (2 ** attempt), 32000) / 1000
    return base + random.uniform(0, base * 0.25)

def with_retry(fn, state, max_retries=10):
    for attempt in range(max_retries):
        try:
            return fn()
        except (RateLimitError, OverloadedError):
            delay = retry_delay(attempt)
            time.sleep(delay)
            if is_overloaded:
                state.consecutive_529 += 1
                if state.consecutive_529 >= 3 and FALLBACK_MODEL:
                    state.current_model = FALLBACK_MODEL
    raise MaxRetriesExceeded()

Backoff formula: min(500 × 2^attempt, 32000) + random(0~25%). If the server returns a Retry-After header, that value takes priority.

Putting It All Together

def agent_loop(messages, context):
    system = get_system_prompt(context)
    state = RecoveryState()
    max_tokens = 8000

    while True:
        try:
            response = with_retry(
                lambda: client.messages.create(
                    model=state.current_model, system=system,
                    messages=messages, tools=TOOLS,
                    max_tokens=max_tokens),
                state)
        except Exception as e:
            if is_prompt_too_long_error(e):
                if not state.has_attempted_reactive_compact:
                    messages[:] = reactive_compact(messages)
                    state.has_attempted_reactive_compact = True
                    continue
                return
            log_error(e)
            return

        # max_tokens check BEFORE appending to messages
        if response.stop_reason == "max_tokens":
            if not state.has_escalated:
                max_tokens = 64000
                state.has_escalated = True
                continue  # retry same request, messages unchanged
            # save truncated output + continuation prompt
            messages.append({"role": "assistant", "content": response.content})
            messages.append({"role": "user", "content": CONTINUATION_PROMPT})
            continue
        # Normal completion
        messages.append({"role": "assistant", "content": response.content})

        if response.stop_reason != "tool_use":
            return
        # ... tool execution ...

The outer try/except catches API exceptions (prompt_too_long, etc.), with_retry handles transient errors (429/529), and stop_reason checks handle truncation. Three recovery mechanisms, each handling its own error type.


Changes from s10

Component Before (s10) After (s11)
Error handling None (crashes on any error) Three recovery patterns + exponential backoff
New constants ESCALATED_MAX_TOKENS=64000, MAX_RETRIES=10, BASE_DELAY_MS=500, FALLBACK_MODEL
New functions with_retry, retry_delay, reactive_compact, is_prompt_too_long_error, RecoveryState
Tools bash, read_file, write_file (3) bash, read_file, write_file (3) — unchanged
Loop Bare LLM call Wrapped in try/except + continue retry

Try It

cd learn-claude-code
python s11_error_recovery/code.py

Try these prompts:

  1. Ask the Agent to generate a very long piece of code, and observe whether it automatically continues after truncation (look for the [max_tokens] escalating log)
  2. Read many files consecutively to bloat the context, and observe reactive compact
  3. If you encounter 429/529, observe the exponential backoff log output

What's Next

The Agent can now automatically recover from errors. But the tasks it handles are still one-shot — you give it a task, it finishes, it's done.

What if the Agent could manage a task list — with dependencies, persisted to disk, resumable across sessions? A TODO list is not a task system.

s12 Task System → Tasks form a dependency graph with state and persistence. This is the foundation for multi-Agent collaboration.

Deep Dive into CC Source

The following is based on CC source code: query.ts (1729 lines), services/api/withRetry.ts (822 lines), query/tokenBudget.ts (93 lines), and utils/tokenBudget.ts (73 lines).

1. A Dozen-Plus Reason/Transition Codes (Not Just 3)

The teaching version covers 3 of the most common recovery patterns. CC actually has a dozen-plus reason/transition codes, evaluated after every LLM call:

Reason/Transition Teaching Version CC Behavior
completed Normal completion Return result
next_turn Normal tool call Continue to next tool execution round
max_output_tokens_escalate Path 1 8K→64K escalation
max_output_tokens_recovery Path 1 continuation Continuation prompt (up to 3 times)
reactive_compact_retry Path 2 Reactive compact → retry
prompt_too_long Path 2 Same as above
collapse_drain_retry Not covered Context collapse — commit staged content first
model_error Not covered Retry
image_error Not covered ImageSizeError / ImageResizeError handled specifically
aborted_streaming Not covered Streaming abort recovery
aborted_tools Not covered Tool abort
stop_hook_blocking Not covered Inject blocking error → model self-corrects
stop_hook_prevented Not covered Hooks prevent execution
hook_stopped Not covered Hook stopped execution
token_budget_continuation Not covered Continue when token usage < 90%
blocking_limit Not covered Blocking limit reached
max_turns Not covered Maximum turns reached

The teaching version only expands on the first 5 (most common); each of the rest has its own dedicated handling logic.

2. Precise Exponential Backoff Formula

CC's backoff delay (withRetry.ts:530-548):

delay = min(500 × 2^(attempt-1), 32000) + random(0~25%)
Attempt Base Delay + Jitter
1 500ms 0-125ms
2 1000ms 0-250ms
4 4000ms 0-1000ms
7+ 32000ms (cap) 0-8000ms

If the server returns a Retry-After header, that value takes priority.

3. Original CONTINUATION Prompt

CC's continuation prompt (query.ts:1225-1227):

Output token limit hit. Resume directly — no apology, no recap of what
you were doing. Pick up mid-thought if that is where the cut happened.
Break remaining work into smaller pieces.

Token budget nudge prompt (tokenBudget.ts:72):

Stopped at {pct}% of token target. Keep working — do not summarize.

4. Streaming Error Handling

In CC's streaming path, recoverable errors (413, max_tokens, media errors) are withheld from display during streaming (query.ts:788-822) — SDK consumers don't see them, only the recovery logic does. After streaming ends, the system determines whether recovery is needed.

5. 529 → Fallback Model Switch

After 3 consecutive 529 overload errors (MAX_529_RETRIES = 3), CC automatically switches to the fallback model (e.g., Opus → Sonnet). On switch, all pending messages and tool results are cleared, and the user sees "Switched to {model} due to high demand".

6. Diminishing Returns Detection

Token budget "continuations" aren't unlimited. When there are 3 consecutive continuations with a token increment < 500, the system determines "continuing won't produce meaningful output" and stops continuation (tokenBudget.ts:60-62).