Follow up PR #265: refine chapters, diagrams, and add S20 (#283)

* feat: s01-s14 docs quality overhaul — tool pipeline, single-agent, knowledge & resilience Rewrite code.py and README (zh/en/ja) for s01-s14, each chapter building incrementally on the previous. Key fixes across chapters: - s01-s04: agent loop, tool dispatch, permission pipeline, hooks - s05-s08: todo write, subagent, skill loading, context compact - s09-s11: memory system, system prompt assembly, error recovery - s12-s14: task graph, background tasks, cron scheduler All chapters CC source-verified. Code inherits fixes forward (PROMPT_SECTIONS, json.dumps cache, real-state context, can_start dep protection, etc.). * feat: s15-s19 docs quality overhaul — multi-agent platform: teams, protocols, autonomy, worktree, MCP tools Rewrite code.py and README (zh/en/ja) for s15-s19, the multi-agent platform chapters. Each chapter inherits all previous fixes and adds one mechanism: - s15: agent teams (TeamCreate, teammate threads, shared task list) - s16: team protocols (plan approval, shutdown handshake, consume_inbox) - s17: autonomous agents (idle polling, auto-claim, consume_lead_inbox) - s18: worktree isolation (git worktree, bind_task, cwd switching, safety) - s19: MCP tools (MCPClient, normalize_mcp_name, assemble_tool_pool, no cache) All appendix source code references verified against CC source. Config priority corrected: claude.ai < plugin < user < project < local. * fix: 5 regressions across s05-s19 — glob safety, todo validation, memory extraction, protocol types, dep crash - s05-s09: glob results now filter with is_relative_to(WORKDIR) (inherited from s02) - s06-s08: todo_write validates content/status required fields (inherited from s05) - s09: extract_memories uses pre-compression snapshot instead of compacted messages - s16: submit_plan docstring clarifies protocol-only (not code-level gate) - s17-s19: match_response restores type mismatch validation (from s16) - s17-s19: claim_task deps list handles missing dep files without crashing * fix: s12 Todo V2 logic reversal, s14/s15 cron range validation, s18/s19 worktree name validation - s12 README (zh/en/ja): fix Todo V2 direction — interactive defaults to Task, non-interactive/SDK defaults to TodoWrite. Fix env var name to CLAUDE_CODE_ENABLE_TASKS (not TODO_V2). - s14/s15: add _validate_cron_field with per-field range checks (minute 0-59, hour 0-23, dom 1-31, month 1-12, dow 0-6), step > 0, range lo <= hi. Replace old try/except validation that only caught exceptions. - s18/s19: add validate_worktree_name() to remove_worktree and keep_worktree, not just create_worktree. * fix: align s16-s19 teaching tool consistency * fix pr265 chapter diagrams * Add comprehensive s20 harness chapter * Fix chapter smoke test regressions * Clarify README tutorial track transition --------- Co-authored-by: Haoran <bill-billion@outlook.com>
2026-06-21 04:33:36 +08:00 · 2026-05-20 21:45:38 +08:00
parent c354cf7721
commit 1baf1aca5a
174 changed files with 35833 additions and 353 deletions
--- a/s11_error_recovery/README.en.md
+++ b/s11_error_recovery/README.en.md
@@ -0,0 +1,277 @@
+# s11: Error Recovery — Errors aren't the end, they're the start of a retry
+
+[中文](README.md) · [English](README.en.md) · [日本語](README.ja.md)
+
+s01 → ... → s09 → s10 → `s11` → [s12](../s12_task_system/) → s13 → ... → s20
+> *"Errors aren't the end, they're the start of a retry"* — escalate tokens, compact context, switch models.
+>
+> **Harness layer**: Resilience — classify and recover when the main loop hits errors.
+
+---
+
+## The Problem
+
+The Agent is running along and then errors out:
+
+```
+Error: 529 overloaded
+```
+
+The Agent crashes. It doesn't retry, doesn't switch models, doesn't reduce context — it just crashes.
+
+In production, API errors are the norm. The three most common failure modes: **truncated output** (the model runs out of tokens mid-sentence), **context overflow** (still too long even after compaction), and **transient failures** (429 rate limiting / 529 overload). An Agent that doesn't handle errors is like a car that stalls at the slightest touch.
+
+---
+
+## Solution
+
+![Error Recovery Overview](images/error-recovery-overview.en.svg)
+
+The loop and prompt assembly from s10 are fully preserved. The only change: the LLM call is wrapped in try/except, with different recovery paths based on error type. After recovery, `continue` loops back to the top to call the LLM again.
+
+The three most common recovery patterns (the teaching version only handles 429/529; real systems also cover connection errors, timeouts, cloud vendor credential caches, etc. CC actually has 13+ reason codes; see the Deep Dive for the rest):
+
+| Pattern | Trigger | Recovery Action |
+|----------|---------|-----------------|
+| Output truncated | `max_tokens` | Escalate 8K→64K / continuation prompt |
+| Context overflow | `prompt_too_long` | Reactive compact → retry |
+| Transient failure | 429 / 529 | Exponential backoff + jitter, fallback model on consecutive 529 |
+
+---
+
+## How It Works
+
+### Path 1: Output Truncated
+
+The model runs out of tokens mid-sentence — `max_tokens` is exhausted. The default 8000 tokens isn't enough for a complete response.
+
+On the first occurrence, escalate `max_tokens` from 8K to 64K (8x the space) and retry the same request — the truncated output is NOT appended to messages, keeping the original request intact. If 64K is still not enough, save the truncated output and inject a continuation prompt telling the model to pick up where it left off, up to 3 times:
+
+```python
+if response.stop_reason == "max_tokens":
+    # First escalation: don't append truncated output, retry same request
+    if not state.has_escalated:
+        max_tokens = ESCALATED_MAX_TOKENS
+        state.has_escalated = True
+        continue  # messages unchanged, same request with more tokens
+    # 64K still truncated: save output + continuation prompt
+    messages.append({"role": "assistant", "content": response.content})
+    if state.recovery_count < MAX_RECOVERY_RETRIES:
+        messages.append({"role": "user", "content":
+            "Output token limit hit. Resume directly — "
+            "no apology, no recap. Pick up mid-thought."})
+        state.recovery_count += 1
+        continue
+    return  # still truncated after 3 continuations
+# Normal: append after max_tokens check
+messages.append({"role": "assistant", "content": response.content})
+```
+
+Escalation gets one chance; continuation gets up to 3. After that, exit — further continuations won't produce meaningful output.
+
+### Path 2: Context Overflow
+
+The LLM says "your context is too long" (`prompt_too_long`). All four compaction layers from s08 have already run, and it's still over the limit.
+
+Trigger reactive compact — more aggressive than auto compact. The teaching version keeps only the last 5 messages to simulate compaction; real CC generates a compact summary via LLM, then retries with the compacted message list. Retry after compacting. But if it's still over the limit after one compaction, the only option is to exit — compacting again won't make it any smaller:
+
+```python
+except PromptTooLongError:
+    if not state.has_attempted_reactive_compact:
+        messages[:] = reactive_compact(messages)
+        state.has_attempted_reactive_compact = True
+        continue
+    return  # Already compacted and still over limit — must exit
+```
+
+### Path 3: Transient Failures
+
+Network blips, 429 rate limiting, 529 overload — these aren't bugs, they're normal in distributed systems.
+
+Both 429 and 529 use exponential backoff + jitter: wait 0.5 seconds on the first attempt, 1 second on the second, 2 seconds on the third, up to 10 retries. Random jitter prevents concurrent requests from all retrying at the same instant. Three consecutive 529 overload errors → switch to the fallback model (if `FALLBACK_MODEL_ID` environment variable is configured):
+
+```python
+def retry_delay(attempt, retry_after=None):
+    if retry_after:
+        return retry_after
+    base = min(500 * (2 ** attempt), 32000) / 1000
+    return base + random.uniform(0, base * 0.25)
+
+def with_retry(fn, state, max_retries=10):
+    for attempt in range(max_retries):
+        try:
+            return fn()
+        except (RateLimitError, OverloadedError):
+            delay = retry_delay(attempt)
+            time.sleep(delay)
+            if is_overloaded:
+                state.consecutive_529 += 1
+                if state.consecutive_529 >= 3 and FALLBACK_MODEL:
+                    state.current_model = FALLBACK_MODEL
+    raise MaxRetriesExceeded()
+```
+
+Backoff formula: `min(500 × 2^attempt, 32000) + random(0~25%)`. If the server returns a `Retry-After` header, that value takes priority.
+
+### Putting It All Together
+
+```python
+def agent_loop(messages, context):
+    system = get_system_prompt(context)
+    state = RecoveryState()
+    max_tokens = 8000
+
+    while True:
+        try:
+            response = with_retry(
+                lambda: client.messages.create(
+                    model=state.current_model, system=system,
+                    messages=messages, tools=TOOLS,
+                    max_tokens=max_tokens),
+                state)
+        except Exception as e:
+            if is_prompt_too_long_error(e):
+                if not state.has_attempted_reactive_compact:
+                    messages[:] = reactive_compact(messages)
+                    state.has_attempted_reactive_compact = True
+                    continue
+                return
+            log_error(e)
+            return
+
+        # max_tokens check BEFORE appending to messages
+        if response.stop_reason == "max_tokens":
+            if not state.has_escalated:
+                max_tokens = 64000
+                state.has_escalated = True
+                continue  # retry same request, messages unchanged
+            # save truncated output + continuation prompt
+            messages.append({"role": "assistant", "content": response.content})
+            messages.append({"role": "user", "content": CONTINUATION_PROMPT})
+            continue
+        # Normal completion
+        messages.append({"role": "assistant", "content": response.content})
+
+        if response.stop_reason != "tool_use":
+            return
+        # ... tool execution ...
+```
+
+The outer try/except catches API exceptions (prompt_too_long, etc.), `with_retry` handles transient errors (429/529), and `stop_reason` checks handle truncation. Three recovery mechanisms, each handling its own error type.
+
+---
+
+## Changes from s10
+
+| Component | Before (s10) | After (s11) |
+|-----------|-------------|-------------|
+| Error handling | None (crashes on any error) | Three recovery patterns + exponential backoff |
+| New constants | — | ESCALATED_MAX_TOKENS=64000, MAX_RETRIES=10, BASE_DELAY_MS=500, FALLBACK_MODEL |
+| New functions | — | with_retry, retry_delay, reactive_compact, is_prompt_too_long_error, RecoveryState |
+| Tools | bash, read_file, write_file (3) | bash, read_file, write_file (3) — unchanged |
+| Loop | Bare LLM call | Wrapped in try/except + continue retry |
+
+---
+
+## Try It
+
+```sh
+cd learn-claude-code
+python s11_error_recovery/code.py
+```
+
+Try these prompts:
+
+1. Ask the Agent to generate a very long piece of code, and observe whether it automatically continues after truncation (look for the `[max_tokens] escalating` log)
+2. Read many files consecutively to bloat the context, and observe reactive compact
+3. If you encounter 429/529, observe the exponential backoff log output
+
+---
+
+## What's Next
+
+The Agent can now automatically recover from errors. But the tasks it handles are still one-shot — you give it a task, it finishes, it's done.
+
+What if the Agent could manage a **task list** — with dependencies, persisted to disk, resumable across sessions? A TODO list is not a task system.
+
+s12 Task System → Tasks form a dependency graph with state and persistence. This is the foundation for multi-Agent collaboration.
+
+<details>
+<summary>Deep Dive into CC Source</summary>
+
+> The following is based on CC source code: `query.ts` (1729 lines), `services/api/withRetry.ts` (822 lines), `query/tokenBudget.ts` (93 lines), and `utils/tokenBudget.ts` (73 lines).
+
+### 1. A Dozen-Plus Reason/Transition Codes (Not Just 3)
+
+The teaching version covers 3 of the most common recovery patterns. CC actually has a dozen-plus reason/transition codes, evaluated after every LLM call:
+
+| Reason/Transition | Teaching Version | CC Behavior |
+|---|---|---|
+| `completed` | Normal completion | Return result |
+| `next_turn` | Normal tool call | Continue to next tool execution round |
+| `max_output_tokens_escalate` | Path 1 | 8K→64K escalation |
+| `max_output_tokens_recovery` | Path 1 continuation | Continuation prompt (up to 3 times) |
+| `reactive_compact_retry` | Path 2 | Reactive compact → retry |
+| `prompt_too_long` | Path 2 | Same as above |
+| `collapse_drain_retry` | Not covered | Context collapse — commit staged content first |
+| `model_error` | Not covered | Retry |
+| `image_error` | Not covered | `ImageSizeError` / `ImageResizeError` handled specifically |
+| `aborted_streaming` | Not covered | Streaming abort recovery |
+| `aborted_tools` | Not covered | Tool abort |
+| `stop_hook_blocking` | Not covered | Inject blocking error → model self-corrects |
+| `stop_hook_prevented` | Not covered | Hooks prevent execution |
+| `hook_stopped` | Not covered | Hook stopped execution |
+| `token_budget_continuation` | Not covered | Continue when token usage < 90% |
+| `blocking_limit` | Not covered | Blocking limit reached |
+| `max_turns` | Not covered | Maximum turns reached |
+
+The teaching version only expands on the first 5 (most common); each of the rest has its own dedicated handling logic.
+
+### 2. Precise Exponential Backoff Formula
+
+CC's backoff delay (`withRetry.ts:530-548`):
+
+```
+delay = min(500 × 2^(attempt-1), 32000) + random(0~25%)
+```
+
+| Attempt | Base Delay | + Jitter |
+|---------|-----------|----------|
+| 1 | 500ms | 0-125ms |
+| 2 | 1000ms | 0-250ms |
+| 4 | 4000ms | 0-1000ms |
+| 7+ | 32000ms (cap) | 0-8000ms |
+
+If the server returns a `Retry-After` header, that value takes priority.
+
+### 3. Original CONTINUATION Prompt
+
+CC's continuation prompt (`query.ts:1225-1227`):
+
+```
+Output token limit hit. Resume directly — no apology, no recap of what
+you were doing. Pick up mid-thought if that is where the cut happened.
+Break remaining work into smaller pieces.
+```
+
+Token budget nudge prompt (`tokenBudget.ts:72`):
+
+```
+Stopped at {pct}% of token target. Keep working — do not summarize.
+```
+
+### 4. Streaming Error Handling
+
+In CC's streaming path, recoverable errors (413, max_tokens, media errors) are **withheld from display** during streaming (`query.ts:788-822`) — SDK consumers don't see them, only the recovery logic does. After streaming ends, the system determines whether recovery is needed.
+
+### 5. 529 → Fallback Model Switch
+
+After 3 consecutive 529 overload errors (`MAX_529_RETRIES = 3`), CC automatically switches to the fallback model (e.g., Opus → Sonnet). On switch, all pending messages and tool results are cleared, and the user sees "Switched to {model} due to high demand".
+
+### 6. Diminishing Returns Detection
+
+Token budget "continuations" aren't unlimited. When there are 3 consecutive continuations with a token increment < 500, the system determines "continuing won't produce meaningful output" and stops continuation (`tokenBudget.ts:60-62`).
+
+</details>
+
+<!-- translation-sync: zh@v1, en@v1, ja@v1 -->