Follow up PR #265: refine chapters, diagrams, and add S20 (#283)

* feat: s01-s14 docs quality overhaul — tool pipeline, single-agent, knowledge & resilience Rewrite code.py and README (zh/en/ja) for s01-s14, each chapter building incrementally on the previous. Key fixes across chapters: - s01-s04: agent loop, tool dispatch, permission pipeline, hooks - s05-s08: todo write, subagent, skill loading, context compact - s09-s11: memory system, system prompt assembly, error recovery - s12-s14: task graph, background tasks, cron scheduler All chapters CC source-verified. Code inherits fixes forward (PROMPT_SECTIONS, json.dumps cache, real-state context, can_start dep protection, etc.). * feat: s15-s19 docs quality overhaul — multi-agent platform: teams, protocols, autonomy, worktree, MCP tools Rewrite code.py and README (zh/en/ja) for s15-s19, the multi-agent platform chapters. Each chapter inherits all previous fixes and adds one mechanism: - s15: agent teams (TeamCreate, teammate threads, shared task list) - s16: team protocols (plan approval, shutdown handshake, consume_inbox) - s17: autonomous agents (idle polling, auto-claim, consume_lead_inbox) - s18: worktree isolation (git worktree, bind_task, cwd switching, safety) - s19: MCP tools (MCPClient, normalize_mcp_name, assemble_tool_pool, no cache) All appendix source code references verified against CC source. Config priority corrected: claude.ai < plugin < user < project < local. * fix: 5 regressions across s05-s19 — glob safety, todo validation, memory extraction, protocol types, dep crash - s05-s09: glob results now filter with is_relative_to(WORKDIR) (inherited from s02) - s06-s08: todo_write validates content/status required fields (inherited from s05) - s09: extract_memories uses pre-compression snapshot instead of compacted messages - s16: submit_plan docstring clarifies protocol-only (not code-level gate) - s17-s19: match_response restores type mismatch validation (from s16) - s17-s19: claim_task deps list handles missing dep files without crashing * fix: s12 Todo V2 logic reversal, s14/s15 cron range validation, s18/s19 worktree name validation - s12 README (zh/en/ja): fix Todo V2 direction — interactive defaults to Task, non-interactive/SDK defaults to TodoWrite. Fix env var name to CLAUDE_CODE_ENABLE_TASKS (not TODO_V2). - s14/s15: add _validate_cron_field with per-field range checks (minute 0-59, hour 0-23, dom 1-31, month 1-12, dow 0-6), step > 0, range lo <= hi. Replace old try/except validation that only caught exceptions. - s18/s19: add validate_worktree_name() to remove_worktree and keep_worktree, not just create_worktree. * fix: align s16-s19 teaching tool consistency * fix pr265 chapter diagrams * Add comprehensive s20 harness chapter * Fix chapter smoke test regressions * Clarify README tutorial track transition --------- Co-authored-by: Haoran <bill-billion@outlook.com>
2026-06-21 04:33:36 +08:00 · 2026-05-20 21:45:38 +08:00
parent c354cf7721
commit 1baf1aca5a
174 changed files with 35833 additions and 353 deletions
--- a/s11_error_recovery/README.en.md
+++ b/s11_error_recovery/README.en.md
@@ -0,0 +1,277 @@
+# s11: Error Recovery — Errors aren't the end, they're the start of a retry
+
+[中文](README.md) · [English](README.en.md) · [日本語](README.ja.md)
+
+s01 → ... → s09 → s10 → `s11` → [s12](../s12_task_system/) → s13 → ... → s20
+> *"Errors aren't the end, they're the start of a retry"* — escalate tokens, compact context, switch models.
+>
+> **Harness layer**: Resilience — classify and recover when the main loop hits errors.
+
+---
+
+## The Problem
+
+The Agent is running along and then errors out:
+
+```
+Error: 529 overloaded
+```
+
+The Agent crashes. It doesn't retry, doesn't switch models, doesn't reduce context — it just crashes.
+
+In production, API errors are the norm. The three most common failure modes: **truncated output** (the model runs out of tokens mid-sentence), **context overflow** (still too long even after compaction), and **transient failures** (429 rate limiting / 529 overload). An Agent that doesn't handle errors is like a car that stalls at the slightest touch.
+
+---
+
+## Solution
+
+![Error Recovery Overview](images/error-recovery-overview.en.svg)
+
+The loop and prompt assembly from s10 are fully preserved. The only change: the LLM call is wrapped in try/except, with different recovery paths based on error type. After recovery, `continue` loops back to the top to call the LLM again.
+
+The three most common recovery patterns (the teaching version only handles 429/529; real systems also cover connection errors, timeouts, cloud vendor credential caches, etc. CC actually has 13+ reason codes; see the Deep Dive for the rest):
+
+| Pattern | Trigger | Recovery Action |
+|----------|---------|-----------------|
+| Output truncated | `max_tokens` | Escalate 8K→64K / continuation prompt |
+| Context overflow | `prompt_too_long` | Reactive compact → retry |
+| Transient failure | 429 / 529 | Exponential backoff + jitter, fallback model on consecutive 529 |
+
+---
+
+## How It Works
+
+### Path 1: Output Truncated
+
+The model runs out of tokens mid-sentence — `max_tokens` is exhausted. The default 8000 tokens isn't enough for a complete response.
+
+On the first occurrence, escalate `max_tokens` from 8K to 64K (8x the space) and retry the same request — the truncated output is NOT appended to messages, keeping the original request intact. If 64K is still not enough, save the truncated output and inject a continuation prompt telling the model to pick up where it left off, up to 3 times:
+
+```python
+if response.stop_reason == "max_tokens":
+    # First escalation: don't append truncated output, retry same request
+    if not state.has_escalated:
+        max_tokens = ESCALATED_MAX_TOKENS
+        state.has_escalated = True
+        continue  # messages unchanged, same request with more tokens
+    # 64K still truncated: save output + continuation prompt
+    messages.append({"role": "assistant", "content": response.content})
+    if state.recovery_count < MAX_RECOVERY_RETRIES:
+        messages.append({"role": "user", "content":
+            "Output token limit hit. Resume directly — "
+            "no apology, no recap. Pick up mid-thought."})
+        state.recovery_count += 1
+        continue
+    return  # still truncated after 3 continuations
+# Normal: append after max_tokens check
+messages.append({"role": "assistant", "content": response.content})
+```
+
+Escalation gets one chance; continuation gets up to 3. After that, exit — further continuations won't produce meaningful output.
+
+### Path 2: Context Overflow
+
+The LLM says "your context is too long" (`prompt_too_long`). All four compaction layers from s08 have already run, and it's still over the limit.
+
+Trigger reactive compact — more aggressive than auto compact. The teaching version keeps only the last 5 messages to simulate compaction; real CC generates a compact summary via LLM, then retries with the compacted message list. Retry after compacting. But if it's still over the limit after one compaction, the only option is to exit — compacting again won't make it any smaller:
+
+```python
+except PromptTooLongError:
+    if not state.has_attempted_reactive_compact:
+        messages[:] = reactive_compact(messages)
+        state.has_attempted_reactive_compact = True
+        continue
+    return  # Already compacted and still over limit — must exit
+```
+
+### Path 3: Transient Failures
+
+Network blips, 429 rate limiting, 529 overload — these aren't bugs, they're normal in distributed systems.
+
+Both 429 and 529 use exponential backoff + jitter: wait 0.5 seconds on the first attempt, 1 second on the second, 2 seconds on the third, up to 10 retries. Random jitter prevents concurrent requests from all retrying at the same instant. Three consecutive 529 overload errors → switch to the fallback model (if `FALLBACK_MODEL_ID` environment variable is configured):
+
+```python
+def retry_delay(attempt, retry_after=None):
+    if retry_after:
+        return retry_after
+    base = min(500 * (2 ** attempt), 32000) / 1000
+    return base + random.uniform(0, base * 0.25)
+
+def with_retry(fn, state, max_retries=10):
+    for attempt in range(max_retries):
+        try:
+            return fn()
+        except (RateLimitError, OverloadedError):
+            delay = retry_delay(attempt)
+            time.sleep(delay)
+            if is_overloaded:
+                state.consecutive_529 += 1
+                if state.consecutive_529 >= 3 and FALLBACK_MODEL:
+                    state.current_model = FALLBACK_MODEL
+    raise MaxRetriesExceeded()
+```
+
+Backoff formula: `min(500 × 2^attempt, 32000) + random(0~25%)`. If the server returns a `Retry-After` header, that value takes priority.
+
+### Putting It All Together
+
+```python
+def agent_loop(messages, context):
+    system = get_system_prompt(context)
+    state = RecoveryState()
+    max_tokens = 8000
+
+    while True:
+        try:
+            response = with_retry(
+                lambda: client.messages.create(
+                    model=state.current_model, system=system,
+                    messages=messages, tools=TOOLS,
+                    max_tokens=max_tokens),
+                state)
+        except Exception as e:
+            if is_prompt_too_long_error(e):
+                if not state.has_attempted_reactive_compact:
+                    messages[:] = reactive_compact(messages)
+                    state.has_attempted_reactive_compact = True
+                    continue
+                return
+            log_error(e)
+            return
+
+        # max_tokens check BEFORE appending to messages
+        if response.stop_reason == "max_tokens":
+            if not state.has_escalated:
+                max_tokens = 64000
+                state.has_escalated = True
+                continue  # retry same request, messages unchanged
+            # save truncated output + continuation prompt
+            messages.append({"role": "assistant", "content": response.content})
+            messages.append({"role": "user", "content": CONTINUATION_PROMPT})
+            continue
+        # Normal completion
+        messages.append({"role": "assistant", "content": response.content})
+
+        if response.stop_reason != "tool_use":
+            return
+        # ... tool execution ...
+```
+
+The outer try/except catches API exceptions (prompt_too_long, etc.), `with_retry` handles transient errors (429/529), and `stop_reason` checks handle truncation. Three recovery mechanisms, each handling its own error type.
+
+---
+
+## Changes from s10
+
+| Component | Before (s10) | After (s11) |
+|-----------|-------------|-------------|
+| Error handling | None (crashes on any error) | Three recovery patterns + exponential backoff |
+| New constants | — | ESCALATED_MAX_TOKENS=64000, MAX_RETRIES=10, BASE_DELAY_MS=500, FALLBACK_MODEL |
+| New functions | — | with_retry, retry_delay, reactive_compact, is_prompt_too_long_error, RecoveryState |
+| Tools | bash, read_file, write_file (3) | bash, read_file, write_file (3) — unchanged |
+| Loop | Bare LLM call | Wrapped in try/except + continue retry |
+
+---
+
+## Try It
+
+```sh
+cd learn-claude-code
+python s11_error_recovery/code.py
+```
+
+Try these prompts:
+
+1. Ask the Agent to generate a very long piece of code, and observe whether it automatically continues after truncation (look for the `[max_tokens] escalating` log)
+2. Read many files consecutively to bloat the context, and observe reactive compact
+3. If you encounter 429/529, observe the exponential backoff log output
+
+---
+
+## What's Next
+
+The Agent can now automatically recover from errors. But the tasks it handles are still one-shot — you give it a task, it finishes, it's done.
+
+What if the Agent could manage a **task list** — with dependencies, persisted to disk, resumable across sessions? A TODO list is not a task system.
+
+s12 Task System → Tasks form a dependency graph with state and persistence. This is the foundation for multi-Agent collaboration.
+
+<details>
+<summary>Deep Dive into CC Source</summary>
+
+> The following is based on CC source code: `query.ts` (1729 lines), `services/api/withRetry.ts` (822 lines), `query/tokenBudget.ts` (93 lines), and `utils/tokenBudget.ts` (73 lines).
+
+### 1. A Dozen-Plus Reason/Transition Codes (Not Just 3)
+
+The teaching version covers 3 of the most common recovery patterns. CC actually has a dozen-plus reason/transition codes, evaluated after every LLM call:
+
+| Reason/Transition | Teaching Version | CC Behavior |
+|---|---|---|
+| `completed` | Normal completion | Return result |
+| `next_turn` | Normal tool call | Continue to next tool execution round |
+| `max_output_tokens_escalate` | Path 1 | 8K→64K escalation |
+| `max_output_tokens_recovery` | Path 1 continuation | Continuation prompt (up to 3 times) |
+| `reactive_compact_retry` | Path 2 | Reactive compact → retry |
+| `prompt_too_long` | Path 2 | Same as above |
+| `collapse_drain_retry` | Not covered | Context collapse — commit staged content first |
+| `model_error` | Not covered | Retry |
+| `image_error` | Not covered | `ImageSizeError` / `ImageResizeError` handled specifically |
+| `aborted_streaming` | Not covered | Streaming abort recovery |
+| `aborted_tools` | Not covered | Tool abort |
+| `stop_hook_blocking` | Not covered | Inject blocking error → model self-corrects |
+| `stop_hook_prevented` | Not covered | Hooks prevent execution |
+| `hook_stopped` | Not covered | Hook stopped execution |
+| `token_budget_continuation` | Not covered | Continue when token usage < 90% |
+| `blocking_limit` | Not covered | Blocking limit reached |
+| `max_turns` | Not covered | Maximum turns reached |
+
+The teaching version only expands on the first 5 (most common); each of the rest has its own dedicated handling logic.
+
+### 2. Precise Exponential Backoff Formula
+
+CC's backoff delay (`withRetry.ts:530-548`):
+
+```
+delay = min(500 × 2^(attempt-1), 32000) + random(0~25%)
+```
+
+| Attempt | Base Delay | + Jitter |
+|---------|-----------|----------|
+| 1 | 500ms | 0-125ms |
+| 2 | 1000ms | 0-250ms |
+| 4 | 4000ms | 0-1000ms |
+| 7+ | 32000ms (cap) | 0-8000ms |
+
+If the server returns a `Retry-After` header, that value takes priority.
+
+### 3. Original CONTINUATION Prompt
+
+CC's continuation prompt (`query.ts:1225-1227`):
+
+```
+Output token limit hit. Resume directly — no apology, no recap of what
+you were doing. Pick up mid-thought if that is where the cut happened.
+Break remaining work into smaller pieces.
+```
+
+Token budget nudge prompt (`tokenBudget.ts:72`):
+
+```
+Stopped at {pct}% of token target. Keep working — do not summarize.
+```
+
+### 4. Streaming Error Handling
+
+In CC's streaming path, recoverable errors (413, max_tokens, media errors) are **withheld from display** during streaming (`query.ts:788-822`) — SDK consumers don't see them, only the recovery logic does. After streaming ends, the system determines whether recovery is needed.
+
+### 5. 529 → Fallback Model Switch
+
+After 3 consecutive 529 overload errors (`MAX_529_RETRIES = 3`), CC automatically switches to the fallback model (e.g., Opus → Sonnet). On switch, all pending messages and tool results are cleared, and the user sees "Switched to {model} due to high demand".
+
+### 6. Diminishing Returns Detection
+
+Token budget "continuations" aren't unlimited. When there are 3 consecutive continuations with a token increment < 500, the system determines "continuing won't produce meaningful output" and stops continuation (`tokenBudget.ts:60-62`).
+
+</details>
+
+<!-- translation-sync: zh@v1, en@v1, ja@v1 -->
--- a/s11_error_recovery/README.ja.md
+++ b/s11_error_recovery/README.ja.md
@@ -0,0 +1,277 @@
+# s11: Error Recovery — エラーは終わりではなく、リトライの始まり
+
+[中文](README.md) · [English](README.en.md) · [日本語](README.ja.md)
+
+s01 → ... → s09 → s10 → `s11` → [s12](../s12_task_system/) → s13 → ... → s20
+> *"エラーは終わりではなく、リトライの始まり"* — トークン拡張、コンテキスト圧縮、モデル切り替え。
+>
+> **Harness 層**: 耐障害性 — メインループのエラーを分類し復旧。
+
+---
+
+## 課題
+
+Agent が動いている途中でエラーが出た：
+
+```
+Error: 529 overloaded
+```
+
+Agent がクラッシュした。リトライもしない、モデルも切り替えない、コンテキストも減らさない——そのままクラッシュ。
+
+本番環境では API エラーが日常茶飯事。最も一般的な 3 つの障害パターン：**出力の切り詰め**（モデルが途中まで出力して token が尽きた）、**コンテキスト超過**（圧縮後も長すぎる）、**一時的障害**（429 レート制限 / 529 過負荷）。エラーを処理しない Agent は、一度触れただけで止まる車のようなものだ。
+
+---
+
+## 解決策
+
+![Error Recovery Overview](images/error-recovery-overview.ja.svg)
+
+s10 のループ、prompt 組み立てはすべてそのまま。唯一の変更点：LLM 呼び出しを try/except で包み、エラータイプに応じて異なる復旧パスに振り分ける。復旧後は `continue` でループ先頭に戻り、再度 LLM を呼び出す。
+
+最も一般的な 3 つの復旧パターン（教学版は 429/529 のみ対応；実際のシステムは接続エラー、タイムアウト、クラウドベンダーの認証キャッシュ等もカバー。CC には実際 13 以上の reason code があるが、残りは Deep dive で解説）：
+
+| パターン | トリガー | 復旧アクション |
+|----------|----------|---------------|
+| 出力切り詰め | `max_tokens` | 8K→64K に拡張 / 続きのプロンプト注入 |
+| コンテキスト超過 | `prompt_too_long` | reactive compact → リトライ |
+| 一時的障害 | 429 / 529 | 指数バックオフ + ジッター、連続 529 でフォールバックモデルに切り替え可能 |
+
+---
+
+## 仕組み
+
+### パス 1: 出力が切り詰められた
+
+モデルが途中まで出力して、`max_tokens` に達した。デフォルトの 8000 token では完全な回答を出力しきれない。
+
+初回発生時、`max_tokens` を 8K から 64K に拡張（8 倍の空間）し、同じリクエストをリトライする——この時、切り詰められた出力は messages に追加せず、元のリクエストをそのまま維持する。64K でも足りない場合にのみ、切り詰められた出力を保存し、続きのプロンプトを注入してモデルに先ほどの続きを出力させる。最大 3 回まで：
+
+```python
+if response.stop_reason == "max_tokens":
+    # First escalation: don't append truncated output, retry same request
+    if not state.has_escalated:
+        max_tokens = ESCALATED_MAX_TOKENS
+        state.has_escalated = True
+        continue  # messages unchanged, same request with more tokens
+    # 64K still truncated: save output + continuation prompt
+    messages.append({"role": "assistant", "content": response.content})
+    if state.recovery_count < MAX_RECOVERY_RETRIES:
+        messages.append({"role": "user", "content":
+            "Output token limit hit. Resume directly — "
+            "no apology, no recap. Pick up mid-thought."})
+        state.recovery_count += 1
+        continue
+    return  # still truncated after 3 continuations
+# Normal: append after max_tokens check
+messages.append({"role": "assistant", "content": response.content})
+```
+
+拡張は 1 回だけ、続きの出力は最大 3 回。超過したら終了——これ以上続けても実質的な出力は得られない。
+
+### パス 2: コンテキスト超過
+
+LLM が「コンテキストが長すぎる」と返す（`prompt_too_long`）。s08 の 4 層圧縮をすべて実行したのに、まだ超えている。
+
+reactive compact をトリガー——auto compact よりも積極的。教学版は最後の 5 メッセージだけを残して圧縮をシミュレート；実際の CC は LLM で compact サマリを生成してからリトライする。圧縮後にリトライ。ただし、一度圧縮してもまだ超過している場合は終了するしかない——再度圧縮しても小さくはならない：
+
+```python
+except PromptTooLongError:
+    if not state.has_attempted_reactive_compact:
+        messages[:] = reactive_compact(messages)
+        state.has_attempted_reactive_compact = True
+        continue
+    return  # 圧縮済みでも超過、終了するしかない
+```
+
+### パス 3: 一時的障害
+
+ネットワークの揺らぎ、429 レート制限、529 過負荷——これらはバグではなく、分散システムの日常だ。
+
+429 と 529 は統一して指数バックオフ + ジッターを使用：1 回目は 0.5 秒待機、2 回目は 1 秒、3 回目は 2 秒、最大 10 回。ランダムジッターを加えることで、並行リクエストが同時にリトライするのを防ぐ。3 回連続で 529 過負荷 → フォールバックモデルに切り替え（`FALLBACK_MODEL_ID` 環境変数が設定されている場合）：
+
+```python
+def retry_delay(attempt, retry_after=None):
+    if retry_after:
+        return retry_after
+    base = min(500 * (2 ** attempt), 32000) / 1000
+    return base + random.uniform(0, base * 0.25)
+
+def with_retry(fn, state, max_retries=10):
+    for attempt in range(max_retries):
+        try:
+            return fn()
+        except (RateLimitError, OverloadedError):
+            delay = retry_delay(attempt)
+            time.sleep(delay)
+            if is_overloaded:
+                state.consecutive_529 += 1
+                if state.consecutive_529 >= 3 and FALLBACK_MODEL:
+                    state.current_model = FALLBACK_MODEL
+    raise MaxRetriesExceeded()
+```
+
+バックオフの公式：`min(500 × 2^attempt, 32000) + random(0~25%)`。サーバーが `Retry-After` ヘッダーを返した場合、その値を優先して使用する。
+
+### 統合して実行
+
+```python
+def agent_loop(messages, context):
+    system = get_system_prompt(context)
+    state = RecoveryState()
+    max_tokens = 8000
+
+    while True:
+        try:
+            response = with_retry(
+                lambda: client.messages.create(
+                    model=state.current_model, system=system,
+                    messages=messages, tools=TOOLS,
+                    max_tokens=max_tokens),
+                state)
+        except Exception as e:
+            if is_prompt_too_long_error(e):
+                if not state.has_attempted_reactive_compact:
+                    messages[:] = reactive_compact(messages)
+                    state.has_attempted_reactive_compact = True
+                    continue
+                return
+            log_error(e)
+            return
+
+        # max_tokens check BEFORE appending to messages
+        if response.stop_reason == "max_tokens":
+            if not state.has_escalated:
+                max_tokens = 64000
+                state.has_escalated = True
+                continue  # retry same request, messages unchanged
+            # save truncated output + continuation prompt
+            messages.append({"role": "assistant", "content": response.content})
+            messages.append({"role": "user", "content": CONTINUATION_PROMPT})
+            continue
+        # Normal completion
+        messages.append({"role": "assistant", "content": response.content})
+
+        if response.stop_reason != "tool_use":
+            return
+        # ... tool execution ...
+```
+
+外側の try/except が API 例外（prompt_too_long 等）を捕捉し、`with_retry` が一時的エラー（429/529）を処理し、`stop_reason` のチェックが切り詰めを処理する。3 つの復旧メカニズムがそれぞれ異なるエラータイプを担当する。
+
+---
+
+## s10 からの変更点
+
+| コンポーネント | 変更前 (s10) | 変更後 (s11) |
+|---------------|-------------|-------------|
+| エラー処理 | なし（エラーで即クラッシュ） | 3 つの復旧パターン + 指数バックオフ |
+| 新規定数 | — | ESCALATED_MAX_TOKENS=64000, MAX_RETRIES=10, BASE_DELAY_MS=500, FALLBACK_MODEL |
+| 新規関数 | — | with_retry, retry_delay, reactive_compact, is_prompt_too_long_error, RecoveryState |
+| ツール | bash, read_file, write_file (3) | bash, read_file, write_file (3) — 変更なし |
+| ループ | LLM を直接呼び出し | try/except で包み + continue でリトライ |
+
+---
+
+## 試してみる
+
+```sh
+cd learn-claude-code
+python s11_error_recovery/code.py
+```
+
+以下の prompt を試してみよう：
+
+1. Agent に長いコードを生成させ、切り詰め後に自動で続きが出力されるか観察する（`[max_tokens] escalating` ログを確認）
+2. 連続して大量のファイルを読み込みコンテキストを肥大化させ、reactive compact の動作を観察する
+3. 429/529 が発生した場合、指数バックオフのログ出力を観察する
+
+---
+
+## 次のステップ
+
+Agent はエラーから自動的に復旧できるようになった。しかし、まだ処理するタスクは「使い捨て」だ——タスクを与えると実行し、終わる。
+
+Agent に**タスクリスト**を管理させられないだろうか——依存関係があり、ディスクに永続化され、セッションをまたいで復旧できる？TODO リストはタスクシステムではない。
+
+s12 Task System → タスクとは依存関係があり、状態があり、永続化されたグラフだ。これはマルチ Agent 協調の基盤となる。
+
+<details>
+<summary>CC ソースコード深掘り</summary>
+
+> 以下は CC ソースコード `query.ts`（1729 行）、`services/api/withRetry.ts`（822 行）、`query/tokenBudget.ts`（93 行）、`utils/tokenBudget.ts`（73 行）の分析に基づく。
+
+### 一、十数種の reason/transition（3 つだけではない）
+
+教学版では最も一般的な 3 つの復旧パターンを解説した。CC には実際十数種の reason/transition があり、毎回の LLM 呼び出し後に判定される：
+
+| reason/transition | 教学版の対応 | CC の動作 |
+|---|---|---|
+| `completed` | 正常終了 | 結果を返す |
+| `next_turn` | 通常のツール呼び出し | 次のツール実行ラウンドへ |
+| `max_output_tokens_escalate` | パス 1 | 8K→64K に拡張 |
+| `max_output_tokens_recovery` | パス 1 続き出力 | 続きのプロンプト注入（最大 3 回） |
+| `reactive_compact_retry` | パス 2 | reactive compact → リトライ |
+| `prompt_too_long` | パス 2 | 同上 |
+| `collapse_drain_retry` | 未展開 | context collapse 時にまず保留中の内容をコミット |
+| `model_error` | 未展開 | リトライ |
+| `image_error` | 未展開 | `ImageSizeError` / `ImageResizeError` の専用処理 |
+| `aborted_streaming` | 未展開 | ストリーミング中断の復旧 |
+| `aborted_tools` | 未展開 | ツール中断 |
+| `stop_hook_blocking` | 未展開 | blocking error を注入 → モデルが自己修正 |
+| `stop_hook_prevented` | 未展開 | hooks によるブロック |
+| `hook_stopped` | 未展開 | hook による実行停止 |
+| `token_budget_continuation` | 未展開 | token 使用量 < 90% の時に継続 |
+| `blocking_limit` | 未展開 | ブロック制限 |
+| `max_turns` | 未展開 | 最大ターン数に到達 |
+
+教学版では最初の 5 つ（最も一般的なもの）だけを展開した。残りはそれぞれ専用の処理ロジックを持つ。
+
+### 二、指数バックオフの正確な公式
+
+CC のバックオフ遅延（`withRetry.ts:530-548`）：
+
+```
+delay = min(500 × 2^(attempt-1), 32000) + random(0~25%)
+```
+
+| 試行 | 基本遅延 | + ジッター |
+|------|---------|-----------|
+| 1 | 500ms | 0-125ms |
+| 2 | 1000ms | 0-250ms |
+| 4 | 4000ms | 0-1000ms |
+| 7+ | 32000ms（上限） | 0-8000ms |
+
+サーバーが `Retry-After` ヘッダーを返した場合、その値を優先して使用する。
+
+### 三、CONTINUATION プロンプト原文
+
+CC の続き出力プロンプト（`query.ts:1225-1227`）：
+
+```
+Output token limit hit. Resume directly — no apology, no recap of what
+you were doing. Pick up mid-thought if that is where the cut happened.
+Break remaining work into smaller pieces.
+```
+
+Token budget のナッジプロンプト（`tokenBudget.ts:72`）：
+
+```
+Stopped at {pct}% of token target. Keep working — do not summarize.
+```
+
+### 四、ストリーミングエラー処理
+
+CC のストリーミングパスでは、復旧可能なエラー（413、max_tokens、media error）はストリーミング中**表示を保留される**（`query.ts:788-822`）——SDK コンシューマーには見えず、復旧ロジックだけが認識できる。ストリーミング終了後に復旧が必要かどうかを判断する。
+
+### 五、529 → フォールバックモデル切り替え
+
+3 回連続で 529 過負荷エラーが発生した後（`MAX_529_RETRIES = 3`）、CC は自動的にフォールバックモデルに切り替える（例：Opus → Sonnet）。切り替え時にすべての保留中のメッセージと tool 結果をクリアし、ユーザーに "Switched to {model} due to high demand" と表示する。
+
+### 六、収穫逓減の検出
+
+Token budget の「継続」は無限ではない。連続 3 回の continuation で token 増分が 500 未満の場合、システムは「続けても実質的な出力は得られない」と判断し、continuation を停止する（`tokenBudget.ts:60-62`）。
+
+</details>
+
+<!-- translation-sync: zh@v1, en@v1, ja@v1 -->
--- a/s11_error_recovery/README.md
+++ b/s11_error_recovery/README.md
@@ -0,0 +1,277 @@
+# s11: Error Recovery — 错误不是结束，是重试的开始
+
+[中文](README.md) · [English](README.en.md) · [日本語](README.ja.md)
+
+s01 → ... → s09 → s10 → `s11` → [s12](../s12_task_system/) → s13 → ... → s20
+> *"错误不是终点, 是重试的起点"* — 升级 token、压缩上下文、切换模型。
+>
+> **Harness 层**: 韧性 — 主循环遇到错误时分类并恢复。
+
+---
+
+## 问题
+
+Agent 跑着跑着报错了：
+
+```
+Error: 529 overloaded
+```
+
+Agent 崩溃了。它没有重试，没有换模型，没有减少上下文——直接崩溃。
+
+生产环境中 API 错误是常态。三种最常见的故障模式：**输出被截断**（模型话说一半 token 用完了）、**上下文超限**（压缩后还是太长）、**临时故障**（429 限流 / 529 过载）。一个不处理错误的 Agent 就像一个一碰就熄火的车。
+
+---
+
+## 解决方案
+
+![Error Recovery Overview](images/error-recovery-overview.svg)
+
+s10 的循环、prompt 组装全部保留。唯一的变动：LLM 调用包裹在 try/except 里，根据错误类型走不同的恢复路径。恢复后 `continue` 回到循环开头重新调用 LLM。
+
+三种最常见的恢复模式（教学版只处理 429/529；真实系统还覆盖连接错误、超时、云厂商认证缓存等。CC 实际有 13+ reason code，其余见 Deep dive）：
+
+| 模式 | 触发 | 恢复动作 |
+|------|------|---------|
+| 输出截断 | `max_tokens` | 升级 8K→64K / 续写提示 |
+| 上下文超限 | `prompt_too_long` | reactive compact → 重试 |
+| 临时故障 | 429 / 529 | 指数退避 + 抖动，连续 529 可切换备用模型 |
+
+---
+
+## 工作原理
+
+### 路径 1: 输出被截断
+
+模型话说一半，`max_tokens` 用完了。默认 8000 token 不够它输出完整回答。
+
+第一次发生时，直接把 `max_tokens` 从 8K 升级到 64K（8 倍空间），重试同一请求——此时不追加截断输出到 messages，保持原始请求不变。如果 64K 还是不够，才保存截断输出并注入续写提示让模型接着刚才的话继续说，最多 3 次：
+
+```python
+if response.stop_reason == "max_tokens":
+    # First escalation: don't append truncated output, retry same request
+    if not state.has_escalated:
+        max_tokens = ESCALATED_MAX_TOKENS
+        state.has_escalated = True
+        continue  # messages unchanged, same request with more tokens
+    # 64K still truncated: save output + continuation prompt
+    messages.append({"role": "assistant", "content": response.content})
+    if state.recovery_count < MAX_RECOVERY_RETRIES:
+        messages.append({"role": "user", "content":
+            "Output token limit hit. Resume directly — "
+            "no apology, no recap. Pick up mid-thought."})
+        state.recovery_count += 1
+        continue
+    return  # still truncated after 3 continuations
+# Normal: append after max_tokens check
+messages.append({"role": "assistant", "content": response.content})
+```
+
+升级只有一次机会，续写最多 3 次。超过就退出——继续续写也不会有实质产出。
+
+### 路径 2: 上下文超限
+
+LLM 说"你的上下文太长了"（`prompt_too_long`）。s08 的四层压缩全跑过了，还是超。
+
+触发 reactive compact——比 auto compact 更激进。教学版只保留最后 5 条消息模拟压缩效果；真实实现会调用 LLM 生成 compact 摘要再重试。压缩后重试。但如果压缩过一次还是超限，只能退出——再压缩也不会变小：
+
+```python
+except PromptTooLongError:
+    if not state.has_attempted_reactive_compact:
+        messages[:] = reactive_compact(messages)
+        state.has_attempted_reactive_compact = True
+        continue
+    return  # 压缩过了还是超限，只能退出
+```
+
+### 路径 3: 临时故障
+
+网络抖动、429 限流、529 过载——这些不是 bug，是分布式系统的常态。
+
+429 和 529 统一走指数退避 + 抖动：第一次等 0.5 秒，第二次等 1 秒，第三次等 2 秒，最多 10 次。加随机抖动让并发请求不在同一时刻重试。连续 3 次 529 过载 → 切换到备用模型（若配置了 `FALLBACK_MODEL_ID` 环境变量）：
+
+```python
+def retry_delay(attempt, retry_after=None):
+    if retry_after:
+        return retry_after
+    base = min(500 * (2 ** attempt), 32000) / 1000
+    return base + random.uniform(0, base * 0.25)
+
+def with_retry(fn, state, max_retries=10):
+    for attempt in range(max_retries):
+        try:
+            return fn()
+        except (RateLimitError, OverloadedError):
+            delay = retry_delay(attempt)
+            time.sleep(delay)
+            if is_overloaded:
+                state.consecutive_529 += 1
+                if state.consecutive_529 >= 3 and FALLBACK_MODEL:
+                    state.current_model = FALLBACK_MODEL
+    raise MaxRetriesExceeded()
+```
+
+退避公式：`min(500 × 2^attempt, 32000) + random(0~25%)`。如果服务器返回 `Retry-After` header，优先用那个值。
+
+### 合起来跑
+
+```python
+def agent_loop(messages, context):
+    system = get_system_prompt(context)
+    state = RecoveryState()
+    max_tokens = 8000
+
+    while True:
+        try:
+            response = with_retry(
+                lambda: client.messages.create(
+                    model=state.current_model, system=system,
+                    messages=messages, tools=TOOLS,
+                    max_tokens=max_tokens),
+                state)
+        except Exception as e:
+            if is_prompt_too_long_error(e):
+                if not state.has_attempted_reactive_compact:
+                    messages[:] = reactive_compact(messages)
+                    state.has_attempted_reactive_compact = True
+                    continue
+                return
+            log_error(e)
+            return
+
+        # max_tokens check BEFORE appending to messages
+        if response.stop_reason == "max_tokens":
+            if not state.has_escalated:
+                max_tokens = 64000
+                state.has_escalated = True
+                continue  # retry same request, messages unchanged
+            # save truncated output + continuation prompt
+            messages.append({"role": "assistant", "content": response.content})
+            messages.append({"role": "user", "content": CONTINUATION_PROMPT})
+            continue
+        # Normal completion
+        messages.append({"role": "assistant", "content": response.content})
+
+        if response.stop_reason != "tool_use":
+            return
+        # ... tool execution ...
+```
+
+外层 try/except 捕获 API 异常（prompt_too_long 等），`with_retry` 处理瞬态错误（429/529），`stop_reason` 检查处理截断。三种恢复机制各管各的错误类型。
+
+---
+
+## 相对 s10 的变更
+
+| 组件 | 之前 (s10) | 之后 (s11) |
+|------|-----------|-----------|
+| 错误处理 | 无（一碰就崩溃） | 三种恢复模式 + 指数退避 |
+| 新常量 | — | ESCALATED_MAX_TOKENS=64000, MAX_RETRIES=10, BASE_DELAY_MS=500, FALLBACK_MODEL |
+| 新函数 | — | with_retry, retry_delay, reactive_compact, is_prompt_too_long_error, RecoveryState |
+| 工具 | bash, read_file, write_file (3) | bash, read_file, write_file (3) — 不变 |
+| 循环 | 裸调用 LLM | try/except 包裹 + continue 重试 |
+
+---
+
+## 试一下
+
+```sh
+cd learn-claude-code
+python s11_error_recovery/code.py
+```
+
+试试这些 prompt：
+
+1. 让 Agent 生成一段很长的代码，观察截断后是否自动续写（看 `[max_tokens] escalating` 日志）
+2. 连续读取大量文件撑大上下文，观察 reactive compact
+3. 如果遇到 429/529，观察指数退避的日志输出
+
+---
+
+## 接下来
+
+Agent 现在能在错误中自动恢复了。但它处理的任务仍然是"一次性"的——你给它一个任务，它做完，结束。
+
+能不能让 Agent 管理一个**任务列表**——有依赖关系、持久化到磁盘、跨会话能恢复？TODO 列表不是任务系统。
+
+s12 Task System → 任务是有依赖、有状态、持久化的图。这是多 Agent 协作的基础。
+
+<details>
+<summary>深入 CC 源码</summary>
+
+> 以下基于 CC 源码 `query.ts`（1729 行）、`services/api/withRetry.ts`（822 行）、`query/tokenBudget.ts`（93 行）、`utils/tokenBudget.ts`（73 行）的分析。
+
+### 一、十几种 reason/transition（不只是 3 条）
+
+教学版讲了 3 种最常见的恢复模式。CC 实际有十几种 reason/transition，每轮 LLM 调用后都会判断：
+
+| reason/transition | 教学版对应 | CC 行为 |
+|---|---|---|
+| `completed` | 正常完成 | 返回结果 |
+| `next_turn` | 正常工具调用 | 继续下一轮工具执行 |
+| `max_output_tokens_escalate` | 路径 1 | 8K→64K 升级 |
+| `max_output_tokens_recovery` | 路径 1 续写 | 续写提示（最多 3 次） |
+| `reactive_compact_retry` | 路径 2 | reactive compact → 重试 |
+| `prompt_too_long` | 路径 2 | 同上 |
+| `collapse_drain_retry` | 未展开 | context collapse 先提交暂存 |
+| `model_error` | 未展开 | 重试 |
+| `image_error` | 未展开 | `ImageSizeError` / `ImageResizeError` 专门处理 |
+| `aborted_streaming` | 未展开 | 流式中止恢复 |
+| `aborted_tools` | 未展开 | 工具中止 |
+| `stop_hook_blocking` | 未展开 | 注入 blocking error → 模型自纠 |
+| `stop_hook_prevented` | 未展开 | hooks 阻止 |
+| `hook_stopped` | 未展开 | hook 停止执行 |
+| `token_budget_continuation` | 未展开 | token 用量 < 90% 时继续 |
+| `blocking_limit` | 未展开 | 阻塞限制 |
+| `max_turns` | 未展开 | 达到最大轮次 |
+
+教学版只展开了前 5 种（最常见的），其余各有专门处理逻辑。
+
+### 二、指数退避的精确公式
+
+CC 的退避延迟（`withRetry.ts:530-548`）：
+
+```
+delay = min(500 × 2^(attempt-1), 32000) + random(0~25%)
+```
+
+| 尝试 | 基础延迟 | + 抖动 |
+|------|---------|--------|
+| 1 | 500ms | 0-125ms |
+| 2 | 1000ms | 0-250ms |
+| 4 | 4000ms | 0-1000ms |
+| 7+ | 32000ms（上限） | 0-8000ms |
+
+如果服务器返回 `Retry-After` header，优先用那个值。
+
+### 三、CONTINUATION 提示原文
+
+CC 的续写提示（`query.ts:1225-1227`）：
+
+```
+Output token limit hit. Resume directly — no apology, no recap of what
+you were doing. Pick up mid-thought if that is where the cut happened.
+Break remaining work into smaller pieces.
+```
+
+Token budget 的 nudge 提示（`tokenBudget.ts:72`）：
+
+```
+Stopped at {pct}% of token target. Keep working — do not summarize.
+```
+
+### 四、流式错误处理
+
+CC 的流式路径中，可恢复的错误（413、max_tokens、media error）在 streaming 期间**被暂扣不展示**（`query.ts:788-822`）——SDK 消费者看不到，只有恢复逻辑能看到。等 streaming 结束后才判断是否需要恢复。
+
+### 五、529 → Fallback Model 切换
+
+连续 3 次 529 过载错误后（`MAX_529_RETRIES = 3`），CC 自动切换到 fallback model（如 Opus → Sonnet）。切换时清除所有 pending 消息和 tool 结果，给用户展示 "Switched to {model} due to high demand"。
+
+### 六、Diminishing Returns 检测
+
+Token budget 的"继续"不是无限的。当连续 3 次 continuation 且 token 增量 < 500 时，系统判断"继续也没有实质性产出"，停止 continuation（`tokenBudget.ts:60-62`）。
+
+</details>
+
+<!-- translation-sync: zh@v1, en@v1, ja@v1 -->
--- a/s11_error_recovery/code.py
+++ b/s11_error_recovery/code.py
@@ -0,0 +1,361 @@
+#!/usr/bin/env python3
+"""
+s11: Error Recovery — three recovery paths + exponential backoff.
+
+Run:  python s11_error_recovery/code.py
+Need: pip install anthropic python-dotenv + .env with ANTHROPIC_API_KEY
+
+Changes from s10:
+  - LLM call wrapped in try/except with three recovery paths
+  - Path 1: max_tokens -> escalate 8K->64K (no append on first escalation),
+            then continuation prompt (max 3)
+  - Path 2: prompt_too_long -> reactive compact -> retry (once)
+  - Path 3: 429/529 -> exponential backoff with jitter (max 10),
+            fallback model on consecutive 529
+  - with_retry wrapper for transient errors
+  - RecoveryState tracks escalation / compact / 529 / model
+
+ASCII flow:
+  messages -> prompt assembly -> compress+load -> [try] LLM [except] -> tools -> loop
+                                                    |          |
+                                              stop_reason   error type
+                                              max_tokens?   prompt_too_long? -> compact
+                                              escalate /    429/529? -> backoff
+                                              continue      other? -> log + exit
+"""
+
+import os, subprocess, time, random, json
+from pathlib import Path
+
+try:
+    import readline
+    readline.parse_and_bind('set bind-tty-special-chars off')
+except ImportError:
+    pass
+
+from anthropic import Anthropic
+from dotenv import load_dotenv
+
+load_dotenv(override=True)
+if os.getenv("ANTHROPIC_BASE_URL"):
+    os.environ.pop("ANTHROPIC_AUTH_TOKEN", None)
+
+WORKDIR = Path.cwd()
+MEMORY_DIR = WORKDIR / ".memory"
+MEMORY_INDEX = MEMORY_DIR / "MEMORY.md"
+client = Anthropic(base_url=os.getenv("ANTHROPIC_BASE_URL"))
+PRIMARY_MODEL = os.environ["MODEL_ID"]
+FALLBACK_MODEL = os.getenv("FALLBACK_MODEL_ID")
+
+# ── Constants ──
+
+ESCALATED_MAX_TOKENS = 64000
+DEFAULT_MAX_TOKENS = 8000
+MAX_RECOVERY_RETRIES = 3
+MAX_RETRIES = 10
+BASE_DELAY_MS = 500
+MAX_CONSECUTIVE_529 = 3
+CONTINUATION_PROMPT = (
+    "Output token limit hit. Resume directly — "
+    "no apology, no recap. Pick up mid-thought."
+)
+
+# ── Prompt Assembly (from s10, synced) ──
+
+PROMPT_SECTIONS = {
+    "identity": "You are a coding agent. Act, don't explain.",
+    "tools": "Available tools: bash, read_file, write_file.",
+    "workspace": f"Working directory: {WORKDIR}",
+    "memory": "Relevant memories are injected below when available.",
+}
+
+
+def assemble_system_prompt(context: dict) -> str:
+    sections = [PROMPT_SECTIONS["identity"],
+                PROMPT_SECTIONS["tools"],
+                PROMPT_SECTIONS["workspace"]]
+    memories = context.get("memories", "")
+    if memories:
+        sections.append(f"Relevant memories:\n{memories}")
+    return "\n\n".join(sections)
+
+
+_last_context_key, _last_prompt = None, None
+
+
+def get_system_prompt(context: dict) -> str:
+    global _last_context_key, _last_prompt
+    key = json.dumps(context, sort_keys=True, ensure_ascii=False, default=str)
+    if key == _last_context_key and _last_prompt:
+        print("  \033[90m[cache hit] system prompt unchanged\033[0m")
+        return _last_prompt
+    _last_context_key = key
+    _last_prompt = assemble_system_prompt(context)
+
+    loaded = ["identity", "tools", "workspace"]
+    if context.get("memories"):
+        loaded.append("memory")
+    print(f"  \033[32m[assembled] sections: {', '.join(loaded)}\033[0m")
+    return _last_prompt
+
+
+# ── Tools (unchanged) ──
+
+def safe_path(p: str) -> Path:
+    path = (WORKDIR / p).resolve()
+    if not path.is_relative_to(WORKDIR):
+        raise ValueError(f"Path escapes workspace: {p}")
+    return path
+
+
+def run_bash(command: str) -> str:
+    try:
+        r = subprocess.run(command, shell=True, cwd=WORKDIR,
+                           capture_output=True, text=True, timeout=120)
+        out = (r.stdout + r.stderr).strip()
+        return out[:50000] if out else "(no output)"
+    except subprocess.TimeoutExpired:
+        return "Error: Timeout (120s)"
+
+
+def run_read(path: str, limit: int | None = None) -> str:
+    try:
+        lines = safe_path(path).read_text().splitlines()
+        if limit and limit < len(lines):
+            lines = lines[:limit] + [f"... ({len(lines) - limit} more lines)"]
+        return "\n".join(lines)
+    except Exception as e:
+        return f"Error: {e}"
+
+
+def run_write(path: str, content: str) -> str:
+    try:
+        file_path = safe_path(path)
+        file_path.parent.mkdir(parents=True, exist_ok=True)
+        file_path.write_text(content)
+        return f"Wrote {len(content)} bytes to {path}"
+    except Exception as e:
+        return f"Error: {e}"
+
+
+TOOLS = [
+    {"name": "bash", "description": "Run a shell command.",
+     "input_schema": {"type": "object",
+                      "properties": {"command": {"type": "string"}},
+                      "required": ["command"]}},
+    {"name": "read_file", "description": "Read file contents.",
+     "input_schema": {"type": "object",
+                      "properties": {"path": {"type": "string"},
+                                     "limit": {"type": "integer"}},
+                      "required": ["path"]}},
+    {"name": "write_file", "description": "Write content to a file.",
+     "input_schema": {"type": "object",
+                      "properties": {"path": {"type": "string"},
+                                     "content": {"type": "string"}},
+                      "required": ["path", "content"]}},
+]
+
+TOOL_HANDLERS = {"bash": run_bash, "read_file": run_read, "write_file": run_write}
+
+
+# ── Error Recovery (s11 new) ──
+
+class RecoveryState:
+    """Track recovery attempts across the loop."""
+    def __init__(self):
+        self.has_escalated = False
+        self.recovery_count = 0
+        self.consecutive_529 = 0
+        self.has_attempted_reactive_compact = False
+        self.current_model = PRIMARY_MODEL
+
+
+def retry_delay(attempt, retry_after=None):
+    """Exponential backoff with jitter. Retry-After takes priority."""
+    if retry_after:
+        return retry_after
+    base = min(BASE_DELAY_MS * (2 ** attempt), 32000) / 1000
+    jitter = random.uniform(0, base * 0.25)
+    return base + jitter
+
+
+def with_retry(fn, state: RecoveryState):
+    """Exponential backoff for transient errors (429/529).
+    Non-transient errors are re-raised for the outer handler."""
+    for attempt in range(MAX_RETRIES):
+        try:
+            result = fn()
+            state.consecutive_529 = 0
+            return result
+        except Exception as e:
+            name = type(e).__name__
+            msg = str(e).lower()
+
+            # 429 rate limit -> exponential backoff
+            if "ratelimit" in name.lower() or "429" in msg:
+                delay = retry_delay(attempt)
+                print(f"  \033[33m[429 rate limit] retry {attempt+1}/{MAX_RETRIES},"
+                      f" wait {delay:.1f}s\033[0m")
+                time.sleep(delay)
+                continue
+
+            # 529 overloaded -> exponential backoff + fallback model
+            if "overloaded" in name.lower() or "529" in msg or "overloaded" in msg:
+                state.consecutive_529 += 1
+                if state.consecutive_529 >= MAX_CONSECUTIVE_529:
+                    if FALLBACK_MODEL:
+                        state.current_model = FALLBACK_MODEL
+                        state.consecutive_529 = 0
+                        print(f"  \033[31m[529 x{MAX_CONSECUTIVE_529}]"
+                              f" switching to {FALLBACK_MODEL}\033[0m")
+                    else:
+                        state.consecutive_529 = 0
+                        print(f"  \033[31m[529 x{MAX_CONSECUTIVE_529}]"
+                              f" no FALLBACK_MODEL_ID configured, continuing retry\033[0m")
+                delay = retry_delay(attempt)
+                print(f"  \033[33m[529 overloaded] retry {attempt+1}/{MAX_RETRIES},"
+                      f" wait {delay:.1f}s\033[0m")
+                time.sleep(delay)
+                continue
+
+            # Not transient -> re-raise for outer try/except
+            raise
+    raise RuntimeError(f"Max retries ({MAX_RETRIES}) exceeded")
+
+
+def is_prompt_too_long_error(e: Exception) -> bool:
+    """Check whether an API error indicates prompt/context too long."""
+    msg = str(e).lower()
+    return (("prompt" in msg and "long" in msg)
+            or "prompt_is_too_long" in msg
+            or "context_length_exceeded" in msg
+            or "max_context_window" in msg)
+
+
+def reactive_compact(messages: list) -> list:
+    """Emergency compact — teaching version keeps last N messages.
+    Real CC generates a compact summary via LLM, then retries with
+    the compacted message list. Teaching version simplifies to tail
+    retention since s08/s09 already cover LLM-based compact."""
+    print("  \033[31m[reactive compact] trimming to last 5 messages\033[0m")
+    tail = messages[-5:]
+    return [{"role": "user",
+             "content": "[Reactive compact] Earlier conversation trimmed. "
+                        "Continue from where you left off."}, *tail]
+
+
+# ── Context ──
+
+def update_context(context: dict, messages: list) -> dict:
+    """Derive context from real state: which tools exist, whether memory files exist."""
+    memories = ""
+    if MEMORY_INDEX.exists():
+        content = MEMORY_INDEX.read_text().strip()
+        if content:
+            memories = content
+    return {
+        "enabled_tools": list(TOOL_HANDLERS.keys()),
+        "workspace": str(WORKDIR),
+        "memories": memories,
+    }
+
+
+# ── Agent Loop ──
+
+def agent_loop(messages: list, context: dict):
+    """Main loop with error recovery wrapping LLM calls."""
+    system = get_system_prompt(context)
+    state = RecoveryState()
+    max_tokens = DEFAULT_MAX_TOKENS
+
+    while True:
+        # ── LLM call: with_retry handles 429/529, outer handles rest ──
+        try:
+            response = with_retry(
+                lambda mt=max_tokens, mdl=state.current_model:
+                    client.messages.create(
+                        model=mdl, system=system, messages=messages,
+                        tools=TOOLS, max_tokens=mt),
+                state)
+        except Exception as e:
+            # Path 2: prompt_too_long -> reactive compact (once)
+            if is_prompt_too_long_error(e):
+                if not state.has_attempted_reactive_compact:
+                    messages[:] = reactive_compact(messages)
+                    state.has_attempted_reactive_compact = True
+                    continue
+                print("  \033[31m[unrecoverable] still too long after compact\033[0m")
+                messages.append({"role": "assistant", "content": [
+                    {"type": "text",
+                     "text": "[Error] Context too large, cannot continue."}]})
+                return
+
+            # Unrecoverable
+            name = type(e).__name__
+            print(f"  \033[31m[unrecoverable] {name}: {str(e)[:100]}\033[0m")
+            messages.append({"role": "assistant", "content": [
+                {"type": "text", "text": f"[Error] {name}: {str(e)[:200]}"}]})
+            return
+
+        # ── Path 1: max_tokens -> escalate or continue ──
+        if response.stop_reason == "max_tokens":
+            # First escalation: don't append truncated output, retry same request
+            if not state.has_escalated:
+                max_tokens = ESCALATED_MAX_TOKENS
+                state.has_escalated = True
+                print(f"  \033[33m[max_tokens] escalating"
+                      f" {DEFAULT_MAX_TOKENS} -> {ESCALATED_MAX_TOKENS}\033[0m")
+                continue
+            # 64K still truncated: save truncated output + continuation prompt
+            messages.append({"role": "assistant", "content": response.content})
+            if state.recovery_count < MAX_RECOVERY_RETRIES:
+                messages.append({"role": "user", "content": CONTINUATION_PROMPT})
+                state.recovery_count += 1
+                print(f"  \033[33m[max_tokens] continuation"
+                      f" {state.recovery_count}/{MAX_RECOVERY_RETRIES}\033[0m")
+                continue
+            print("  \033[31m[max_tokens] recovery limit reached\033[0m")
+            return
+
+        # Normal completion: append assistant response
+        messages.append({"role": "assistant", "content": response.content})
+
+        if response.stop_reason != "tool_use":
+            return
+
+        # ── Tool execution ──
+        results = []
+        for block in response.content:
+            if block.type != "tool_use":
+                continue
+            print(f"\033[36m> {block.name}\033[0m")
+            handler = TOOL_HANDLERS.get(block.name)
+            output = handler(**block.input) if handler else f"Unknown: {block.name}"
+            print(str(output)[:200])
+            results.append({"type": "tool_result",
+                            "tool_use_id": block.id, "content": output})
+        messages.append({"role": "user", "content": results})
+
+        context = update_context(context, messages)
+        system = get_system_prompt(context)
+
+
+if __name__ == "__main__":
+    print("s11: error recovery")
+    print("Enter a question, press Enter to send. Type q to quit.\n")
+    history = []
+    context = update_context({}, [])
+    while True:
+        try:
+            query = input("\033[36ms11 >> \033[0m")
+        except (EOFError, KeyboardInterrupt):
+            break
+        if query.strip().lower() in ("q", "exit", ""):
+            break
+        history.append({"role": "user", "content": query})
+        agent_loop(history, context)
+        context = update_context(context, history)
+        for block in history[-1]["content"]:
+            if getattr(block, "type", None) == "text":
+                print(block.text)
+        print()
--- a/s11_error_recovery/images/error-recovery-overview.en.svg
+++ b/s11_error_recovery/images/error-recovery-overview.en.svg
@@ -0,0 +1,98 @@
+<svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 760 440" font-family="system-ui, -apple-system, sans-serif">
+  <defs>
+    <linearGradient id="header" x1="0" y1="0" x2="1" y2="0">
+      <stop offset="0%" stop-color="#1e3a5f"/><stop offset="100%" stop-color="#dc2626"/>
+    </linearGradient>
+    <marker id="arrow" viewBox="0 0 10 10" refX="10" refY="5" markerWidth="6" markerHeight="6" orient="auto-start-reverse">
+      <path d="M 0 0 L 10 5 L 0 10 z" fill="#555"/>
+    </marker>
+    <marker id="arrow-red" viewBox="0 0 10 10" refX="10" refY="5" markerWidth="7" markerHeight="7" orient="auto-start-reverse">
+      <path d="M 0 0 L 10 5 L 0 10 z" fill="#dc2626"/>
+    </marker>
+    <linearGradient id="l1" x1="0" y1="0" x2="0" y2="1">
+      <stop offset="0%" stop-color="#fef3c7"/><stop offset="100%" stop-color="#fde68a"/>
+    </linearGradient>
+    <linearGradient id="l2" x1="0" y1="0" x2="0" y2="1">
+      <stop offset="0%" stop-color="#fed7aa"/><stop offset="100%" stop-color="#fdba74"/>
+    </linearGradient>
+    <linearGradient id="l3" x1="0" y1="0" x2="0" y2="1">
+      <stop offset="0%" stop-color="#fecaca"/><stop offset="100%" stop-color="#fca5a5"/>
+    </linearGradient>
+  </defs>
+
+  <rect width="760" height="440" fill="#fafbfc" rx="8"/>
+
+  <!-- Title -->
+  <rect x="0" y="0" width="760" height="44" fill="url(#header)" rx="8"/>
+  <rect x="0" y="36" width="760" height="8" fill="url(#header)"/>
+  <text x="380" y="28" fill="#fff" font-size="15" font-weight="700" text-anchor="middle">Error Recovery — try/except wrapping LLM calls, three recovery modes</text>
+
+  <!-- Legend -->
+  <rect x="40" y="56" width="12" height="10" rx="2" fill="#f0f4ff" stroke="#2563eb" stroke-width="1"/>
+  <text x="58" y="66" fill="#2563eb" font-size="10" font-weight="600">s10 retained</text>
+  <rect x="140" y="56" width="12" height="10" rx="2" fill="#fef3c7" stroke="#d97706" stroke-width="1"/>
+  <text x="158" y="66" fill="#d97706" font-size="10" font-weight="600">s11 new</text>
+
+  <!-- ===== s10 loop (compact) ===== -->
+  <rect x="30" y="92" width="80" height="40" rx="8" fill="#f0f4ff" stroke="#2563eb" stroke-width="1.5"/>
+  <text x="70" y="116" fill="#1e3a5f" font-size="10" font-weight="600" text-anchor="middle">messages</text>
+
+  <line x1="110" y1="112" x2="128" y2="112" stroke="#555" stroke-width="1.5" marker-end="url(#arrow)"/>
+
+  <rect x="131" y="86" width="90" height="52" rx="8" fill="#f0f4ff" stroke="#2563eb" stroke-width="1.5"/>
+  <text x="176" y="108" fill="#1e3a5f" font-size="9" font-weight="600" text-anchor="middle">prompt assembly</text>
+  <text x="176" y="122" fill="#94a3b8" font-size="8" text-anchor="middle">(s10)</text>
+
+  <line x1="221" y1="112" x2="239" y2="112" stroke="#555" stroke-width="1.5" marker-end="url(#arrow)"/>
+
+  <rect x="242" y="86" width="100" height="52" rx="8" fill="#f0f4ff" stroke="#2563eb" stroke-width="1.5"/>
+  <text x="292" y="108" fill="#1e3a5f" font-size="9" font-weight="600" text-anchor="middle">compress + load</text>
+  <text x="292" y="122" fill="#94a3b8" font-size="8" text-anchor="middle">(s08-s09)</text>
+
+  <line x1="342" y1="112" x2="360" y2="112" stroke="#555" stroke-width="1.5" marker-end="url(#arrow)"/>
+
+  <!-- LLM (wrapped in try/except) -->
+  <rect x="363" y="86" width="80" height="52" rx="8" fill="#fef2f2" stroke="#dc2626" stroke-width="2"/>
+  <text x="403" y="108" fill="#991b1b" font-size="11" font-weight="700" text-anchor="middle">LLM</text>
+  <text x="403" y="122" fill="#dc2626" font-size="8" text-anchor="middle">try/except</text>
+
+  <line x1="443" y1="112" x2="461" y2="112" stroke="#555" stroke-width="1.5" marker-end="url(#arrow)"/>
+
+  <rect x="464" y="86" width="110" height="52" rx="8" fill="#f0f4ff" stroke="#2563eb" stroke-width="1.5"/>
+  <text x="519" y="108" fill="#1e3a5f" font-size="9" font-weight="600" text-anchor="middle">TOOL_HANDLERS</text>
+  <text x="519" y="122" fill="#94a3b8" font-size="8" text-anchor="middle">bash · read · write</text>
+
+  <!-- Arrow: LLM → Recovery -->
+  <path d="M 403 138 L 403 178" fill="none" stroke="#dc2626" stroke-width="1.5" marker-end="url(#arrow-red)"/>
+  <text x="415" y="164" fill="#dc2626" font-size="9">error</text>
+
+  <!-- ===== Recovery Section ===== -->
+  <rect x="20" y="182" width="720" height="22" rx="4" fill="#f1f5f9"/>
+  <text x="55" y="197" fill="#64748b" font-size="11" font-weight="600">Error Recovery (classify, recover, retry LLM)</text>
+
+  <!-- Layer 1: max_tokens -->
+  <rect x="40" y="210" width="680" height="48" rx="7" fill="url(#l1)" stroke="#d97706" stroke-width="1.5"/>
+  <text x="60" y="230" fill="#92400e" font-size="12" font-weight="600">Path 1</text>
+  <text x="112" y="230" fill="#92400e" font-size="11" font-weight="700">max_tokens</text>
+  <text x="200" y="230" fill="#92400e" font-size="11">Output truncated → escalate 8K→64K (once) / continuation prompt (max 3)</text>
+  <text x="200" y="246" fill="#b45309" font-size="9">Trigger: stop_reason == "max_tokens" · Cost: 0-1 API · Recover then continue</text>
+
+  <!-- Layer 2: prompt_too_long -->
+  <rect x="40" y="266" width="680" height="48" rx="7" fill="url(#l2)" stroke="#ea580c" stroke-width="1.5"/>
+  <text x="60" y="286" fill="#9a3412" font-size="12" font-weight="600">Path 2</text>
+  <text x="112" y="286" fill="#9a3412" font-size="11" font-weight="700">prompt_too_long</text>
+  <text x="230" y="286" fill="#9a3412" font-size="11">Context overflow → reactive compact → retry (one chance)</text>
+  <text x="200" y="302" fill="#c2410c" font-size="9">Trigger: API returns 413 · Cost: 1 API · Still over after compact → exit</text>
+
+  <!-- Layer 3: 429/529 -->
+  <rect x="40" y="322" width="680" height="48" rx="7" fill="url(#l3)" stroke="#dc2626" stroke-width="1.5"/>
+  <text x="60" y="342" fill="#991b1b" font-size="12" font-weight="600">Path 3</text>
+  <text x="112" y="342" fill="#991b1b" font-size="11" font-weight="700">429/529</text>
+  <text x="170" y="342" fill="#991b1b" font-size="11">Transient failure → exponential backoff + jitter (max 10) / 3×529 → switch model</text>
+  <text x="200" y="358" fill="#b91c1c" font-size="9">Trigger: RateLimitError / OverloadedError · Formula: min(500×2^n, 32s) + jitter</text>
+
+  <!-- ===== Bottom notes ===== -->
+  <rect x="40" y="388" width="680" height="40" rx="6" fill="#f8fafc" stroke="#e2e8f0" stroke-width="1"/>
+  <text x="60" y="406" fill="#475569" font-size="10">Three most common recovery modes. CC has 13+ reason codes (image_error, aborted_streaming, etc.), each with dedicated handling.</text>
+  <text x="60" y="422" fill="#94a3b8" font-size="9">All paths after recovery → continue back to LLM · Normal flow: tool results → messages → loop</text>
+</svg>
--- a/s11_error_recovery/images/error-recovery-overview.ja.svg
+++ b/s11_error_recovery/images/error-recovery-overview.ja.svg
@@ -0,0 +1,98 @@
+<svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 760 440" font-family="system-ui, -apple-system, sans-serif">
+  <defs>
+    <linearGradient id="header" x1="0" y1="0" x2="1" y2="0">
+      <stop offset="0%" stop-color="#1e3a5f"/><stop offset="100%" stop-color="#dc2626"/>
+    </linearGradient>
+    <marker id="arrow" viewBox="0 0 10 10" refX="10" refY="5" markerWidth="6" markerHeight="6" orient="auto-start-reverse">
+      <path d="M 0 0 L 10 5 L 0 10 z" fill="#555"/>
+    </marker>
+    <marker id="arrow-red" viewBox="0 0 10 10" refX="10" refY="5" markerWidth="7" markerHeight="7" orient="auto-start-reverse">
+      <path d="M 0 0 L 10 5 L 0 10 z" fill="#dc2626"/>
+    </marker>
+    <linearGradient id="l1" x1="0" y1="0" x2="0" y2="1">
+      <stop offset="0%" stop-color="#fef3c7"/><stop offset="100%" stop-color="#fde68a"/>
+    </linearGradient>
+    <linearGradient id="l2" x1="0" y1="0" x2="0" y2="1">
+      <stop offset="0%" stop-color="#fed7aa"/><stop offset="100%" stop-color="#fdba74"/>
+    </linearGradient>
+    <linearGradient id="l3" x1="0" y1="0" x2="0" y2="1">
+      <stop offset="0%" stop-color="#fecaca"/><stop offset="100%" stop-color="#fca5a5"/>
+    </linearGradient>
+  </defs>
+
+  <rect width="760" height="440" fill="#fafbfc" rx="8"/>
+
+  <!-- Title -->
+  <rect x="0" y="0" width="760" height="44" fill="url(#header)" rx="8"/>
+  <rect x="0" y="36" width="760" height="8" fill="url(#header)"/>
+  <text x="380" y="28" fill="#fff" font-size="15" font-weight="700" text-anchor="middle">Error Recovery — try/except で LLM 呼び出しをラップ、3 つの復旧モード</text>
+
+  <!-- Legend -->
+  <rect x="40" y="56" width="12" height="10" rx="2" fill="#f0f4ff" stroke="#2563eb" stroke-width="1"/>
+  <text x="58" y="66" fill="#2563eb" font-size="10" font-weight="600">s10 維持</text>
+  <rect x="140" y="56" width="12" height="10" rx="2" fill="#fef3c7" stroke="#d97706" stroke-width="1"/>
+  <text x="158" y="66" fill="#d97706" font-size="10" font-weight="600">s11 新規</text>
+
+  <!-- ===== s10 loop (compact) ===== -->
+  <rect x="30" y="92" width="80" height="40" rx="8" fill="#f0f4ff" stroke="#2563eb" stroke-width="1.5"/>
+  <text x="70" y="116" fill="#1e3a5f" font-size="10" font-weight="600" text-anchor="middle">messages</text>
+
+  <line x1="110" y1="112" x2="128" y2="112" stroke="#555" stroke-width="1.5" marker-end="url(#arrow)"/>
+
+  <rect x="131" y="86" width="90" height="52" rx="8" fill="#f0f4ff" stroke="#2563eb" stroke-width="1.5"/>
+  <text x="176" y="108" fill="#1e3a5f" font-size="9" font-weight="600" text-anchor="middle">prompt assembly</text>
+  <text x="176" y="122" fill="#94a3b8" font-size="8" text-anchor="middle">(s10)</text>
+
+  <line x1="221" y1="112" x2="239" y2="112" stroke="#555" stroke-width="1.5" marker-end="url(#arrow)"/>
+
+  <rect x="242" y="86" width="100" height="52" rx="8" fill="#f0f4ff" stroke="#2563eb" stroke-width="1.5"/>
+  <text x="292" y="108" fill="#1e3a5f" font-size="9" font-weight="600" text-anchor="middle">compress + load</text>
+  <text x="292" y="122" fill="#94a3b8" font-size="8" text-anchor="middle">(s08-s09)</text>
+
+  <line x1="342" y1="112" x2="360" y2="112" stroke="#555" stroke-width="1.5" marker-end="url(#arrow)"/>
+
+  <!-- LLM (wrapped in try/except) -->
+  <rect x="363" y="86" width="80" height="52" rx="8" fill="#fef2f2" stroke="#dc2626" stroke-width="2"/>
+  <text x="403" y="108" fill="#991b1b" font-size="11" font-weight="700" text-anchor="middle">LLM</text>
+  <text x="403" y="122" fill="#dc2626" font-size="8" text-anchor="middle">try/except</text>
+
+  <line x1="443" y1="112" x2="461" y2="112" stroke="#555" stroke-width="1.5" marker-end="url(#arrow)"/>
+
+  <rect x="464" y="86" width="110" height="52" rx="8" fill="#f0f4ff" stroke="#2563eb" stroke-width="1.5"/>
+  <text x="519" y="108" fill="#1e3a5f" font-size="9" font-weight="600" text-anchor="middle">TOOL_HANDLERS</text>
+  <text x="519" y="122" fill="#94a3b8" font-size="8" text-anchor="middle">bash · read · write</text>
+
+  <!-- Arrow: LLM → Recovery -->
+  <path d="M 403 138 L 403 178" fill="none" stroke="#dc2626" stroke-width="1.5" marker-end="url(#arrow-red)"/>
+  <text x="415" y="164" fill="#dc2626" font-size="9">エラー</text>
+
+  <!-- ===== Recovery Section ===== -->
+  <rect x="20" y="182" width="720" height="22" rx="4" fill="#f1f5f9"/>
+  <text x="55" y="197" fill="#64748b" font-size="11" font-weight="600">エラー復旧（分類処理、復旧後 LLM に戻りリトライ）</text>
+
+  <!-- Layer 1: max_tokens -->
+  <rect x="40" y="210" width="680" height="48" rx="7" fill="url(#l1)" stroke="#d97706" stroke-width="1.5"/>
+  <text x="60" y="230" fill="#92400e" font-size="12" font-weight="600">パス 1</text>
+  <text x="112" y="230" fill="#92400e" font-size="11" font-weight="700">max_tokens</text>
+  <text x="200" y="230" fill="#92400e" font-size="11">出力が途切れた → 8K→64K に拡張（1 回）/ 続行プロンプト（最大 3 回）</text>
+  <text x="200" y="246" fill="#b45309" font-size="9">トリガー: stop_reason == "max_tokens" · コスト: 0-1 API · 復旧後 continue</text>
+
+  <!-- Layer 2: prompt_too_long -->
+  <rect x="40" y="266" width="680" height="48" rx="7" fill="url(#l2)" stroke="#ea580c" stroke-width="1.5"/>
+  <text x="60" y="286" fill="#9a3412" font-size="12" font-weight="600">パス 2</text>
+  <text x="112" y="286" fill="#9a3412" font-size="11" font-weight="700">prompt_too_long</text>
+  <text x="230" y="286" fill="#9a3412" font-size="11">コンテキスト超過 → reactive compact → リトライ（1 回のみ）</text>
+  <text x="200" y="302" fill="#c2410c" font-size="9">トリガー: API が 413 返却 · コスト: 1 API · 圧縮後も超過 → 終了</text>
+
+  <!-- Layer 3: 429/529 -->
+  <rect x="40" y="322" width="680" height="48" rx="7" fill="url(#l3)" stroke="#dc2626" stroke-width="1.5"/>
+  <text x="60" y="342" fill="#991b1b" font-size="12" font-weight="600">パス 3</text>
+  <text x="112" y="342" fill="#991b1b" font-size="11" font-weight="700">429/529</text>
+  <text x="170" y="342" fill="#991b1b" font-size="11">一時障害 → 指数バックオフ + ジッター（最大 10 回）/ 3 回 529 → モデル切替</text>
+  <text x="200" y="358" fill="#b91c1c" font-size="9">トリガー: RateLimitError / OverloadedError · 式: min(500×2^n, 32s) + jitter</text>
+
+  <!-- ===== Bottom notes ===== -->
+  <rect x="40" y="388" width="680" height="40" rx="6" fill="#f8fafc" stroke="#e2e8f0" stroke-width="1"/>
+  <text x="60" y="406" fill="#475569" font-size="10">最も一般的な 3 つの復旧モード。CC は実際に 13+ の reason code を持ち（image_error, aborted_streaming 等）、それぞれ専用の処理がある。</text>
+  <text x="60" y="422" fill="#94a3b8" font-size="9">全パス復旧後 → continue で LLM に戻る · 正常フロー: ツール結果 → messages → ループ</text>
+</svg>
--- a/s11_error_recovery/images/error-recovery-overview.svg
+++ b/s11_error_recovery/images/error-recovery-overview.svg
@@ -0,0 +1,98 @@
+<svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 760 440" font-family="system-ui, -apple-system, sans-serif">
+  <defs>
+    <linearGradient id="header" x1="0" y1="0" x2="1" y2="0">
+      <stop offset="0%" stop-color="#1e3a5f"/><stop offset="100%" stop-color="#dc2626"/>
+    </linearGradient>
+    <marker id="arrow" viewBox="0 0 10 10" refX="10" refY="5" markerWidth="6" markerHeight="6" orient="auto-start-reverse">
+      <path d="M 0 0 L 10 5 L 0 10 z" fill="#555"/>
+    </marker>
+    <marker id="arrow-red" viewBox="0 0 10 10" refX="10" refY="5" markerWidth="7" markerHeight="7" orient="auto-start-reverse">
+      <path d="M 0 0 L 10 5 L 0 10 z" fill="#dc2626"/>
+    </marker>
+    <linearGradient id="l1" x1="0" y1="0" x2="0" y2="1">
+      <stop offset="0%" stop-color="#fef3c7"/><stop offset="100%" stop-color="#fde68a"/>
+    </linearGradient>
+    <linearGradient id="l2" x1="0" y1="0" x2="0" y2="1">
+      <stop offset="0%" stop-color="#fed7aa"/><stop offset="100%" stop-color="#fdba74"/>
+    </linearGradient>
+    <linearGradient id="l3" x1="0" y1="0" x2="0" y2="1">
+      <stop offset="0%" stop-color="#fecaca"/><stop offset="100%" stop-color="#fca5a5"/>
+    </linearGradient>
+  </defs>
+
+  <rect width="760" height="440" fill="#fafbfc" rx="8"/>
+
+  <!-- Title -->
+  <rect x="0" y="0" width="760" height="44" fill="url(#header)" rx="8"/>
+  <rect x="0" y="36" width="760" height="8" fill="url(#header)"/>
+  <text x="380" y="28" fill="#fff" font-size="15" font-weight="700" text-anchor="middle">Error Recovery — try/except 包裹 LLM 调用，三种恢复模式</text>
+
+  <!-- Legend -->
+  <rect x="40" y="56" width="12" height="10" rx="2" fill="#f0f4ff" stroke="#2563eb" stroke-width="1"/>
+  <text x="58" y="66" fill="#2563eb" font-size="10" font-weight="600">s10 保留</text>
+  <rect x="140" y="56" width="12" height="10" rx="2" fill="#fef3c7" stroke="#d97706" stroke-width="1"/>
+  <text x="158" y="66" fill="#d97706" font-size="10" font-weight="600">s11 新增</text>
+
+  <!-- ===== s10 loop (compact) ===== -->
+  <rect x="30" y="92" width="80" height="40" rx="8" fill="#f0f4ff" stroke="#2563eb" stroke-width="1.5"/>
+  <text x="70" y="116" fill="#1e3a5f" font-size="10" font-weight="600" text-anchor="middle">messages</text>
+
+  <line x1="110" y1="112" x2="128" y2="112" stroke="#555" stroke-width="1.5" marker-end="url(#arrow)"/>
+
+  <rect x="131" y="86" width="90" height="52" rx="8" fill="#f0f4ff" stroke="#2563eb" stroke-width="1.5"/>
+  <text x="176" y="108" fill="#1e3a5f" font-size="9" font-weight="600" text-anchor="middle">prompt assembly</text>
+  <text x="176" y="122" fill="#94a3b8" font-size="8" text-anchor="middle">(s10)</text>
+
+  <line x1="221" y1="112" x2="239" y2="112" stroke="#555" stroke-width="1.5" marker-end="url(#arrow)"/>
+
+  <rect x="242" y="86" width="100" height="52" rx="8" fill="#f0f4ff" stroke="#2563eb" stroke-width="1.5"/>
+  <text x="292" y="108" fill="#1e3a5f" font-size="9" font-weight="600" text-anchor="middle">compress + load</text>
+  <text x="292" y="122" fill="#94a3b8" font-size="8" text-anchor="middle">(s08-s09)</text>
+
+  <line x1="342" y1="112" x2="360" y2="112" stroke="#555" stroke-width="1.5" marker-end="url(#arrow)"/>
+
+  <!-- LLM (wrapped in try/except) -->
+  <rect x="363" y="86" width="80" height="52" rx="8" fill="#fef2f2" stroke="#dc2626" stroke-width="2"/>
+  <text x="403" y="108" fill="#991b1b" font-size="11" font-weight="700" text-anchor="middle">LLM</text>
+  <text x="403" y="122" fill="#dc2626" font-size="8" text-anchor="middle">try/except</text>
+
+  <line x1="443" y1="112" x2="461" y2="112" stroke="#555" stroke-width="1.5" marker-end="url(#arrow)"/>
+
+  <rect x="464" y="86" width="110" height="52" rx="8" fill="#f0f4ff" stroke="#2563eb" stroke-width="1.5"/>
+  <text x="519" y="108" fill="#1e3a5f" font-size="9" font-weight="600" text-anchor="middle">TOOL_HANDLERS</text>
+  <text x="519" y="122" fill="#94a3b8" font-size="8" text-anchor="middle">bash · read · write</text>
+
+  <!-- Arrow: LLM → Recovery -->
+  <path d="M 403 138 L 403 178" fill="none" stroke="#dc2626" stroke-width="1.5" marker-end="url(#arrow-red)"/>
+  <text x="415" y="164" fill="#dc2626" font-size="9">报错</text>
+
+  <!-- ===== Recovery Section ===== -->
+  <rect x="20" y="182" width="720" height="22" rx="4" fill="#f1f5f9"/>
+  <text x="55" y="197" fill="#64748b" font-size="11" font-weight="600">错误恢复（分类处理，恢复后回到 LLM 重试）</text>
+
+  <!-- Layer 1: max_tokens -->
+  <rect x="40" y="210" width="680" height="48" rx="7" fill="url(#l1)" stroke="#d97706" stroke-width="1.5"/>
+  <text x="60" y="230" fill="#92400e" font-size="12" font-weight="600">路径 1</text>
+  <text x="112" y="230" fill="#92400e" font-size="11" font-weight="700">max_tokens</text>
+  <text x="200" y="230" fill="#92400e" font-size="11">输出被截断 → 升级 8K→64K（一次）/ 续写提示（最多 3 次）</text>
+  <text x="200" y="246" fill="#b45309" font-size="9">触发: stop_reason == "max_tokens" · 代价: 0-1 API · 恢复后 continue</text>
+
+  <!-- Layer 2: prompt_too_long -->
+  <rect x="40" y="266" width="680" height="48" rx="7" fill="url(#l2)" stroke="#ea580c" stroke-width="1.5"/>
+  <text x="60" y="286" fill="#9a3412" font-size="12" font-weight="600">路径 2</text>
+  <text x="112" y="286" fill="#9a3412" font-size="11" font-weight="700">prompt_too_long</text>
+  <text x="230" y="286" fill="#9a3412" font-size="11">上下文超限 → reactive compact → 重试（一次机会）</text>
+  <text x="200" y="302" fill="#c2410c" font-size="9">触发: API 返回 413 · 代价: 1 API · 压缩过还是超 → 退出</text>
+
+  <!-- Layer 3: 429/529 -->
+  <rect x="40" y="322" width="680" height="48" rx="7" fill="url(#l3)" stroke="#dc2626" stroke-width="1.5"/>
+  <text x="60" y="342" fill="#991b1b" font-size="12" font-weight="600">路径 3</text>
+  <text x="112" y="342" fill="#991b1b" font-size="11" font-weight="700">429/529</text>
+  <text x="170" y="342" fill="#991b1b" font-size="11">临时故障 → 指数退避 + 抖动（最多 10 次）/ 3 次 529 → 切换模型</text>
+  <text x="200" y="358" fill="#b91c1c" font-size="9">触发: RateLimitError / OverloadedError · 公式: min(500×2^n, 32s) + jitter</text>
+
+  <!-- ===== Bottom notes ===== -->
+  <rect x="40" y="388" width="680" height="40" rx="6" fill="#f8fafc" stroke="#e2e8f0" stroke-width="1"/>
+  <text x="60" y="406" fill="#475569" font-size="10">三种最常见的恢复模式。CC 实际有 13+ reason code（image_error、aborted_streaming 等），各有专门处理。</text>
+  <text x="60" y="422" fill="#94a3b8" font-size="9">所有路径恢复后 → continue 回到 LLM · 正常流程: 工具结果 → messages → 循环</text>
+</svg>