Follow up PR #265: refine chapters, diagrams, and add S20 (#283)

* feat: s01-s14 docs quality overhaul — tool pipeline, single-agent, knowledge & resilience

Rewrite code.py and README (zh/en/ja) for s01-s14, each chapter building
incrementally on the previous. Key fixes across chapters:

- s01-s04: agent loop, tool dispatch, permission pipeline, hooks
- s05-s08: todo write, subagent, skill loading, context compact
- s09-s11: memory system, system prompt assembly, error recovery
- s12-s14: task graph, background tasks, cron scheduler

All chapters CC source-verified. Code inherits fixes forward (PROMPT_SECTIONS,
json.dumps cache, real-state context, can_start dep protection, etc.).

* feat: s15-s19 docs quality overhaul — multi-agent platform: teams, protocols, autonomy, worktree, MCP tools

Rewrite code.py and README (zh/en/ja) for s15-s19, the multi-agent platform
chapters. Each chapter inherits all previous fixes and adds one mechanism:

- s15: agent teams (TeamCreate, teammate threads, shared task list)
- s16: team protocols (plan approval, shutdown handshake, consume_inbox)
- s17: autonomous agents (idle polling, auto-claim, consume_lead_inbox)
- s18: worktree isolation (git worktree, bind_task, cwd switching, safety)
- s19: MCP tools (MCPClient, normalize_mcp_name, assemble_tool_pool, no cache)

All appendix source code references verified against CC source. Config priority
corrected: claude.ai < plugin < user < project < local.

* fix: 5 regressions across s05-s19 — glob safety, todo validation, memory extraction, protocol types, dep crash

- s05-s09: glob results now filter with is_relative_to(WORKDIR) (inherited from s02)
- s06-s08: todo_write validates content/status required fields (inherited from s05)
- s09: extract_memories uses pre-compression snapshot instead of compacted messages
- s16: submit_plan docstring clarifies protocol-only (not code-level gate)
- s17-s19: match_response restores type mismatch validation (from s16)
- s17-s19: claim_task deps list handles missing dep files without crashing

* fix: s12 Todo V2 logic reversal, s14/s15 cron range validation, s18/s19 worktree name validation

- s12 README (zh/en/ja): fix Todo V2 direction — interactive defaults to Task,
  non-interactive/SDK defaults to TodoWrite. Fix env var name to
  CLAUDE_CODE_ENABLE_TASKS (not TODO_V2).
- s14/s15: add _validate_cron_field with per-field range checks (minute 0-59,
  hour 0-23, dom 1-31, month 1-12, dow 0-6), step > 0, range lo <= hi.
  Replace old try/except validation that only caught exceptions.
- s18/s19: add validate_worktree_name() to remove_worktree and keep_worktree,
  not just create_worktree.

* fix: align s16-s19 teaching tool consistency

* fix pr265 chapter diagrams

* Add comprehensive s20 harness chapter

* Fix chapter smoke test regressions

* Clarify README tutorial track transition

---------

Co-authored-by: Haoran <bill-billion@outlook.com>
This commit is contained in:
gui-yue
2026-05-20 21:45:38 +08:00
committed by GitHub
parent c354cf7721
commit 1baf1aca5a
174 changed files with 35833 additions and 353 deletions

View File

@@ -0,0 +1,277 @@
# s11: Error Recovery — Errors aren't the end, they're the start of a retry
[中文](README.md) · [English](README.en.md) · [日本語](README.ja.md)
s01 → ... → s09 → s10 → `s11` → [s12](../s12_task_system/) → s13 → ... → s20
> *"Errors aren't the end, they're the start of a retry"* — escalate tokens, compact context, switch models.
>
> **Harness layer**: Resilience — classify and recover when the main loop hits errors.
---
## The Problem
The Agent is running along and then errors out:
```
Error: 529 overloaded
```
The Agent crashes. It doesn't retry, doesn't switch models, doesn't reduce context — it just crashes.
In production, API errors are the norm. The three most common failure modes: **truncated output** (the model runs out of tokens mid-sentence), **context overflow** (still too long even after compaction), and **transient failures** (429 rate limiting / 529 overload). An Agent that doesn't handle errors is like a car that stalls at the slightest touch.
---
## Solution
![Error Recovery Overview](images/error-recovery-overview.en.svg)
The loop and prompt assembly from s10 are fully preserved. The only change: the LLM call is wrapped in try/except, with different recovery paths based on error type. After recovery, `continue` loops back to the top to call the LLM again.
The three most common recovery patterns (the teaching version only handles 429/529; real systems also cover connection errors, timeouts, cloud vendor credential caches, etc. CC actually has 13+ reason codes; see the Deep Dive for the rest):
| Pattern | Trigger | Recovery Action |
|----------|---------|-----------------|
| Output truncated | `max_tokens` | Escalate 8K→64K / continuation prompt |
| Context overflow | `prompt_too_long` | Reactive compact → retry |
| Transient failure | 429 / 529 | Exponential backoff + jitter, fallback model on consecutive 529 |
---
## How It Works
### Path 1: Output Truncated
The model runs out of tokens mid-sentence — `max_tokens` is exhausted. The default 8000 tokens isn't enough for a complete response.
On the first occurrence, escalate `max_tokens` from 8K to 64K (8x the space) and retry the same request — the truncated output is NOT appended to messages, keeping the original request intact. If 64K is still not enough, save the truncated output and inject a continuation prompt telling the model to pick up where it left off, up to 3 times:
```python
if response.stop_reason == "max_tokens":
# First escalation: don't append truncated output, retry same request
if not state.has_escalated:
max_tokens = ESCALATED_MAX_TOKENS
state.has_escalated = True
continue # messages unchanged, same request with more tokens
# 64K still truncated: save output + continuation prompt
messages.append({"role": "assistant", "content": response.content})
if state.recovery_count < MAX_RECOVERY_RETRIES:
messages.append({"role": "user", "content":
"Output token limit hit. Resume directly — "
"no apology, no recap. Pick up mid-thought."})
state.recovery_count += 1
continue
return # still truncated after 3 continuations
# Normal: append after max_tokens check
messages.append({"role": "assistant", "content": response.content})
```
Escalation gets one chance; continuation gets up to 3. After that, exit — further continuations won't produce meaningful output.
### Path 2: Context Overflow
The LLM says "your context is too long" (`prompt_too_long`). All four compaction layers from s08 have already run, and it's still over the limit.
Trigger reactive compact — more aggressive than auto compact. The teaching version keeps only the last 5 messages to simulate compaction; real CC generates a compact summary via LLM, then retries with the compacted message list. Retry after compacting. But if it's still over the limit after one compaction, the only option is to exit — compacting again won't make it any smaller:
```python
except PromptTooLongError:
if not state.has_attempted_reactive_compact:
messages[:] = reactive_compact(messages)
state.has_attempted_reactive_compact = True
continue
return # Already compacted and still over limit — must exit
```
### Path 3: Transient Failures
Network blips, 429 rate limiting, 529 overload — these aren't bugs, they're normal in distributed systems.
Both 429 and 529 use exponential backoff + jitter: wait 0.5 seconds on the first attempt, 1 second on the second, 2 seconds on the third, up to 10 retries. Random jitter prevents concurrent requests from all retrying at the same instant. Three consecutive 529 overload errors → switch to the fallback model (if `FALLBACK_MODEL_ID` environment variable is configured):
```python
def retry_delay(attempt, retry_after=None):
if retry_after:
return retry_after
base = min(500 * (2 ** attempt), 32000) / 1000
return base + random.uniform(0, base * 0.25)
def with_retry(fn, state, max_retries=10):
for attempt in range(max_retries):
try:
return fn()
except (RateLimitError, OverloadedError):
delay = retry_delay(attempt)
time.sleep(delay)
if is_overloaded:
state.consecutive_529 += 1
if state.consecutive_529 >= 3 and FALLBACK_MODEL:
state.current_model = FALLBACK_MODEL
raise MaxRetriesExceeded()
```
Backoff formula: `min(500 × 2^attempt, 32000) + random(0~25%)`. If the server returns a `Retry-After` header, that value takes priority.
### Putting It All Together
```python
def agent_loop(messages, context):
system = get_system_prompt(context)
state = RecoveryState()
max_tokens = 8000
while True:
try:
response = with_retry(
lambda: client.messages.create(
model=state.current_model, system=system,
messages=messages, tools=TOOLS,
max_tokens=max_tokens),
state)
except Exception as e:
if is_prompt_too_long_error(e):
if not state.has_attempted_reactive_compact:
messages[:] = reactive_compact(messages)
state.has_attempted_reactive_compact = True
continue
return
log_error(e)
return
# max_tokens check BEFORE appending to messages
if response.stop_reason == "max_tokens":
if not state.has_escalated:
max_tokens = 64000
state.has_escalated = True
continue # retry same request, messages unchanged
# save truncated output + continuation prompt
messages.append({"role": "assistant", "content": response.content})
messages.append({"role": "user", "content": CONTINUATION_PROMPT})
continue
# Normal completion
messages.append({"role": "assistant", "content": response.content})
if response.stop_reason != "tool_use":
return
# ... tool execution ...
```
The outer try/except catches API exceptions (prompt_too_long, etc.), `with_retry` handles transient errors (429/529), and `stop_reason` checks handle truncation. Three recovery mechanisms, each handling its own error type.
---
## Changes from s10
| Component | Before (s10) | After (s11) |
|-----------|-------------|-------------|
| Error handling | None (crashes on any error) | Three recovery patterns + exponential backoff |
| New constants | — | ESCALATED_MAX_TOKENS=64000, MAX_RETRIES=10, BASE_DELAY_MS=500, FALLBACK_MODEL |
| New functions | — | with_retry, retry_delay, reactive_compact, is_prompt_too_long_error, RecoveryState |
| Tools | bash, read_file, write_file (3) | bash, read_file, write_file (3) — unchanged |
| Loop | Bare LLM call | Wrapped in try/except + continue retry |
---
## Try It
```sh
cd learn-claude-code
python s11_error_recovery/code.py
```
Try these prompts:
1. Ask the Agent to generate a very long piece of code, and observe whether it automatically continues after truncation (look for the `[max_tokens] escalating` log)
2. Read many files consecutively to bloat the context, and observe reactive compact
3. If you encounter 429/529, observe the exponential backoff log output
---
## What's Next
The Agent can now automatically recover from errors. But the tasks it handles are still one-shot — you give it a task, it finishes, it's done.
What if the Agent could manage a **task list** — with dependencies, persisted to disk, resumable across sessions? A TODO list is not a task system.
s12 Task System → Tasks form a dependency graph with state and persistence. This is the foundation for multi-Agent collaboration.
<details>
<summary>Deep Dive into CC Source</summary>
> The following is based on CC source code: `query.ts` (1729 lines), `services/api/withRetry.ts` (822 lines), `query/tokenBudget.ts` (93 lines), and `utils/tokenBudget.ts` (73 lines).
### 1. A Dozen-Plus Reason/Transition Codes (Not Just 3)
The teaching version covers 3 of the most common recovery patterns. CC actually has a dozen-plus reason/transition codes, evaluated after every LLM call:
| Reason/Transition | Teaching Version | CC Behavior |
|---|---|---|
| `completed` | Normal completion | Return result |
| `next_turn` | Normal tool call | Continue to next tool execution round |
| `max_output_tokens_escalate` | Path 1 | 8K→64K escalation |
| `max_output_tokens_recovery` | Path 1 continuation | Continuation prompt (up to 3 times) |
| `reactive_compact_retry` | Path 2 | Reactive compact → retry |
| `prompt_too_long` | Path 2 | Same as above |
| `collapse_drain_retry` | Not covered | Context collapse — commit staged content first |
| `model_error` | Not covered | Retry |
| `image_error` | Not covered | `ImageSizeError` / `ImageResizeError` handled specifically |
| `aborted_streaming` | Not covered | Streaming abort recovery |
| `aborted_tools` | Not covered | Tool abort |
| `stop_hook_blocking` | Not covered | Inject blocking error → model self-corrects |
| `stop_hook_prevented` | Not covered | Hooks prevent execution |
| `hook_stopped` | Not covered | Hook stopped execution |
| `token_budget_continuation` | Not covered | Continue when token usage < 90% |
| `blocking_limit` | Not covered | Blocking limit reached |
| `max_turns` | Not covered | Maximum turns reached |
The teaching version only expands on the first 5 (most common); each of the rest has its own dedicated handling logic.
### 2. Precise Exponential Backoff Formula
CC's backoff delay (`withRetry.ts:530-548`):
```
delay = min(500 × 2^(attempt-1), 32000) + random(0~25%)
```
| Attempt | Base Delay | + Jitter |
|---------|-----------|----------|
| 1 | 500ms | 0-125ms |
| 2 | 1000ms | 0-250ms |
| 4 | 4000ms | 0-1000ms |
| 7+ | 32000ms (cap) | 0-8000ms |
If the server returns a `Retry-After` header, that value takes priority.
### 3. Original CONTINUATION Prompt
CC's continuation prompt (`query.ts:1225-1227`):
```
Output token limit hit. Resume directly — no apology, no recap of what
you were doing. Pick up mid-thought if that is where the cut happened.
Break remaining work into smaller pieces.
```
Token budget nudge prompt (`tokenBudget.ts:72`):
```
Stopped at {pct}% of token target. Keep working — do not summarize.
```
### 4. Streaming Error Handling
In CC's streaming path, recoverable errors (413, max_tokens, media errors) are **withheld from display** during streaming (`query.ts:788-822`) — SDK consumers don't see them, only the recovery logic does. After streaming ends, the system determines whether recovery is needed.
### 5. 529 → Fallback Model Switch
After 3 consecutive 529 overload errors (`MAX_529_RETRIES = 3`), CC automatically switches to the fallback model (e.g., Opus → Sonnet). On switch, all pending messages and tool results are cleared, and the user sees "Switched to {model} due to high demand".
### 6. Diminishing Returns Detection
Token budget "continuations" aren't unlimited. When there are 3 consecutive continuations with a token increment < 500, the system determines "continuing won't produce meaningful output" and stops continuation (`tokenBudget.ts:60-62`).
</details>
<!-- translation-sync: zh@v1, en@v1, ja@v1 -->