Files
gui-yue 1baf1aca5a Follow up PR #265: refine chapters, diagrams, and add S20 (#283)
* feat: s01-s14 docs quality overhaul — tool pipeline, single-agent, knowledge & resilience

Rewrite code.py and README (zh/en/ja) for s01-s14, each chapter building
incrementally on the previous. Key fixes across chapters:

- s01-s04: agent loop, tool dispatch, permission pipeline, hooks
- s05-s08: todo write, subagent, skill loading, context compact
- s09-s11: memory system, system prompt assembly, error recovery
- s12-s14: task graph, background tasks, cron scheduler

All chapters CC source-verified. Code inherits fixes forward (PROMPT_SECTIONS,
json.dumps cache, real-state context, can_start dep protection, etc.).

* feat: s15-s19 docs quality overhaul — multi-agent platform: teams, protocols, autonomy, worktree, MCP tools

Rewrite code.py and README (zh/en/ja) for s15-s19, the multi-agent platform
chapters. Each chapter inherits all previous fixes and adds one mechanism:

- s15: agent teams (TeamCreate, teammate threads, shared task list)
- s16: team protocols (plan approval, shutdown handshake, consume_inbox)
- s17: autonomous agents (idle polling, auto-claim, consume_lead_inbox)
- s18: worktree isolation (git worktree, bind_task, cwd switching, safety)
- s19: MCP tools (MCPClient, normalize_mcp_name, assemble_tool_pool, no cache)

All appendix source code references verified against CC source. Config priority
corrected: claude.ai < plugin < user < project < local.

* fix: 5 regressions across s05-s19 — glob safety, todo validation, memory extraction, protocol types, dep crash

- s05-s09: glob results now filter with is_relative_to(WORKDIR) (inherited from s02)
- s06-s08: todo_write validates content/status required fields (inherited from s05)
- s09: extract_memories uses pre-compression snapshot instead of compacted messages
- s16: submit_plan docstring clarifies protocol-only (not code-level gate)
- s17-s19: match_response restores type mismatch validation (from s16)
- s17-s19: claim_task deps list handles missing dep files without crashing

* fix: s12 Todo V2 logic reversal, s14/s15 cron range validation, s18/s19 worktree name validation

- s12 README (zh/en/ja): fix Todo V2 direction — interactive defaults to Task,
  non-interactive/SDK defaults to TodoWrite. Fix env var name to
  CLAUDE_CODE_ENABLE_TASKS (not TODO_V2).
- s14/s15: add _validate_cron_field with per-field range checks (minute 0-59,
  hour 0-23, dom 1-31, month 1-12, dow 0-6), step > 0, range lo <= hi.
  Replace old try/except validation that only caught exceptions.
- s18/s19: add validate_worktree_name() to remove_worktree and keep_worktree,
  not just create_worktree.

* fix: align s16-s19 teaching tool consistency

* fix pr265 chapter diagrams

* Add comprehensive s20 harness chapter

* Fix chapter smoke test regressions

* Clarify README tutorial track transition

---------

Co-authored-by: Haoran <bill-billion@outlook.com>
2026-05-20 21:45:38 +08:00

278 lines
12 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# s11: Error Recovery — Errors aren't the end, they're the start of a retry
[中文](README.md) · [English](README.en.md) · [日本語](README.ja.md)
s01 → ... → s09 → s10 → `s11` → [s12](../s12_task_system/) → s13 → ... → s20
> *"Errors aren't the end, they're the start of a retry"* — escalate tokens, compact context, switch models.
>
> **Harness layer**: Resilience — classify and recover when the main loop hits errors.
---
## The Problem
The Agent is running along and then errors out:
```
Error: 529 overloaded
```
The Agent crashes. It doesn't retry, doesn't switch models, doesn't reduce context — it just crashes.
In production, API errors are the norm. The three most common failure modes: **truncated output** (the model runs out of tokens mid-sentence), **context overflow** (still too long even after compaction), and **transient failures** (429 rate limiting / 529 overload). An Agent that doesn't handle errors is like a car that stalls at the slightest touch.
---
## Solution
![Error Recovery Overview](images/error-recovery-overview.en.svg)
The loop and prompt assembly from s10 are fully preserved. The only change: the LLM call is wrapped in try/except, with different recovery paths based on error type. After recovery, `continue` loops back to the top to call the LLM again.
The three most common recovery patterns (the teaching version only handles 429/529; real systems also cover connection errors, timeouts, cloud vendor credential caches, etc. CC actually has 13+ reason codes; see the Deep Dive for the rest):
| Pattern | Trigger | Recovery Action |
|----------|---------|-----------------|
| Output truncated | `max_tokens` | Escalate 8K→64K / continuation prompt |
| Context overflow | `prompt_too_long` | Reactive compact → retry |
| Transient failure | 429 / 529 | Exponential backoff + jitter, fallback model on consecutive 529 |
---
## How It Works
### Path 1: Output Truncated
The model runs out of tokens mid-sentence — `max_tokens` is exhausted. The default 8000 tokens isn't enough for a complete response.
On the first occurrence, escalate `max_tokens` from 8K to 64K (8x the space) and retry the same request — the truncated output is NOT appended to messages, keeping the original request intact. If 64K is still not enough, save the truncated output and inject a continuation prompt telling the model to pick up where it left off, up to 3 times:
```python
if response.stop_reason == "max_tokens":
# First escalation: don't append truncated output, retry same request
if not state.has_escalated:
max_tokens = ESCALATED_MAX_TOKENS
state.has_escalated = True
continue # messages unchanged, same request with more tokens
# 64K still truncated: save output + continuation prompt
messages.append({"role": "assistant", "content": response.content})
if state.recovery_count < MAX_RECOVERY_RETRIES:
messages.append({"role": "user", "content":
"Output token limit hit. Resume directly — "
"no apology, no recap. Pick up mid-thought."})
state.recovery_count += 1
continue
return # still truncated after 3 continuations
# Normal: append after max_tokens check
messages.append({"role": "assistant", "content": response.content})
```
Escalation gets one chance; continuation gets up to 3. After that, exit — further continuations won't produce meaningful output.
### Path 2: Context Overflow
The LLM says "your context is too long" (`prompt_too_long`). All four compaction layers from s08 have already run, and it's still over the limit.
Trigger reactive compact — more aggressive than auto compact. The teaching version keeps only the last 5 messages to simulate compaction; real CC generates a compact summary via LLM, then retries with the compacted message list. Retry after compacting. But if it's still over the limit after one compaction, the only option is to exit — compacting again won't make it any smaller:
```python
except PromptTooLongError:
if not state.has_attempted_reactive_compact:
messages[:] = reactive_compact(messages)
state.has_attempted_reactive_compact = True
continue
return # Already compacted and still over limit — must exit
```
### Path 3: Transient Failures
Network blips, 429 rate limiting, 529 overload — these aren't bugs, they're normal in distributed systems.
Both 429 and 529 use exponential backoff + jitter: wait 0.5 seconds on the first attempt, 1 second on the second, 2 seconds on the third, up to 10 retries. Random jitter prevents concurrent requests from all retrying at the same instant. Three consecutive 529 overload errors → switch to the fallback model (if `FALLBACK_MODEL_ID` environment variable is configured):
```python
def retry_delay(attempt, retry_after=None):
if retry_after:
return retry_after
base = min(500 * (2 ** attempt), 32000) / 1000
return base + random.uniform(0, base * 0.25)
def with_retry(fn, state, max_retries=10):
for attempt in range(max_retries):
try:
return fn()
except (RateLimitError, OverloadedError):
delay = retry_delay(attempt)
time.sleep(delay)
if is_overloaded:
state.consecutive_529 += 1
if state.consecutive_529 >= 3 and FALLBACK_MODEL:
state.current_model = FALLBACK_MODEL
raise MaxRetriesExceeded()
```
Backoff formula: `min(500 × 2^attempt, 32000) + random(0~25%)`. If the server returns a `Retry-After` header, that value takes priority.
### Putting It All Together
```python
def agent_loop(messages, context):
system = get_system_prompt(context)
state = RecoveryState()
max_tokens = 8000
while True:
try:
response = with_retry(
lambda: client.messages.create(
model=state.current_model, system=system,
messages=messages, tools=TOOLS,
max_tokens=max_tokens),
state)
except Exception as e:
if is_prompt_too_long_error(e):
if not state.has_attempted_reactive_compact:
messages[:] = reactive_compact(messages)
state.has_attempted_reactive_compact = True
continue
return
log_error(e)
return
# max_tokens check BEFORE appending to messages
if response.stop_reason == "max_tokens":
if not state.has_escalated:
max_tokens = 64000
state.has_escalated = True
continue # retry same request, messages unchanged
# save truncated output + continuation prompt
messages.append({"role": "assistant", "content": response.content})
messages.append({"role": "user", "content": CONTINUATION_PROMPT})
continue
# Normal completion
messages.append({"role": "assistant", "content": response.content})
if response.stop_reason != "tool_use":
return
# ... tool execution ...
```
The outer try/except catches API exceptions (prompt_too_long, etc.), `with_retry` handles transient errors (429/529), and `stop_reason` checks handle truncation. Three recovery mechanisms, each handling its own error type.
---
## Changes from s10
| Component | Before (s10) | After (s11) |
|-----------|-------------|-------------|
| Error handling | None (crashes on any error) | Three recovery patterns + exponential backoff |
| New constants | — | ESCALATED_MAX_TOKENS=64000, MAX_RETRIES=10, BASE_DELAY_MS=500, FALLBACK_MODEL |
| New functions | — | with_retry, retry_delay, reactive_compact, is_prompt_too_long_error, RecoveryState |
| Tools | bash, read_file, write_file (3) | bash, read_file, write_file (3) — unchanged |
| Loop | Bare LLM call | Wrapped in try/except + continue retry |
---
## Try It
```sh
cd learn-claude-code
python s11_error_recovery/code.py
```
Try these prompts:
1. Ask the Agent to generate a very long piece of code, and observe whether it automatically continues after truncation (look for the `[max_tokens] escalating` log)
2. Read many files consecutively to bloat the context, and observe reactive compact
3. If you encounter 429/529, observe the exponential backoff log output
---
## What's Next
The Agent can now automatically recover from errors. But the tasks it handles are still one-shot — you give it a task, it finishes, it's done.
What if the Agent could manage a **task list** — with dependencies, persisted to disk, resumable across sessions? A TODO list is not a task system.
s12 Task System → Tasks form a dependency graph with state and persistence. This is the foundation for multi-Agent collaboration.
<details>
<summary>Deep Dive into CC Source</summary>
> The following is based on CC source code: `query.ts` (1729 lines), `services/api/withRetry.ts` (822 lines), `query/tokenBudget.ts` (93 lines), and `utils/tokenBudget.ts` (73 lines).
### 1. A Dozen-Plus Reason/Transition Codes (Not Just 3)
The teaching version covers 3 of the most common recovery patterns. CC actually has a dozen-plus reason/transition codes, evaluated after every LLM call:
| Reason/Transition | Teaching Version | CC Behavior |
|---|---|---|
| `completed` | Normal completion | Return result |
| `next_turn` | Normal tool call | Continue to next tool execution round |
| `max_output_tokens_escalate` | Path 1 | 8K→64K escalation |
| `max_output_tokens_recovery` | Path 1 continuation | Continuation prompt (up to 3 times) |
| `reactive_compact_retry` | Path 2 | Reactive compact → retry |
| `prompt_too_long` | Path 2 | Same as above |
| `collapse_drain_retry` | Not covered | Context collapse — commit staged content first |
| `model_error` | Not covered | Retry |
| `image_error` | Not covered | `ImageSizeError` / `ImageResizeError` handled specifically |
| `aborted_streaming` | Not covered | Streaming abort recovery |
| `aborted_tools` | Not covered | Tool abort |
| `stop_hook_blocking` | Not covered | Inject blocking error → model self-corrects |
| `stop_hook_prevented` | Not covered | Hooks prevent execution |
| `hook_stopped` | Not covered | Hook stopped execution |
| `token_budget_continuation` | Not covered | Continue when token usage < 90% |
| `blocking_limit` | Not covered | Blocking limit reached |
| `max_turns` | Not covered | Maximum turns reached |
The teaching version only expands on the first 5 (most common); each of the rest has its own dedicated handling logic.
### 2. Precise Exponential Backoff Formula
CC's backoff delay (`withRetry.ts:530-548`):
```
delay = min(500 × 2^(attempt-1), 32000) + random(0~25%)
```
| Attempt | Base Delay | + Jitter |
|---------|-----------|----------|
| 1 | 500ms | 0-125ms |
| 2 | 1000ms | 0-250ms |
| 4 | 4000ms | 0-1000ms |
| 7+ | 32000ms (cap) | 0-8000ms |
If the server returns a `Retry-After` header, that value takes priority.
### 3. Original CONTINUATION Prompt
CC's continuation prompt (`query.ts:1225-1227`):
```
Output token limit hit. Resume directly — no apology, no recap of what
you were doing. Pick up mid-thought if that is where the cut happened.
Break remaining work into smaller pieces.
```
Token budget nudge prompt (`tokenBudget.ts:72`):
```
Stopped at {pct}% of token target. Keep working — do not summarize.
```
### 4. Streaming Error Handling
In CC's streaming path, recoverable errors (413, max_tokens, media errors) are **withheld from display** during streaming (`query.ts:788-822`) — SDK consumers don't see them, only the recovery logic does. After streaming ends, the system determines whether recovery is needed.
### 5. 529 → Fallback Model Switch
After 3 consecutive 529 overload errors (`MAX_529_RETRIES = 3`), CC automatically switches to the fallback model (e.g., Opus → Sonnet). On switch, all pending messages and tool results are cleared, and the user sees "Switched to {model} due to high demand".
### 6. Diminishing Returns Detection
Token budget "continuations" aren't unlimited. When there are 3 consecutive continuations with a token increment < 500, the system determines "continuing won't produce meaningful output" and stops continuation (`tokenBudget.ts:60-62`).
</details>
<!-- translation-sync: zh@v1, en@v1, ja@v1 -->