Files
gui-yue 1baf1aca5a Follow up PR #265: refine chapters, diagrams, and add S20 (#283)
* feat: s01-s14 docs quality overhaul — tool pipeline, single-agent, knowledge & resilience

Rewrite code.py and README (zh/en/ja) for s01-s14, each chapter building
incrementally on the previous. Key fixes across chapters:

- s01-s04: agent loop, tool dispatch, permission pipeline, hooks
- s05-s08: todo write, subagent, skill loading, context compact
- s09-s11: memory system, system prompt assembly, error recovery
- s12-s14: task graph, background tasks, cron scheduler

All chapters CC source-verified. Code inherits fixes forward (PROMPT_SECTIONS,
json.dumps cache, real-state context, can_start dep protection, etc.).

* feat: s15-s19 docs quality overhaul — multi-agent platform: teams, protocols, autonomy, worktree, MCP tools

Rewrite code.py and README (zh/en/ja) for s15-s19, the multi-agent platform
chapters. Each chapter inherits all previous fixes and adds one mechanism:

- s15: agent teams (TeamCreate, teammate threads, shared task list)
- s16: team protocols (plan approval, shutdown handshake, consume_inbox)
- s17: autonomous agents (idle polling, auto-claim, consume_lead_inbox)
- s18: worktree isolation (git worktree, bind_task, cwd switching, safety)
- s19: MCP tools (MCPClient, normalize_mcp_name, assemble_tool_pool, no cache)

All appendix source code references verified against CC source. Config priority
corrected: claude.ai < plugin < user < project < local.

* fix: 5 regressions across s05-s19 — glob safety, todo validation, memory extraction, protocol types, dep crash

- s05-s09: glob results now filter with is_relative_to(WORKDIR) (inherited from s02)
- s06-s08: todo_write validates content/status required fields (inherited from s05)
- s09: extract_memories uses pre-compression snapshot instead of compacted messages
- s16: submit_plan docstring clarifies protocol-only (not code-level gate)
- s17-s19: match_response restores type mismatch validation (from s16)
- s17-s19: claim_task deps list handles missing dep files without crashing

* fix: s12 Todo V2 logic reversal, s14/s15 cron range validation, s18/s19 worktree name validation

- s12 README (zh/en/ja): fix Todo V2 direction — interactive defaults to Task,
  non-interactive/SDK defaults to TodoWrite. Fix env var name to
  CLAUDE_CODE_ENABLE_TASKS (not TODO_V2).
- s14/s15: add _validate_cron_field with per-field range checks (minute 0-59,
  hour 0-23, dom 1-31, month 1-12, dow 0-6), step > 0, range lo <= hi.
  Replace old try/except validation that only caught exceptions.
- s18/s19: add validate_worktree_name() to remove_worktree and keep_worktree,
  not just create_worktree.

* fix: align s16-s19 teaching tool consistency

* fix pr265 chapter diagrams

* Add comprehensive s20 harness chapter

* Fix chapter smoke test regressions

* Clarify README tutorial track transition

---------

Co-authored-by: Haoran <bill-billion@outlook.com>
2026-05-20 21:45:38 +08:00

278 lines
11 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# s11: Error Recovery — 错误不是结束,是重试的开始
[中文](README.md) · [English](README.en.md) · [日本語](README.ja.md)
s01 → ... → s09 → s10 → `s11` → [s12](../s12_task_system/) → s13 → ... → s20
> *"错误不是终点, 是重试的起点"* — 升级 token、压缩上下文、切换模型。
>
> **Harness 层**: 韧性 — 主循环遇到错误时分类并恢复。
---
## 问题
Agent 跑着跑着报错了:
```
Error: 529 overloaded
```
Agent 崩溃了。它没有重试,没有换模型,没有减少上下文——直接崩溃。
生产环境中 API 错误是常态。三种最常见的故障模式:**输出被截断**(模型话说一半 token 用完了)、**上下文超限**(压缩后还是太长)、**临时故障**429 限流 / 529 过载)。一个不处理错误的 Agent 就像一个一碰就熄火的车。
---
## 解决方案
![Error Recovery Overview](images/error-recovery-overview.svg)
s10 的循环、prompt 组装全部保留。唯一的变动LLM 调用包裹在 try/except 里,根据错误类型走不同的恢复路径。恢复后 `continue` 回到循环开头重新调用 LLM。
三种最常见的恢复模式(教学版只处理 429/529真实系统还覆盖连接错误、超时、云厂商认证缓存等。CC 实际有 13+ reason code其余见 Deep dive
| 模式 | 触发 | 恢复动作 |
|------|------|---------|
| 输出截断 | `max_tokens` | 升级 8K→64K / 续写提示 |
| 上下文超限 | `prompt_too_long` | reactive compact → 重试 |
| 临时故障 | 429 / 529 | 指数退避 + 抖动,连续 529 可切换备用模型 |
---
## 工作原理
### 路径 1: 输出被截断
模型话说一半,`max_tokens` 用完了。默认 8000 token 不够它输出完整回答。
第一次发生时,直接把 `max_tokens` 从 8K 升级到 64K8 倍空间),重试同一请求——此时不追加截断输出到 messages保持原始请求不变。如果 64K 还是不够,才保存截断输出并注入续写提示让模型接着刚才的话继续说,最多 3 次:
```python
if response.stop_reason == "max_tokens":
# First escalation: don't append truncated output, retry same request
if not state.has_escalated:
max_tokens = ESCALATED_MAX_TOKENS
state.has_escalated = True
continue # messages unchanged, same request with more tokens
# 64K still truncated: save output + continuation prompt
messages.append({"role": "assistant", "content": response.content})
if state.recovery_count < MAX_RECOVERY_RETRIES:
messages.append({"role": "user", "content":
"Output token limit hit. Resume directly — "
"no apology, no recap. Pick up mid-thought."})
state.recovery_count += 1
continue
return # still truncated after 3 continuations
# Normal: append after max_tokens check
messages.append({"role": "assistant", "content": response.content})
```
升级只有一次机会,续写最多 3 次。超过就退出——继续续写也不会有实质产出。
### 路径 2: 上下文超限
LLM 说"你的上下文太长了"`prompt_too_long`。s08 的四层压缩全跑过了,还是超。
触发 reactive compact——比 auto compact 更激进。教学版只保留最后 5 条消息模拟压缩效果;真实实现会调用 LLM 生成 compact 摘要再重试。压缩后重试。但如果压缩过一次还是超限,只能退出——再压缩也不会变小:
```python
except PromptTooLongError:
if not state.has_attempted_reactive_compact:
messages[:] = reactive_compact(messages)
state.has_attempted_reactive_compact = True
continue
return # 压缩过了还是超限,只能退出
```
### 路径 3: 临时故障
网络抖动、429 限流、529 过载——这些不是 bug是分布式系统的常态。
429 和 529 统一走指数退避 + 抖动:第一次等 0.5 秒,第二次等 1 秒,第三次等 2 秒,最多 10 次。加随机抖动让并发请求不在同一时刻重试。连续 3 次 529 过载 → 切换到备用模型(若配置了 `FALLBACK_MODEL_ID` 环境变量):
```python
def retry_delay(attempt, retry_after=None):
if retry_after:
return retry_after
base = min(500 * (2 ** attempt), 32000) / 1000
return base + random.uniform(0, base * 0.25)
def with_retry(fn, state, max_retries=10):
for attempt in range(max_retries):
try:
return fn()
except (RateLimitError, OverloadedError):
delay = retry_delay(attempt)
time.sleep(delay)
if is_overloaded:
state.consecutive_529 += 1
if state.consecutive_529 >= 3 and FALLBACK_MODEL:
state.current_model = FALLBACK_MODEL
raise MaxRetriesExceeded()
```
退避公式:`min(500 × 2^attempt, 32000) + random(0~25%)`。如果服务器返回 `Retry-After` header优先用那个值。
### 合起来跑
```python
def agent_loop(messages, context):
system = get_system_prompt(context)
state = RecoveryState()
max_tokens = 8000
while True:
try:
response = with_retry(
lambda: client.messages.create(
model=state.current_model, system=system,
messages=messages, tools=TOOLS,
max_tokens=max_tokens),
state)
except Exception as e:
if is_prompt_too_long_error(e):
if not state.has_attempted_reactive_compact:
messages[:] = reactive_compact(messages)
state.has_attempted_reactive_compact = True
continue
return
log_error(e)
return
# max_tokens check BEFORE appending to messages
if response.stop_reason == "max_tokens":
if not state.has_escalated:
max_tokens = 64000
state.has_escalated = True
continue # retry same request, messages unchanged
# save truncated output + continuation prompt
messages.append({"role": "assistant", "content": response.content})
messages.append({"role": "user", "content": CONTINUATION_PROMPT})
continue
# Normal completion
messages.append({"role": "assistant", "content": response.content})
if response.stop_reason != "tool_use":
return
# ... tool execution ...
```
外层 try/except 捕获 API 异常prompt_too_long 等),`with_retry` 处理瞬态错误429/529`stop_reason` 检查处理截断。三种恢复机制各管各的错误类型。
---
## 相对 s10 的变更
| 组件 | 之前 (s10) | 之后 (s11) |
|------|-----------|-----------|
| 错误处理 | 无(一碰就崩溃) | 三种恢复模式 + 指数退避 |
| 新常量 | — | ESCALATED_MAX_TOKENS=64000, MAX_RETRIES=10, BASE_DELAY_MS=500, FALLBACK_MODEL |
| 新函数 | — | with_retry, retry_delay, reactive_compact, is_prompt_too_long_error, RecoveryState |
| 工具 | bash, read_file, write_file (3) | bash, read_file, write_file (3) — 不变 |
| 循环 | 裸调用 LLM | try/except 包裹 + continue 重试 |
---
## 试一下
```sh
cd learn-claude-code
python s11_error_recovery/code.py
```
试试这些 prompt
1. 让 Agent 生成一段很长的代码,观察截断后是否自动续写(看 `[max_tokens] escalating` 日志)
2. 连续读取大量文件撑大上下文,观察 reactive compact
3. 如果遇到 429/529观察指数退避的日志输出
---
## 接下来
Agent 现在能在错误中自动恢复了。但它处理的任务仍然是"一次性"的——你给它一个任务,它做完,结束。
能不能让 Agent 管理一个**任务列表**——有依赖关系、持久化到磁盘、跨会话能恢复TODO 列表不是任务系统。
s12 Task System → 任务是有依赖、有状态、持久化的图。这是多 Agent 协作的基础。
<details>
<summary>深入 CC 源码</summary>
> 以下基于 CC 源码 `query.ts`1729 行)、`services/api/withRetry.ts`822 行)、`query/tokenBudget.ts`93 行)、`utils/tokenBudget.ts`73 行)的分析。
### 一、十几种 reason/transition不只是 3 条)
教学版讲了 3 种最常见的恢复模式。CC 实际有十几种 reason/transition每轮 LLM 调用后都会判断:
| reason/transition | 教学版对应 | CC 行为 |
|---|---|---|
| `completed` | 正常完成 | 返回结果 |
| `next_turn` | 正常工具调用 | 继续下一轮工具执行 |
| `max_output_tokens_escalate` | 路径 1 | 8K→64K 升级 |
| `max_output_tokens_recovery` | 路径 1 续写 | 续写提示(最多 3 次) |
| `reactive_compact_retry` | 路径 2 | reactive compact → 重试 |
| `prompt_too_long` | 路径 2 | 同上 |
| `collapse_drain_retry` | 未展开 | context collapse 先提交暂存 |
| `model_error` | 未展开 | 重试 |
| `image_error` | 未展开 | `ImageSizeError` / `ImageResizeError` 专门处理 |
| `aborted_streaming` | 未展开 | 流式中止恢复 |
| `aborted_tools` | 未展开 | 工具中止 |
| `stop_hook_blocking` | 未展开 | 注入 blocking error → 模型自纠 |
| `stop_hook_prevented` | 未展开 | hooks 阻止 |
| `hook_stopped` | 未展开 | hook 停止执行 |
| `token_budget_continuation` | 未展开 | token 用量 < 90% 时继续 |
| `blocking_limit` | 未展开 | 阻塞限制 |
| `max_turns` | 未展开 | 达到最大轮次 |
教学版只展开了前 5 种(最常见的),其余各有专门处理逻辑。
### 二、指数退避的精确公式
CC 的退避延迟(`withRetry.ts:530-548`
```
delay = min(500 × 2^(attempt-1), 32000) + random(0~25%)
```
| 尝试 | 基础延迟 | + 抖动 |
|------|---------|--------|
| 1 | 500ms | 0-125ms |
| 2 | 1000ms | 0-250ms |
| 4 | 4000ms | 0-1000ms |
| 7+ | 32000ms上限 | 0-8000ms |
如果服务器返回 `Retry-After` header优先用那个值。
### 三、CONTINUATION 提示原文
CC 的续写提示(`query.ts:1225-1227`
```
Output token limit hit. Resume directly — no apology, no recap of what
you were doing. Pick up mid-thought if that is where the cut happened.
Break remaining work into smaller pieces.
```
Token budget 的 nudge 提示(`tokenBudget.ts:72`
```
Stopped at {pct}% of token target. Keep working — do not summarize.
```
### 四、流式错误处理
CC 的流式路径中可恢复的错误413、max_tokens、media error在 streaming 期间**被暂扣不展示**`query.ts:788-822`——SDK 消费者看不到,只有恢复逻辑能看到。等 streaming 结束后才判断是否需要恢复。
### 五、529 → Fallback Model 切换
连续 3 次 529 过载错误后(`MAX_529_RETRIES = 3`CC 自动切换到 fallback model如 Opus → Sonnet。切换时清除所有 pending 消息和 tool 结果,给用户展示 "Switched to {model} due to high demand"。
### 六、Diminishing Returns 检测
Token budget 的"继续"不是无限的。当连续 3 次 continuation 且 token 增量 < 500 时,系统判断"继续也没有实质性产出",停止 continuation`tokenBudget.ts:60-62`)。
</details>
<!-- translation-sync: zh@v1, en@v1, ja@v1 -->