Files
gui-yue 1baf1aca5a Follow up PR #265: refine chapters, diagrams, and add S20 (#283)
* feat: s01-s14 docs quality overhaul — tool pipeline, single-agent, knowledge & resilience

Rewrite code.py and README (zh/en/ja) for s01-s14, each chapter building
incrementally on the previous. Key fixes across chapters:

- s01-s04: agent loop, tool dispatch, permission pipeline, hooks
- s05-s08: todo write, subagent, skill loading, context compact
- s09-s11: memory system, system prompt assembly, error recovery
- s12-s14: task graph, background tasks, cron scheduler

All chapters CC source-verified. Code inherits fixes forward (PROMPT_SECTIONS,
json.dumps cache, real-state context, can_start dep protection, etc.).

* feat: s15-s19 docs quality overhaul — multi-agent platform: teams, protocols, autonomy, worktree, MCP tools

Rewrite code.py and README (zh/en/ja) for s15-s19, the multi-agent platform
chapters. Each chapter inherits all previous fixes and adds one mechanism:

- s15: agent teams (TeamCreate, teammate threads, shared task list)
- s16: team protocols (plan approval, shutdown handshake, consume_inbox)
- s17: autonomous agents (idle polling, auto-claim, consume_lead_inbox)
- s18: worktree isolation (git worktree, bind_task, cwd switching, safety)
- s19: MCP tools (MCPClient, normalize_mcp_name, assemble_tool_pool, no cache)

All appendix source code references verified against CC source. Config priority
corrected: claude.ai < plugin < user < project < local.

* fix: 5 regressions across s05-s19 — glob safety, todo validation, memory extraction, protocol types, dep crash

- s05-s09: glob results now filter with is_relative_to(WORKDIR) (inherited from s02)
- s06-s08: todo_write validates content/status required fields (inherited from s05)
- s09: extract_memories uses pre-compression snapshot instead of compacted messages
- s16: submit_plan docstring clarifies protocol-only (not code-level gate)
- s17-s19: match_response restores type mismatch validation (from s16)
- s17-s19: claim_task deps list handles missing dep files without crashing

* fix: s12 Todo V2 logic reversal, s14/s15 cron range validation, s18/s19 worktree name validation

- s12 README (zh/en/ja): fix Todo V2 direction — interactive defaults to Task,
  non-interactive/SDK defaults to TodoWrite. Fix env var name to
  CLAUDE_CODE_ENABLE_TASKS (not TODO_V2).
- s14/s15: add _validate_cron_field with per-field range checks (minute 0-59,
  hour 0-23, dom 1-31, month 1-12, dow 0-6), step > 0, range lo <= hi.
  Replace old try/except validation that only caught exceptions.
- s18/s19: add validate_worktree_name() to remove_worktree and keep_worktree,
  not just create_worktree.

* fix: align s16-s19 teaching tool consistency

* fix pr265 chapter diagrams

* Add comprehensive s20 harness chapter

* Fix chapter smoke test regressions

* Clarify README tutorial track transition

---------

Co-authored-by: Haoran <bill-billion@outlook.com>
2026-05-20 21:45:38 +08:00

11 KiB
Raw Permalink Blame History

s11: Error Recovery — 错误不是结束,是重试的开始

中文 · English · 日本語

s01 → ... → s09 → s10 → s11s12 → s13 → ... → s20

"错误不是终点, 是重试的起点" — 升级 token、压缩上下文、切换模型。

Harness 层: 韧性 — 主循环遇到错误时分类并恢复。


问题

Agent 跑着跑着报错了:

Error: 529 overloaded

Agent 崩溃了。它没有重试,没有换模型,没有减少上下文——直接崩溃。

生产环境中 API 错误是常态。三种最常见的故障模式:输出被截断(模型话说一半 token 用完了)、上下文超限(压缩后还是太长)、临时故障429 限流 / 529 过载)。一个不处理错误的 Agent 就像一个一碰就熄火的车。


解决方案

Error Recovery Overview

s10 的循环、prompt 组装全部保留。唯一的变动LLM 调用包裹在 try/except 里,根据错误类型走不同的恢复路径。恢复后 continue 回到循环开头重新调用 LLM。

三种最常见的恢复模式(教学版只处理 429/529真实系统还覆盖连接错误、超时、云厂商认证缓存等。CC 实际有 13+ reason code其余见 Deep dive

模式 触发 恢复动作
输出截断 max_tokens 升级 8K→64K / 续写提示
上下文超限 prompt_too_long reactive compact → 重试
临时故障 429 / 529 指数退避 + 抖动,连续 529 可切换备用模型

工作原理

路径 1: 输出被截断

模型话说一半,max_tokens 用完了。默认 8000 token 不够它输出完整回答。

第一次发生时,直接把 max_tokens 从 8K 升级到 64K8 倍空间),重试同一请求——此时不追加截断输出到 messages保持原始请求不变。如果 64K 还是不够,才保存截断输出并注入续写提示让模型接着刚才的话继续说,最多 3 次:

if response.stop_reason == "max_tokens":
    # First escalation: don't append truncated output, retry same request
    if not state.has_escalated:
        max_tokens = ESCALATED_MAX_TOKENS
        state.has_escalated = True
        continue  # messages unchanged, same request with more tokens
    # 64K still truncated: save output + continuation prompt
    messages.append({"role": "assistant", "content": response.content})
    if state.recovery_count < MAX_RECOVERY_RETRIES:
        messages.append({"role": "user", "content":
            "Output token limit hit. Resume directly — "
            "no apology, no recap. Pick up mid-thought."})
        state.recovery_count += 1
        continue
    return  # still truncated after 3 continuations
# Normal: append after max_tokens check
messages.append({"role": "assistant", "content": response.content})

升级只有一次机会,续写最多 3 次。超过就退出——继续续写也不会有实质产出。

路径 2: 上下文超限

LLM 说"你的上下文太长了"prompt_too_long。s08 的四层压缩全跑过了,还是超。

触发 reactive compact——比 auto compact 更激进。教学版只保留最后 5 条消息模拟压缩效果;真实实现会调用 LLM 生成 compact 摘要再重试。压缩后重试。但如果压缩过一次还是超限,只能退出——再压缩也不会变小:

except PromptTooLongError:
    if not state.has_attempted_reactive_compact:
        messages[:] = reactive_compact(messages)
        state.has_attempted_reactive_compact = True
        continue
    return  # 压缩过了还是超限,只能退出

路径 3: 临时故障

网络抖动、429 限流、529 过载——这些不是 bug是分布式系统的常态。

429 和 529 统一走指数退避 + 抖动:第一次等 0.5 秒,第二次等 1 秒,第三次等 2 秒,最多 10 次。加随机抖动让并发请求不在同一时刻重试。连续 3 次 529 过载 → 切换到备用模型(若配置了 FALLBACK_MODEL_ID 环境变量):

def retry_delay(attempt, retry_after=None):
    if retry_after:
        return retry_after
    base = min(500 * (2 ** attempt), 32000) / 1000
    return base + random.uniform(0, base * 0.25)

def with_retry(fn, state, max_retries=10):
    for attempt in range(max_retries):
        try:
            return fn()
        except (RateLimitError, OverloadedError):
            delay = retry_delay(attempt)
            time.sleep(delay)
            if is_overloaded:
                state.consecutive_529 += 1
                if state.consecutive_529 >= 3 and FALLBACK_MODEL:
                    state.current_model = FALLBACK_MODEL
    raise MaxRetriesExceeded()

退避公式:min(500 × 2^attempt, 32000) + random(0~25%)。如果服务器返回 Retry-After header优先用那个值。

合起来跑

def agent_loop(messages, context):
    system = get_system_prompt(context)
    state = RecoveryState()
    max_tokens = 8000

    while True:
        try:
            response = with_retry(
                lambda: client.messages.create(
                    model=state.current_model, system=system,
                    messages=messages, tools=TOOLS,
                    max_tokens=max_tokens),
                state)
        except Exception as e:
            if is_prompt_too_long_error(e):
                if not state.has_attempted_reactive_compact:
                    messages[:] = reactive_compact(messages)
                    state.has_attempted_reactive_compact = True
                    continue
                return
            log_error(e)
            return

        # max_tokens check BEFORE appending to messages
        if response.stop_reason == "max_tokens":
            if not state.has_escalated:
                max_tokens = 64000
                state.has_escalated = True
                continue  # retry same request, messages unchanged
            # save truncated output + continuation prompt
            messages.append({"role": "assistant", "content": response.content})
            messages.append({"role": "user", "content": CONTINUATION_PROMPT})
            continue
        # Normal completion
        messages.append({"role": "assistant", "content": response.content})

        if response.stop_reason != "tool_use":
            return
        # ... tool execution ...

外层 try/except 捕获 API 异常prompt_too_long 等),with_retry 处理瞬态错误429/529stop_reason 检查处理截断。三种恢复机制各管各的错误类型。


相对 s10 的变更

组件 之前 (s10) 之后 (s11)
错误处理 无(一碰就崩溃) 三种恢复模式 + 指数退避
新常量 ESCALATED_MAX_TOKENS=64000, MAX_RETRIES=10, BASE_DELAY_MS=500, FALLBACK_MODEL
新函数 with_retry, retry_delay, reactive_compact, is_prompt_too_long_error, RecoveryState
工具 bash, read_file, write_file (3) bash, read_file, write_file (3) — 不变
循环 裸调用 LLM try/except 包裹 + continue 重试

试一下

cd learn-claude-code
python s11_error_recovery/code.py

试试这些 prompt

  1. 让 Agent 生成一段很长的代码,观察截断后是否自动续写(看 [max_tokens] escalating 日志)
  2. 连续读取大量文件撑大上下文,观察 reactive compact
  3. 如果遇到 429/529观察指数退避的日志输出

接下来

Agent 现在能在错误中自动恢复了。但它处理的任务仍然是"一次性"的——你给它一个任务,它做完,结束。

能不能让 Agent 管理一个任务列表——有依赖关系、持久化到磁盘、跨会话能恢复TODO 列表不是任务系统。

s12 Task System → 任务是有依赖、有状态、持久化的图。这是多 Agent 协作的基础。

深入 CC 源码

以下基于 CC 源码 query.ts1729 行)、services/api/withRetry.ts822 行)、query/tokenBudget.ts93 行)、utils/tokenBudget.ts73 行)的分析。

一、十几种 reason/transition不只是 3 条)

教学版讲了 3 种最常见的恢复模式。CC 实际有十几种 reason/transition每轮 LLM 调用后都会判断:

reason/transition 教学版对应 CC 行为
completed 正常完成 返回结果
next_turn 正常工具调用 继续下一轮工具执行
max_output_tokens_escalate 路径 1 8K→64K 升级
max_output_tokens_recovery 路径 1 续写 续写提示(最多 3 次)
reactive_compact_retry 路径 2 reactive compact → 重试
prompt_too_long 路径 2 同上
collapse_drain_retry 未展开 context collapse 先提交暂存
model_error 未展开 重试
image_error 未展开 ImageSizeError / ImageResizeError 专门处理
aborted_streaming 未展开 流式中止恢复
aborted_tools 未展开 工具中止
stop_hook_blocking 未展开 注入 blocking error → 模型自纠
stop_hook_prevented 未展开 hooks 阻止
hook_stopped 未展开 hook 停止执行
token_budget_continuation 未展开 token 用量 < 90% 时继续
blocking_limit 未展开 阻塞限制
max_turns 未展开 达到最大轮次

教学版只展开了前 5 种(最常见的),其余各有专门处理逻辑。

二、指数退避的精确公式

CC 的退避延迟(withRetry.ts:530-548

delay = min(500 × 2^(attempt-1), 32000) + random(0~25%)
尝试 基础延迟 + 抖动
1 500ms 0-125ms
2 1000ms 0-250ms
4 4000ms 0-1000ms
7+ 32000ms上限 0-8000ms

如果服务器返回 Retry-After header优先用那个值。

三、CONTINUATION 提示原文

CC 的续写提示(query.ts:1225-1227

Output token limit hit. Resume directly — no apology, no recap of what
you were doing. Pick up mid-thought if that is where the cut happened.
Break remaining work into smaller pieces.

Token budget 的 nudge 提示(tokenBudget.ts:72

Stopped at {pct}% of token target. Keep working — do not summarize.

四、流式错误处理

CC 的流式路径中可恢复的错误413、max_tokens、media error在 streaming 期间被暂扣不展示query.ts:788-822——SDK 消费者看不到,只有恢复逻辑能看到。等 streaming 结束后才判断是否需要恢复。

五、529 → Fallback Model 切换

连续 3 次 529 过载错误后(MAX_529_RETRIES = 3CC 自动切换到 fallback model如 Opus → Sonnet。切换时清除所有 pending 消息和 tool 结果,给用户展示 "Switched to {model} due to high demand"。

六、Diminishing Returns 检测

Token budget 的"继续"不是无限的。当连续 3 次 continuation 且 token 增量 < 500 时,系统判断"继续也没有实质性产出",停止 continuationtokenBudget.ts:60-62)。