Harness Engineering — Case Studies & Frameworks

Summary

Real-world harness engineering implementations and formal frameworks. Covers OpenAI, Anthropic, LangChain case studies, the CAR decomposition model, HarnessCard reporting standard, sandbox architecture, and Meta-Harness automation. Companion to harness-engineering which covers core principles.

OpenAI’s Codex Case Study

OpenAI 用 Codex 在 5 个月内构建了 100 万行代码的生产系统，零行人工代码：

3 名工程师, 1,500+ PR, 平均每人每天 3.5 个 PR，约为手动的 10 倍效率
三大支柱: Context Engineering（上下文工程）、Architectural Constraint Enforcement（架构约束执行）、Custom Linters & Feedback（自定义反馈）
依赖层强制执行: Types → Config → Repo → Service → Runtime → UI
Linter 错误信息为 agent 编写: 注入修复指令到 agent 上下文
定期”垃圾收集”: 扫描架构漂移，让 agent 建议修复
Without good harness → “AI slop”: 语法正确但违反架构不变性的代码

Anthropic’s Multi-Agent Harness Design

Anthropic 的实践：用 GAN 式的 Generator-Evaluator 分离来解决长时间任务：

核心问题: 自我评估偏差（agent 过度赞美自己）+ 上下文焦虑（context anxiety）
三个 Agent: Planner（规划）→ Generator（生成）→ Evaluator（评估，用 Playwright 功能测试）
GAN 式反馈循环: 评估器偏向怀疑，通过迭代 prompt 调优
成本-质量权衡: Retro Game Maker — $9/20 min （功能破损） v s$ 200/6h（精致完整），20x 成本但质的飞跃
关键洞察: 模型能力提升 → harness 可以简化（Opus 4.6 移除了 sprint 分解）
标准措辞影响输出: “museum quality” 直接改变了设计方向

Anthropic’s Session Continuity Harness (Justin Young)

解决跨 context window 的连续性问题——agent 每次新会话都不记得之前做了什么：

两个失败模式: Over-ambition（一次做太多，半途耗尽上下文）和 Premature completion（看到进度就宣布完成）
两阶段方案: Initializer Agent（建立环境 + feature list + init.sh）→ Coding Agent（增量推进）
Feature List 作为合同: JSON 枚举所有功能，agent 只改 passes 字段，不能删需求
每次只做一个功能: 防止 context 耗尽，git commit 支持回滚
浏览器自动化测试: 端到端验证比单元测试更能发现 bug
核心类比: 本质上是人类团队的交接机制（handoff）搬到了 agent 世界

LangChain Terminal Bench 2.0 Case Study

同一模型（GPT-5.2-Codex）仅修改 harness 即从 Top 30 跃升至 Top 5，四项改进：

强制自验证：中间件拦截 Agent 退出，必须自检
环境预扫描：启动注入环境信息，减少探索时间
循环检测：追踪跨迭代文件编辑
推理三明治（高→中→高）：平衡质量与延迟

CAR Framework (He et al., 2026)

Harness Engineering for Language Agents (He et al., Alibaba-NTU, April 2026) proposes the CAR decomposition as a formal framework:

H = ⟨C, A, R⟩ — Control, Agency, Runtime

Control Layer (C)

Durable artifacts that shape behavior before action:

Repository maps, AGENTS.md, tool descriptions
System instructions, architecture rules
Tests, linters, permission policies, success criteria
Key insight: “Reliable agents are rarely bounded by prompt wording; they are often bounded by specifications”

Agency Layer (A)

How the model is allowed to act:

Action substrates (code execution, browser interaction)
Planner-verifier or orchestrator-worker structures
Reviewer roles and action space interfaces
Definition: “The mediated action surface and delegation structure that the harness permits”

Runtime Layer (R)

What happens as work unfolds over time:

Context assembly, memory and compaction
Checkpointing, retries, backtracking
Approval flows, budgets, trace collection, replay support
Key insight: “Many agent failures are runtime failures: stale state, brittle retry loops, overgrown context”

Mini-Cases from Paper

Repository Coding Agent: Two systems share the same frontier model but differ:

Control: Repository map, AGENTS.md, required tests, linter
Agency: Shell access, file-edit surface
Runtime: Progress file, retries, escalation logic

Browser/Research Agent: Same browsing model, different harness:

Control: Source hierarchy, citation rules, note-taking format
Agency: Search, browser, delegation surface
Runtime: Scratchpads, branching traces, recovery

HarnessCard Reporting Artifact

Also from He et al. (2026), HarnessCard is proposed as a lightweight reporting standard:

Field	Priority	What to Disclose
Base model(s)	Required	Model name, version, decoding settings, finetuning
Control artifacts	Required	System instructions, AGENTS.md, repo maps, rules
Runtime policy	Required	Memory type, compaction, checkpointing, retries
Action substrate	Required	Tools, APIs, browser, code execution, MCP
Execution topology	Required	Single vs multi-agent, planner/verifier roles
Feedback stack	Required	Tests, graders, reflection prompts, human review
Governance layer	Required	Permissions, sandboxing, escalation rules
Observability	Required	Traces, replay, latency/cost logs
Evaluation protocol	Required	Task set, runs, success criteria
Release artifacts	Recommended	Prompts, tool specs, traces, configs
Known limitations	Recommended	Failure modes, safety concerns

Purpose: Make agent claims comparable, auditable, and reproducible — like Model Cards but for the harness layer.

Meta-Harness: Automated Harness Optimization

Meta-Harness (Lee et al. 2026) shifts harness engineering from manual craft to automated search:

Agent-as-optimizer: A coding agent iteratively proposes, evaluates, and refines harness code via filesystem access to source, traces, and metrics
Raw execution traces are critical: Full trace access (50%) dramatically outperforms scores-only (34.6%) — the proposer needs to see why things failed
Results: +7.7 points over SOTA on text classification with 75% fewer tokens; discovered harnesses transfer across unseen models
Implication: Manual harness engineering becomes the seed for automated search rather than the final product
See meta-harness for full details

Sandbox: The Server of the Harness Era

Sandbox 即沙箱，是 harness 的执行环境，类比于服务器在传统应用中的角色。详见 sandbox。

Sandboxes Vs Servers

应用运行在服务器上：使用计算资源、写入文件系统、跨重启保持状态
Harness 和 Sandbox 同样关系：harness 负责推理和调用工具，sandbox 提供隔离执行环境
两者可独立替换：swap either one independently and the system still works

The State Question

什么让 harness 可重启？两个东西：

Trajectory（轨迹）：完整记录问了什么问题、做了什么 tool call、做了什么决策。这是 harness 产生的最有价值的 artifact。
Local Data（本地数据）：harness 在沙箱文件系统中创建的一切。

Together：Anthropic 称为 “decoupling the brain from the hands”——brain 是 harness，hands 是 sandbox。Trajectory + sandbox 文件系统持久化 = 持久的 artifact，trajectory 是最重要的。

Who Controls the State

版本一：labs 运营 managed harnesses，trajectories 住在 labs 的 session logs
版本二：企业不允许那种 connectivity，sandbox 跑在客户的 cloud 里，状态留在客户基础设施（Daytona, E2B, Stripe Minions）
关键问题：谁控制 trajectory？推理历史让 harness 的工作可复现、可审计、可改进

未来：Swarm

跨多个沙箱协调的团队，每个有自己的 trajectory 和 state
Individual harnesses fail and restart. Sandboxes die and get reprovisioned. The swarm keeps working
已见雏形：Stripe Minions, Browserbase

Visibility Gap

He et al. audit 63 harness-relevant works and find:

Academic papers often leave harness as “hidden implementation residue”
Public engineering notes (Anthropic, OpenAI) describe harness innovations not yet in papers
Many reported “agent gains” may be harness-sensitive rather than purely model-driven

Engineering Evolution (from CAR perspective)

Software Engineering → Prompt Engineering → Context Engineering → Harness Engineering

Prompt Engineering: Wording of instructions (Control subset)
Context Engineering: What information is provided (Control + Runtime subset)
Harness Engineering: Full H = ⟨C, A, R⟩ layer

Evidence Timeline

2026-04-07: OpenAI Codex case study — 1M lines, zero handwritten code, 3 engineers in 5 months
2026-04-07: Anthropic’s multi-agent harness design — GAN-inspired generator-evaluator separation
2026-04-07: Justin Young’s session continuity harness — Initializer/Coding agent pattern
2026-04-07: Meta-Harness — automated harness search outperforming manual engineering
2026-04-07: LangChain Terminal Bench 2.0 case study
2026-04-23: CAR Framework (Control, Agency, Runtime) from He et al. (Alibaba-NTU)
2026-04-23: HarnessCard reporting artifact
2026-04-23: Visibility Gap finding — audit of 63 works
2026-06-08: Split from harness-engineering — case studies and frameworks consolidated here

My Brain Wiki

探索