HarnessCard

Summary

HarnessCard is a lightweight reporting artifact proposed by He et al. (2026) for disclosing the harness configuration of language agent systems. Similar to how Model Cards document model properties, HarnessCard documents the apparatus that makes an agent claim interpretable, comparable, and reproducible.

Key Points

Purpose

Make agent claims comparable across different systems
Enable auditable harness configurations
Support reproducible agent evaluations
Reveal the harness-sensitive component of reported gains

Required Fields

Field	What to Disclose
Base model(s)	Model name, version, decoding settings, finetuning or adapters
Control artifacts	System instructions, AGENTS.md, repo maps, architecture rules, schemas, tests, linters, done-when criteria
Runtime policy	Memory type, compaction/summarization policy, checkpointing, retry/rollback policy, budget limits
Action substrate	Tools, APIs, browser/GUI access, code execution, interface schemas, MCP usage
Execution topology	Single vs multi-agent structure, planner/verifier roles, reviewer loops, routing logic
Feedback stack	Tests, graders, reflection prompts, hidden checks, human interventions, repair loops
Governance layer	Permissions, sandboxing, escalation rules, policy checks, provenance logging, audit support
Observability	Stored traces, replay support, latency and cost logging, failure categories
Evaluation protocol	Task set, number of runs, success criteria, variance treatment, held-out checks, budget limits

Recommended Fields

Field	What to Disclose
Release artifacts	Prompts or programs, tool specs, traces, configs, environment setup, reproducibility notes
Known limitations	Unresolved failure modes, portability caveats, safety concerns, red-team findings

Example: Repository Coding Agent

Field	Illustrative Disclosure
Base model(s)	Frontier coding model configured through repo or user profiles; effort tuned for long tasks
Control artifacts	Root-level AGENTS.md; repository map; build/test/lint commands; architecture rules; done-when criteria
Runtime policy	Repository treated as system of record; thread history; progress file; compaction near context limits; bounded retries
Action substrate	File edits, shell commands, test runs, diff generation, PR review, optional MCP tools
Execution topology	Plan → edit → run tools → observe → repair → update status → repeat; optional reviewer loop
Feedback stack	Failing tests, custom linter messages, self-review, grader checks, occasional human review
Governance layer	Sandbox mode, approval policy for privileged actions, least-privilege connectors, audit trail
Observability	Persisted thread events, replay support, latency and cost logs, categorized failures
Evaluation protocol	Standard coding benchmark; 3 runs per task; pass@1 and pass@3; variance reported

Why It Matters

The Visibility Gap

He et al. audit 63 harness-relevant works and find:

Academic papers often treat harness as “hidden implementation residue”
Public engineering notes (Anthropic, OpenAI) describe innovations not yet in papers
Without HarnessCard, it’s unclear whether gains come from model or harness

Comparison with Model Cards

Aspect	Model Card	HarnessCard
Focus	Model properties	Agent system apparatus
Disclosure	Training data, biases, capabilities	Control, agency, runtime configuration
Goal	Transparency about model	Reproducibility of agent claims
Analogy	”What is this model?"	"How does this agent work?”

Relationship to Other Concepts

CAR Framework: HarnessCard documents each layer of ⟨C, A, R⟩
Harness Engineering: HarnessCard is the reporting artifact of harness engineering practice
Agent Evaluation: HarnessCard makes evaluation claims interpretable and comparable

Open Questions

What fields should be truly required vs recommended?
How to handle proprietary harnesses that can’t fully disclose?
Should HarnessCard become a community standard or remain a proposal?
How does HarnessCard interact with evaluation frameworks (HAL, ATBench, VeRO)?

Evidence Timeline

2026-04-23: Created from He et al. “Harness Engineering for Language Agents” (Alibaba-NTU, Preprints.org)
2026-04-23: HarnessCard proposed as lightweight reporting artifact with 9 required + 2 recommended fields
2026-04-23: Purpose: make agent claims comparable, auditable, reproducible

My Brain Wiki

探索