The harness is the application. The sandbox is the server.
Sandboxes Vs Servers
An application runs on a server: it uses the compute, writes to the filesystem, maintains state across restarts. The harness and sandbox have the same relationship. The harness reasons and calls tools. The sandbox provides the isolated execution environment. Swap either one independently and the system still works.
Where it started: evals
Sandboxes began as disposable test environments. An entire cottage industry of eval companies sells benchmarks to labs like Anthropic and OpenAI, and the first step in proving an eval works is proving a coding agent struggles against it. Isolation matters: coding agents can use the internet to cheat. The sandbox provides the air gap.
Sandbox for Evals: An eval, a test environment, and the execution tools to run it. That is the basic sandbox pattern. Universities, startups, and teams benchmarking their own agents scale out hundreds or thousands of runs across providers like Daytona and E2B. Disposable, ephemeral, spun up and thrown away.
Where it’s heading: long-running harnesses
The more interesting use case is harnesses that run for hours. Fixing software, writing code across a large codebase, running analytical reports, operating parts of your business continuously. Most tasks worth doing are not one-and-done. They take time, analysis, and iteration.
Running a harness continuously is hard. Anthropic’s recent post on managed agents highlights the core problems: sandboxes die mid-task, state needs to survive failures, the execution environment needs to be robust enough for hours of work. The harness stops being disposable and starts being a job you depend on.
The state question
What makes a harness restartable? Two things:
-
The trajectory: the full record of what questions were asked, what tool calls were made, what decisions were reached. Claude Code and Cursor can already restart from a trajectory alone. It is the most valuable artifact the harness produces.
-
The local data: everything the harness created in the sandbox filesystem. These include downloaded files, new skills written, generated analysis, and modified code. The state the trajectory operated on will sometimes reside there as well.
Together, these two pieces are everything needed to resume from any point. Anthropic frames this as “decoupling the brain from the hands.” The brain is the harness. The hands are the sandbox. The session log lives outside both. The sandbox becomes cattle. We believe the trajectory and sandbox file system persistence combination is the durable artifact. The trajectory being the most important.
Who controls the state
If trajectory data is the valuable asset, sandbox providers are not just selling compute. They are selling the environment in which this data is generated. Many today save this state along with the persisted file system.
There is a version of this market where the labs run managed harnesses end to end. Your trajectories live in their session logs. The labs control the state.
There is another version where enterprises do not allow that connectivity, where managed harnesses having access to production infrastructure is a non-starter. In this world, sandboxes run inside the customer’s cloud. The state stays in the customer’s infrastructure, using providers like Daytona and E2B, as well as home-built options like Stripe Minions, to run sandboxes in your own infrastructure.
Regardless of where the sandbox runs: who owns the trajectory? The reasoning history is what makes a harness’s work reproducible, auditable, and improvable. Wherever that data lives, whoever controls it, has huge leverage in the future AI market.
The future is swarms
The future of the sandbox market is swarms of long-running harnesses. Coordinated teams working across many sandboxes, each with its own trajectory and state. Individual harnesses fail and restart. Sandboxes die and get reprovisioned. The swarm keeps working.
We’ve seen the future of swarms in long running agents with: