The lifecycle of a run
Each run moves through these statuses:- creating_sandbox — a fresh sandbox is provisioned and initialized with the task’s files, commands, and secrets
- running — the agent harness executes the instruction (runs can take up to 70 minutes)
- evaluating — each goal’s criteria are checked against the sandbox and transcript, producing a score and pass/fail verdict
- analyzing — signals are extracted and the transcript is analyzed for friction points
- completed — results are final; the sandbox is archived and destroyed
failed after retries are exhausted.
What each run records
- Score and verdict — overall 0–100 score and whether the run passed its goals
- Transcript — the unified timeline of agent messages, thinking, tool calls, and results
- Goal results — per-goal, per-criterion scores with evaluation details
- Sandbox archive — a snapshot of the agent’s home directory after the run, downloadable for inspection
- Usage — tokens, estimated cost, and duration
- Snapshots — the exact task definition and environment config used, frozen at run time
Iterations
Runs inside an experiment are grouped into numbered iterations. Iterations are immutable: the signal config and aggregated results are frozen when the iteration runs, so you can compare iteration 5 to iteration 1 and trust that each reflects its moment in time. See Experiments for how iterations are triggered and how their results are generated.Inspecting runs from the CLI
--format json on any of these to feed results into scripts or CI.