How it works
- Define a task — what the agent should accomplish with your product (e.g. “install the SDK and send a first event”), plus the goals that define success
- Define environments — which agent harness and model attempt the task, and what sandbox it runs in
- Run an experiment — execute every task across every environment in parallel, in fresh sandboxes
- Review the results — per-run scores, full transcripts, extracted signals, and failure clusters with root causes and recommended fixes
How the pieces fit together
| Concept | Command group | Common commands |
|---|---|---|
| Tasks | tpc sim task | create --file task.json, list, get, update, run |
| Environments | tpc sim env | create, list, update, task attach/detach, secret set |
| Runs & iterations | tpc sim run | list, get, logs, actions |
| Experiments | tpc sim experiment | create, run, run status --watch, results |
| Signals | tpc sim experiment | validate-signal-config, signals |
What you get from each run
- A pass/fail verdict and a 0–100 score against your goals
- The complete agent transcript: every message, tool call, and thinking step
- An archive of the sandbox after the run, so you can inspect exactly what the agent built
- Token usage, cost, and duration
- Signal values — custom metrics you define in YAML, extracted from the run automatically
Try it from the CLI
Everything below can be driven end-to-end with the tpc CLI undertpc sim:
--format json for scripting. The guides below include the CLI workflow for each abstraction.
Where to start
- Tasks — defining what agents should do and how success is measured
- Environments — configuring agents and sandboxes
- Runs & iterations — what happens during an attempt and what it records
- Experiments — running iterations and reading results
- Signals — extracting custom metrics from runs