Skip to main content
A task tells the agent what to accomplish and tells the platform how to score the attempt. Both halves live on the task itself, so every environment that runs it is measured the same way.

Anatomy of a task

PartPurpose
InstructionThe prompt the agent receives — what to build, install, or accomplish
GoalsSuccess criteria, each with weighted scoring and a passing threshold
Init filesFiles placed in the sandbox before the run: zip uploads or git repositories
Init commandsShell commands run during sandbox setup (installing dependencies, seeding data)
SecretsNamed credentials exposed to the run as environment variables

Writing instructions

Write instructions the way a real user would brief an agent — not the way you wish they would. The goal is to measure your product’s agent experience, so avoid embedding hints the average user wouldn’t provide:
  • Good: “Set up error tracking for this Express app using Acme.”
  • Too helpful: “Install @acme/sdk@2.1, then call acme.init() with the DSN from the dashboard.”

Goals and criteria

Each goal contains one or more criteria. A criterion has an evaluation type, a weight, and a max score. Goals are scored by weighted_average, binary, or percentage, against a passing threshold from 0–100. Criteria are evaluated by one of:
EvaluationWhat it checks
comparisonA run metric against a threshold (duration, tokens, cost)
file_existsA file is present in the sandbox after the run
file_content_matchFile contents match a pattern
json_schemaOutput validates against a JSON Schema
bash_commandA shell command exits 0 in the post-run sandbox
python_scriptA custom Python check passes
script_judgeYour own verification script (exit 0 = pass)
llm_judgeAn LLM judges the transcript against a rubric
Prefer deterministic checks (file_exists, bash_command, script_judge) where possible — they’re cheaper and more reproducible. Reserve llm_judge for genuinely subjective criteria like “did the agent follow the documented approach?”

Sandbox setup

Init files and commands prepare the sandbox before the agent starts:
  • Zip upload — extract an archive into the sandbox (default target: the agent’s home directory)
  • Git clone — clone a repository at a specific ref; private repos authenticate via a secret reference
  • Init commands — run shell commands in order, each with a working directory and timeout
Use these to start the agent from a realistic state, like a half-finished app that needs your product integrated.

Secrets

Secrets are encrypted credentials scoped to your organization (or to a single environment). Reference them by name in a task and they’re injected as environment variables during the run — API keys never appear in the task definition or transcripts. Secrets are managed on environments — see Environments for the CLI workflow.

Working with tasks from the CLI

Tasks are created from a JSON file. Run tpc sim spec to print the full contract; a minimal task.json looks like:
{
  "name": "Checkout regression",
  "description": "Validate checkout flow",
  "category": "coding",
  "prompt": "Open the app, add an item to cart, and complete checkout successfully.",
  "goals": [
    {
      "name": "Checkout succeeds",
      "description": "The agent should add an item to the cart, proceed through the checkout flow, and reach a confirmation page without errors.",
      "evaluationType": "llm_judge",
      "model": "claude-sonnet-4-6",
      "passingThreshold": 70,
      "scoringMethod": "weighted_average"
    }
  ]
}
category is one of coding, research, documentation, or analysis. Don’t include a product field — the CLI injects your active product (set with tpc product switch).
# Create the task
tpc sim task create --file task.json

# Browse and inspect tasks
tpc sim task list --search checkout
tpc sim task get task_123

# Update metadata, or replace goals/definition from a file
tpc sim task update task_123 --name "Checkout regression v2"
tpc sim task update task_123 --file task.json

# Queue runs across the task's linked, enabled environments
tpc sim task run task_123

# ...or only on specific ones (comma-separate or repeat the flag)
tpc sim task run task_123 --environment-id env_a,env_b
To create a task and attach it to existing environments in a single step — including secrets sourced from your local shell — use tpc sim create with a simulation.json:
{
  "product": "my-product-slug",
  "environmentIds": ["env_a", "env_b"],
  "task": {
    "name": "Checkout with promo code",
    "category": "coding",
    "prompt": "Open the app, add an item to cart, apply PROMO10, and verify the discount is reflected at checkout.",
    "goals": [
      {
        "name": "Promo is applied",
        "description": "The final checkout state shows the expected 10% discount.",
        "evaluationType": "llm_judge",
        "model": "claude-sonnet-4-6",
        "passingThreshold": 70,
        "scoringMethod": "weighted_average"
      }
    ]
  },
  "secrets": [
    { "name": "ACME_API_KEY", "valueFromEnv": "ACME_API_KEY" }
  ]
}
tpc sim create --file simulation.json