Skip to main content
This guide walks you through the complete workflow from agent configuration to final evaluation results.

Overview

  1. Select an agent
  2. Configure the agent
  3. Choose a benchmark
  4. Run tasks
  5. Evaluate results
  6. Inspect outputs

1. Select an Agent

browseruse-bench supports multiple agents. Choose one based on your needs:
AgentDescriptionDocumentation
browser-useProgrammable browser agent with vision capabilitiesDetails
Agent-TARSReasoning-focused agent via Node.js CLIDetails
SkyvernBrowser automation powered by the Skyvern SDKDetails
Claude CodeAnthropic’s Claude CLI with Playwright MCPDetails

2. Configure Agent

All agent runtime settings live in the root config.yaml under agents.<agent-name>. This is the recommended approach. Copy the example and fill in your credentials:
cp config.example.yaml config.yaml
Then edit the agents section in config.yaml:
agents:
  browser-use:
    active_model: gpt          # model profile to use by default
    models:
      gpt:
        model_type: OPENAI
        model_id: gpt-4.1
        api_key: $OPENAI_API_KEY
        base_url: $OPENAI_BASE_URL
    browser:
      browser_id: Chrome-Local
    defaults:
      use_vision: false
      max_steps: 40
      timeout: 600
Switch the active model at runtime without editing the file:
bubench run --agent browser-use --model gpt ...
Not Recommended: configs/agents/<agent>/config.yamlPer-agent config files under configs/agents/ are no longer the recommended approach and may be removed in a future release. Use the root config.yaml instead (see above). They can still be passed explicitly via --agent-config configs/agents/<agent>/config.yaml.

3. Select Benchmark

Choose a benchmark based on your evaluation needs:

LexBench-Browser

  • Evaluation Method: Visual assessment (screenshot sequence analysis)
  • Scoring: 0-100 scale, default threshold: 60
  • Use Case: Visual understanding and multi-step reasoning

Online-Mind2Web

  • Evaluation Method: WebJudge multi-round evaluation
  • Scoring: 3-point scale, default threshold: 3
  • Use Case: Web navigation and task completion

BrowseComp

  • Evaluation Method: Text answer accuracy
  • Scoring: Binary (correct/incorrect)
  • Use Case: Factual accuracy and information extraction

4. Run Tasks

Basic Command

bubench run \
  --agent browser-use \
  --benchmark LexBench-Browser \
  --split All

All Parameters

ParameterDescriptionNotes
--agentAgent nameDefaults to config.yaml default.agent (fallback Agent-TARS)
--benchmarkBenchmark nameDefaults to config.yaml default.benchmark (fallback Online-Mind2Web)
--splitDataset splitDefaults to All
--data-sourceDataset sourcelocal (default) or huggingface
--force-downloadRe-download datasetOnly for huggingface
--modeTask selection modesingle, first_n, sample_n, specific, by_id, all
--countTask count for first_n/sample_nDefaults to 1
--task-idsTask IDs for specific modeSpace-separated list
--idSingle task ID for by_id modeNumeric ID field
--timeoutTimeout per task (seconds)Overrides TIMEOUT in config
--skip-completedSkip tasks with existing resultsUseful for resumes
--agent-configExplicit external agent config YAML pathOptional; by default runtime config is loaded from root config.yaml
--timestampRun or resume in a specific directoryYYYYMMDD_HHmmss
--dry-runShow the command without executingConfiguration check

Output Structure

Results are saved to:
experiments/{benchmark}/{split}/{agent}/{timestamp}/
├── tasks/
│   ├── <task_id>/
│   │   ├── result.json
│   │   └── trajectory/
│   │       ├── screenshot-1.png
│   │       └── ...
└── tasks_eval_result/
    └── *_summary.json

Monitoring Progress

Log files are created under output/logs/run/:
ls -t output/logs/run | head -n 1
# then tail the newest file

5. Evaluate Results

Run Evaluation

bubench eval \
  --benchmark LexBench-Browser \
  --agent browser-use \
  --model-id bu-2-0 \
  --split All
The script automatically finds the latest results under the --agent / --model-id output directory.

Evaluation Parameters

ParameterDescriptionDefault
--model-idmodel_id used at run time (output subdirectory)agents.<agent>.active_modelmodel_id
--modelEvaluation LLM modelFrom eval.model in config.yaml
--score-thresholdSuccess threshold60 (LexBench), 3 (others)
--force-reevalForce re-evaluationfalse
--timestampEvaluate a specific runLatest (auto-detected)
--data-sourceDataset source (LexBench only)local
--force-downloadRe-download dataset (LexBench only)false

Output Files

Evaluation results are written to tasks_eval_result/:
  • Detailed Results: *_eval_results.json
  • Summary Statistics: *_summary.json

Review Results

cat experiments/{benchmark}/{split}/{agent}/{timestamp}/tasks_eval_result/*_summary.json

Complete Example

# 1. Configure agent in root config.yaml
vim config.yaml   # edit agents.browser-use section

# 2. Run inference (first 10 tasks)
bubench run \
  --agent browser-use \
  --benchmark LexBench-Browser \
  --split All \
  --mode first_n \
  --count 10

# 3. Evaluate results
bubench eval \
  --benchmark LexBench-Browser \
  --agent browser-use \
  --model-id bu-2-0 \
  --split All

# 4. View summary
ls -lh experiments/LexBench-Browser/All/browser-use/*/tasks_eval_result/

Common Issues

Timeout Errors

Problem: Tasks exceed configured timeout Solution: Increase timeout in the agent’s defaults section of config.yaml, or pass --timeout on the command line.

Missing Screenshots (LexBench-Browser)

Problem: Evaluation fails due to missing screenshots Solution: Confirm tasks/<task_id>/trajectory/ contains screenshots and check the run logs for task failures.

Model API Errors

Problem: LLM API calls fail Solution: Verify API keys in config.yaml (use $ENV_VAR references and set values in .env); for evaluation, check .env values.

Next Steps

  • Custom Benchmarks: Learn how to create your own benchmark (Guide)
  • Leaderboard: Submit results to the public leaderboard (Details)
  • Advanced Configuration: Explore advanced agent settings (Documentation)