Skip to main content

Prerequisites

  • Python 3.11+
  • Node.js 18+ (only for Agent-TARS)
  • uv (recommended Python package manager)

Installation

1

Clone the repository

git clone https://github.com/lexmount/browseruse-bench.git
cd browseruse-bench
2

Install Python dependencies

# Install core dependencies and register the bubench CLI
uv sync
Activate the venv so bubench is on PATH:
source .venv/bin/activate
Windows PowerShell:
.venv\Scripts\Activate.ps1
bubench run will create the agent venv defined in config.yaml (built-in defaults: .venvs/browser_use, .venvs/skyvern, .venvs/agent_tars) and install the matching dependencies on first use. Agent venv must be configured explicitly (no fallback to .venv). If uv is not available, creation/install falls back to python -m venv and pip.
3

Configure environment (.env)

cp .env.example .env
Edit .env.env.example is the source of truth; keys below are the common ones. $VAR placeholders in config.yaml are resolved from this file at runtime.
### OpenAI API (used by agents and the eval pipeline)
OPENAI_API_KEY=your-openai-api-key-here
# OPENAI_BASE_URL=https://api.openai.com/v1    # Optional: custom / OpenAI-compatible gateway

### OpenAI-compatible gateway (Skyvern, and any agent using openai_compatible_*)
# OPENAI_COMPATIBLE_API_KEY=your-gateway-api-key
# OPENAI_COMPATIBLE_API_BASE=https://your-gateway.example.com/v1

### Anthropic API (Claude Code agent)
ANTHROPIC_API_KEY=your-anthropic-api-key-here
# ANTHROPIC_BASE_URL=https://api.anthropic.com

### Lexmount cloud browser
LEXMOUNT_API_KEY=your-lexmount-api-key
LEXMOUNT_PROJECT_ID=your-project-id
# LEXMOUNT_BASE_URL=https://api.lexmount.cn   # Optional. Official endpoints:
#   https://api.lexmount.cn               — production (mainland China / 国内)
#   https://api.lexmount.com              — production (international / 国外)

### Browser Use API (only when models.*.model_type: BROWSER_USE)
BROWSER_USE_API_KEY=your-browser-use-api-key

### AgentBay (only when browser_id: agentbay)
# AGENTBAY_API_KEY=your-agentbay-api-key

### HuggingFace mirror (optional, faster downloads in China)
# HF_ENDPOINT=https://hf-mirror.com
Lexmount credentials: Apply for LEXMOUNT_API_KEY and LEXMOUNT_PROJECT_ID at browser.lexmount.cn (mainland China) or browser.lexmount.com (international). See Lexmount Cloud Browser for the full setup.
4

Configure config.yaml

cp config.example.yaml config.yaml
The root config.yaml is the canonical runtime config. $VAR placeholders are resolved from .env at runtime. Three parts to set up:1. Agent (model) — pick the active model for each agent under agents.<agent>:
agents:
  browser-use:
    active_model: gpt-5.4         # pick any key from models below
    models:
      gpt-5.4:
        model_type: OPENAI        # OPENAI | GEMINI | ANTHROPIC | AZURE | BROWSER_USE
        model_id: gpt-5.4
        api_key: $OPENAI_API_KEY
        base_url: $OPENAI_BASE_URL
      browser-use:
        model_type: BROWSER_USE
        model_id: bu-2-0
        api_key: $BROWSER_USE_API_KEY
Override the active model per run with --model <name> — no need to edit the file.
2. Browser backend — set browser_id under agents.<agent>.browser. Only one backend is active at a time; fill in its required keys and leave others commented:
agents:
  browser-use:
    browser:
      browser_id: lexmount          # lexmount | Chrome-Local | agentbay | browser-use-cloud | cdp
      # --- lexmount ---
      lexmount_browser_mode: normal
      lexmount_api_key: $LEXMOUNT_API_KEY
      lexmount_project_id: $LEXMOUNT_PROJECT_ID
      # --- Chrome-Local ---
      # browser_id: Chrome-Local   # no extra params
      # --- agentbay ---
      # browser_id: agentbay       # set AGENTBAY_API_KEY in .env
See Lexmount Cloud Browser and each agent’s page for backend-specific options.3. Eval model — used by bubench eval for LLM-as-judge scoring:
eval:
  model: gpt-5.4
  api_key: $OPENAI_API_KEY
  base_url: $OPENAI_BASE_URL   # remove if using api.openai.com directly
Per-agent config files under configs/agents/<agent>/config.yaml are a legacy path and may be removed in a future release. Prefer the root config.yaml.
5

Install Agent-TARS CLI (optional)

npm install -g @agent-tars/cli@0.3.0
6

Install skills (optional)

bubench skills

Quick Run

Run your first benchmark

# Run the first 3 tasks of LexBench-Browser (L1 no-login subset)
bubench run \
  --agent browser-use \
  --benchmark LexBench-Browser \
  --split L1 \
  --mode first_n \
  --count 3
Add --dry-run to validate config loading and task resolution without executing tasks:
bubench run \
  --agent browser-use \
  --benchmark LexBench-Browser \
  --mode single \
  --dry-run
--dry-run checks that config.yaml / .env parse cleanly and that at least one task matches your --benchmark / --split / --mode. It does not create the agent venv, call model APIs, or open a browser — those only happen on a real run.

Evaluate results

# Uses active_model's model_id from config.yaml by default
bubench eval --agent browser-use --benchmark LexBench-Browser

# Override with --model-id when evaluating a non-active model's run
bubench eval --agent browser-use --benchmark LexBench-Browser --model-id gpt-5.4

# Custom score threshold
bubench eval --agent browser-use --benchmark LexBench-Browser --score-threshold 70
--model-id tells eval which output subdirectory to score (experiments/<benchmark>/<split>/<agent>/<model_id>/). Omitting it falls back to the model_id of agents.<agent>.active_model in config.yaml, matching what bubench run writes by default.
Logs: Script execution logs are saved in output/logs/.
  • run.py: output/logs/run/
  • eval.py: output/logs/eval/
  • leaderboard: output/logs/leaderboard/

Generate leaderboard

# Collect all evaluation results and generate the HTML leaderboard
bubench leaderboard

# Start a local server to view it
bubench server
# Visit http://localhost:8000

Run Modes

ModeDescriptionExample
singleRun the first task (sanity check)--mode single
first_nRun the first N tasks--mode first_n --count 5
sample_nRandomly sample N tasks--mode sample_n --count 10
specificRun specified task IDs--mode specific --task-ids id1 id2
by_idRun one task by numeric ID field--mode by_id --id 123
allRun all tasks--mode all
Note: --task-ids expects a space-separated list.

Common Parameters

bubench run \
  --agent browser-use \
  --benchmark LexBench-Browser \
  --split All \
  --mode first_n \
  --count 5 \
  --timeout 600 \
  --skip-completed \
  --dry-run
Additional flags:
  • --data-source: local or huggingface.
  • --force-download: Force re-download in HuggingFace mode.
  • --agent-config: Optional external agent config YAML path. By default the runtime config is loaded from root config.yaml.
  • --timestamp: Resume or run in a specific directory (YYYYMMDD_HHmmss).
--timeout overrides TIMEOUT in the agent config.

Resume an Interrupted Run

If a run is interrupted, use --timestamp to point to the same output directory and --skip-completed to skip tasks that already have results:
bubench run \
  --agent browser-use \
  --benchmark LexBench-Browser \
  --timestamp 20260101_120000 \
  --skip-completed \
  --mode all
Tip: Find your timestamp under experiments/{benchmark}/{split}/{agent}/.

Running Multiple Agents in Parallel

bubench run uses the venv specified by the agent entry in config.yaml and will auto-create/install dependencies on first use. By default each built-in agent has a dedicated venv:
  • browser-use -> .venvs/browser_use
  • skyvern -> .venvs/skyvern
  • Agent-TARS -> .venvs/agent_tars
If an agent entry does not define venv, bubench run exits with an error instead of falling back to .venv. If you need to run conflicting agents at the same time, open two terminals and run each agent with its own venv.

Parallel Task Execution (Split by Task IDs)

To speed up a large benchmark, split tasks across multiple terminals using --mode specific --task-ids:
# Terminal 1
bubench run --agent browser-use --benchmark LexBench-Browser \
  --mode specific --task-ids task_001 task_002 task_003

# Terminal 2
bubench run --agent browser-use --benchmark LexBench-Browser \
  --mode specific --task-ids task_004 task_005 task_006
Use the same --timestamp in both terminals to write results to the same output directory.

Node.js Agents (No Conflicts)

Agent-TARS runs via a Node.js CLI and does not share Python dependencies with other agents. You can run it in any terminal after installing the CLI.
bubench run --agent Agent-TARS ...

Next Steps

Supported Agents

Explore available browser agents

Benchmarks

Learn about each benchmark

Cloud Browser Setup

Configure Lexmount cloud browser

View Leaderboard

Compare agent performance