# Install core dependencies and register the bubench CLIuv sync
Activate the venv so bubench is on PATH:
source .venv/bin/activate
Windows PowerShell:
.venv\Scripts\Activate.ps1
bubench run will create the agent venv defined in config.yaml (built-in defaults:
.venvs/browser_use, .venvs/skyvern, .venvs/agent_tars) and install the matching
dependencies on first use. Agent venv must be configured explicitly (no fallback to .venv).
If uv is not available, creation/install falls back to python -m venv and pip.
3
Configure environment (.env)
cp .env.example .env
Edit .env — .env.example is the source of truth; keys below are the common ones.
$VAR placeholders in config.yaml are resolved from this file at runtime.
### OpenAI API (used by agents and the eval pipeline)OPENAI_API_KEY=your-openai-api-key-here# OPENAI_BASE_URL=https://api.openai.com/v1 # Optional: custom / OpenAI-compatible gateway### OpenAI-compatible gateway (Skyvern, and any agent using openai_compatible_*)# OPENAI_COMPATIBLE_API_KEY=your-gateway-api-key# OPENAI_COMPATIBLE_API_BASE=https://your-gateway.example.com/v1### Anthropic API (Claude Code agent)ANTHROPIC_API_KEY=your-anthropic-api-key-here# ANTHROPIC_BASE_URL=https://api.anthropic.com### Lexmount cloud browserLEXMOUNT_API_KEY=your-lexmount-api-keyLEXMOUNT_PROJECT_ID=your-project-id# LEXMOUNT_BASE_URL=https://api.lexmount.cn # Optional. Official endpoints:# https://api.lexmount.cn — production (mainland China / 国内)# https://api.lexmount.com — production (international / 国外)### Browser Use API (only when models.*.model_type: BROWSER_USE)BROWSER_USE_API_KEY=your-browser-use-api-key### AgentBay (only when browser_id: agentbay)# AGENTBAY_API_KEY=your-agentbay-api-key### HuggingFace mirror (optional, faster downloads in China)# HF_ENDPOINT=https://hf-mirror.com
The root config.yaml is the canonical runtime config. $VAR placeholders are
resolved from .env at runtime. Three parts to set up:1. Agent (model) — pick the active model for each agent under agents.<agent>:
Override the active model per run with --model <name> — no need to edit the file.
2. Browser backend — set browser_id under agents.<agent>.browser. Only one
backend is active at a time; fill in its required keys and leave others commented:
agents: browser-use: browser: browser_id: lexmount # lexmount | Chrome-Local | agentbay | browser-use-cloud | cdp # --- lexmount --- lexmount_browser_mode: normal lexmount_api_key: $LEXMOUNT_API_KEY lexmount_project_id: $LEXMOUNT_PROJECT_ID # --- Chrome-Local --- # browser_id: Chrome-Local # local_proxy_server: "" # http://127.0.0.1:7890 etc. — local Chromium does NOT inherit OS / clash / v2ray proxy # --- agentbay --- # browser_id: agentbay # set AGENTBAY_API_KEY in .env
See Local Chromium Browser and Lexmount Cloud Browser, plus each agent’s page, for backend-specific options. Important: the local Chromium does NOT pick up your OS / clash / v2ray proxy — set local_proxy_server if you need to reach sites the bare-connect can’t (e.g. bloomberg.com, archive.org).3. Eval model — used by bubench eval for LLM-as-judge scoring:
eval: model: gpt-5.4 api_key: $OPENAI_API_KEY base_url: $OPENAI_BASE_URL # remove if using api.openai.com directly
Add --dry-run to validate config loading and task resolution without executing tasks:
bubench run \ --agent browser-use \ --data LexBench-Browser \ --mode single \ --dry-run
--dry-run checks that config.yaml / .env parse cleanly and that at least one
task matches your --data / --split / --mode. It does not create the
agent venv, call model APIs, or open a browser — those only happen on a real run.
# Uses active_model's model_id from config.yaml by defaultbubench eval --agent browser-use --data LexBench-Browser# Override with --model-id when evaluating a non-active model's runbubench eval --agent browser-use --data LexBench-Browser --model-id gpt-5.4# Custom score thresholdbubench eval --agent browser-use --data LexBench-Browser --score-threshold 70
--model-id tells eval which output subdirectory to score
(experiments/<benchmark>/<split>/<agent>/<model_id>/). Omitting it falls back to
the model_id of agents.<agent>.active_model in config.yaml, matching what
bubench run writes by default.
Logs: Script execution logs are saved in output/logs/.
# Collect all evaluation results and generate the HTML leaderboardbubench leaderboard# Start a local server to view itbubench server# Visit http://localhost:8000
bubench run uses the venv specified by the agent entry in config.yaml and will auto-create/install dependencies
on first use. By default each built-in agent has a dedicated venv:
browser-use -> .venvs/browser_use
skyvern -> .venvs/skyvern
Agent-TARS -> .venvs/agent_tars
If an agent entry does not define venv, bubench run exits with an error instead of falling back to .venv.If you need to run conflicting agents at the same time, open two terminals and run each agent with its own venv.