Skip to main content
BrowseComp is a benchmark for browser operation competition tasks, evaluating agents’ comprehensive browser operation capabilities.

Overview

AttributeValue
Task TypeBrowser operations
EvaluationGrader-based scoring
DifficultyMedium-High

Features

Competition-grade Tasks

Tasks from browser operation competitions with high difficulty

Comprehensive Skills

Tests a wide range of browser operation capabilities

Quick Start

Run Tasks

# Run first 3 tasks
bubench run \
  --agent browser-use \
  --benchmark BrowseComp \
  --mode first_n \
  --count 3

# Run with Agent-TARS
bubench run \
  --agent Agent-TARS \
  --benchmark BrowseComp \
  --mode first_n \
  --count 3

Evaluate Results

bubench eval --agent browser-use --benchmark BrowseComp --model-id bu-2-0

Data Loading

BrowseComp supports local JSONL files or HuggingFace downloads. To use HuggingFace:
bubench run --agent browser-use --benchmark BrowseComp \
  --data-source huggingface
The HuggingFace parquet file is converted to JSONL in the HF cache before use.

Evaluation Metrics

MetricDescription
Task CompletionPercentage of tasks completed
AccuracyResult accuracy

Data Format

Task data is stored in benchmarks/BrowseComp/data/:
{
  "task_id": "browsecomp_001",
  "task": "Navigate to the website and complete the registration form",
  "expected_result": "Registration successful"
}