Skip to main content
Online-Mind2Web is an online evaluation benchmark based on the Mind2Web dataset, testing agents’ navigation and interaction capabilities on real websites.

Overview

AttributeValue
SourceMind2Web dataset
Task TypeWeb navigation and interaction
Target WebsitesReal-world English websites
EvaluationWebJudge semantic matching

Features

Real Websites

Tests operation on real websites, not simulated environments

Multi-step Tasks

Requires multiple sequential steps to complete complex goals

Semantic Evaluation

Uses WebJudge for semantic matching evaluation

No Login Required

All tasks can be executed without login

Quick Start

Run Tasks

# Run first 3 tasks
bubench run \
  --agent browser-use \
  --benchmark Online-Mind2Web \
  --mode first_n \
  --count 3

# Run all tasks
bubench run \
  --agent Agent-TARS \
  --benchmark Online-Mind2Web \
  --mode all \
  --skip-completed

Evaluate Results

bubench eval --agent browser-use --benchmark Online-Mind2Web --model-id bu-2-0

Evaluation Metrics

MetricDescription
Task Success RatePercentage of tasks completed
Action AccuracyAccuracy of individual actions
Element AccuracyAccuracy of element targeting

Data Format

Task data is stored in benchmarks/Online-Mind2Web/data/:
{
  "task_id": "b7258ee05d75e6c50673a59914db412e_110325",
  "confirmed_task": "Find the store location and hours of the closest Trader Joe's to zip code 90028 and set it as my home store.",
  "website": "https://www.traderjoes.com/",
  "reference_length": 6,
  "level": "medium"
}