Mind2Web E2E Performance Failure Report

Run: mind2web-e2e-perf/2026-05-05-1715 · Screenshots embedded as data URIs

10
Total Tasks
6
Passed
4
Failed
60%
Pass Rate
222s
Avg Duration
15.7
Avg Steps
15.7
Avg Tool Calls

Metrics

Duration by Task (seconds)

87f4c512
623s
816851ff
609s
cfafe377
335s
905cb530
236s
bbbc243b
107s
a7a73c8f
91s
608c595e
91s
3c1ffc3f
51s
56f8890a
39s
d71be72a
33s

Tool Calls by Task

816851ff
32
905cb530
32
cfafe377
22
608c595e
21
87f4c512
15
a7a73c8f
15
3c1ffc3f
11
d71be72a
5
bbbc243b
3
56f8890a
1

Task Summary

Task IDStatusScoreDurationStepsTool CallsErrorsPrompt
608c595eec27PASS0.86791s21210Calculate a FedEx Ground shipping rate for a 3-pound package from zip code 10019
816851ff92ffTIMEOUT0.255609s32320Search for boys' infant pajamas below $40.
d71be72aa25cPASS0.92033s550Find technical specs for the latest Macbook Air on Apple.
3c1ffc3f494eCOMPLETED0.71451s11110Find 12 Monkeys community and view the latest posts mentioning James Cole.
56f8890a837cPASS0.93039s110Show crazy credits for the movie "Prometheus" on IMDb.
905cb53061c3COMPLETED0.693236s32320Browse dermatologists within 10 miles of zip code 10019 and filter by only those
87f4c5128e36TIMEOUT0.430623s15150View the speakers that are bluetooth and wireless and filter the results to only
bbbc243b4f18PASS0.851107s330Find out how many assists Chris Paul has been averaging in the current season.
cfafe3771369PASS0.802335s22220Find the highest-rated dealer for Cadillac with a rating above 4 stars within 20
a7a73c8fa754PASS0.79891s15150Estimate the cost of a photographer in 07055 for a 4-hour project.

Failure Analysis

TIMEOUT Search for boys' infant pajamas below $40.

ID: 816851ff92ff0219acf4364dcc2c4692 · Score: 0.255 · Duration: 609s · Steps: 32

Diagnosis
Bot protection wall on Macy's blocked all navigation attempts. Agent got stuck in a loop of trying direct URLs, menu navigation, evaluate_script, and Google site-search — all hitting 'Access Denied'. Never reached search results page.
Evidence
Final page content returned 'Access Denied' (Reference: 0.adc83017). Agent tried evaluate_script to navigate to '/shop/featured/boys-infant-pajamas', got Access Denied. Tried Google site-search which also failed. 32 tool calls over 10 minutes with at least 4 Access Denied responses.
Recent Trajectory
navigate_page(macys.com) → fill(search) → Access Denied → navigate_page(homepage) → hover(Kids menu) → click(Boys Infant) → Access Denied → evaluate_script(location.href) → Access Denied → navigate_page(google site:macys.com) → blocked → timeout
Next Check
Add bot-protection detection heuristic: if Access Denied is hit twice on same domain, abandon and try alternative site or report inability. Consider cookie/session warm-up before filtered navigations.
Final Answer (truncated)
I notice the search input is not visible in this snapshot. Let me look for it more carefully.
Final screenshot for 816851ff

COMPLETED Find 12 Monkeys community and view the latest posts mentioning James Cole.

ID: 3c1ffc3f494e423b3c434c79e35da8f3 · Score: 0.714 · Duration: 51s · Steps: 11

Diagnosis
Reddit blocked non-authenticated automated access. Agent pivoted to Google site-search workaround and found post titles/snippets, but couldn't actually 'view' the posts as required. Task completion penalized because only search snippets were shown, not actual post content.
Evidence
Direct Reddit navigation returned login/blocking page. old.reddit.com also blocked. Google search with 'site:reddit.com/r/12Monkeys James Cole' returned snippets. Agent provided comprehensive list of 10 posts with excerpts from Google results.
Recent Trajectory
navigate_page(reddit.com/r/12Monkeys) → blocked → navigate_page(old.reddit.com) → blocked → navigate_page(google site:reddit.com/r/12Monkeys James Cole) → get_page_content → take_snapshot → navigate_page(reddit post link) → blocked → done
Next Check
For Reddit tasks, consider using Reddit's public JSON API (append .json to URLs) or ensure authentication cookies are available. Could also try accessing cached versions.
Final Answer (truncated)
Found r/12Monkeys posts mentioning James Cole via Google search. Listed 10 posts with titles and snippets. Reddit blocked direct access.
Final screenshot for 3c1ffc3f

COMPLETED Browse dermatologists within 10 miles of zip code 10019 and filter by only those who accept Blue Medicare Advantage.

ID: 905cb53061c33aa2d77e485fe1fca516 · Score: 0.693 · Duration: 236s · Steps: 32

Diagnosis
Target filter 'Blue Medicare Advantage' does not exist on Healthgrades. Agent made reasonable substitution (Blue Cross Blue Shield + HMO) but this doesn't exactly match the task requirement. Wasted time on 4-5 URL manipulation attempts that returned 404 before falling back to UI filters.
Evidence
Agent tried URL patterns like 'insurance=BCBS-Medicare+Advantage' and 'planType=Medicare+Advantage' — all 404. Final screenshot shows 194 results with BCBS HMO filter applied. Grader scored task_completion 70/100 due to filter mismatch.
Recent Trajectory
navigate_page(healthgrades dermatologists 10019) → fill(zip) → click(search) → navigate_page(URL with insurance param) → 404 → navigate_page(another URL) → 404 → navigate_page(back) → click(Insurance filter) → select_option(BCBS) → select_option(HMO) → done
Next Check
Prefer UI-first filter exploration over URL guessing. When exact filter option unavailable, report inability clearly rather than substituting — or verify substitution matches task intent.
Final Answer (truncated)
Healthgrades does not have 'Blue Medicare Advantage' as a specific plan type option. Selected HMO as closest match since Blue Medicare Advantage plans are typically HMO-based networks.
Final screenshot for 905cb530

TIMEOUT View the speakers that are bluetooth and wireless and filter the results to only show models that are on sale and cost less than $50.

ID: 87f4c5128e36cdb9366a138a7b61bb00 · Score: 0.430 · Duration: 623s · Steps: 15

Diagnosis
Agent reached correct filter state at step 10 (Bluetooth + Wireless + On Sale + price < $50, 27 results) but then accidentally added an 'Open-Box' condition filter, reducing results to 1. Spent remaining time trying to remove unwanted filter but timed out. Misdiagnosed the issue as cookie/session state rather than URL parameter.
Evidence
Screenshot 10 showed correct state with 27 results. Final screenshot shows Open-Box filter active with only 1 result. Agent's final answer incorrectly claims extra filters come from 'session/cookie state' when Open-Box was clearly in the URL.
Recent Trajectory
navigate_page(bestbuy speakers) → take_snapshot → check(Bluetooth) → check(Wireless) → click(On Sale) → click(<$25) → click($25-$49.99) → [correct state, 27 results] → accidentally clicked Open-Box → take_snapshot → evaluate_script(try remove) → timeout
Next Check
Add state verification after filter application: compare active filter count against expected. If unexpected filters appear, remove them immediately via URL parameter manipulation rather than UI clicks.
Final Answer (truncated)
The 'Open-Box', 'JBL', and 'Refurbished' tags are still appearing in the filter bar - these seem to be coming from the site's session/cookie state rather than the URL.
Final screenshot for 87f4c512

Cross-Cutting Observations

Bot Protection / Site Blocking (2 of 4 failures)
Both the Macy's task (816851ff) and Reddit task (3c1ffc3f) failed primarily due to anti-bot measures. The agent lacks a reliable strategy for detecting and escaping bot-protection loops early.
Accidental Filter Regression (1 failure)
The Best Buy task (87f4c512) reached a correct state at step 10 but then accidentally applied an unwanted filter. No state-verification checkpoint exists to detect regressions.
URL Guessing Before UI (1 failure)
The Healthgrades task (905cb530) wasted 4-5 steps on URL parameter guessing that all returned 404. A UI-first approach would have been more efficient.
Zero Tool Errors
All 157 tool calls across the run succeeded at the tool level. Failures are entirely due to site-level blocking, logical errors, and timeout. The tooling infrastructure is stable.