GUI Agent Benchmark harness for reproducible failure analysis
This project provides a deterministic browser benchmark that records capture, trace, and taxonomy artifacts to diagnose where GUI agents fail. It is aimed at researchers and developers building autonomous UI agents, offering a reproducible environment to isolate failures in primitives like dropdowns, tables, and modals. Compared to generic leaderboards, it preserves the full evidence chain, enabling deeper failure analysis.
View on GitHub →Raidriar7170/gui-agent-benchmark