hatchmoment. scored by care · not by stars

gui-agent-benchmark

GUI Agent Benchmark harness for reproducible failure analysis

This project provides a deterministic browser benchmark that records capture, trace, and taxonomy artifacts to diagnose where GUI agents fail. It is aimed at researchers and developers building autonomous UI agents, offering a reproducible environment to isolate failures in primitives like dropdowns, tables, and modals. Compared to generic leaderboards, it preserves the full evidence chain, enabling deeper failure analysis.

View on GitHub →

Raidriar7170/gui-agent-benchmark