ITBench-AA: Frontier Models Score Below 50% on the First Benchmark for Agentic Enterprise IT Tasks
·2 min read
Compiled by KHAO Editorial
— aggregated from 2 sources.
See llms.txt for citation guidance.
✓ KHAO Verified
Summary
Claude Opus 4.7 (Adaptive Reasoning, Max Effort) leads at 47%, followed by GPT-5.5 (xhigh) at 46% and Qwen3.7 Max at 42%. All frontier models score below 50%, making ITBench-AA SRE one of the least saturated agentic benchmarks in their suite. Turn counts vary nearly 3x and longer trajectories do not translate to higher accuracy. GPT-5.5 (xhigh) averages 31 turns per task at 46%, while Gemini 3.1 Pro Preview averages 83 turns at 30%. GLM-5.1 (Reasoning) leads open weights models at 40%, effectively tied with Gemini 3.5 Flash (high). 59 SRE tasks in total: 40 public tasks and 19 brand new, held-out tasks.