Gemini · Claude · GPT · Agentic AI · AI Reasoning · DeepSeek · Hugging Face

ITBench-AA: Frontier Models Score Below 50% on the First Benchmark for Agentic Enterprise IT Tasks

Wed, May 27 · 5:20 PM UTC 2 min read

Compiled by KHAO Editorial — aggregated from 2 sources. See llms.txt for citation guidance.

✓ KHAO Verified

Summary

Claude Opus 4.7 (Adaptive Reasoning, Max Effort) leads at 47%, followed by GPT-5.5 (xhigh) at 46% and Qwen3.7 Max at 42%. All frontier models score below 50%, making ITBench-AA SRE one of the least saturated agentic benchmarks in their suite. Turn counts vary nearly 3x and longer trajectories do not translate to higher accuracy. GPT-5.5 (xhigh) averages 31 turns per task at 46%, while Gemini 3.1 Pro Preview averages 83 turns at 30%. GLM-5.1 (Reasoning) leads open weights models at 40%, effectively tied with Gemini 3.5 Flash (high). 59 SRE tasks in total: 40 public tasks and 19 brand new, held-out tasks.

#Gemini #Claude #GPT #Agentic AI #AI Reasoning #DeepSeek