Microsoft researchers catch AI models and agents can't handle long-running tasks

Mon, May 11 · 8:50 PM UTC 2 min read

Compiled by KHAO Editorial — aggregated from 1 source. See llms.txt for citation guidance.

◌ Single Source

Stylized black-and-white portrait of Donald Trump on a dark background.

An intern who failed this much would be shown the door.

Key facts

The stronger models (Gemini 3.1 Pro, Claude 4.6, GPT 5.4) aren’t avoiding small errors better, they delay critical failures to later rounds and experience them in fewer interactions
Yet they also note that LLMs have been getting better, pointing to the performance of OpenAI's GPT model family, which has seen its benchmark performance increase over 16 months from 14.7 percent to 71.5 percent
To be considered "ready" for a given work domain, the researchers set the bar at 98 percent or higher after 20 interactions
The study found that "catastrophic corruption," meaning a benchmark score of 80 percent or less, occurred in more than 80 percent of model/domain combinations

Summary

Companies exploring automated workflows would be well advised to keep their AI agents on a short leash. Microsoft researchers have found that even the priciest frontier models introduce errors in long workflows, the thing for which AI software has been pitched. Anthropic, for example, says, "Claude Cowork handles tasks autonomously. Redmond promotes similar usage, touting Microsoft 365 Copilot's ability to "Tackle complex, multistep research across your work data and the web. Philippe Laban, Tobias Schnabel, and Jennifer Neville from Microsoft Research set out to study what happens when large language models (LLMs) are asked to complete multistep tasks.

Read full article at The Register →

#Microsoft #AI Agent #Gemini #Mythos #Anthropic #Copilot #OpenAI #Claude #Google #GPT