🟡 Partially justified

Microsoft study: AI agents corrupt up to 25 percent of document content in longer workflows

Source: Microsoft Research / The Register / WinBuzzer / ResultSense·May 11, 2026

What it really says

Microsoft Research published a systematic study using the DELEGATE-52 benchmark that demonstrates how unreliable current AI agents are during multi-step workflows. The benchmark simulates workflows over 20 consecutive interactions across 52 professional domains - from software development to crystallography to music notation. The results are sobering: frontier AI models lose up to 25 percent of document content over 20 delegated work steps. Average degradation across all tested models reaches 50 percent. Catastrophic corruption (benchmark score of 80 percent or less) occurred in over 80 percent of all model-domain combinations. The researchers define 98 percent fidelity or higher as the threshold for professional reliability. Only one domain - Python code - consistently cleared this bar across most tested models. The best-performing model, Google Gemini 3.1 Pro, was reliable enough for only 11 of 52 domains. Particularly concerning: providing tools (file access, code execution) made performance worse, not better. The four tested GPT models (5.4, 5.2, 5.1, and 4.1) scored 6 percentage points worse on average when operated agentically with tools. The researchers also distinguish between two types of corruption: weaker models corrupt through deletion - which is noticeable. Frontier models corrupt through plausible-looking changes that pass review - which is more dangerous.

Our assessment

This study provides an important reality check on current enthusiasm about AI agents. If even the best models lose or falsify a quarter of document content over 20 consecutive work steps, they are simply not reliable enough for most professional applications. The researchers' comparison is apt: an intern who corrupted a quarter of a document would be shown the door. The finding that agentic tools worsen rather than improve performance directly contradicts the marketing of many AI companies positioning their agents as 'autonomous workers.' For the AI anxiety debate, this has both a reassuring and a concerning dimension: reassuring that AI agents clearly cannot replace human knowledge workers anytime soon; concerning that companies may nonetheless push for automation while accepting quality losses that only become apparent later - especially since frontier models produce errors that look plausible.

Relevance for Germany

The study is relevant for Germany for several reasons. First: many German companies are currently piloting AI agents for document-intensive processes - contract review, report generation, compliance documentation. The DELEGATE-52 results counsel caution when automating critical workflows. Second, the study supports the position of works councils and unions demanding human oversight of AI processes: if even GPT-5.4 corrupts documents, qualified humans are needed for quality control. Third, the study refutes the narrative that AI agents can replace entire departments in the short term. German companies under pressure to follow US firms in cutting staff receive a fact-based counterargument: the technology is not mature enough for autonomous delegation in most professional domains.

Fact check

Core figures come from the DELEGATE-52 benchmark paper by Microsoft Research, consistently reported by multiple independent technology outlets. The 25 percent corruption rate refers to frontier models after 20 interaction steps. The average degradation of 50 percent covers all tested models. The tested models (GPT-5.4, 5.2, 5.1, 4.1 and Gemini 3.1 Pro) and 52 domains are consistently reported. The 6 percentage point deterioration with agentic tools is specifically documented for the GPT model family. Limitation: the benchmark tests 20 consecutive interactions - in practice, many workflows will be shorter. Additionally, tests used a standardized, basic agentic framework, not optimized enterprise products. Actual error rates in production systems with additional guardrails could be lower.

Source

• The Register 11.05.2026: Microsoft researchers find AI models and agents can't handle long-running tasks (theregister.com/ai-ml/2026/05/11/)
• WinBuzzer 13.05.2026: Microsoft Research Finds AI Agents Still Corrupt Work Documents (winbuzzer.com/2026/05/13/)
• ResultSense 12.05.2026: Microsoft Research: frontier AI fails 25% on long workflows (resultsense.com/news/2026-05-12/)
• NeuralWired 28.04.2026: AI Agent Document Corruption: 25% Rate Confirmed (neuralwired.com/2026/04/28/)
• FlyingPenguin 05.2026: Microsoft on AI: Delegation Corrupts Data and You (flyingpenguin.com)

KI-AgentenStudieKI-FähigkeitenAutonomieUnternehmenSicherheit

All News