Agent Benchmarking - Search News

10d

Surprise upset: GPT-5.5 beats Claude Fable 5 on brutal new Agents’ Last Exam benchmark

The victory of GPT-5.5 aligns with recent third-party analysis suggesting that OpenAI's models are currently superior at ...

1mon

Microsoft’s multi-agent AI system tops Anthropic’s Mythos on cybersecurity benchmark

Microsoft's new vulnerability-scanning system, codenamed MDASH, scored 88.45% on the CyberGym benchmark, surpassing single-model systems from Anthropic and OpenAI by using more than 100 specialized AI ...

InfoWorld

Researchers reveal flaws in AI agent benchmarking

As agents using artificial intelligence have wormed their way into the mainstream for everything from customer service to fixing software code, it’s increasingly important to determine which are the ...

Why Most AI Agents Fail When It Matters

As organizations rush to deploy autonomous systems, success increasingly depends on governance, workflow design and ...

24d

DeepSWE blows up the AI coding leaderboard, crowns GPT-5.5, and finds Claude Opus exploiting a benchmark loophole

DeepSWE puts GPT-5.5 atop the AI coding leaderboard while raising new questions about Claude Opus, SWE-Bench Pro, and benchmark leakage.

Neowin

Microsoft reveals Windows Agent Arena to benchmark generative AI agents

The use of generative AI and large language models to automate and simplify tasks for people who work with PCs continued to grow. However, there's also a need to see how well AI can work to accomplish ...

Tech Times

Google Ad Manager Launches Ask Ad Manager, Gemini-Powered Publisher AI in Beta

Google Ad Manager AI agent Ask Ad Manager launches in beta this month, using Gemini and retrieval-augmented generation over ...

Accounting Today

Gusto releases AI agents for accounting firm development

Gusto's newest release is focused on the accounting practice management arena: six new AI agents designed specifically for ...

Some results have been hidden because they may be inaccessible to you

Show inaccessible results