EVA-Bench Data 2.0: 3 Domains, 121 Tools, 213 Scenarios

Thu, Jun 4 · 12:24 PM UTC 2 min read

Compiled by KHAO Editorial — aggregated from 2 sources. See llms.txt for citation guidance.

◎ Multiple-sources

Voice agent failures are often highly domain-specific.

Key facts

Every scenario was validated for solvability against three frontier models (OpenAI GPT-5.4, Google Gemini 3.1 Pro, and Anthropic Claude Opus 4.6) ensuring the benchmark is both challenging and fair
As a final pass, they ran three frontier models, OpenAI GPT-5.4, Google Gemini 3.1 Pro, and Anthropic Claude Opus 4.6, on a text-only version of each scenario, bypassing the audio pipeline
Together they span 213 evaluation scenarios across 121 tools, a roughly 4x increase in scenario coverage from their original release
So with this release, EVA-Bench expands from one enterprise domain to three: Airline Customer Service Management (CSM), Enterprise IT Service Management (ITSM), and Healthcare HR Service Delivery

Summary

Every scenario was validated for solvability against three frontier models (OpenAI GPT-5.4, Google Gemini 3.1 Pro, and Anthropic Claude Opus 4.6) ensuring the benchmark is both challenging and fair. EVA-Bench is built for multiple audiences. If you're evaluating a voice agent, you can run it against a diverse set of realistic enterprise scenarios spanning 35+ distinct workflows. Five principles guided the design of the EVA-Bench datasets across all three domains. Not every enterprise workflow belongs in a voice benchmark. Tool schemas were modeled after the kinds of APIs a production platform uses.

#Anthropic #OpenAI #Gemini #Claude #Google #GPT