Anthropic · OpenAI · Gemini · Claude · Google · GPT · Hugging Face
EVA-Bench Data 2.0: 3 Domains, 121 Tools, 213 Scenarios
Compiled by KHAO Editorial — aggregated from 2 sources. See llms.txt for citation guidance.
◎ Multiple-sources
Voice agent failures are often highly domain-specific.
Key facts
- Every scenario was validated for solvability against three frontier models (OpenAI GPT-5.4, Google Gemini 3.1 Pro, and Anthropic Claude Opus 4.6) ensuring the benchmark is both challenging and fair
- As a final pass, they ran three frontier models, OpenAI GPT-5.4, Google Gemini 3.1 Pro, and Anthropic Claude Opus 4.6, on a text-only version of each scenario, bypassing the audio pipeline
- Together they span 213 evaluation scenarios across 121 tools, a roughly 4x increase in scenario coverage from their original release
- So with this release, EVA-Bench expands from one enterprise domain to three: Airline Customer Service Management (CSM), Enterprise IT Service Management (ITSM), and Healthcare HR Service Delivery
Summary
Every scenario was validated for solvability against three frontier models (OpenAI GPT-5.4, Google Gemini 3.1 Pro, and Anthropic Claude Opus 4.6) ensuring the benchmark is both challenging and fair. EVA-Bench is built for multiple audiences. If you're evaluating a voice agent, you can run it against a diverse set of realistic enterprise scenarios spanning 35+ distinct workflows. Five principles guided the design of the EVA-Bench datasets across all three domains. Not every enterprise workflow belongs in a voice benchmark. Tool schemas were modeled after the kinds of APIs a production platform uses.