Research · Google Research
Large-capacity (>120B) and frontier closed-weights models show significant improvement
Compiled by KHAO Editorial — aggregated from 1 outlet. See llms.txt for citation guidance.
★ Tier-1 Source
Qualitative analysis of cases where LLMs deviate from the preferred behavioral mode in high-consensus scenarios revealed several interesting patterns.
Key facts
- The plot is obtained from all 25 evaluated models, with specific icons marking the positions of a subset of frontier models (Anthropic Claude 4 Sonnet, Google Gemini 3 Pro, OpenAI GPT 5.1, Mistral
- Even in the low-consensus cases where human opinion is significantly divided (50–60% agreement), confidence remains high across all evaluated models
- Amir Taubenfeld, Research Engineer, Zorik Gekhman, Research Scientist, and Lior Nezry, Psychology Researcher, Google Research
- The figure below presents the results across 25 different LLMs and four distinct traits
Summary
Amir Taubenfeld, Research Engineer, Zorik Gekhman, Research Scientist, and Lior Nezry, Psychology Researcher, Google Research. As part of their ongoing exploration of model behavior and alignment, they introduce a systematic evaluation framework that transforms established assessments into large-scale situational judgment tests for large language models. As LLMs integrate into their daily lives, understanding their behavior becomes essential. Behavioral dispositions are typically quantified via self-report questionnaires under different traits (e.g., empathy, assertiveness), where individuals rate their agreement with preference-statements, such as, "The reporter is quick to express an opinion. Each instrument is grounded in peer-reviewed literature that establishes its psychometric validity and reliability using different strategies. Their objective is to build upon such psychological questionnaires, but directly applying them to LLMs presents technical challenges, as LLM outputs are sensitive to prompt phrasing and distribution shifts.