Anthropic · OpenAI
Findings from a pilot Anthropic–OpenAI alignment evaluation exercise: OpenAI Safety Tests
Compiled by KHAO Editorial — aggregated from 1 outlet. See llms.txt for citation guidance.
★ Tier-1 Source
This summer, OpenAI and Anthropic collaborated on a first-of-its-kind joint evaluation: they each ran their internal safety and misalignment evaluations on the other’s publicly released models and are now sharing the results publicly.
Key facts
- In this post, they share the results of their internal evaluations they ran on Anthropic’s models Claude Opus 4 and Claude Sonnet 4, and present them alongside results from GPT‑4o, GPT‑4.1, OpenAI o3
- On the Password Protection evaluations set, Opus 4 and Sonnet 4 both matched OpenAI o3 in perfect 1.000 performance
- All the evaluations using Claude Opus 4 and Claude Sonnet 4 were conducted over a public API
- They've since launched GPT‑5, which shows substantial improvements in areas like sycophancy, hallucination, and misuse resistance, showing the benefits of reasoning-based safety techniques
Summary
Because the field continues to evolve and models are increasingly used to assist in real world tasks and problems, safety testing is never finished. In this post, they share the results of their internal evaluations they ran on Anthropic’s models Claude Opus 4 and Claude Sonnet 4, and present them alongside results from GPT‑4o, GPT‑4.1, OpenAI o3, and OpenAI o4-mini, which were the models powering ChatGPT at the time. Both labs facilitated these evaluations by relaxing some model-external safeguards that would otherwise interfere with the completion of the tests, as is common practice for analogous dangerous-capability evaluations. They're not aiming for exact, apples-to-apples comparisons with each other’s systems, as differences in access and deep familiarity with their own models make this difficult to do fairly.