Business · GitHub Blog

GitHub noticed that Rubber Duck tends to help more with difficult problems

Mon, Apr 6 · 9:53 PM UTC 2 min read

Compiled by KHAO Editorial — aggregated from 1 outlet. See llms.txt for citation guidance.

★ Tier-1 Source

Decorative illustration featuring Ducky inside a translucent cube surrounded by green geometric blocks.

On these problems, Sonnet + Rubber Duck scores 3.8% higher than the Sonnet baseline, and 4.8% higher on the hardest problems identified across three trials.

Key facts

Claude Sonnet 4.6 paired with Rubber Duck running GPT-5.4 achieved a resolution rate approaching Claude Opus 4.6 running alone, closing 74.7% of the performance gap between Sonnet and Opus
The team noticed that Rubber Duck tends to help more with difficult problems, ones that span 3+ files and would normally take 70+ steps
On these problems, Sonnet + Rubber Duck scores 3.8% higher than the Sonnet baseline, and 4.8% higher on the hardest problems identified across three trials
When you’ve selected a Claude model from the model picker to use as your orchestrator, Rubber Duck will be GPT-5.4

Summary

When you ask a coding agent to build a data pipeline, it may not use the best structure. Today, in GitHub Copilot CLI, they're introducing Rubber Duck in experimental mode. To catch different kinds of errors, a different perspective matters. Today’s coding agents follow a clear loop. Assumptions and inefficiencies become dependencies, and by the time you notice, you may have to fix more than the small mistake at the start. Using self-reflection and having the agent review its own output before moving forward is a proven technique.

Read full article at GitHub Blog →