According to independent benchmarks from Artificial Analysis, Opus 4.6 has defeated GPT-5.2 in the GDPval-AA (General Domain Performance - Agent Analysis) assessment by a massive margin.
Here is the deep dive into the model that just made your current AI stack look obsolete.
1. The New King of Knowledge Work
The most critical benchmark for enterprise is not "creative writing"—it's GDPval-AA. This measures a model's ability to perform actual white-collar tasks in Finance, Law, and STEM.
The results are shocking:
- vs GPT-5.2: +144 Elo
- vs Opus 4.5: +190 Elo
In layman's terms: Opus 4.6 beats GPT-5.2 in 7 out of 10 complex tasks.
The "Agentic" Trifecta
Opus 4.6 swept the board in three agent-critical categories:
- Agent Coding (Terminal-Bench 2.0): Highest score ever recorded.
- Multidisciplinary Reasoning (Humanity's Last Exam): Solved problems requiring broad domain knowledge.
- Agent Search (BrowseComp): 86.8% success rate when using "Agent Teams" (more on that below).
2. 1M Context Without "Context Rot"
We've had "long context" before, but it usually came with a catch: the longer the context, the dumber the model got ("Context Rot"). Sonnet 4.5, for example, only scored 18.5% on the grueling "MRCR v2 8-Needle" test.
Opus 4.6 scored 76%. That is a 4x improvement in retrieval accuracy over a million tokens. Anthropic claims it can now track details across hundreds of PDFs without "hallucinating" or losing the plot.
3. Product Updates: Agent Teams & The Office
Anthropic didn't just ship a model; they shipped an ecosystem.
Claude Code: Agent Teams
This is a game changer for developers. You can now spawn multiple sub-agents in parallel.
- Agent A: Writes the backend API.
- Agent B: Writes the React frontend.
- Agent C: Reviews the code for security flaws.
They coordinate autonomously. You just watch (or use
tmuxto intervene).
Claude in Excel & PowerPoint
The "Research Preview" is out.
- Excel: Can infer schema from messy CSVs and apply conditional formatting autonomously.
- PowerPoint: Reads your brand guidelines (font, color, logo) and generates a compliant deck from a data dump.
4. API Updates for Developers
For those of us building on Wisdom Gate, the API just got smarter:
- Adaptive Thinking: No more guessing. The model decides when to "think" (High Effort) and when to fast-track (Low Effort).
- Context Compaction (Beta): When your conversation hits the limit, Opus 4.6 automatically summarizes old context to free up space, keeping the "memory" alive without crashing the window.
- 128K Output: Finally. You can generate entire codebases or long-form reports in a single request.
5. Security & Pricing
Despite the power jump, Opus 4.6 is safer. It has the lowest "Over-Refusal" rate of any Claude model (meaning it won't lecture you when you ask it to do valid work).
Pricing:
- Base: $25 / 1M input tokens.
- Long Context (>200K): $37.50 / 1M input tokens.
Available Now on Wisdom Gate
We are rolling out Claude Opus 4.6 access immediately.
Update your OPENAI_BASE_URL to point to Wisdom Gate, and start building with the new King.
Model ID: claude-3-opus-20260229 (Alias: claude-opus-4.6)