Claude Opus 4.8: Reaching New Heights in AI Performance and Safety
Just six weeks after its predecessor, Anthropic has unveiled Claude Opus 4.8, its latest iteration of the flagship artificial intelligence model. This rapid update brings substantial enhancements in coding capabilities, safety protocols, and overall performance, all while maintaining its previous pricing structure.
“The release of Opus 4.8 underscores Anthropic’s commitment to swift innovation, delivering a more powerful tool to developers and enterprises without an increased cost burden. It’s a significant move in the competitive AI race,” notes a leading AI analyst.
The new model demonstrates notable progress across various benchmarks, solidifying its position at the forefront of large language models (LLMs).
Key Performance Benchmarks
- On SWE-bench Pro, which measures an AI’s ability to solve complex, multi-language software engineering problems, Opus 4.8 achieved 69.2%, an improvement from 64.3% for Opus 4.7. This outperforms OpenAI’s GPT-5.5 (58.6%) and Google’s Gemini 3.1 Pro (54.2%).
- In “Humanity’s Last Exam,” assessing expert-level questions across academic disciplines, Opus 4.8 scored 49.8% without tools and 57.9% with them, ahead of all three rivals.
- OSWorld-Verified, testing real-world computer use tasks, saw Opus 4.8 hit 83.4%, nudging past Opus 4.7’s score of 82.8%.
- The only area where Opus 4.8 didn’t take the top spot was Terminal-Bench 2.1 for command-line tasks, where GPT-5.5 leads at 78.2%, with Opus 4.8 scoring 74.6%.
Effort Control and Dynamic Workflows
Anthropic has introduced a new “effort control” feature, allowing users to dictate how deeply the model “thinks” about a problem. Options include “High” (the default), “Extra” (for harder problems), and “Max” (for the deepest compute), alongside “Low” and “Medium” for token savings. This flexibility allows for optimization of both accuracy and cost.
Additionally, dynamic workflows are now shipping in Claude Code (in research preview). This feature enables Claude to write its own orchestration scripts, spin up parallel subagents, verify their outputs, and report back. This significantly enhances automation capabilities for enterprise users, though it does consume significantly more tokens.
AI Safety and Alignment
Anthropic’s alignment team highlights that Opus 4.8 “reaches new highs on our measures of prosocial traits like supporting user autonomy and acting in the user’s best interest.” Deception rates and misuse-cooperation rates came in substantially lower than Opus 4.7, comparable to Claude Mythos Preview—Anthropic’s most locked-down model.
Opus 4.8 is also four times less likely than 4.7 to let bugs in its own code slide past without flagging them. These safety enhancements are crucial for AI adoption in regulated industries and legal work, where reliability is paramount.
“The integration of enhanced safety protocols and reduced deception rates in Opus 4.8 sets a new standard for responsible AI development, particularly for applications in sensitive sectors,” comments an AI ethics expert.
Pricing and Competitive Landscape
Despite the substantial improvements, Anthropic maintains its pricing of $5 per million input tokens and $25 per million output tokens for Opus 4.8. The fast mode costs $10 input and $50 output per million, which Anthropic states is three times cheaper than its previous fast mode.
However, when compared to Chinese competitors like DeepSeek V4 Pro and Xiaomi MiMo V2.5 Pro, which offer significantly lower prices (e.g., $0.435 per million input tokens), Opus’s cost remains high. The price difference can be as much as 57 times per output token.
Nevertheless, Anthropic justifies its pricing with superior quality and safety, which are critical in production environments where the risk of a model quietly cooperating with bad inputs is unacceptable.
Real-world Coding Test: Zombie Games
To assess real-world coding prowess, Opus 4.8 was tested on creating a 3D zombie game against GPT-5.5 and DeepSeek V4 Pro. GPT-5.5 finished first, but its game was incomplete. DeepSeek V4 Pro came in second, delivering a complete game with solid mechanics.
Opus 4.8 took roughly three times as long as GPT-5.5 but delivered the best splash screen, zombie designs, game mechanics, and sound effects. While it was the slowest, the output quality was superior. Yet, given the cost gap, this might not be enough to justify its use over DeepSeek for certain tasks.
Frequently Asked Questions (FAQ)
-
What’s new in Claude Opus 4.8?
Claude Opus 4.8 features enhanced coding capabilities, improved safety (reduced deception and better bug detection), new effort control features, dynamic workflows, and superior performance across various benchmarks.
-
Has the price of Claude Opus 4.8 changed?
No, the pricing for Claude Opus 4.8 remains the same at $5 per million input tokens and $25 per million output tokens. However, its fast mode is now three times cheaper than previous versions.
-
How does Claude Opus 4.8 compare to competitors like GPT-5.5?
Claude Opus 4.8 outperforms GPT-5.5 and Google Gemini 3.1 Pro in key benchmarks like SWE-bench Pro and “Humanity’s Last Exam.” While it may be slower in some tasks, its output quality is often higher, especially in complex coding scenarios.
