Your Agent Isn't 'Almost There.' ClawBench Just Proved It.

GPT-5.4 scores 6.5% and Claude Sonnet 4.6 scores 33.3% on real-world write-heavy web tasks — while they ace the benchmarks on your vendor's slide deck. Here's what engineering leaders should actually take from ClawBench.
Your Agentic AI Roadmap Just Ran Into A Wall Called ClawBench
Your vendor's demo booked a flight. Your vendor's demo filed an expense report. Your vendor's demo cleared a sprint ticket. Your vendor's demo is also, almost certainly, running on a sandboxed snapshot of a page from 2023. At Kuaray, here's our take: the new ClawBench paper is the most honest thing anyone has published about agents in twelve months, and the numbers should make every CTO with an "agentic transformation" line item in Q3 pick up the phone.
The Numbers That Should End A Few Meetings
ClawBench is 153 write-heavy tasks across 144 live production platforms the kind of web you and your customers actually use, not a frozen replica sitting in a Docker container. Tasks include filling detailed forms, running multi-step workflows, lifting data out of user-provided documents, and committing consequential actions. You know. Work.
Here's the scoreboard:
| Model | Traditional web benchmark | ClawBench |
|---|---|---|
| Claude Sonnet 4.6 | ~65–75% | 33.3% |
| GPT-5.4 | ~65–75% | 6.5% |
Read that second row again. The frontier model that your PM wants to put in front of your customers fails 93 out of every 100 real web tasks. Not obscure tasks. Everyday ones.
Why The Old Benchmarks Have Been Lying To You
WebArena, Mind2Web, the usual suspects they all share a dirty secret: the web is frozen. Static HTML snapshots. No rate limits. No CAPTCHAs. No A/B test that just flipped the checkout button to the other side of the page. No 302 redirects to a new SSO provider. No "we're performing maintenance" banner eating half the DOM.
Real websites are a moving adversary. Benchmarks built on snapshots measure reasoning about a museum exhibit, not performance in a fight. ClawBench forces the agent into the fight and the fight is winning.
Three structural reasons the scores collapse:
- Write-heavy means consequential. Fill a 40-field form wrong and you don't get a retry you get a charge, a ticket, a policy violation. Agents trained on "read-to-pass" tasks panic when a single wrong click ships.
- Multi-platform means multi-auth. Your agent has to log into a different IdP every seven minutes. One MFA prompt and the whole trajectory dies.
- Documents are adversarial. Half the tasks require pulling a detail out of a user-supplied PDF and transcribing it into a form that validates server-side. Most models still can't reliably tell a date from a due date.
What Engineering Leaders Should Actually Do Monday
Stop arguing about which model is "ready." None of them are, not for the jobs the sales deck promises. Start arguing about where the agent is allowed to fail and how much it costs when it does.
- Kill the open-ended "autonomous agent" pilots. Any pilot that lets the model pick its own sequence of writes on production SaaS is a liability event waiting for a calendar invite. Scope the work. Fence the tools. Whitelist the endpoints.
- Score your use cases on the ClawBench axes. If your workflow is write-heavy, multi-platform, and dependent on documents you don't control, assume a sub-35% success rate even with the best model in the room. Budget for a human-in-the-loop or don't ship.
- Buy evals before you buy agents. Every agent vendor is selling you a capability they can't measure on your traffic. Insist on running ClawBench-style probes against your own site and your own workflows before signing. If the vendor can't stand up to the probe, the contract isn't ready.
- Design for the 67%. The scariest number in ClawBench isn't the 6.5%. It's the 33.3%. That means two out of three times, Claude Sonnet 4.6 looks confidently wrong to an end user. Your UX needs to surface uncertainty, not hide it behind a spinner and a smile.
None of this means agents are a dead end. It means the deployment surface is much narrower than the marketing surface, and the teams that accept that now will spend 2026 building things that actually hold. The teams that don't will spend 2026 explaining to legal why a bot submitted the wrong W-9.
Schedule a Technical Architecture Review with our Strategists we help engineering teams design agent systems that ship value without betting the business on a 33% task success rate.
Enlightenment Insight
In Guarani cosmology, Kuaray (Sun) does not pretend to be the moon when clouds arrive; it does not claim to shine when the sky has closed over the valley. It is powerful because it is honest about where its light reaches and where it does not and the people beneath it plan their harvest, their journey, their shelter around that honesty. The ClawBench results are a gift of the same kind. An agent that cannot tell you what it cannot do is not intelligence; it is weather you cannot read. The engineering leaders who thrive in this next era will be those who measure the shadow as carefully as the light, who build systems whose boundaries are as legible as their capabilities. At Kuaray, we believe the honest agent, like Kuaray (Sun) itself, earns trust not by claiming to illuminate everything, but by illuminating truthfully what it can.