← Back to Resources
Benchmarks

Model benchmarks that actually matter

Not synthetic beauty-pageant nonsense. These are the runs, scoreboards, and drill-downs we use to compare models on real operator work.

The benchmark explainer page with the current leaderboard, benchmark canon, and graphics.

Read more →
Leaderboard

Where the models stack up

Updated 2026-04-16
Rank
Model
Operator
Messaging
External
Cost
Verdict
#1
Claude Sonnet 4.6
Messaging benchmark + external canon · Open detail →
100/100
100/100
79.6 SWE-bench
$9.00/M blended
Strong all-rounder, needs full internal operator-suite run.
#2
Gemini 3.1 Pro
Messaging benchmark + external canon · Open detail →
100/100
100/100
80.6 SWE-bench · 1492 Arena · 91.9 GPQA
$7.00/M blended
Looks elite, still needs full operator-suite validation.
#3
Gemini Flash
Messaging benchmark canon · Open detail →
100/100
100/100
Messaging benchmark only
$1.75/M blended
Useful cheap helper, not yet proven on the hard pack.
#4
Claude Opus 4.6
Operator Suite v2 · View canon →
95.3/100
100/100
#1 SWE-bench · #1 Arena · #1 HLE
$15/M blended
Best raw benchmark performer overall.
#5
GLM-5-Turbo
Operator Suite v2 · View canon →
95/100
100/100
77.8 SWE-bench · 1454 Arena
$2.60/M blended
Almost-Opus quality without the wallet mugging.
#6
Gemma 4 31B
Enterprise Ollama · Open detail →
92/100
Quick execution pack
Fast local quality leader in current Gemma run
Enterprise local
Best local quality of the Gemma pair, but significantly slower.
#7
MiniMax M2.7
Operator Suite v2 · Open detail →
90.8/100
100/100
80.2 SWE-bench Verified
$0.75/M blended
Great value, unsafe near guardrails.
#8
MiniMax M2.7 / M2.5
Operator Suite v2 · View canon →
90.8/100
100/100
80.2 SWE-bench Verified
$0.75/M blended
Wildly cheap, but unsafe around guardrails.
#9
GPT-5.4
External benchmark canon · Open detail →
88/100
Provider path unsupported
75.1 Terminal-Bench · 57.7 SWE-bench Pro · 1463 Arena
$8.75/M blended
Looks strongest for coding/execution, still under-benchmarked internally.
#10
Qwen 3.5 Opus Distill
Enterprise Local · Open report →
81.7/100
Single-model detailed run
Needs more canon-side comparison runs
TBD
Interesting enough to earn its own drill-down page already.
#11
Gemma 4 26B
Enterprise Ollama · Open detail →
80/100
Quick execution pack
Best speed-quality tradeoff in current Gemma run
Enterprise local
Best operational default for local routing because it is much faster while still competent.
#12
PrismML Bonsai 1.7B
PrismML local benchmark · Open detail →
56/100
Quick execution pack
Prism ternary local model · 1.7B GGUF CPU run
Local / experimental
Tiny local model, useful for light work, not a serious operator default.
Runs and artifacts

Open a benchmark, then drill into the model

PrismML Bonsai 1.7B artifact
PrismML local benchmark
2026-04-16

PrismML Bonsai 1.7B

56/100 · avg 8.8s · CPU quick pack

Official PrismML Bonsai demo run through the same quick operator pack. Fine for lightweight local use, but it missed the routing task badly and is not an OpenClaw default.

Generated benchmark pack Open Bonsai detail →
Operator Suite v2 artifact
Full benchmark canon
2026-03-18

Operator Suite v2

Opus 95.3 · GLM-5-Turbo 95.0 · MiniMax 90.8

The serious one. Routing, recovery, config safety, delegation, and proof under pressure.

Leaderboard graphic Open benchmark context →
Messaging Tool Planning v2 artifact
Cross-model routing benchmark
2026-03-14

Messaging Tool Planning v2

12-way tie at 100/100

Shows the crowded top tier on lighter messaging and routing tasks before the harder operator tests separated them.

Infographic Open benchmark context →
Cron Reliability v1 artifact
First benchmark cut
2026-03-13

Cron Reliability v1

Hunter 94 · Healer 90 · Open 87

The first useful benchmark image. Good signal, but still too soft compared with the later operator suite.

Infographic Read the story →
Enterprise Local
2026-04-02

Qwen 3.5 Opus Distill

81.7 overall · detailed track report

Fresh detailed report page for the Qwen 3.5 Opus-distilled run. Good enough to publish as the first clickable model drill-down.

Detailed report Open detail page →
Gemma 4 26B vs 31B artifact
Enterprise Ollama
2026-04-12

Gemma 4 26B vs 31B

31B 92/100 · 26B 80/100

Quick operator-oriented local benchmark comparing Gemma 4 26B and 31B on Enterprise. 31B wins on quality, 26B wins hard on speed.

Generated benchmark pack Open Gemma detail →
Recovered benchmark canon models artifact
Backfilled benchmark registry
2026-03-18

Recovered benchmark canon models

Opus 95.3 · GLM 95.0 · MiniMax 90.8 · Gemini/GPT-5.4/Sonnet backfilled from reference notes

Recovered benchmark model entries from existing reference docs, published benchmark article, and archived report artifacts so the leaderboard reflects the broader canon instead of only the most recent Gemma run.

Registry backfill Open benchmark context →