Recommendations¶

Date: 2026-03-15
Scope: Consolidated benchmark results for model success rate, runtime, and cost.

This page summarizes data from the following benchmark runs:

At a Glance¶

Top performers: opus-4.6 and sonnet-4.6
Most expensive: opus-4.6
Cheapest: deepseek-r1-reasoner and deepseek-v3.2-chat
Fastest: gpt-5.3-codex
Slowest: deepseek-r1-reasoner

Rankings¶

Benchmark Place	Model	Price Tier (cheapest / most expensive)	Speed Tier (fastest / slowest)
1^st	opus-4.6	Most Expensive	Average
1^st	sonnet-4.6	Expensive	Average
2^nd	deepseek-r1-reasoner	Cheapest	Slowest
2^nd	gemini-3.1-pro-preview	Average	Average
3^rd	deepseek-v3.2-chat	Cheapest	Slow
3^rd	gpt-5.4	Average	Average
3^rd	haiku-4.5	Cheap	Fast
4^th	qwen-next-80B-instruct	Cheap	Fast
5^th	qwen-next-80B-thinking	Cheap	Average
6^th	gpt-5.3-codex	Cheap	Fastest

Notes¶

Context Requirements¶

gpt-5.3-codex and qwen-next-80B-thinking tended to ask for additional context instead of proceeding with the investigation. This reduced their benchmark performance.

Instruction Handling¶

opus-4.6 occasionally ignored explicit instructions when it believed another approach would be better. For example, it sometimes pulled unrelated runbooks despite being told not to.

Literal Interpretation¶

sonnet-4.6 sometimes followed instructions too literally. For example, when told to "look at all logs", it would say that it had looked at the logs without explaining what it found.

Benchmark Methodology¶

Category	Rule	Explanation
Benchmark Place	Based on success rate ranking	Models are ranked by number of successful tests. Equal scores share the same placement. For example, two models with 16/16 both receive 1^st place.
Price Tier	Cheap: ≤ $0.06	Models with an average cost per run of $0.06 or less.
	Average: $0.07 – $0.15	Models with moderate cost per run.
	Expensive: ≥ $0.16	Models with higher cost per run.
Speed Tier	Fast: < 30 seconds average runtime	Models that completed benchmark tasks quickly.
	Average: 30–60 seconds	Models with mid-range runtime.
	Slow: > 60 seconds	Models with slower average runtime.

Special Labels Used¶

Label	Meaning
Cheapest	Model(s) with the lowest average cost across the benchmark.
Most Expensive	Model with the highest average cost across the benchmark.
Fastest	Model with the lowest average runtime.
Slowest	Model with the highest average runtime.