Skip to content

Recommendations

Date: 2026-03-15
Scope: Consolidated benchmark results for model success rate, runtime, and cost.

This page summarizes data from the following benchmark runs:

At a Glance

  • Top performers: opus-4.6 and sonnet-4.6
  • Most expensive: opus-4.6
  • Cheapest: deepseek-r1-reasoner and deepseek-v3.2-chat
  • Fastest: gpt-5.3-codex
  • Slowest: deepseek-r1-reasoner

Rankings

Benchmark Place Model Price Tier (cheapest / most expensive) Speed Tier (fastest / slowest)
1st opus-4.6 Most Expensive Average
1st sonnet-4.6 Expensive Average
2nd deepseek-r1-reasoner Cheapest Slowest
2nd gemini-3.1-pro-preview Average Average
3rd deepseek-v3.2-chat Cheapest Slow
3rd gpt-5.4 Average Average
3rd haiku-4.5 Cheap Fast
4th qwen-next-80B-instruct Cheap Fast
5th qwen-next-80B-thinking Cheap Average
6th gpt-5.3-codex Cheap Fastest

Notes

Context Requirements

gpt-5.3-codex and qwen-next-80B-thinking tended to ask for additional context instead of proceeding with the investigation. This reduced their benchmark performance.

Instruction Handling

opus-4.6 occasionally ignored explicit instructions when it believed another approach would be better. For example, it sometimes pulled unrelated runbooks despite being told not to.

Literal Interpretation

sonnet-4.6 sometimes followed instructions too literally. For example, when told to "look at all logs", it would say that it had looked at the logs without explaining what it found.

Benchmark Methodology

Category Rule Explanation
Benchmark Place Based on success rate ranking Models are ranked by number of successful tests. Equal scores share the same placement. For example, two models with 16/16 both receive 1st place.
Price Tier Cheap: ≤ $0.06 Models with an average cost per run of $0.06 or less.
Average: $0.07 – $0.15 Models with moderate cost per run.
Expensive: ≥ $0.16 Models with higher cost per run.
Speed Tier Fast: < 30 seconds average runtime Models that completed benchmark tasks quickly.
Average: 30–60 seconds Models with mid-range runtime.
Slow: > 60 seconds Models with slower average runtime.

Special Labels Used

Label Meaning
Cheapest Model(s) with the lowest average cost across the benchmark.
Most Expensive Model with the highest average cost across the benchmark.
Fastest Model with the lowest average runtime.
Slowest Model with the highest average runtime.