Frontier 5: Claude, GPT, Gemini, DeepSeek (n=5)¶
Generated: 2026-03-14 20:45 UTC
Total Duration: 2h 39m 25s
Iterations: 5
Judge (classifier) model: gpt-4.1
Fast Benchmark
Markers: regression or benchmark
Schedule: Weekly (Sunday 2 AM UTC)
Purpose: Quick regression tests to catch breaking changes
HolmesGPT is continuously evaluated against real-world Kubernetes and cloud troubleshooting scenarios.
If you find scenarios that HolmesGPT does not perform well on, please consider adding them as evals to the benchmark.
Model Accuracy Comparison¶
| Model | Pass | Fail | Skip/Error | Total | Success Rate |
|---|---|---|---|---|---|
| deepseek-v3.2-chat | 77 | 23 | 0 | 100 | 🟡 77% (77/100) |
| gemini-3.1-pro-preview | 70 | 30 | 0 | 100 | 🟡 70% (70/100) |
| gpt-5.4 | 79 | 21 | 0 | 100 | 🟡 79% (79/100) |
| opus-4.6 | 89 | 11 | 0 | 100 | 🟡 89% (89/100) |
| sonnet-4.6 | 89 | 11 | 0 | 100 | 🟡 89% (89/100) |
Model Cost Comparison¶
| Model | Tests | Avg Cost | Min Cost | Max Cost | Total Cost |
|---|---|---|---|---|---|
| deepseek-v3.2-chat | 100 | $0.02 | $0.00 | $0.06 | $1.75 |
| gemini-3.1-pro-preview | 83 | $0.12 | $0.03 | $0.50 | $9.92 |
| gpt-5.4 | 100 | $0.11 | $0.02 | $0.30 | $11.19 |
| opus-4.6 | 99 | $0.37 | $0.12 | $1.51 | $36.85 |
| sonnet-4.6 | 100 | $0.21 | $0.07 | $0.65 | $21.14 |
Model Latency Comparison¶
| Model | Avg (s) | Min (s) | Max (s) | P50 (s) | P95 (s) |
|---|---|---|---|---|---|
| deepseek-v3.2-chat | 159.8 | 9.7 | 380.7 | 158.8 | 279.0 |
| gemini-3.1-pro-preview | 82.9 | 12.4 | 624.2 | 32.9 | 330.4 |
| gpt-5.4 | 35.0 | 5.1 | 72.7 | 35.6 | 56.0 |
| opus-4.6 | 51.9 | 3.8 | 289.3 | 44.8 | 127.5 |
| sonnet-4.6 | 41.3 | 3.5 | 150.7 | 40.4 | 75.7 |
Performance by Tag¶
Success rate by test category and model:
| Tag | deepseek-v3.2-chat | gemini-3.1-pro-preview | gpt-5.4 | opus-4.6 | sonnet-4.6 | Warnings |
|---|---|---|---|---|---|---|
| benchmark | 🟡 70% (21/30) | 🟡 67% (20/30) | 🟡 50% (15/30) | 🟡 83% (25/30) | 🟡 83% (25/30) | |
| context_window | 🟢 100% (10/10) | 🟡 60% (6/10) | 🟢 100% (10/10) | 🟢 100% (10/10) | 🟢 100% (10/10) | |
| counting | 🟢 100% (10/10) | 🟡 80% (8/10) | 🟡 90% (9/10) | 🟢 100% (10/10) | 🟢 100% (10/10) | |
| datetime | 🟡 67% (10/15) | 🟡 73% (11/15) | 🟢 100% (15/15) | 🟢 100% (15/15) | 🟢 100% (15/15) | |
| easy | 🟡 78% (35/45) | 🟡 82% (37/45) | 🟡 89% (40/45) | 🟡 93% (42/45) | 🟡 91% (41/45) | |
| grafana-dashboard | 🟢 100% (5/5) | 🟢 100% (5/5) | 🟢 100% (5/5) | 🟢 100% (5/5) | 🟢 100% (5/5) | |
| hard | 🟡 60% (6/10) | 🟢 100% (10/10) | 🟡 50% (5/10) | 🟡 80% (8/10) | 🟡 90% (9/10) | |
| kubernetes | 🟡 84% (42/50) | 🟡 82% (41/50) | 🟡 80% (40/50) | 🟡 88% (44/50) | 🟡 84% (42/50) | |
| logs | 🟡 52% (13/25) | 🟡 56% (14/25) | 🟡 44% (11/25) | 🟡 68% (17/25) | 🟡 64% (16/25) | |
| loki | 🟡 20% (2/10) | 🟡 30% (3/10) | 🟡 10% (1/10) | 🟡 40% (4/10) | 🟡 20% (2/10) | |
| medium | 🟡 78% (31/40) | 🟡 50% (20/40) | 🟡 75% (30/40) | 🟡 85% (34/40) | 🟡 85% (34/40) | |
| metrics | 🟢 100% (5/5) | 🟢 100% (5/5) | 🟢 100% (5/5) | 🟢 100% (5/5) | 🟢 100% (5/5) | |
| network | 🟢 100% (5/5) | 🟢 100% (5/5) | 🟢 100% (5/5) | 🟢 100% (5/5) | 🟢 100% (5/5) | |
| one-test | 🟢 100% (5/5) | 🟢 100% (5/5) | 🟢 100% (5/5) | 🟢 100% (5/5) | 🟢 100% (5/5) | |
| port-forward | 🟡 47% (7/15) | 🟡 53% (8/15) | 🟡 40% (6/15) | 🟡 60% (9/15) | 🟡 47% (7/15) | |
| question-answer | 🟢 100% (5/5) | 🟢 100% (5/5) | 🟢 100% (5/5) | 🟢 100% (5/5) | 🟢 100% (5/5) | |
| regression | 🟡 80% (56/70) | 🟡 71% (50/70) | 🟡 91% (64/70) | 🟡 91% (64/70) | 🟡 91% (64/70) | |
| runbooks | 🟡 90% (9/10) | 🟡 30% (3/10) | 🟡 50% (5/10) | 🟢 100% (10/10) | 🟢 100% (10/10) | |
| transparency | 🟡 73% (11/15) | 🔴 0% (0/15) | 🟢 100% (15/15) | 🟡 80% (12/15) | 🟡 87% (13/15) | |
| Overall | 🟡 77% (77/100) | 🟡 70% (70/100) | 🟡 79% (79/100) | 🟡 89% (89/100) | 🟡 89% (89/100) |
Raw Results¶
Status of all evaluations across models. Color coding:
- 🟢 Passing 100% (stable)
- 🟡 Passing 1-99%
- 🔴 Passing 0% (failing)
- 🔧 Mock data failure (missing or invalid test data)
- ⚠️ Setup failure (environment/infrastructure issue)
- ⏱️ Timeout or rate limit error
- ⏭️ Test skipped (e.g., known issue or precondition not met)
| Eval ID | deepseek-v3.2-chat | gemini-3.1-pro-preview | gpt-5.4 | opus-4.6 | sonnet-4.6 |
|---|---|---|---|---|---|
| 09_crashpod 🔗 | 🟢 | 🟢 | 🟢 | 🟢 | 🟢 |
| 100a_loki_historical_logs 🔗 | 🟡 | 🟡 | 🔴 | 🟡 | 🟡 |
| 101_loki_historical_logs_pod_deleted 🔗 | 🟡 | 🟡 | 🟡 | 🟡 | 🟡 |
| 108_logs_nearby_lines 🔗 | 🟡 | 🟢 | 🔴 | 🟡 | 🟡 |
| 111_pod_names_contain_service 🔗 | 🟢 | 🟢 | 🟢 | 🟢 | 🟢 |
| 112_find_pvcs_by_uuid 🔗 | 🟢 | 🟢 | 🟢 | 🟢 | 🟢 |
| 12_job_crashing 🔗 | 🟡 | 🟢 | 🟡 | 🟢 | 🟢 |
| 176_network_policy_blocking_traffic_no_runbooks 🔗 | 🟢 | 🟢 | 🟢 | 🟢 | 🟢 |
| 179_grafana_big_dashboard_query 🔗 | 🟢 | 🟢 | 🟢 | 🟢 | 🟢 |
| 19_detect_missing_app_details 🔗 | 🟢 | 🔴 | 🟢 | 🟢 | 🟢 |
| 227_count_configmaps_per_namespace[0] 🔗 | 🟢 | 🟡 | 🟡 | 🟢 | 🟢 |
| 24_misconfigured_pvc 🔗 | 🟢 | 🟢 | 🟢 | 🟢 | 🟢 |
| 43_current_datetime_from_prompt 🔗 | 🔴 | 🟢 | 🟢 | 🟢 | 🟢 |
| 61_exact_match_counting 🔗 | 🟢 | 🟢 | 🟢 | 🟢 | 🟢 |
| 65_health_check_followup 🔗 | 🟡 | 🔴 | 🟢 | 🟡 | 🟡 |
| 73a_time_window_anomaly 🔗 | 🟢 | 🟡 | 🟢 | 🟢 | 🟢 |
| 73b_time_window_anomaly 🔗 | 🟢 | 🟡 | 🟢 | 🟢 | 🟢 |
| 86_configmap_like_but_secret 🔗 | 🟢 | 🟢 | 🟢 | 🟢 | 🟢 |
| 90_runbook_basic_selection 🔗 | 🟢 | 🔴 | 🟢 | 🟢 | 🟢 |
| 96_no_matching_runbook 🔗 | 🟡 | 🟡 | 🔴 | 🟢 | 🟢 |
| SUMMARY | 🟡 77% (77/100) | 🟡 70% (70/100) | 🟡 79% (79/100) | 🟡 89% (89/100) | 🟡 89% (89/100) |
Detailed Raw Results¶
| Eval ID | deepseek-v3.2-chat | gemini-3.1-pro-preview | gpt-5.4 | opus-4.6 | sonnet-4.6 |
|---|---|---|---|---|---|
| 09_crashpod 🔗 | 🟢 100% (5/5) / ⏱️ 152.3s / 💰 $0.02 | 🟢 100% (5/5) / ⏱️ 29.2s / 💰 $0.08 | 🟢 100% (5/5) / ⏱️ 34.3s / 💰 $0.09 | 🟢 100% (5/5) / ⏱️ 37.0s / 💰 $0.28 | 🟢 100% (5/5) / ⏱️ 37.6s / 💰 $0.18 |
| 100a_loki_historical_logs 🔗 | 🟡 20% (⅕) / ⏱️ 175.1s / 💰 $0.02 | 🟡 20% (⅕) / ⏱️ 48.7s / 💰 $0.16 | 🔴 0% (0/5) / ⏱️ 49.1s / 💰 $0.16 | 🟡 40% (⅖) / ⏱️ 115.4s / 💰 $0.66 | 🟡 20% (⅕) / ⏱️ 61.6s / 💰 $0.28 |
| 101_loki_historical_logs_pod_deleted 🔗 | 🟡 20% (⅕) / ⏱️ 247.2s / 💰 $0.03 | 🟡 40% (⅖) / ⏱️ 194.9s / 💰 $0.21 | 🟡 20% (⅕) / ⏱️ 51.4s / 💰 $0.14 | 🟡 40% (⅖) / ⏱️ 166.7s / 💰 $0.77 | 🟡 20% (⅕) / ⏱️ 74.7s / 💰 $0.31 |
| 108_logs_nearby_lines 🔗 | 🟡 20% (⅕) / ⏱️ 231.5s / 💰 $0.03 | 🟢 100% (5/5) / ⏱️ 35.3s / 💰 $0.16 | 🔴 0% (0/5) / ⏱️ 40.0s / 💰 $0.13 | 🟡 60% (⅗) / ⏱️ 50.6s / 💰 $0.48 | 🟡 80% (⅘) / ⏱️ 46.6s / 💰 $0.26 |
| 111_pod_names_contain_service 🔗 | 🟢 100% (5/5) / ⏱️ 131.2s / 💰 $0.01 | 🟢 100% (5/5) / ⏱️ 27.3s / 💰 $0.08 | 🟢 100% (5/5) / ⏱️ 30.3s / 💰 $0.08 | 🟢 100% (5/5) / ⏱️ 34.9s / 💰 $0.25 | 🟢 100% (5/5) / ⏱️ 39.1s / 💰 $0.18 |
| 112_find_pvcs_by_uuid 🔗 | 🟢 100% (5/5) / ⏱️ 143.5s / 💰 $0.01 | 🟢 100% (5/5) / ⏱️ 22.7s / 💰 $0.06 | 🟢 100% (5/5) / ⏱️ 30.4s / 💰 $0.08 | 🟢 100% (5/5) / ⏱️ 30.8s / 💰 $0.24 | 🟢 100% (5/5) / ⏱️ 29.7s / 💰 $0.15 |
| 12_job_crashing 🔗 | 🟡 80% (⅘) / ⏱️ 191.0s / 💰 $0.02 | 🟢 100% (5/5) / ⏱️ 40.7s / 💰 $0.12 | 🟡 80% (⅘) / ⏱️ 35.7s / 💰 $0.11 | 🟢 100% (5/5) / ⏱️ 38.4s / 💰 $0.28 | 🟢 100% (5/5) / ⏱️ 41.5s / 💰 $0.19 |
| 176_network_policy_blocking_traffic_no_runbooks 🔗 | 🟢 100% (5/5) / ⏱️ 137.7s / 💰 $0.01 | 🟢 100% (5/5) / ⏱️ 36.0s / 💰 $0.16 | 🟢 100% (5/5) / ⏱️ 34.4s / 💰 $0.14 | 🟢 100% (5/5) / ⏱️ 45.8s / 💰 $0.41 | 🟢 100% (5/5) / ⏱️ 40.9s / 💰 $0.23 |
| 179_grafana_big_dashboard_query 🔗 | 🟢 100% (5/5) / ⏱️ 84.8s / 💰 $0.01 | 🟢 100% (5/5) / ⏱️ 19.0s / 💰 $0.10 | 🟢 100% (5/5) / ⏱️ 18.5s / 💰 $0.10 | 🟢 100% (5/5) / ⏱️ 20.5s / 💰 $0.20 | 🟢 100% (5/5) / ⏱️ 16.7s / 💰 $0.11 |
| 19_detect_missing_app_details 🔗 | 🟢 100% (5/5) / ⏱️ 227.5s / 💰 $0.03 | 🔴 0% (0/5) / ⏱️ 288.8s | 🟢 100% (5/5) / ⏱️ 42.2s / 💰 $0.11 | 🟢 100% (5/5) / ⏱️ 45.3s / 💰 $0.30 | 🟢 100% (5/5) / ⏱️ 44.7s / 💰 $0.21 |
| 227_count_configmaps_per_namespace[0] 🔗 | 🟢 100% (5/5) / ⏱️ 100.3s / 💰 $0.01 | 🟡 60% (⅗) / ⏱️ 81.5s / 💰 $0.10 | 🟡 80% (⅘) / ⏱️ 23.5s / 💰 $0.08 | 🟢 100% (5/5) / ⏱️ 25.9s / 💰 $0.22 | 🟢 100% (5/5) / ⏱️ 25.5s / 💰 $0.14 |
| 24_misconfigured_pvc 🔗 | 🟢 100% (5/5) / ⏱️ 131.1s / 💰 $0.01 | 🟢 100% (5/5) / ⏱️ 33.3s / 💰 $0.09 | 🟢 100% (5/5) / ⏱️ 34.3s / 💰 $0.11 | 🟢 100% (5/5) / ⏱️ 44.4s / 💰 $0.31 | 🟢 100% (5/5) / ⏱️ 42.5s / 💰 $0.20 |
| 43_current_datetime_from_prompt 🔗 | 🔴 0% (0/5) / ⏱️ 11.6s / 💰 $0.00 | 🟢 100% (5/5) / ⏱️ 15.3s / 💰 $0.04 | 🟢 100% (5/5) / ⏱️ 6.9s / 💰 $0.02 | 🟢 100% (5/5) / ⏱️ 5.4s / 💰 $0.12 | 🟢 100% (5/5) / ⏱️ 4.7s / 💰 $0.07 |
| 61_exact_match_counting 🔗 | 🟢 100% (5/5) / ⏱️ 65.2s / 💰 $0.01 | 🟢 100% (5/5) / ⏱️ 15.6s / 💰 $0.05 | 🟢 100% (5/5) / ⏱️ 12.1s / 💰 $0.03 | 🟢 100% (5/5) / ⏱️ 18.3s / 💰 $0.18 | 🟢 100% (5/5) / ⏱️ 10.2s / 💰 $0.09 |
| 65_health_check_followup 🔗 | 🟡 20% (⅕) / ⏱️ 210.7s / 💰 $0.02 | 🔴 0% (0/5) / ⏱️ 309.4s | 🟢 100% (5/5) / ⏱️ 43.3s / 💰 $0.13 | 🟡 40% (⅖) / ⏱️ 63.9s / 💰 $0.38 | 🟡 60% (⅗) / ⏱️ 45.3s / 💰 $0.22 |
| 73a_time_window_anomaly 🔗 | 🟢 100% (5/5) / ⏱️ 136.7s / 💰 $0.01 | 🟡 80% (⅘) / ⏱️ 37.5s / 💰 $0.18 | 🟢 100% (5/5) / ⏱️ 38.7s / 💰 $0.13 | 🟢 100% (5/5) / ⏱️ 50.2s / 💰 $0.34 | 🟢 100% (5/5) / ⏱️ 41.6s / 💰 $0.20 |
| 73b_time_window_anomaly 🔗 | 🟢 100% (5/5) / ⏱️ 184.0s / 💰 $0.02 | 🟡 40% (⅖) / ⏱️ 36.1s / 💰 $0.16 | 🟢 100% (5/5) / ⏱️ 37.5s / 💰 $0.13 | 🟢 100% (5/5) / ⏱️ 47.9s / 💰 $0.37 | 🟢 100% (5/5) / ⏱️ 39.4s / 💰 $0.19 |
| 86_configmap_like_but_secret 🔗 | 🟢 100% (5/5) / ⏱️ 149.0s / 💰 $0.01 | 🟢 100% (5/5) / ⏱️ 28.3s / 💰 $0.08 | 🟢 100% (5/5) / ⏱️ 37.0s / 💰 $0.10 | 🟢 100% (5/5) / ⏱️ 38.9s / 💰 $0.29 | 🟢 100% (5/5) / ⏱️ 43.1s / 💰 $0.21 |
| 90_runbook_basic_selection 🔗 | 🟢 100% (5/5) / ⏱️ 289.0s / 💰 $0.05 | 🔴 0% (0/5) / ⏱️ 320.8s | 🟢 100% (5/5) / ⏱️ 48.9s / 💰 $0.18 | 🟢 100% (5/5) / ⏱️ 94.7s / 💰 $0.84 | 🟢 100% (5/5) / ⏱️ 80.2s / 💰 $0.47 |
| 96_no_matching_runbook 🔗 | 🟡 80% (⅘) / ⏱️ 196.5s / 💰 $0.02 | 🟡 60% (⅗) / ⏱️ 37.4s / 💰 $0.16 | 🔴 0% (0/5) / ⏱️ 51.1s / 💰 $0.19 | 🟢 100% (5/5) / ⏱️ 63.6s / 💰 $0.45 | 🟢 100% (5/5) / ⏱️ 60.7s / 💰 $0.33 |
Results are automatically generated and updated weekly. View full traces and detailed analysis in Braintrust experiment: ci-benchmark-23093326433.