Investigation Quality Benchmark Overview
Generated: January 16, 2026
Models Analyzed: 5
Total Benchmark Runs: 79
1. Quality Rankings
Rankings based on average overall investigation score.| Rank | Model | Avg Score | Min | Max | Evals | Verdict |
|---|---|---|---|---|---|---|
| 1 | gpt-5.2 | 80.6 | 55.0 | 92.0 | 15 | Excellent |
| 2 | gpt-4.1-nano | 57.9 | 45.0 | 75.0 | 15 | Below Average |
| 3 | gpt-4.1 | 57.1 | 25.0 | 85.0 | 15 | Below Average |
| 4 | gpt-4o-mini | 52.5 | 20.0 | 78.0 | 15 | Below Average |
| 5 | gpt-4o | 44.4 | 25.0 | 75.0 | 17 | Poor |
2. Cost Analysis
| Model | $/Run | $/100 Runs | $/1000 Runs | Score/$ | Cost Rank |
|---|---|---|---|---|---|
| gpt-4.1-nano | $0.0076 | $0.76 | $7.58 | 7645.5 | #1 |
| gpt-4o-mini | $0.0252 | $2.52 | $25.25 | 2080.6 | #2 |
| gpt-4o | $0.1253 | $12.53 | $125.30 | 354.4 | #3 |
| gpt-4.1 | $0.3035 | $30.35 | $303.53 | 188.0 | #4 |
| gpt-5.2 | $0.4392 | $43.92 | $439.21 | 183.5 | #5 |
3. Performance Metrics
| Model | Runs | Success% | Avg Time(s) | Avg Tokens | Tools/Run | Iterations |
|---|---|---|---|---|---|---|
| gpt-5.2 | 15 | 100.0% | 96.5 | 173,443 | 5.9 | 5.5 |
| gpt-4.1-nano | 15 | 100.0% | 17.3 | 75,025 | 3.5 | 4.2 |
| gpt-4.1 | 15 | 100.0% | 31.4 | 150,851 | 4.8 | 4.5 |
| gpt-4o-mini | 15 | 100.0% | 42.0 | 166,825 | 6.0 | 6.1 |
| gpt-4o | 19 | 94.7% | 23.5 | 49,604 | 1.7 | 2.5 |
4. Evaluation Criteria Breakdown
| Model | Final Response | Investigation | Reasoning Quality | Tool Usage |
|---|---|---|---|---|
| gpt-5.2 | 82.0 | 80.7 | 83.3 | 76.7 |
| gpt-4.1-nano | 62.3 | 55.7 | 61.0 | 53.0 |
| gpt-4.1 | 62.0 | 53.7 | 57.7 | 46.0 |
| gpt-4o-mini | 54.7 | 49.0 | 52.0 | 41.0 |
| gpt-4o | 53.7 | 38.8 | 46.8 | 30.0 |
5. Efficiency Analysis
| Model | Score/1K Tokens | Score/Minute | Tokens/Tool | Cost Efficiency |
|---|---|---|---|---|
| gpt-5.2 | 0.46 | 50.1 | 29397 | 183.5 |
| gpt-4.1-nano | 0.77 | 201.4 | 21436 | 7645.5 |
| gpt-4.1 | 0.38 | 108.9 | 31427 | 188.0 |
| gpt-4o-mini | 0.31 | 75.0 | 27804 | 2080.6 |
| gpt-4o | 0.90 | 113.2 | 29179 | 354.4 |
6. Qualitative Analysis
gpt-5.2
Strengths
- High quality investigation output
- Thorough tool utilization
- Strong in reasoning quality
Weaknesses
- Inconsistent quality - high variance
- High cost per investigation
- Slow response time
- Poor quality-to-cost ratio
gpt-4.1-nano
Strengths
- High quality investigation output
- Very cost-effective
- Fast response time
- Thorough tool utilization
- Token-efficient reasoning
- Excellent quality-to-cost ratio
Weaknesses
- Weak in tool usage
gpt-4.1
Strengths
- High quality investigation output
- Thorough tool utilization
Weaknesses
- Inconsistent quality - high variance
- High cost per investigation
- Verbose - uses many tokens for results
- Poor quality-to-cost ratio
- Weak in tool usage
gpt-4o-mini
Strengths
- High quality investigation output
- Very cost-effective
- Thorough tool utilization
- Excellent quality-to-cost ratio
Weaknesses
- Inconsistent quality - high variance
- Slow response time
- Verbose - uses many tokens for results
- Weak in tool usage
gpt-4o
Strengths
- Token-efficient reasoning
Weaknesses
- Inconsistent quality - high variance
- Poor quality-to-cost ratio
- Weak in tool usage
7. Recommendations
Primary Recommendation: gpt-4.1-nanoWeighted scoring: 40% quality, 30% cost-effectiveness, 20% reliability, 10% speed
Weighted Rankings
| Model | Score | Rating |
|---|---|---|
| gpt-4.1-nano | 81.4 | ████████████████ |
| gpt-4o-mini | 54.8 | ██████████ |
| gpt-5.2 | 53.0 | ██████████ |
| gpt-4.1 | 50.3 | ██████████ |
| gpt-4o | 45.7 | █████████ |
Use Case Specific Recommendations
Production (High Volume)
Production (High Volume)
Recommended Model: gpt-4.1-nanoBest balance of quality and cost for high volume usage.
Critical Investigations
Critical Investigations
Recommended Model: gpt-5.2Highest quality output for critical incident investigations.
Budget Constrained
Budget Constrained
Recommended Model: gpt-4.1-nanoLowest cost per investigation while maintaining acceptable quality.
Real-time Response
Real-time Response
Recommended Model: gpt-4.1-nanoFastest response time among high-quality models for time-sensitive alerts.
Quick Comparison Matrix
| Model | Quality | Cost | Speed | Reliability | Recommended For |
|---|---|---|---|---|---|
| gpt-5.2 | ★★★★☆ | ★☆☆☆☆ | ★★★★☆ | ★★★★★ | Best Quality |
| gpt-4.1-nano | ★★☆☆☆ | ★★★★★ | ★★★★★ | ★★★★★ | PRIMARY CHOICE |
| gpt-4.1 | ★★☆☆☆ | ★☆☆☆☆ | ★★★★★ | ★★★★★ | - |
| gpt-4o-mini | ★★☆☆☆ | ★★★★☆ | ★★★★★ | ★★★★★ | - |
| gpt-4o | ★★☆☆☆ | ★☆☆☆☆ | ★★★★★ | ★★★★☆ | - |

