Skip to main content

Investigation Quality Benchmark Overview

Generated: January 16, 2026 Models Analyzed: 5 Total Benchmark Runs: 79

1. Quality Rankings

Rankings based on average overall investigation score.
RankModelAvg ScoreMinMaxEvalsVerdict
1gpt-5.280.655.092.015Excellent
2gpt-4.1-nano57.945.075.015Below Average
3gpt-4.157.125.085.015Below Average
4gpt-4o-mini52.520.078.015Below Average
5gpt-4o44.425.075.017Poor

2. Cost Analysis

Model$/Run$/100 Runs$/1000 RunsScore/$Cost Rank
gpt-4.1-nano$0.0076$0.76$7.587645.5#1
gpt-4o-mini$0.0252$2.52$25.252080.6#2
gpt-4o$0.1253$12.53$125.30354.4#3
gpt-4.1$0.3035$30.35$303.53188.0#4
gpt-5.2$0.4392$43.92$439.21183.5#5

3. Performance Metrics

ModelRunsSuccess%Avg Time(s)Avg TokensTools/RunIterations
gpt-5.215100.0%96.5173,4435.95.5
gpt-4.1-nano15100.0%17.375,0253.54.2
gpt-4.115100.0%31.4150,8514.84.5
gpt-4o-mini15100.0%42.0166,8256.06.1
gpt-4o1994.7%23.549,6041.72.5

4. Evaluation Criteria Breakdown

ModelFinal ResponseInvestigationReasoning QualityTool Usage
gpt-5.282.080.783.376.7
gpt-4.1-nano62.355.761.053.0
gpt-4.162.053.757.746.0
gpt-4o-mini54.749.052.041.0
gpt-4o53.738.846.830.0

5. Efficiency Analysis

ModelScore/1K TokensScore/MinuteTokens/ToolCost Efficiency
gpt-5.20.4650.129397183.5
gpt-4.1-nano0.77201.4214367645.5
gpt-4.10.38108.931427188.0
gpt-4o-mini0.3175.0278042080.6
gpt-4o0.90113.229179354.4

6. Qualitative Analysis

gpt-5.2

Strengths

  • High quality investigation output
  • Thorough tool utilization
  • Strong in reasoning quality

Weaknesses

  • Inconsistent quality - high variance
  • High cost per investigation
  • Slow response time
  • Poor quality-to-cost ratio

gpt-4.1-nano

Strengths

  • High quality investigation output
  • Very cost-effective
  • Fast response time
  • Thorough tool utilization
  • Token-efficient reasoning
  • Excellent quality-to-cost ratio

Weaknesses

  • Weak in tool usage

gpt-4.1

Strengths

  • High quality investigation output
  • Thorough tool utilization

Weaknesses

  • Inconsistent quality - high variance
  • High cost per investigation
  • Verbose - uses many tokens for results
  • Poor quality-to-cost ratio
  • Weak in tool usage

gpt-4o-mini

Strengths

  • High quality investigation output
  • Very cost-effective
  • Thorough tool utilization
  • Excellent quality-to-cost ratio

Weaknesses

  • Inconsistent quality - high variance
  • Slow response time
  • Verbose - uses many tokens for results
  • Weak in tool usage

gpt-4o

Strengths

  • Token-efficient reasoning

Weaknesses

  • Inconsistent quality - high variance
  • Poor quality-to-cost ratio
  • Weak in tool usage

7. Recommendations

Primary Recommendation: gpt-4.1-nanoWeighted scoring: 40% quality, 30% cost-effectiveness, 20% reliability, 10% speed

Weighted Rankings

ModelScoreRating
gpt-4.1-nano81.4████████████████
gpt-4o-mini54.8██████████
gpt-5.253.0██████████
gpt-4.150.3██████████
gpt-4o45.7█████████

Use Case Specific Recommendations

Recommended Model: gpt-4.1-nanoBest balance of quality and cost for high volume usage.
Recommended Model: gpt-5.2Highest quality output for critical incident investigations.
Recommended Model: gpt-4.1-nanoLowest cost per investigation while maintaining acceptable quality.
Recommended Model: gpt-4.1-nanoFastest response time among high-quality models for time-sensitive alerts.

Quick Comparison Matrix

ModelQualityCostSpeedReliabilityRecommended For
gpt-5.2★★★★☆★☆☆☆☆★★★★☆★★★★★Best Quality
gpt-4.1-nano★★☆☆☆★★★★★★★★★★★★★★★PRIMARY CHOICE
gpt-4.1★★☆☆☆★☆☆☆☆★★★★★★★★★★-
gpt-4o-mini★★☆☆☆★★★★☆★★★★★★★★★★-
gpt-4o★★☆☆☆★☆☆☆☆★★★★★★★★★☆-