7 min read
AI assisted

10 OpenAI Models Through Quick Benchmarks — The Model Isn't as Smart as You Pay

I ran 30 trials per configuration across GPT-4, GPT-5, and o-series models using three reasoning problems. gpt-5-nano on minimal scored 4.4%. o1 scored lower than gpt-4o.

When choosing an OpenAI model, most people go by the name. o-series for reasoning, GPT-5 for cutting-edge capability, mini for lightweight and cheap. After running a structured experiment, I found the names don't always match the reality.

A few examples:

  • gpt-5-nano on minimal setting scored 4.4% accuracy. That's 1–2 correct answers out of 30 attempts — worse than random guessing (33%).
  • o1 scored 47.8%. A reasoning-specialized model, lower than gpt-4o at 66.7%.
  • gpt-5-mini (high) takes 7.72 seconds. o4-mini hits 100% accuracy in 2.45 seconds. There's no reason to use high.

This post is the record of that experiment.


Why I Ran This

Benchmarks on official leaderboards mostly test knowledge retrieval. What I actually wanted to know was different: does raising reasoning effort meaningfully improve accuracy, and are o-series models actually better than GPT-5 for reasoning tasks?

So I designed three problem types that language models are known to struggle with, and ran each of 19 model configurations 30 times each.


Experiment Setup

Three Problems

Each problem requires actual reasoning, not pattern matching.

  1. Character counting: "How many times does the letter r appear in the word 'strawberry'?" (Answer: 3) — Targets the structural limitation of token-level text processing.

  2. Decimal comparison: "Which is larger, 9.11 or 9.9?" (Answer: 9.9) — Models that follow the pattern "11 > 9" get this wrong. Real numeric understanding is required.

  3. Simple algebra: "Solve for x: 8.9 = x + 8.11" (Answer: 0.79) — Basic arithmetic, but decimal handling causes occasional failures.

Models Tested (19 configurations)

  • GPT-4 series: gpt-4o, gpt-4.1, gpt-4.1-mini
  • GPT-5 series: gpt-5, gpt-5-mini, gpt-5-nano × 4 reasoning effort levels (minimal / low / medium / high)
  • o-series: o1, o3, o3-mini, o4-mini

Each configuration ran 30 trials across all 3 problems — 90 data points per configuration.

For models supporting it, I also used logprobs from the Chat Completion API to quantify model uncertainty. If a model chose "3" with 72.9% probability and "4" with 26.8%, the weighted average of ~3.27 captures how uncertain the response actually was.

response = client.chat.completions.create(
  model="gpt-4.1",
  messages=[{"role": "user", "content": "What is the meaning of life? Answer with a single integer between 1 and 5."}],
  logprobs=True,
  top_logprobs=10
)

minimal is a trap

The most surprising results came from the minimal reasoning effort setting in the GPT-5 series.

Model minimal accuracy low accuracy
gpt-5 66.7% 86.7%
gpt-5-mini 33.3% 98.9%
gpt-5-nano 4.4% 98.9%

gpt-5-nano on minimal answered correctly 1–2 times out of 30. That's worse than random (33%). Switching the same model to low brought accuracy to 98.9%. I didn't expect that cliff to be so steep.

Takeaway: There's no reason to use GPT-5 models on minimal. gpt-5-nano (minimal) is effectively unusable.


o1 and o3 underperform their names

The o-series is positioned as reasoning-specialized. The results told a different story.

Model Accuracy Response time
o1 47.8% 3.45s
o3 61.1% 2.67s
o3-mini 92.2% 2.78s
o4-mini 100% 2.45s

o1 at 47.8% was the most surprising number in the entire experiment. It scored lower than gpt-4o (66.7%) and at the same level as gpt-4.1 (66.7%). o3 at 61.1% wasn't much better.

The lightweight versions were a completely different story. o3-mini hit 92.2% and o4-mini hit 100% at 2.45 seconds. The best performer in the o-series was the smallest, cheapest model.

One caveat: these three problems specifically target token-level perception and decimal arithmetic — areas where the o1/o3 architecture may not have a structural advantage. Results on knowledge-heavy or code reasoning tasks could differ.

Takeaway: If you're using the o-series, o4-mini is the clear choice. o1 and o3 aren't worth the cost.


The value winner is gpt-5-nano (low)

Looking at accuracy and response time together, a few configurations stand out.

Model Accuracy Response time Notes
gpt-5-nano (low) 98.9% 1.72s Best value
o4-mini 100% 2.45s Best accuracy
gpt-5-mini (low) 98.9% 3.25s Balanced
gpt-4o 66.7% 0.67s Fastest

gpt-5-nano (low) delivers 98.9% accuracy in 1.72 seconds, at nano pricing. For most production workloads, this configuration is the best starting point.

If you need 100% accuracy, use o4-mini. If sub-second response is the priority, use gpt-4o.

gpt-5-mini (high) takes 7.72 seconds and ties o4-mini at 100% accuracy. It's hard to construct a scenario where that trade-off makes sense.


Results by Problem Type

Decimal comparison (9.11 vs 9.9)

The most discriminating problem. Many models followed the pattern "11 > 9" and got it wrong.

  • GPT-4 series: 0–30% accuracy across the board
  • o1, o3: 30–40%
  • o4-mini, o3-mini: 100%
  • gpt-5-mini/nano (low or higher): 95–100%

The entire GPT-4 family largely failed this problem. If decimal reasoning is core to your task, GPT-4 is not the right choice.

Character counting (r in 'strawberry')

Tests the token-level processing limitation directly.

  • o4-mini, o3-mini: 100%
  • GPT-5 (low or higher): 90–100%
  • gpt-5-nano (minimal): 1 correct answer out of 30

Simple algebra (8.9 = x + 8.11)

The easiest of the three. Most models solved it. Notable failures: gpt-5-nano (minimal) was unstable, and o1 occasionally returned -0.21 due to a sign error.


Practical Recommendations

Model selection by use case

Real-time conversational applicationsgpt-4o (0.67s, 67%)

  • Response speed is the top priority

General production workloadsgpt-5-nano (low) (1.72s, 99%) — best value

  • High accuracy at nano pricing
  • Best starting point for most tasks

Tasks requiring high accuracyo4-mini (2.45s, 100%)

  • Reasoning-heavy tasks, code debugging, math

Cost-optimized batch processinggpt-5-nano (low) (1.72s, 99%)

  • Most efficient at scale

GPT-5 reasoning effort guide

minimal → avoid (nano drops to 4.4%)
low     → recommended for most cases (98–99% accuracy)
medium  → when accuracy matters more than speed
high    → o4-mini is faster with the same accuracy

Configurations to avoid

  • gpt-5-nano (minimal): effectively unusable (4.4%)
  • gpt-5-mini (high): 7.72s with no accuracy advantage over o4-mini
  • o1, o3: poor cost-performance ratio

Technical Details

Responses API parameters

reasoning.effort (GPT-5 only)

  • "minimal": minimal reasoning (fast but inaccurate)
  • "low": low reasoning (recommended)
  • "medium": medium reasoning (high accuracy)
  • "high": maximum reasoning (top accuracy, slow)

text.verbosity

  • "low": concise answer
  • "medium": standard answer
  • "high": detailed answer

Using logprobs for uncertainty quantification

probs = [math.exp(logprob) for logprob in logprobs]

# Weighted average captures model uncertainty
weighted_avg = sum(val * prob for val, prob in zip(values, probs)) / sum(probs)

# Entropy measures confidence
entropy = -sum(p * math.log(p) for p in probs if p > 0)

Experiment conditions

  • 30 repeated trials per configuration
  • 3 problems × 30 trials = 90 data points per model
  • Checkpoint feature for reproducibility
  • Model versions as of April 14, 2025

References


Names, pricing, and marketing positioning are starting points, not answers. The only reliable method is testing on your actual task. That's what this experiment was.

Model versions as of October 2025. OpenAI updates models continuously — check the official docs for the latest.