10 OpenAI Models Through Quick Benchmarks — The Model Isn't as Smart as You Pay
When choosing an OpenAI model, most people go by the name. o-series for reasoning, GPT-5 for cutting-edge capability, mini for lightweight and cheap. After running a structured experiment, I found the names don't always match the reality.
A few examples:
- gpt-5-nano on minimal setting scored 4.4% accuracy. That's 1–2 correct answers out of 30 attempts — worse than random guessing (33%).
- o1 scored 47.8%. A reasoning-specialized model, lower than gpt-4o at 66.7%.
- gpt-5-mini (high) takes 7.72 seconds. o4-mini hits 100% accuracy in 2.45 seconds. There's no reason to use high.
This post is the record of that experiment.
Why I Ran This
Benchmarks on official leaderboards mostly test knowledge retrieval. What I actually wanted to know was different: does raising reasoning effort meaningfully improve accuracy, and are o-series models actually better than GPT-5 for reasoning tasks?
So I designed three problem types that language models are known to struggle with, and ran each of 19 model configurations 30 times each.
Experiment Setup
Three Problems
Each problem requires actual reasoning, not pattern matching.
Character counting: "How many times does the letter r appear in the word 'strawberry'?" (Answer: 3) — Targets the structural limitation of token-level text processing.
Decimal comparison: "Which is larger, 9.11 or 9.9?" (Answer: 9.9) — Models that follow the pattern "11 > 9" get this wrong. Real numeric understanding is required.
Simple algebra: "Solve for x: 8.9 = x + 8.11" (Answer: 0.79) — Basic arithmetic, but decimal handling causes occasional failures.
Models Tested (19 configurations)
- GPT-4 series: gpt-4o, gpt-4.1, gpt-4.1-mini
- GPT-5 series: gpt-5, gpt-5-mini, gpt-5-nano × 4 reasoning effort levels (minimal / low / medium / high)
- o-series: o1, o3, o3-mini, o4-mini
Each configuration ran 30 trials across all 3 problems — 90 data points per configuration.
For models supporting it, I also used logprobs from the Chat Completion API to quantify model uncertainty. If a model chose "3" with 72.9% probability and "4" with 26.8%, the weighted average of ~3.27 captures how uncertain the response actually was.
response = client.chat.completions.create(
model="gpt-4.1",
messages=[{"role": "user", "content": "What is the meaning of life? Answer with a single integer between 1 and 5."}],
logprobs=True,
top_logprobs=10
)minimal is a trap
The most surprising results came from the minimal reasoning effort setting in the GPT-5 series.
| Model | minimal accuracy | low accuracy |
|---|---|---|
| gpt-5 | 66.7% | 86.7% |
| gpt-5-mini | 33.3% | 98.9% |
| gpt-5-nano | 4.4% | 98.9% |
gpt-5-nano on minimal answered correctly 1–2 times out of 30. That's worse than random (33%). Switching the same model to low brought accuracy to 98.9%. I didn't expect that cliff to be so steep.
Takeaway: There's no reason to use GPT-5 models on minimal. gpt-5-nano (minimal) is effectively unusable.
o1 and o3 underperform their names
The o-series is positioned as reasoning-specialized. The results told a different story.
| Model | Accuracy | Response time |
|---|---|---|
| o1 | 47.8% | 3.45s |
| o3 | 61.1% | 2.67s |
| o3-mini | 92.2% | 2.78s |
| o4-mini | 100% | 2.45s |
o1 at 47.8% was the most surprising number in the entire experiment. It scored lower than gpt-4o (66.7%) and at the same level as gpt-4.1 (66.7%). o3 at 61.1% wasn't much better.
The lightweight versions were a completely different story. o3-mini hit 92.2% and o4-mini hit 100% at 2.45 seconds. The best performer in the o-series was the smallest, cheapest model.
One caveat: these three problems specifically target token-level perception and decimal arithmetic — areas where the o1/o3 architecture may not have a structural advantage. Results on knowledge-heavy or code reasoning tasks could differ.
Takeaway: If you're using the o-series, o4-mini is the clear choice. o1 and o3 aren't worth the cost.
The value winner is gpt-5-nano (low)
Looking at accuracy and response time together, a few configurations stand out.
| Model | Accuracy | Response time | Notes |
|---|---|---|---|
| gpt-5-nano (low) | 98.9% | 1.72s | Best value |
| o4-mini | 100% | 2.45s | Best accuracy |
| gpt-5-mini (low) | 98.9% | 3.25s | Balanced |
| gpt-4o | 66.7% | 0.67s | Fastest |
gpt-5-nano (low) delivers 98.9% accuracy in 1.72 seconds, at nano pricing. For most production workloads, this configuration is the best starting point.
If you need 100% accuracy, use o4-mini. If sub-second response is the priority, use gpt-4o.
gpt-5-mini (high) takes 7.72 seconds and ties o4-mini at 100% accuracy. It's hard to construct a scenario where that trade-off makes sense.
Results by Problem Type
Decimal comparison (9.11 vs 9.9)
The most discriminating problem. Many models followed the pattern "11 > 9" and got it wrong.
- GPT-4 series: 0–30% accuracy across the board
- o1, o3: 30–40%
- o4-mini, o3-mini: 100%
- gpt-5-mini/nano (low or higher): 95–100%
The entire GPT-4 family largely failed this problem. If decimal reasoning is core to your task, GPT-4 is not the right choice.
Character counting (r in 'strawberry')
Tests the token-level processing limitation directly.
- o4-mini, o3-mini: 100%
- GPT-5 (low or higher): 90–100%
- gpt-5-nano (minimal): 1 correct answer out of 30
Simple algebra (8.9 = x + 8.11)
The easiest of the three. Most models solved it. Notable failures: gpt-5-nano (minimal) was unstable, and o1 occasionally returned -0.21 due to a sign error.
Practical Recommendations
Model selection by use case
Real-time conversational applications → gpt-4o (0.67s, 67%)
- Response speed is the top priority
General production workloads → gpt-5-nano (low) (1.72s, 99%) — best value
- High accuracy at nano pricing
- Best starting point for most tasks
Tasks requiring high accuracy → o4-mini (2.45s, 100%)
- Reasoning-heavy tasks, code debugging, math
Cost-optimized batch processing → gpt-5-nano (low) (1.72s, 99%)
- Most efficient at scale
GPT-5 reasoning effort guide
minimal → avoid (nano drops to 4.4%)
low → recommended for most cases (98–99% accuracy)
medium → when accuracy matters more than speed
high → o4-mini is faster with the same accuracyConfigurations to avoid
- gpt-5-nano (minimal): effectively unusable (4.4%)
- gpt-5-mini (high): 7.72s with no accuracy advantage over o4-mini
- o1, o3: poor cost-performance ratio
Technical Details
Responses API parameters
reasoning.effort (GPT-5 only)
"minimal": minimal reasoning (fast but inaccurate)"low": low reasoning (recommended)"medium": medium reasoning (high accuracy)"high": maximum reasoning (top accuracy, slow)
text.verbosity
"low": concise answer"medium": standard answer"high": detailed answer
Using logprobs for uncertainty quantification
probs = [math.exp(logprob) for logprob in logprobs]
# Weighted average captures model uncertainty
weighted_avg = sum(val * prob for val, prob in zip(values, probs)) / sum(probs)
# Entropy measures confidence
entropy = -sum(p * math.log(p) for p in probs if p > 0)Experiment conditions
- 30 repeated trials per configuration
- 3 problems × 30 trials = 90 data points per model
- Checkpoint feature for reproducibility
- Model versions as of April 14, 2025
References
Names, pricing, and marketing positioning are starting points, not answers. The only reliable method is testing on your actual task. That's what this experiment was.
Model versions as of October 2025. OpenAI updates models continuously — check the official docs for the latest.