Comparative Analysis of OpenAI Models — From GPT to the o-Series
An experimental record of quantitative performance measurements of reasoning ability, response time, and accuracy across OpenAI language models including GPT-4, GPT-5, o1, o3, and o4.
Comparative Analysis of OpenAI Models
As of 2025, OpenAI provides a range of language models that includes the GPT-4 series, the GPT-5 series, and the reasoning-focused o-series (o1, o3, o4), each of which appears to have distinct architectural design objectives and performance characteristics. In this experiment, we designed three problem types that language models are typically known to struggle with in order to measure their reasoning ability, and we quantitatively measured the accuracy and response time of each model.
1. Experimental Method
1.1 API Interfaces
Beginning in 2025, OpenAI introduced a new Responses API (client.responses.create()) alongside the existing Chat Completion API (client.chat.completions.create()), and the two APIs support different parameter sets. The Chat Completion API returns the probability distribution at each token selection through the logprobs parameter for non-thinking models, while the Responses API is differentiated by its ability to adjust reasoning effort (reasoning.effort) for the GPT-5 series and by its support for o-series reasoning models.
In this experiment, we applied a methodology that quantifies model uncertainty using the logprobs feature of the Chat Completion API to some of the models. For example, when a model selected "3" with a probability of 72.9% and "4" with a probability of 26.8% for a given question, the weighted average is computed as roughly 3.27, which quantifies how uncertain the model's response is between the two values.
response = client.chat.completions.create(
model="gpt-4.1",
messages=[{"role":"user","content":"인생의 의미는 무엇이지? 반드시 1~5 사이 정수로 답해줘."}],
logprobs=True,
top_logprobs=10
)1.2 Test Problem Design
We selected three problem types that are known to be typically difficult for language models, all of which share the characteristic that they require actual reasoning rather than simple pattern matching.
Character counting: "How many times does the letter r appear in the English word 'strawberry'?" (Answer: 3)
- Tests the structural limitation of language models that process text at the token level.
Decimal comparison: "Which number is larger, 9.11 or 9.9?" (Answer: 9.9)
- Induces pattern-matching errors caused by differences in the number of decimal places.
Simple algebra: "Solve for x in 8.9 = x + 8.11." (Answer: 0.79)
- Measures basic arithmetic ability and the accuracy of sign handling.
1.3 Models Under Evaluation
The experiment included three model series in total. For the GPT-5 series, we independently measured four variants according to the reasoning effort (reasoning.effort) parameter.
- GPT-4 series: gpt-4o, gpt-4.1, gpt-4.1-mini (3 models)
- GPT-5 series: gpt-5, gpt-5-mini, gpt-5-nano (each with 4 reasoning-effort settings: minimal, low, medium, high — 12 configurations in total)
- o-series: o1, o3, o3-mini, o4-mini (4 models)
For each model configuration we performed 30 repeated measurements, generating 90 data points per model (3 problems x 30 repetitions). A checkpoint feature was implemented to allow the experiment to be paused and resumed, which ensured reproducibility.
2. Experimental Results
2.1 Accuracy Measurement
When we measured average accuracy across all problems, considerable performance differences between models became apparent. The accuracy ranking is as follows:
- o4-mini: 100.0%
- gpt-5-mini (medium/high): 100.0%
- gpt-5-nano (high): 100.0%
- gpt-5-mini (low): 98.9%
- gpt-5-nano (low/medium): 96.7–98.9%
- gpt-5 (high): 96.7%
- o3-mini: 92.2%
- gpt-5 (medium): 90.0%
- gpt-5 (low): 86.7%
- gpt-4o, gpt-4.1, gpt-5 (minimal): 66.7%
Several observations are worth noting. o4-mini achieved the highest performance with 100% accuracy, while the GPT-5 series showed performance that varied considerably depending on the reasoning-effort parameter. The minimal setting produced poor performance in most GPT-5 models, and gpt-5-nano (minimal) in particular recorded an accuracy of just 4.4%, which is effectively unusable. Despite being reasoning-focused models, o1 (47.8%) and o3 (61.1%) recorded lower accuracy than expected, which contrasts with the higher accuracy obtained by their lightweight counterparts o3-mini and o4-mini.
2.2 Response Time Measurement
The average response time per model ranged broadly from 0.647 seconds to 7.715 seconds, depending on the model architecture and reasoning-effort setting. The measured response times can be grouped as follows:
Under 1 second (4 configurations)
- gpt-4.1-mini: 0.647 s
- gpt-4o: 0.672 s
- gpt-4.1: 0.872 s
- gpt-5-nano (minimal): 0.912 s
1–3 seconds (6 configurations)
- gpt-5 (minimal): 1.158 s
- gpt-5-mini (minimal): 1.268 s
- gpt-5-nano (low): 1.716 s
- o4-mini: 2.451 s
- gpt-5-nano (medium): 2.564 s
- o3: 2.666 s
3 seconds or more (9 configurations)
- gpt-5 (low): 2.734 s
- o3-mini: 2.780 s
- gpt-5-mini (low): 3.251 s
- o1: 3.452 s
- gpt-5 (medium): 3.997 s
- gpt-5-nano (high): 4.093 s
- gpt-5-mini (medium): 4.005 s
- gpt-5 (high): 5.901 s
- gpt-5-mini (high): 7.715 s
Analyzing the response-time pattern, the GPT-4 series showed the fastest response times in the 0.6–0.9 second range, which is likely related to the fact that these models do not support adjustable reasoning effort. In the GPT-5 series, response time tended to increase systematically as reasoning effort was raised from minimal to high, while the o-series fell into an intermediate range of 2–3 seconds. Notably, gpt-5-mini (high) recorded the longest response time among the measured configurations at 7.715 seconds.
2.3 Accuracy–Response Time Trade-off Analysis
Examining the relationship between accuracy and response time, different model configurations exhibited different trade-off profiles. The configurations with the highest efficiency in terms of accuracy per unit of response time are:
- o4-mini: 100% accuracy, 2.45 s
- gpt-5-nano (low): 98.9% accuracy, 1.72 s
- gpt-5-mini (low): 98.9% accuracy, 3.25 s
- o3-mini: 92.2% accuracy, 2.78 s
By contrast, some configurations spent excessive response time relative to their accuracy. gpt-5-mini (high) achieved 100% accuracy but required 7.72 seconds, which is inefficient compared with o4-mini (2.45 s) or gpt-5-mini (medium) (4.01 s) at the same accuracy level. o1 required 3.45 seconds despite a low accuracy of 47.8%, and gpt-5-nano (minimal) performed at essentially random-guess level with 4.4% accuracy.
3. Performance Analysis by Problem Type
3.1 Character-Counting Problem (number of 'r' in 'strawberry')
The character-counting problem is a task on which language models are known to struggle because of the structural property that they process text at the token level. In this experiment, o4-mini and o3-mini achieved 100% accuracy, while GPT-5 family models reached 90–100% accuracy at low effort or higher. By contrast, gpt-5-nano (minimal) provided the correct answer only once out of 30 attempts, and gpt-4.1-mini exhibited unstable performance. This suggests that models with stronger reasoning capability have a clear advantage in character-level analysis.
4.2 GPT-5 Series
Characteristics
- Adjustable reasoning effort: minimal, low, medium, high
- Performance and speed change substantially with reasoning effort
- minimal is at GPT-4 level, high is at o-series level
Per-model Characteristics
gpt-5 (standard)
- minimal: 66.7%, 1.16 s
- low: 86.7%, 2.73 s
- medium: 90.0%, 4.00 s
- high: 96.7%, 5.90 s
- Recommended: medium (balance of accuracy and speed)
gpt-5-mini (lightweight)
- minimal: 33.3%, 1.27 s
- low: 98.9%, 3.25 s [recommended]
- medium: 100%, 4.01 s
- high: 100%, 7.72 s
- Recommended: low (high accuracy, reasonable speed)
gpt-5-nano (ultra-lightweight)
- minimal: 4.4%, 0.91 s (not usable)
- low: 98.9%, 1.72 s [best value]
- medium: 96.7%, 2.56 s
- high: 100%, 4.09 s
- Recommended: low (best price-performance)
Recommended Use Cases
- minimal: Simple tasks, fast response required
- low: Most practical tasks (recommended)
- medium: Tasks where higher accuracy matters
- high: Important tasks where top accuracy is essential
4.3 o-Series (Reasoning Models)
Characteristics
- Models specialized for reasoning
- Internally go through a "thinking process"
- Limited reasoning-effort control (only low is supported)
- Output verbosity is controlled via the verbosity setting
Per-model Characteristics
o1 (1st generation)
- Accuracy: 47.8%
- Speed: 3.45 s
- Assessment: Below expectations, not recommended
o3 (3rd generation)
- Accuracy: 61.1%
- Speed: 2.67 s
- Assessment: Improved over o1 but still insufficient
o3-mini (lightweight 3rd generation)
- Accuracy: 92.2%
- Speed: 2.78 s
- Assessment: Excellent performance, practical
o4-mini (lightweight 4th generation) [top pick]
- Accuracy: 100%
- Speed: 2.45 s
- Assessment: Best overall performance
Recommended Use Cases
- Mathematical problem solving
- Complex logical reasoning
- Code debugging
- Multi-step problem solving
- Tasks where accuracy is the top priority
5. Analysis by Problem Type
5.1 Character Counting (number of 'r' in 'strawberry')
Difficulty: Moderately high
High-scoring Models
- o4-mini, o3-mini: 100%
- GPT-5 family (low or higher): 90–100%
Low-scoring Models
- gpt-5-nano (minimal): only 1 correct answer
- gpt-4.1-mini: unstable performance
Analysis: Because language models process text at the token level, counting characters is inherently difficult. Models with stronger reasoning capability have a clear advantage.
5.2 Decimal Comparison (9.11 vs. 9.9)
Difficulty: High
The Trap in This Problem
- Many models incorrectly judge 9.11 to be larger.
- They appear to rely on the pattern that "11" is greater than "9".
- True understanding of decimal comparison is required.
High-scoring Models
- o4-mini, o3-mini: 100%
- gpt-5-mini/nano (low or higher): 95–100%
Low-scoring Models
- Most GPT-4 models: 0–30%
- o1, o3: 30–40%
Analysis: The most discriminating problem. It requires genuine reasoning ability.
5.3 Simple Equation (8.9 = x + 8.11)
Difficulty: Moderate
Characteristics
- A simple subtraction (0.79)
- Most models solve it well
- The basics of decimal arithmetic
Low-scoring Models
- gpt-5-nano (minimal): unstable
- o1: occasionally returns -0.21 (sign error)
Analysis: The easiest problem, but it can fail under the minimal setting.
6. Practical Recommendations
6.1 Model Selection by Task Type
Real-time Conversational Applications -> gpt-4o (0.67 s, 67% accuracy)
- Fast response is the top priority
- Moderate accuracy is sufficient
General Production Environments -> gpt-5-nano (low) (1.72 s, 99% accuracy) [best value]
- Best price-performance
- A balance of high accuracy and fast speed
- Suitable for most practical tasks
Tasks Requiring High Accuracy -> o4-mini (2.45 s, 100% accuracy) [top pick]
- Highest accuracy
- Reasonable response speed
- Optimal for tasks requiring reasoning capability
When Cost Optimization Is Needed -> gpt-4.1-mini (0.65 s, 33% accuracy)
- Fastest and cheapest
- Use only for simple tasks
- When accuracy is not important
Top Accuracy Required (regardless of speed) -> gpt-5-mini (high) (7.72 s, 100% accuracy)
- Highest accuracy guaranteed
- Slow but reliable
- Critical decision-making tasks
6.2 GPT-5 Reasoning-Effort Selection Guide
minimal: not recommended (nano in particular drops to 4.4% accuracy)
low: recommended for most cases (98-99% accuracy, fast speed)
medium: when accuracy is more important (95-100% accuracy, moderate speed)
high: only when top accuracy is essential (100% accuracy, very slow)6.3 Cost–Performance Trade-off
OpenAI model pricing is generally as follows (relative comparison):
nano < mini < standard
GPT-4 < GPT-5 < o-seriesCost-effectiveness Ranking
- gpt-5-nano (low): best price-performance
- o4-mini: reasonable cost for top performance
- gpt-4o: when fast response is needed
- gpt-5-mini (low): balanced choice
Settings to Avoid
- gpt-5-nano (minimal): effectively unusable (4.4% accuracy)
- gpt-5-mini (high): excessive response time (7.72 s)
- o1, o3: insufficient performance for the price
7. Practical Application Examples
Case 1: RAG-Based Document Retrieval System
Requirements
- Document-based question answering
- Citation accuracy is important
- 2–3 second response time acceptable
Recommended Model: o4-mini or gpt-5-nano (low)
Rationale
- High accuracy ensures correct citations
- Capable of complex multi-step reasoning
- Acceptable response speed
Case 2: Customer Support Chatbot
Requirements
- Real-time response (<1 s)
- General question answering
- High throughput
Recommended Model: gpt-4o
Rationale
- Fastest response (0.67 s)
- Sufficient performance for general conversation
- Cost-efficient
Case 3: Code Review and Debugging
Requirements
- Complex logical reasoning
- High accuracy
- Flexible response time
Recommended Model: o4-mini
Rationale
- 100% accuracy
- Multi-step reasoning ability
- Optimized for code analysis
Case 4: Large-Scale Batch Processing
Requirements
- Process thousands of requests
- Minimize cost
- Reasonable quality
Recommended Model: gpt-5-nano (low)
Rationale
- Best price-performance
- High accuracy (98.9%)
- Fast processing speed (1.72 s)
8. Conclusion and Summary
Key Findings
- o4-mini wins overall: 100% accuracy and a 2.45-second response time provide the best balance.
- gpt-5-nano (low) offers the best value: 98.9% accuracy and 1.72 seconds make it highly practical.
- Reasoning effort is decisive for performance: minimal and high in GPT-5 behave like entirely different models.
- o1/o3 fall short of expectations: Despite their naming, performance is only moderate.
- GPT-4 is specialized for speed: Optimal when response speed matters more than accuracy.
Golden Rules
Need fast response -> gpt-4o
General tasks -> gpt-5-nano (low)
Need high accuracy -> o4-mini
Top accuracy required -> gpt-5-mini (high)Top 3 Recommended Models
1st. o4-mini
- 100% accuracy, 2.45 s
- Use for: any task where accuracy is important
- Complex problems that require reasoning ability
2nd. gpt-5-nano (low)
- 98.9% accuracy, 1.72 s
- Use for: most production environments
- Best price-performance
3rd. gpt-4o
- 66.7% accuracy, 0.67 s
- Use for: real-time conversational applications
- When fast response is the top priority
Outlook
OpenAI continues to improve its models, and progress is expected especially in the following areas:
- Better reasoning models: a full o4 release, the o5 series
- Faster response speed: optimization of the GPT-5 series
- More fine-grained control: finer adjustment of reasoning effort
- Cost efficiency: performance improvements in nano/mini models
9. Technical Details
9.1 Responses API Parameter Description
text.verbosity
"low": concise answer"medium": standard answer"high": detailed answer
reasoning.effort (GPT-5 only)
"minimal": minimal reasoning (fast but inaccurate)"low": low reasoning (recommended setting)"medium": medium reasoning (high accuracy)"high": high reasoning (top accuracy, slow)
max_output_tokens
- Limits response length
- Controls cost management and response time
9.2 Using Log Probabilities (logprobs)
Using logprobs allows us to quantify model uncertainty:
# Compute probabilities
probs = [math.exp(logprob) for logprob in logprobs]
# Measure uncertainty via weighted average
weighted_avg = sum(val * prob for val, prob in zip(values, probs)) / sum(probs)
# Measure confidence via entropy
entropy = -sum(p * math.log(p) for p in probs if p > 0)Use Cases
- Detecting answers the model is uncertain about
- Computing weighted averages of several candidate answers
- Statistical analysis in A/B tests
9.3 Reproducibility
The experiment was conducted under the following conditions:
- 30 repeated trials per model
- 3 problems x 30 repetitions = 90 data points
- Resumable through a checkpoint feature
- Model versions as of April 14, 2025
References
Disclaimer: This analysis reflects test results on a specific type of reasoning problem. Real-world performance may vary depending on task characteristics, prompt design, and use case. We recommend running your own benchmarks before deploying to production.
Up-to-date information: OpenAI continually updates its models. This article reflects an analysis as of October 2025; for the latest information, please refer to the official documentation.