Research October 16, 2025 13 min read

AI assisted

Comparative Analysis of OpenAI Models — From GPT to the o-Series

An experimental record of quantitative performance measurements of reasoning ability, response time, and accuracy across OpenAI language models including GPT-4, GPT-5, o1, o3, and o4.

#OpenAI #GPT #model comparison #AI #language models #reasoning models #performance analysis

Comparative Analysis of OpenAI Models

As of 2025, OpenAI provides a range of language models that includes the GPT-4 series, the GPT-5 series, and the reasoning-focused o-series (o1, o3, o4), each of which appears to have distinct architectural design objectives and performance characteristics. In this experiment, we designed three problem types that language models are typically known to struggle with in order to measure their reasoning ability, and we quantitatively measured the accuracy and response time of each model.

1. Experimental Method

1.1 API Interfaces

Beginning in 2025, OpenAI introduced a new Responses API (client.responses.create()) alongside the existing Chat Completion API (client.chat.completions.create()), and the two APIs support different parameter sets. The Chat Completion API returns the probability distribution at each token selection through the logprobs parameter for non-thinking models, while the Responses API is differentiated by its ability to adjust reasoning effort (reasoning.effort) for the GPT-5 series and by its support for o-series reasoning models.

In this experiment, we applied a methodology that quantifies model uncertainty using the logprobs feature of the Chat Completion API to some of the models. For example, when a model selected "3" with a probability of 72.9% and "4" with a probability of 26.8% for a given question, the weighted average is computed as roughly 3.27, which quantifies how uncertain the model's response is between the two values.

response = client.chat.completions.create(
  model="gpt-4.1",
  messages=[{"role":"user","content":"인생의 의미는 무엇이지? 반드시 1~5 사이 정수로 답해줘."}],
  logprobs=True,
  top_logprobs=10
)

1.2 Test Problem Design

We selected three problem types that are known to be typically difficult for language models, all of which share the characteristic that they require actual reasoning rather than simple pattern matching.

Character counting: "How many times does the letter r appear in the English word 'strawberry'?" (Answer: 3)
- Tests the structural limitation of language models that process text at the token level.
Decimal comparison: "Which number is larger, 9.11 or 9.9?" (Answer: 9.9)
- Induces pattern-matching errors caused by differences in the number of decimal places.
Simple algebra: "Solve for x in 8.9 = x + 8.11." (Answer: 0.79)
- Measures basic arithmetic ability and the accuracy of sign handling.

1.3 Models Under Evaluation

The experiment included three model series in total. For the GPT-5 series, we independently measured four variants according to the reasoning effort (reasoning.effort) parameter.

GPT-4 series: gpt-4o, gpt-4.1, gpt-4.1-mini (3 models)
GPT-5 series: gpt-5, gpt-5-mini, gpt-5-nano (each with 4 reasoning-effort settings: minimal, low, medium, high — 12 configurations in total)
o-series: o1, o3, o3-mini, o4-mini (4 models)

For each model configuration we performed 30 repeated measurements, generating 90 data points per model (3 problems x 30 repetitions). A checkpoint feature was implemented to allow the experiment to be paused and resumed, which ensured reproducibility.

2. Experimental Results

2.1 Accuracy Measurement

When we measured average accuracy across all problems, considerable performance differences between models became apparent. The accuracy ranking is as follows:

o4-mini: 100.0%
gpt-5-mini (medium/high): 100.0%
gpt-5-nano (high): 100.0%
gpt-5-mini (low): 98.9%
gpt-5-nano (low/medium): 96.7–98.9%
gpt-5 (high): 96.7%
o3-mini: 92.2%
gpt-5 (medium): 90.0%
gpt-5 (low): 86.7%
gpt-4o, gpt-4.1, gpt-5 (minimal): 66.7%

Several observations are worth noting. o4-mini achieved the highest performance with 100% accuracy, while the GPT-5 series showed performance that varied considerably depending on the reasoning-effort parameter. The minimal setting produced poor performance in most GPT-5 models, and gpt-5-nano (minimal) in particular recorded an accuracy of just 4.4%, which is effectively unusable. Despite being reasoning-focused models, o1 (47.8%) and o3 (61.1%) recorded lower accuracy than expected, which contrasts with the higher accuracy obtained by their lightweight counterparts o3-mini and o4-mini.

2.2 Response Time Measurement

The average response time per model ranged broadly from 0.647 seconds to 7.715 seconds, depending on the model architecture and reasoning-effort setting. The measured response times can be grouped as follows:

Under 1 second (4 configurations)

gpt-4.1-mini: 0.647 s
gpt-4o: 0.672 s
gpt-4.1: 0.872 s
gpt-5-nano (minimal): 0.912 s

1–3 seconds (6 configurations)

gpt-5 (minimal): 1.158 s
gpt-5-mini (minimal): 1.268 s
gpt-5-nano (low): 1.716 s
o4-mini: 2.451 s
gpt-5-nano (medium): 2.564 s
o3: 2.666 s

3 seconds or more (9 configurations)

gpt-5 (low): 2.734 s
o3-mini: 2.780 s
gpt-5-mini (low): 3.251 s
o1: 3.452 s
gpt-5 (medium): 3.997 s
gpt-5-nano (high): 4.093 s
gpt-5-mini (medium): 4.005 s
gpt-5 (high): 5.901 s
gpt-5-mini (high): 7.715 s

Analyzing the response-time pattern, the GPT-4 series showed the fastest response times in the 0.6–0.9 second range, which is likely related to the fact that these models do not support adjustable reasoning effort. In the GPT-5 series, response time tended to increase systematically as reasoning effort was raised from minimal to high, while the o-series fell into an intermediate range of 2–3 seconds. Notably, gpt-5-mini (high) recorded the longest response time among the measured configurations at 7.715 seconds.

2.3 Accuracy–Response Time Trade-off Analysis

Examining the relationship between accuracy and response time, different model configurations exhibited different trade-off profiles. The configurations with the highest efficiency in terms of accuracy per unit of response time are:

o4-mini: 100% accuracy, 2.45 s
gpt-5-nano (low): 98.9% accuracy, 1.72 s
gpt-5-mini (low): 98.9% accuracy, 3.25 s
o3-mini: 92.2% accuracy, 2.78 s

By contrast, some configurations spent excessive response time relative to their accuracy. gpt-5-mini (high) achieved 100% accuracy but required 7.72 seconds, which is inefficient compared with o4-mini (2.45 s) or gpt-5-mini (medium) (4.01 s) at the same accuracy level. o1 required 3.45 seconds despite a low accuracy of 47.8%, and gpt-5-nano (minimal) performed at essentially random-guess level with 4.4% accuracy.

3. Performance Analysis by Problem Type

3.1 Character-Counting Problem (number of 'r' in 'strawberry')

The character-counting problem is a task on which language models are known to struggle because of the structural property that they process text at the token level. In this experiment, o4-mini and o3-mini achieved 100% accuracy, while GPT-5 family models reached 90–100% accuracy at low effort or higher. By contrast, gpt-5-nano (minimal) provided the correct answer only once out of 30 attempts, and gpt-4.1-mini exhibited unstable performance. This suggests that models with stronger reasoning capability have a clear advantage in character-level analysis.

4.2 GPT-5 Series

Characteristics

Adjustable reasoning effort: minimal, low, medium, high
Performance and speed change substantially with reasoning effort
minimal is at GPT-4 level, high is at o-series level

Per-model Characteristics

gpt-5 (standard)

minimal: 66.7%, 1.16 s
low: 86.7%, 2.73 s
medium: 90.0%, 4.00 s
high: 96.7%, 5.90 s
Recommended: medium (balance of accuracy and speed)

gpt-5-mini (lightweight)

minimal: 33.3%, 1.27 s
low: 98.9%, 3.25 s [recommended]
medium: 100%, 4.01 s
high: 100%, 7.72 s
Recommended: low (high accuracy, reasonable speed)

gpt-5-nano (ultra-lightweight)

minimal: 4.4%, 0.91 s (not usable)
low: 98.9%, 1.72 s [best value]
medium: 96.7%, 2.56 s
high: 100%, 4.09 s
Recommended: low (best price-performance)

Recommended Use Cases

minimal: Simple tasks, fast response required
low: Most practical tasks (recommended)
medium: Tasks where higher accuracy matters
high: Important tasks where top accuracy is essential

4.3 o-Series (Reasoning Models)

Characteristics

Models specialized for reasoning
Internally go through a "thinking process"
Limited reasoning-effort control (only low is supported)
Output verbosity is controlled via the verbosity setting

Per-model Characteristics

o1 (1st generation)

Accuracy: 47.8%
Speed: 3.45 s
Assessment: Below expectations, not recommended

o3 (3rd generation)

Accuracy: 61.1%
Speed: 2.67 s
Assessment: Improved over o1 but still insufficient

o3-mini (lightweight 3rd generation)

Accuracy: 92.2%
Speed: 2.78 s
Assessment: Excellent performance, practical

o4-mini (lightweight 4th generation) [top pick]

Accuracy: 100%
Speed: 2.45 s
Assessment: Best overall performance

Recommended Use Cases

Mathematical problem solving
Complex logical reasoning
Code debugging
Multi-step problem solving
Tasks where accuracy is the top priority

5. Analysis by Problem Type

5.1 Character Counting (number of 'r' in 'strawberry')

Difficulty: Moderately high

High-scoring Models

o4-mini, o3-mini: 100%
GPT-5 family (low or higher): 90–100%

Low-scoring Models

gpt-5-nano (minimal): only 1 correct answer
gpt-4.1-mini: unstable performance

Analysis: Because language models process text at the token level, counting characters is inherently difficult. Models with stronger reasoning capability have a clear advantage.

5.2 Decimal Comparison (9.11 vs. 9.9)

Difficulty: High

The Trap in This Problem

Many models incorrectly judge 9.11 to be larger.
They appear to rely on the pattern that "11" is greater than "9".
True understanding of decimal comparison is required.

High-scoring Models

o4-mini, o3-mini: 100%
gpt-5-mini/nano (low or higher): 95–100%

Low-scoring Models

Most GPT-4 models: 0–30%
o1, o3: 30–40%

Analysis: The most discriminating problem. It requires genuine reasoning ability.

5.3 Simple Equation (8.9 = x + 8.11)

Difficulty: Moderate

Characteristics

A simple subtraction (0.79)
Most models solve it well
The basics of decimal arithmetic

Low-scoring Models

gpt-5-nano (minimal): unstable
o1: occasionally returns -0.21 (sign error)

Analysis: The easiest problem, but it can fail under the minimal setting.

6. Practical Recommendations

6.1 Model Selection by Task Type

Real-time Conversational Applications -> gpt-4o (0.67 s, 67% accuracy)

Fast response is the top priority
Moderate accuracy is sufficient

General Production Environments -> gpt-5-nano (low) (1.72 s, 99% accuracy) [best value]

Best price-performance
A balance of high accuracy and fast speed
Suitable for most practical tasks

Tasks Requiring High Accuracy -> o4-mini (2.45 s, 100% accuracy) [top pick]

Highest accuracy
Reasonable response speed
Optimal for tasks requiring reasoning capability

When Cost Optimization Is Needed -> gpt-4.1-mini (0.65 s, 33% accuracy)

Fastest and cheapest
Use only for simple tasks
When accuracy is not important

Top Accuracy Required (regardless of speed) -> gpt-5-mini (high) (7.72 s, 100% accuracy)

Highest accuracy guaranteed
Slow but reliable
Critical decision-making tasks

6.2 GPT-5 Reasoning-Effort Selection Guide

minimal: not recommended (nano in particular drops to 4.4% accuracy)
low: recommended for most cases (98-99% accuracy, fast speed)
medium: when accuracy is more important (95-100% accuracy, moderate speed)
high: only when top accuracy is essential (100% accuracy, very slow)

6.3 Cost–Performance Trade-off

OpenAI model pricing is generally as follows (relative comparison):

nano < mini < standard
GPT-4 < GPT-5 < o-series

Cost-effectiveness Ranking

gpt-5-nano (low): best price-performance
o4-mini: reasonable cost for top performance
gpt-4o: when fast response is needed
gpt-5-mini (low): balanced choice

Settings to Avoid

gpt-5-nano (minimal): effectively unusable (4.4% accuracy)
gpt-5-mini (high): excessive response time (7.72 s)
o1, o3: insufficient performance for the price

7. Practical Application Examples

Case 1: RAG-Based Document Retrieval System

Requirements

Document-based question answering
Citation accuracy is important
2–3 second response time acceptable

Recommended Model: o4-mini or gpt-5-nano (low)

Rationale

High accuracy ensures correct citations
Capable of complex multi-step reasoning
Acceptable response speed

Case 2: Customer Support Chatbot

Requirements

Real-time response (<1 s)
General question answering
High throughput

Recommended Model: gpt-4o

Rationale

Fastest response (0.67 s)
Sufficient performance for general conversation
Cost-efficient

Case 3: Code Review and Debugging

Requirements

Complex logical reasoning
High accuracy
Flexible response time

Recommended Model: o4-mini

Rationale

100% accuracy
Multi-step reasoning ability
Optimized for code analysis

Case 4: Large-Scale Batch Processing

Requirements

Process thousands of requests
Minimize cost
Reasonable quality

Recommended Model: gpt-5-nano (low)

Rationale

Best price-performance
High accuracy (98.9%)
Fast processing speed (1.72 s)

8. Conclusion and Summary

Key Findings

o4-mini wins overall: 100% accuracy and a 2.45-second response time provide the best balance.
gpt-5-nano (low) offers the best value: 98.9% accuracy and 1.72 seconds make it highly practical.
Reasoning effort is decisive for performance: minimal and high in GPT-5 behave like entirely different models.
o1/o3 fall short of expectations: Despite their naming, performance is only moderate.
GPT-4 is specialized for speed: Optimal when response speed matters more than accuracy.

Golden Rules

Need fast response       -> gpt-4o
General tasks            -> gpt-5-nano (low)
Need high accuracy       -> o4-mini
Top accuracy required    -> gpt-5-mini (high)

Top 3 Recommended Models

1st. o4-mini

100% accuracy, 2.45 s
Use for: any task where accuracy is important
Complex problems that require reasoning ability

2nd. gpt-5-nano (low)

98.9% accuracy, 1.72 s
Use for: most production environments
Best price-performance

3rd. gpt-4o

66.7% accuracy, 0.67 s
Use for: real-time conversational applications
When fast response is the top priority

Outlook

OpenAI continues to improve its models, and progress is expected especially in the following areas:

Better reasoning models: a full o4 release, the o5 series
Faster response speed: optimization of the GPT-5 series
More fine-grained control: finer adjustment of reasoning effort
Cost efficiency: performance improvements in nano/mini models

9. Technical Details

9.1 Responses API Parameter Description

text.verbosity

"low": concise answer
"medium": standard answer
"high": detailed answer

reasoning.effort (GPT-5 only)

"minimal": minimal reasoning (fast but inaccurate)
"low": low reasoning (recommended setting)
"medium": medium reasoning (high accuracy)
"high": high reasoning (top accuracy, slow)

max_output_tokens

Limits response length
Controls cost management and response time

9.2 Using Log Probabilities (logprobs)

Using logprobs allows us to quantify model uncertainty:

# Compute probabilities
probs = [math.exp(logprob) for logprob in logprobs]

# Measure uncertainty via weighted average
weighted_avg = sum(val * prob for val, prob in zip(values, probs)) / sum(probs)

# Measure confidence via entropy
entropy = -sum(p * math.log(p) for p in probs if p > 0)

Use Cases

Detecting answers the model is uncertain about
Computing weighted averages of several candidate answers
Statistical analysis in A/B tests

9.3 Reproducibility

The experiment was conducted under the following conditions:

30 repeated trials per model
3 problems x 30 repetitions = 90 data points
Resumable through a checkpoint feature
Model versions as of April 14, 2025

References

Disclaimer: This analysis reflects test results on a specific type of reasoning problem. Real-world performance may vary depending on task characteristics, prompt design, and use case. We recommend running your own benchmarks before deploying to production.

Up-to-date information: OpenAI continually updates its models. This article reflects an analysis as of October 2025; for the latest information, please refer to the official documentation.