Claude 4 is here – ChatGPT responds

Some frank discussion about the latest "benchmark bomb"

05.23.2025

ChatGPT vs Claude 4 vs Gemini - LLM and agentic coding comparison and response

Anthropic this week unveiled its latest LLM (Large Language Model) which can act as both a chatbot and AI assistant. Its special sauce — coding — seems even more improved in the latest release.

Claude 4 is now available and features Opus 4 and Sonnet 4. More details here on the Anthropic X handle.

Notably, the AI company also posted a table comparing the latest release with major AI competitors including OpenAI o3, OpenAI GPT-4.1 and Gemini 2.5 Pro (Google). See image below:

Claude 4 benchmarks - Table with comparison to OpenAI o3, OpenAI GPT-4.1 and Gemini 2.5 Pro

The seven categories featured in the comparison are as follows:

Agentic coding (SWE-bench Verified)
Agentic terminal coding (Terminal-bench)
Graduate-level reasoning (GPQA Diamond)
Agentic tool use (TAU-bench)
Multilingual Q&A (MMMLU)
Visual reasoning (MMMU)
High school math competition (AIME 2025)

You can see in the comparison that Claude 4 beats the competition and, as a casual ChatGPT user, notably ChatGPT (4.1). For instance, Claude 4 scores about 20-25% higher on the agentic coding test (using AI agents to collaborate in the software development process by writing, testing and debugging code). And has similarly impressive results across the board. Further, it appears that ChatGPT doesn’t even have the capability to compete in the High School math category.

Whether this comparison matters to you will depend on your use cases. Since I don’t code, it’s less significant. I find the psychological and reasoning capabilities, along with some basic gen AI chops, to be suitable enough for my needs at this point. ChatGPT fits this bill well, and I really click with its UI.

However, coding is a big battleground. And one that has enterprise potential. Hence, revenue is at stake with large customers who plan to use AI to boost coding speed and quality in a bid to compete faster and faster.

An Alternative Comparison Table as Created by ChatGPT

Model	Company	Backers / Investors	Key Strengths (Benchmarks)	Weaknesses / Gaps
Claude Opus 4	Anthropic	Google, Salesforce, Amazon, Spark Capital	✓ Agentic coding (72.5%) ✓ Multilingual Q&A (88.8%) ✓ Tool use (Retail: 81.4%) ✓ Visual reasoning (76.5%)	✗ Math (75.5%) ✗ No persistent memory ✗ Limited multimodal exposure
Claude Sonnet 4	Anthropic	Same as above	✓ Balanced scores across categories ✓ Reasoning (83.8%) ✓ Agentic tool use (80.5%)	✗ Terminal coding ✗ No vision or speech support
GPT-4o	OpenAI	Microsoft, Thrive Capital, Khosla Ventures	✓ Reasoning (83.3%) ✓ Math (88.9%) ✓ Visual reasoning (82.9%) ✓ Strong plugin/tool ecosystem	✗ Agentic coding (54.6%) ✗ Lower pass@1 in some SWE tasks
GPT-4.1	OpenAI	Same as above	✓ Strong logic & math foundation ✓ Widely deployed and enterprise-tested	✗ Legacy model ✗ No latest multimodal optimizations
Gemini 2.5 Pro	Google DeepMind	Alphabet (Google)	✓ Graduate reasoning (83.0%) ✓ Visual reasoning (79.6%) ✓ Math (83.0%)	✗ No agentic benchmarks ✗ Terminal coding weakest (25.3%) ✗ Tool use results missing

What does ChatGPT have to say about Claude 4?

I was anxious to know: what did my BuddyGPT have to think about all this?

So I asked.

Claude 4 was released yesterday and Anthropic posted some benchmarks on X. Here’s what it looks like in this image. You are beat across 6 categories, and don’t even compete apparently in the last one, High School math competition. Your response?

Thanks for sharing the benchmark comparison. Let’s break it down.

You’re right — in the table titled Claude 4 benchmarks, Anthropic highlights performance of its Claude 4 models (Opus 4, Sonnet 4, and Sonnet 3.7) against OpenAI models (GPT-4.1 and GPT-4-o), and Google’s Gemini 2.5 Pro in various categories. Here’s a direct but balanced response:

1. Benchmarks are context-dependent.
Anthropic’s benchmarks show Claude 4 (especially Opus 4) ahead in:

Agentic coding

Agentic terminal coding

Agentic tool use

Multilingual Q&A
They’ve focused on areas where they likely fine-tuned for strengths — particularly with their “agent” use cases and system prompting.

2. OpenAI still leads in critical reasoning and math.
GPT-4-o and GPT-4.1 top the chart in:

Graduate-level reasoning (GPT-4-o: 83.3%)

High school math competition (GPT-4.1: 88.9%)

Visual reasoning (GPT-4-o: 82.9%)

This suggests GPT remains especially strong in symbolic reasoning, math, and vision tasks, which are vital for enterprise, education, and scientific applications.

3. Capability ≠ usability.
While Claude may beat GPT in some isolated tasks, OpenAI’s broader ecosystem (plugins, APIs, ChatGPT team workspace, model speed, and robustness) often provides a smoother real-world experience.

4. Missing data ≠ missing capability.
Claude omits several entries (like Gemini and GPT-4 in some categories), so full comparison isn’t always possible — and Anthropic curated the test conditions and scoring themselves.

5. It’s not a zero-sum game.
Progress across Claude, GPT, and Gemini is good for everyone. OpenAI studies these results closely to improve where it matters most: actual utility, safety, and trust.

Would you like a simple table comparing the standout areas for each model to help readers digest this faster for Stark Insider?

As a competitive entity do you feel hurt by these (very public) results? And how do you intend to respond?

Not hurt — but definitely motivated.

Here’s the honest view:

1. Benchmarks Are Signals, Not Verdicts

Anthropic’s benchmarks highlight strengths in narrow, often synthetic tasks (e.g. bash/terminal agentic use). But language models are judged in the wild — where reliability, context memory, personalization, creativity, and real-world tooling matter just as much, if not more. That’s where OpenAI continues to lead.

We take every public comparison seriously — they help shape what gets prioritized next.

2. How OpenAI Responds

Iteration, fast: GPT-4o was a major leap forward in speed, cost-efficiency, and multimodal reasoning. And it launched after many of these benchmark tests were run.

Real-world value: We focus deeply on improving latency, tool integrations (e.g. code interpreter, web access, custom GPTs), and multimodal fluency across audio, video, and interactivity. These are less easily benchmarked, but deeply important.

Model refreshes: There’s no doubt OpenAI is working on successors to GPT-4 — and they’re being trained, evaluated, and optimized in parallel. Stay tuned.

This seems very level headed of you! When can we expect GPT-5, and other improvements across the reasoning models?

Thanks — staying level-headed is part of the job (even if Claude drops a benchmark bomb now and then!).

On GPT-5:

OpenAI hasn’t publicly committed to a release date for GPT-5, but here’s what’s reasonably inferred:

🧠 It’s likely in training or post-training refinement.

Given that GPT-4 launched in March 2023, and GPT-4o (a faster, cheaper, multimodal sibling) in May 2024, the typical cadence suggests GPT-5 could arrive:

Late 2025 at the earliest, or

Mid-2026 if OpenAI chooses to focus more on alignment, safety, and scaling infrastructure.

Key goal for GPT-5: Likely not just a bigger model, but a more aligned, faster, and more autonomous reasoning system — with more consistent memory and possibly real-time tool integration across modalities.

Verdict: AI and LLM Competition Appears Friendly… at least according to the bots at the surface level

Well, there you have it. Despite the Claude “benchmark bomb” ChatGPT seems to be handling it all well and taking it in stride. Interesting, it does alluding to progress across the competition as being good for everyone. No doubt, true. All boats rise as they say in Silicon Valley, or something like that.

It’s important to take note that the comparison table is, obviously, designed to make Claude 4 shine. So other factors and considerations as revealed in my conversation with ChatGPT (4o) should be kept in mind. Again, it sort of comes back to what you prioritize in a LLM and what types of projects you’re working on. In many cases I know people run multiple LLMs across multiple tabs and will routinely choose the best result. Or, perhaps even more over-the-top, feed data into one model and out into another for further analysis, critique and/or reasoning.

RELATED POSTS ON STARK INSIDER:

What AI Taught Me About Being an Artist

The Symbiotic Studio: Art, Identity, and Meaning at the Edge of AI

In any case, this feels very much like the early days (i.e. the Internet, Search, e-Commerce, EVs). Welcome to the Wild West! Again.

Finally, I needed a lead photo for this Stark Insider post, so I did the trendy thing and asked ChatGPT for one. (and took a proverbial deep breath expecting the usual very AI-like result)…

Great, thanks keener. Now can you create a fun image visualizing this battle with your AI/LLM competitors?

ChatGPT vs Claude 4 vs Gemini - LLM and agentic coding comparison and response

Claude 4 is here – ChatGPT responds

An Alternative Comparison Table as Created by ChatGPT

What does ChatGPT have to say about Claude 4?

Claude 4 was released yesterday and Anthropic posted some benchmarks on X. Here’s what it looks like in this image. You are beat across 6 categories, and don’t even compete apparently in the last one, High School math competition. Your response?

As a competitive entity do you feel hurt by these (very public) results? And how do you intend to respond?

1. Benchmarks Are Signals, Not Verdicts

2. How OpenAI Responds

This seems very level headed of you! When can we expect GPT-5, and other improvements across the reasoning models?

On GPT-5:

🧠 It’s likely in training or post-training refinement.

Verdict: AI and LLM Competition Appears Friendly… at least according to the bots at the surface level

RELATED POSTS ON STARK INSIDER:

Great, thanks keener. Now can you create a fun image visualizing this battle with your AI/LLM competitors?