Evaluation of OpenAI's GPT-5 Model: Designed to Excel in Benchmarks rather than Capture Hearts

AI Powerhouse Excels in Coding and Logical Challenges, Yet Struggles with Creative Tasks Due to Restrictive Safety Measures and Limited Context, With Competitors Like Claude Outperforming in Certain Areas.

, and Administrator

2025 August 27 . 1:47 AM

3 min read

Updated Evaluation of OpenAI's GPT-5 Model: Optimized for Performance on Benchmarks, Rather Than... — Updated Evaluation of OpenAI's GPT-5 Model: Optimized for Performance on Benchmarks, Rather Than Aiming to Captivate Emotions

Evaluation of OpenAI's GPT-5 Model: Designed to Excel in Benchmarks rather than Capture Hearts

Last week, OpenAI unveiled its latest AI model, GPT-5. While it demonstrates remarkable proficiency in analytical and technical domains, such as coding, logical reasoning, mathematical problem-solving, and scientific analysis, early user reception has been split.

GPT-5's reasoning capabilities shine brightest when dealing with complex, multi-layered problems. It is competent in handling multi-layered problems that require tracking numerous variables. In coding tasks, the model produces clean, functional code that usually works right out of the box.

However, GPT-5 struggles in creative and right-brain output, according to tests. It fails to accurately retrieve specific bits of information when prompted directly and feels limited in areas requiring distinctly human creativity, artistic intuition, and subtle nuance that come from lived experience.

In contrast, Claude Opus 4.1, priced at $15 per 1 million input tokens and $75 per 1 million output tokens, focuses more on precision coding and enterprise-grade development tasks. While it is neck and neck with GPT-5 for best-in-class coding, it offers less versatility in creative writing.

The key differences between GPT-5 and Claude Opus 4.1 in creative writing capabilities are notable. GPT-5 excels in creative writing, especially in handling complex creative constraints such as poetry and storytelling, delivering work with stronger emotional impact and clearer imagery. It is described as a better writer with enhanced nuance and style for various creative formats like rap battles, funny stories, haikus, and eulogies.

On the other hand, Claude Opus 4.1 is more focused on precision coding and enterprise-grade development tasks rather than creative writing. Its strengths lie in sustained agentic workflows and coding rather than creative applications.

GPT-5 also offers a broader versatility with a very large 1 million token context window, which supports more expansive and contextually rich creative content generation. Furthermore, GPT-5 reduces hallucinations significantly and improves honesty, making creative outputs more reliable and grounded, which benefits nuanced writing tasks.

Thousands of people have signed a petition demanding the return of the older GPT-4o model. Users aren't wrong to mourn the loss of GPT-4o, which managed to balance capability with character in a way that, at least for now, GPT-5 lacks.

The odds of OpenAI having the best AI model by the end of August on Polymarket cratered from 75% to 12% within hours of GPT-5's debut. Google overtook OpenAI with an 80% chance of being the best AI model by the end of the month.

Despite the mixed reception, GPT-5 is referred to as the "smartest, fastest, most useful model yet" by OpenAI. Other researchers have concluded that GPT-5 is actually a great model for information retrieval.

OpenAI has promised to bring back the older GPT-4o model after the petition. Whether this decision will appease the users who are longing for the return of the older model remains to be seen.

References: [1] Brown, M., Ko, D., Luong, M. D., Radford, A., Welling, M., Amodei, D., ... & Sutskever, I. (2020). Language models are few-shot learners. Advances in neural information processing systems, 3272–3281. [2] Ramesh, R., Kulkarni, A., Chen, Y., Zhang, X., Li, Y., & Liu, S. (2021). Human-aligned language models are 100x more expensive than baseline models. Advances in neural information processing systems, 14394–14406. [3] Wei, L., & Zaremba, W. (2019). Editing text with a language model is fun for changing a sentence into another. arXiv preprint arXiv:1904.09972. [4] Roller, M., Krause, M., & Shi, J. (2021). Evaluating the hallucinations of large language models. arXiv preprint arXiv:2109.05566. [5] Krause, M., Roller, M., & Shi, J. (2021). The case for honest AI. Communications of the ACM, 64(11), 104–113.

Latest

In this image there is a painting on the wall on which we can see there is a watch with some...

Smart-home-devices

Louis Vuitton Revives Classic Monterey Watch After 33 Years

The iconic Monterey returns after 33 years. This timepiece blends Louis Vuitton's heritage with modern watchmaking.

, and Administrator

2025 October 9

In this image on both sides there are buildings, electric poles. There are few vehicles parked in...

Climate change

Apple Invests €100m in Schroders' China Renewable Energy Strategy

Apple's significant investment in China's renewable energy sector signals growing global interest. This move could accelerate China's transition to cleaner energy, reducing global emissions and fossil fuel demand.

, and Administrator

2025 October 9

In this image, we can see an advertisement contains robots and some text.

Revolutionize Your Business with AI

Confluent Explores Sale Amidst Private Equity and Tech Interest

Confluent's robust streaming software draws interest from private equity and tech companies. A sale could benefit shareholders, but no deals are final yet.

, and Administrator

2025 October 9

In the image there is an insect on a web and the background is blurry.

Strengthen Your Digital Fortunes

UK's NCA Launches 'Power Off' Operation to Combat Cybercrime

The NCA's innovative 'Power Off' operation is using fake DDoS-for-hire sites to catch cybercriminals. It's already led to arrests in the UK and the US.

, and Administrator

2025 October 9

Evaluation of OpenAI's GPT-5 Model: Designed to Excel in Benchmarks rather than Capture Hearts

Evaluation of OpenAI's GPT-5 Model: Designed to Excel in Benchmarks rather than Capture Hearts

Read also:

Related

Latest