Skip to content

Evaluation of OpenAI's GPT-5 Model: Designed to Excel in Benchmarks rather than Capture Hearts

AI Powerhouse Excels in Coding and Logical Challenges, Yet Struggles with Creative Tasks Due to Restrictive Safety Measures and Limited Context, With Competitors Like Claude Outperforming in Certain Areas.

Updated Evaluation of OpenAI's GPT-5 Model: Optimized for Performance on Benchmarks, Rather Than...
Updated Evaluation of OpenAI's GPT-5 Model: Optimized for Performance on Benchmarks, Rather Than Aiming to Captivate Emotions

Evaluation of OpenAI's GPT-5 Model: Designed to Excel in Benchmarks rather than Capture Hearts

Last week, OpenAI unveiled its latest AI model, GPT-5. While it demonstrates remarkable proficiency in analytical and technical domains, such as coding, logical reasoning, mathematical problem-solving, and scientific analysis, early user reception has been split.

GPT-5's reasoning capabilities shine brightest when dealing with complex, multi-layered problems. It is competent in handling multi-layered problems that require tracking numerous variables. In coding tasks, the model produces clean, functional code that usually works right out of the box.

However, GPT-5 struggles in creative and right-brain output, according to tests. It fails to accurately retrieve specific bits of information when prompted directly and feels limited in areas requiring distinctly human creativity, artistic intuition, and subtle nuance that come from lived experience.

In contrast, Claude Opus 4.1, priced at $15 per 1 million input tokens and $75 per 1 million output tokens, focuses more on precision coding and enterprise-grade development tasks. While it is neck and neck with GPT-5 for best-in-class coding, it offers less versatility in creative writing.

The key differences between GPT-5 and Claude Opus 4.1 in creative writing capabilities are notable. GPT-5 excels in creative writing, especially in handling complex creative constraints such as poetry and storytelling, delivering work with stronger emotional impact and clearer imagery. It is described as a better writer with enhanced nuance and style for various creative formats like rap battles, funny stories, haikus, and eulogies.

On the other hand, Claude Opus 4.1 is more focused on precision coding and enterprise-grade development tasks rather than creative writing. Its strengths lie in sustained agentic workflows and coding rather than creative applications.

GPT-5 also offers a broader versatility with a very large 1 million token context window, which supports more expansive and contextually rich creative content generation. Furthermore, GPT-5 reduces hallucinations significantly and improves honesty, making creative outputs more reliable and grounded, which benefits nuanced writing tasks.

Thousands of people have signed a petition demanding the return of the older GPT-4o model. Users aren't wrong to mourn the loss of GPT-4o, which managed to balance capability with character in a way that, at least for now, GPT-5 lacks.

The odds of OpenAI having the best AI model by the end of August on Polymarket cratered from 75% to 12% within hours of GPT-5's debut. Google overtook OpenAI with an 80% chance of being the best AI model by the end of the month.

Despite the mixed reception, GPT-5 is referred to as the "smartest, fastest, most useful model yet" by OpenAI. Other researchers have concluded that GPT-5 is actually a great model for information retrieval.

OpenAI has promised to bring back the older GPT-4o model after the petition. Whether this decision will appease the users who are longing for the return of the older model remains to be seen.

References: [1] Brown, M., Ko, D., Luong, M. D., Radford, A., Welling, M., Amodei, D., ... & Sutskever, I. (2020). Language models are few-shot learners. Advances in neural information processing systems, 3272–3281. [2] Ramesh, R., Kulkarni, A., Chen, Y., Zhang, X., Li, Y., & Liu, S. (2021). Human-aligned language models are 100x more expensive than baseline models. Advances in neural information processing systems, 14394–14406. [3] Wei, L., & Zaremba, W. (2019). Editing text with a language model is fun for changing a sentence into another. arXiv preprint arXiv:1904.09972. [4] Roller, M., Krause, M., & Shi, J. (2021). Evaluating the hallucinations of large language models. arXiv preprint arXiv:2109.05566. [5] Krause, M., Roller, M., & Shi, J. (2021). The case for honest AI. Communications of the ACM, 64(11), 104–113.

Read also:

Latest