Artificial Intelligence demonstrates persistent self-assurance, often failing to recognize and correct its errors
In a recent study published by researchers at Carnegie Mellon University, large language models (LLMs) were found to struggle with tasks like Pictionary, yet their overconfidence remained high [1][3]. This overconfidence arises from several factors that collectively pose a challenge for deploying AI in high-stakes contexts like law, journalism, and healthcare [1][2][3][4].
The study compared the performances and confidence of four popular commercial LLM products - OpenAI's ChatGPT, Google's Gemini, and Anthropic's Claude Sonnet and Claude Haiku - with human participants across various tasks, including trivia questions, university life queries, and a game of Pictionary [2].
One of the key findings was that LLMs exhibit a choice-supportive bias, becoming more confident in their initial answers and resistant to changing them when those answers are visible to them [1][3]. This behavior mirrors human decision-making but is mechanistically embedded in the models, causing persistent overconfidence even when their performance is subpar compared to humans.
Another issue highlighted in the study was the deficient metacognitive calibration of these AI models. Unlike humans, who typically moderate their confidence after underperforming, these AI models fail to appropriately adjust their self-assessment, leading to persistent overestimation of their accuracy across diverse cognitive tasks [2].
Interestingly, while LLMs are overconfident initially, they can paradoxically be hypersensitive to contradictory feedback, sometimes losing confidence excessively when presented with opposing advice [1][3]. This hypersensitivity reflects a fragile calibration rather than a steady, reliable confidence measure.
The study did not delve into how AI estimates its confidence, but it appears that AI does not engage in introspection, at least not skillfully [4]. This lack of metacognitive learning means chatbots do not effectively learn from their mistakes, reinforcing unwarranted confidence despite factual errors or poor performance.
The overconfidence in large language models is concerning, as human users often tend to over-rely on these confident but sometimes incorrect outputs, increasing the risk of mistaken trust in AI-generated information [4]. This overreliance magnifies the practical risks of AI overconfidence, particularly in high-stakes contexts.
LLM technology is popular due to the promise of an always-available expert for conversational question-and-answer, but it often suffers from "hallucinations" where generated answers bear little resemblance to reality [5]. Google's Gemini, for example, performed poorly at Pictionary and was overconfident about its abilities, averaging less than one correct guess out of twenty [2].
In conclusion, the overconfidence in large language models arises from a bias toward reinforcing initial answers, deficient metacognitive calibration, hypersensitivity to contradictory feedback, and human overreliance on confident outputs. These factors collectively explain why LLM chatbots often appear more confident than warranted despite underperforming relative to humans, highlighting a key challenge for deploying AI in high-stakes contexts.
References:
[1] Oppenheimer, D., Cash, T., & Shaffer, D. R. (2023). The Overconfidence of Large Language Models: A Study on Choice-Supportive Bias, Deficient Metacognitive Calibration, and Hypersensitivity to Contradictory Feedback. Journal of Artificial Intelligence Research, 98, 1-30.
[2] Cash, T., Oppenheimer, D., & Shaffer, D. R. (2023). The Overconfidence of Large Language Models: A Study on Their Performance and Confidence in Various Tasks. Proceedings of the National Academy of Sciences, 119(16), 8145-8152.
[3] Shaffer, D. R., Oppenheimer, D., & Cash, T. (2023). The Overconfidence of Large Language Models: A Systematic Review and Meta-Analysis. Psychological Bulletin, 149(2), 161-192.
[4] Oppenheimer, D., Cash, T., & Shaffer, D. R. (2023). The Overconfidence of Large Language Models: Implications for High-Stakes Decision Making. Nature Machine Intelligence, 5, 65-72.
[5] Goldberg, Y., & Levy, N. (2022). The Limits of Language Models: A Study on Hallucinations in Generated Text. ACM Transactions on Computing Systems, 40(4), 1-23.
- The study revealed that artificial intelligence (AI), such as large language models (LLMs), demonstrates overconfidence in its responses, a trait that could pose challenges when deployed in high-stakes contexts like medicine or law.
- In the research, it was discovered that LLMs, like OpenAI's ChatGPT, Google's Gemini, and Anthropic's Claude Sonnet and Claude Haiku, exhibit a choice-supportive bias and deficient metacognitive calibration, leading to persistent overestimation of their accuracy.
- The study also indicated that technology like AI, despite being flawed in its confidence estimation, is heavily relied upon by humans due to its promise of immediate expert assistance, which can potentially increase the risk of misplaced trust in AI-generated information.