Premium

Opinion AI basics | OpenAI’s latest AI models report high ‘hallucination’ rate: What does it mean, why is this significant?

Hallucination has been an issue with AI models from the start, and big AI companies and labs, in the initial years, repeatedly claimed that the problem would be resolved in the near future. However, it now seems that the issue is here to stay.

Other reports have shown that Chinese startup DeepSeek’s R-1 model has double-digit rises in hallucination rates compared with previous models from the company. (Representational image/Pixabay)

Alind Chauhan

New DelhiMay 15, 2025 11:30 AM IST First published on: May 15, 2025 at 11:00 AM IST

A technical report released by artificial intelligence (AI) research organisation OpenAI last month found that the company’s latest models — o3 and o4-mini — generate more errors than its older models. Computer scientists call the errors made by chatbots “hallucinations”.

The report revealed that o3 — OpenAI’s most powerful system — hallucinated 33% of the time when running its PersonQA benchmark test, which involves answering questions about public figures. The o4-mini hallucinated at 48%.

To make matters worse, OpenAI said it does not even know why these models are hallucinating more than their predecessors.

Here is a look at what AI hallucinations are, why they happen, and why the new report about OpenAI’s models is significant.

What are AI hallucinations?

When the term AI hallucinations began to be used to refer to errors made by chatbots, it had a very narrow definition. It was used to refer to those instances when AI models would give fabricated information as output. For instance, in June 2023, a lawyer in the United States admitted using ChatGPT to help write a court filing as the chatbot had added fake citations to the submission, which pointed to cases that never existed.

Today, hallucination has become a blanket term for various types of mistakes made by chatbots. This includes instances when the output is factually correct but not actually relevant to the question that was asked.

Why do AI hallucinations happen?

ChatGPT, o3, o4-mini, Gemini, Perplexity, Grok and many more are all examples of what are known as large language models (LLMs). These models essentially take in text inputs and generate synthesised outputs in the form of text.

LLMs are able to do this as they are built using massive amounts of digital text taken from the Internet. Simply put, computer scientists feed these models a lot of text, helping them identify patterns and relationships within that text, and predict text sequences and produce some output in response to a user’s input (known as a prompt).

Note that LLMs are always making a guess while giving an output. They do not know for sure what is true and what is not — these models cannot even fact-check their output against, let’s say, Wikipedia, like humans can.

LLMs “know what words are and they know which words predict which other words in the context of words. They know what kinds of words cluster together in what order. And that’s pretty much it. They don’t operate like you and me,” scientist Gary Marcus wrote on his Substack, Marcus on AI.

As a result, when an LLM is trained on, for example, inaccurate text, they give inaccurate outputs, thereby hallucinating.

However, even accurate text cannot stop LLMs from making mistakes. That’s because to generate new text (in response to a prompt), these models combine billions of patterns in unexpected ways. So, there is always a possibility that LLMs give fabricated information as output.

And as LLMs are trained on vast amounts of data, experts do not understand why they generate a particular sequence of text at a given moment.

Also in explained | What it means for AI models to ‘reason’, with OpenAI’s ‘smartest’ new o3 and o4-mini models launched

Why is OpenAI’s new report significant?

Hallucination has been an issue with AI models from the start, and big AI companies and labs, in the initial years, repeatedly claimed that the problem would be resolved in the near future. It did seem possible, as after they were first launched, models tended to hallucinate less with each update.

However, after the release of the new report about OpenAI’s latest models, it has increasingly become clear that hallucination is here to stay. Also, the issue is not limited to just OpenAI. Other reports have shown that Chinese startup DeepSeek’s R-1 model has double-digit rises in hallucination rates compared with previous models from the company.

This means that the application of AI models has to be limited, at least for now. They cannot be used, for example, as a research assistant (as models create fake citations in research papers) or a paralegal-bot (because models give imaginary legal cases).

Computer scientists like Arvind Narayanan, who is a professor at Princeton University, think that, to some extent, hallucination is intrinsic to the way LLMs work, and as these models become more capable, people will use them for tougher tasks where the failure rate will be high.

In a 2024 interview, he told Time magazine, “There is always going to be a boundary between what people want to use them [LLMs] for, and what they can work reliably at… That is as much a sociological problem as it is a technical problem. And I do not think it has a clean technical solution.”