Premium

AI learns to lie better when punished for deception, OpenAI study reveals

OpenAI researchers also said that making AI smarter won't help solve the issue.

OpenAI says its new study is yet to be peer reviewed. (Image Source: OpenAI)

Artificial intelligence has revolutionised the way we search for information, but the large language models powering them often hallucinate or churn out wrong responses. Now, a recent study published by researchers at OpenAI is now saying that punishing these AI models for deceptive or harmful actions doesn’t stop them from misbehaving but instead just hides their deceptive behaviour.

In a blog post, OpenAI said chain-of-thought [CoT] reasoning models “think” in natural language understandable by humans. Researchers at the company said that they gave an unreleased AI model a set of tasks which they could complete by taking shortcuts, cheating or outright lying. The researchers found out that the AI often engaged in “reward hacking”, which means it maximised its rewards by cheating.

Also Read | India ranks fourth in global traffic to DeepSeek with 43.36M monthly visits: Report

While developers can easily monitor and flag these deceptive outputs generated by AI, they said that the models’ “natural monitorability is very fragile.” Also, they noted that if you apply strong supervision directly to the chain-of-thought, AI models will then learn to hide their intent and continue with their lying and deceptive practices to get more rewards.

Story continues below this ad

This behaviour is fairly similar to humans, who often find and exploit certain loopholes like sharing online subscription services, claiming subsidies that aren’t for them and lying about their birthdays to get free cake. Since designing good reward structures that do not incentivise lying is pretty hard in both real and virtual worlds, making AI smarter won’t likely solve the issue.

However, some large language models that were trained with reinforcement learning such as OpenAI’s o3-mini might help monitor reward hacking. Also, we might be able to observe a large language model using another model and flag their misbehaviour.

Top Stories

Bangladesh Prime Minister Sheikh Hasina speaks during a press conference in Dhaka, Bangladesh, on Jan. 6, 2014. (AP Photo/Rajesh Kumar Singh, File)

Explained Sheikh Hasina sentenced to death: what happens now

Renowned for his courage and bravery to perform extremely adventurous and dangerous scenes without stunt doubles, it was Jayan's penchant for perfection that eventually led to his untimely demise.

Entertainment Malayalam's first action hero did over 100 films in 6 years; died performing dangerous stunt hanging from a helicopter

Akon pants pulled during Bengaluru concert

In both innings at Eden Gardens, India's Ravindra Jadeja fell leg before the wicket to Harmer. Both times, he was guilty of playing with the bat behind the pad. (Express Photo by Partha Paul)

Sports How Kolkata pitch exposed Indian batsmen’s weakness of skill and temperament

Opinion Key to Bihar’s victory script is the new grammar of Modi politics, governance

Live Blog

Loading Taboola...