Artificial intelligence has revolutionised the way we search for information, but the large language models powering them often hallucinate or churn out wrong responses. Now, a recent study published by researchers at OpenAI is now saying that punishing these AI models for deceptive or harmful actions doesn't stop them from misbehaving but instead just hides their deceptive behaviour. In a blog post, OpenAI said chain-of-thought [CoT] reasoning models “think” in natural language understandable by humans. Researchers at the company said that they gave an unreleased AI model a set of tasks which they could complete by taking shortcuts, cheating or outright lying. The researchers found out that the AI often engaged in “reward hacking”, which means it maximised its rewards by cheating. While developers can easily monitor and flag these deceptive outputs generated by AI, they said that the models’ “natural monitorability is very fragile.” Also, they noted that if you apply strong supervision directly to the chain-of-thought, AI models will then learn to hide their intent and continue with their lying and deceptive practices to get more rewards. This behaviour is fairly similar to humans, who often find and exploit certain loopholes like sharing online subscription services, claiming subsidies that aren’t for them and lying about their birthdays to get free cake. Since designing good reward structures that do not incentivise lying is pretty hard in both real and virtual worlds, making AI smarter won’t likely solve the issue. However, some large language models that were trained with reinforcement learning such as OpenAI’s o3-mini might help monitor reward hacking. Also, we might be able to observe a large language model using another model and flag their misbehaviour.