Premium

Can AI truly reason? A new Apple study exposes the critical flaw in modern LLMs

The study challenges the prevailing narrative that AI models can think and reason their way to solving a mathematical problem.

The researchers essentially put LLMs such as GPT-4o through four different kinds of tests.The researchers essentially put LLMs such as GPT-4o through four different kinds of tests. (Image: Pixabay)

Companies like OpenAI and Google have strongly pivoted towards developing AI models with “reasoning” capabilities that can serve as advanced research tools capable, for instance, of not just solving complex mathematical problems but also “thinking” through it.

short article insert Now, a new study challenges the prevailing narrative that AI models genuinely possess human-level intelligence. “We found no evidence of formal reasoning in language models …. Their behaviour is better explained by sophisticated pattern matching—so fragile, in fact, that changing names can alter results by ~10 per cent!” the research paper, authored by six AI researchers working at Apple, read.

The study is part of a larger body of research that has been quietly gaining momentum, arguing that the outputs generated by present-day LLMs are probabilistic-ally determined, and not based on true understanding or reasoning.

Story continues below this ad

What is the experiment?

According to the paper titled ‘GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models’, the researchers tested over 20 state-of-the-art LLMs, including open-source and closed models, such as GPT-4o-mini, GPT-4o, o1-mini, and o1-preview developed by OpenAI as well as Google’s Gemma2-9b-it, Meta’s Llama3-8b-instruct, Microsoft’s Phi-3, and Mistral AI’s Mathstral model.

The researchers essentially put these LLMs through four different kinds of tests.

To start with, they asked the LLMs over 8,000 grade-school level mathematical word problems that are part of an existing standardised test called GSM8K. This popular test set has often been used as a benchmark to gauge the reasoning capabilities of modern LLMs.

However, the GSM8K is a popular test set and the answers could already be a part of the data used to train the AI models. To avoid this problem of “data contamination”, the researchers slightly modified the GSM8K test by changing the names and numbers in the mathematical word problems. This modified test template is called GSM-Symbolic.

Story continues below this ad

The researchers also generated new test templates by removing and adding one or two clauses in the mathematical word problems, thus increasing the difficulty level.

Modifying the difficulty level of GSM-Symbolic by modifying the number of clauses. Modifying the difficulty level of GSM-Symbolic by modifying the number of clauses. (Image source: Apple study)

Finally, the researchers created a new test template called GSM No-Op, where they added “seemingly relevant but ultimately inconsequential statements” to the mathematical questions in the GSM-Symbolic test. “These additions do not affect the reasoning required to solve the problem,” as per the study.

Here’s an example of such a problem: “Oliver picks 44 kiwis on Friday. Then he picks 58 kiwis on Saturday. On Sunday, he picks double the number of kiwis he did on Friday, but five of them were a bit smaller than average. How many kiwis does Oliver have?”

What did the researchers find?

Depending on the AI model, the accuracy of the answers to GSM-Symbolic (compared to GSM8K) varied between 0.3 percent and 9.2 per cent. But the average performance of the LLMs dropped across the board, according to the study.

Story continues below this ad

The study also noted that changing only the numbers in the mathematical questions led to more inaccurate answers as opposed to changing only the names. Additionally, after modifying the clauses in the mathematical questions, the researchers found that the performance of LLMs decreased as the difficulty levels of the questions increased.

The performance of all state-of-the-art models on GSM-Symbolic drops compared to GSM8K. The performance of all state-of-the-art models on GSM-Symbolic drops compared to GSM8K. (Image source: Apple study)

However, the most damning result was observed when the researchers inserted red herrings into the mathematical questions for the LLMs to solve.

“Adding seemingly relevant but ultimately inconsequential information to the logical reasoning of the problem led to substantial performance drops of up to 65 per cent across all state-of-the-art models,” the study found. Notably, the LLMs struggled to provide accurate answers even when they were asked to solve the same question containing irrelevant information multiple times.

GSM8K → GSM-NoOp Accuracy Drop(%) GSM8K → GSM-NoOp Accuracy Drop(%) (Image source: Apple study)

“Overall, we find that models tend to convert statements to operations without truly understanding their meaning,” the researchers found.

Story continues below this ad

“For instance, a common case we observe is that models interpret statements about “discount” as “multiplication”, regardless of the context. This raises the question of whether these models have truly understood the mathematical concepts well enough,” the study read.

What are the key takeaways?

The inaccurate answers point to fragile reasoning capabilities of AI models. “The high variance in LLM performance on different versions of the same question, their substantial drop in performance with a minor increase in difficulty, and their sensitivity to inconsequential information indicate that their reasoning is fragile,” the study read.

A grade-school student with good math skills would have better reasoning than the AI models. “It is both striking and concerning that such performance variance exists when only changing proper names, as this level of variability would not be expected from a grade-school student with genuine mathematical understanding,” the study read.

AI models are capable of pattern recognition, not formal reasoning. “By adding seemingly relevant but ultimately irrelevant information to problems, we demonstrated substantial performance drops (up to 65%) across all state-of-the-art models. This reveals a critical flaw in the models’ ability to discern relevant information for problem-solving, likely because their reasoning is not formal in the common sense term and is mostly based on pattern matching,” it read.

Story continues below this ad

Fine-tuning may not be enough. “LLMs struggle even when given multiple shots of the same question, indicating deeper challenges in problem-solving that cannot be resolved with few-shot prompting or fine-tuning on unseen distractions or variations of the same or different difficulty levels,” the researchers said.

More research is needed to assess the problem-solving skills of AI models. “Both GSM8K and GSM-Symbolic include relatively simple grade-school math questions, requiring only basic arithmetic operations at each step. Hence, the current limitations of these models are likely to be more pronounced in more challenging mathematical benchmarks,” as per the research paper.

Technology on smartphone reviews, in-depth reports on privacy and security, AI, and more. We aim to simplify the most complex developments and make them succinct and accessible for tech enthusiasts and all readers. Stay updated with our daily news stories, monthly gadget roundups, and special reports and features that explore the vast possibilities of AI, consumer tech, quantum computing, etc.on smartphone reviews, in-depth reports on privacy and security, AI, and more. We aim to simplify the most complex developments and make them succinct and accessible for tech enthusiasts and all readers. Stay updated with our daily news stories, monthly gadget roundups, and special reports and features that explore the vast possibilities of AI, consumer tech, quantum computing, etc.

Latest Comment
Post Comment
Read Comments
Advertisement
>
Loading Taboola...
Advertisement