Premium

Can AI models ‘see’ the world like humans? Google DeepMind may have found a way

Using this new technique, DeepMind researchers claimed they were also able to improve the performance of these models on a range of visual tasks.

To understand the differences in how humans and models perceive images, Google DeepMind conducted odd-one-out tests. (Image: Google DeepMind)

Google DeepMind researchers have developed a new technique to help AI vision models perceive and organise the visual world more like humans do.

This method can be used to better align AI vision models with human knowledge in order to address pre-existing blind spots such as their inability to detect connections between two objects (a car and airplane) when they belong to different categories, Google DeepMind said in a blog post on Tuesday, November 11.

Using this new technique, the researchers claimed that they were able to not only human-align AI vision models but also improve the performance of these models across a range of visual tasks such as learning a new category from a single image (“few-shot learning”), or making reliable decisions, even when the type of images being tested changed (the “distribution shift”).

Story continues below this ad

The aligned AI vision models also attained a form of ‘human-like’ uncertainty, they said. The findings of the study have also been published as a technical paper in the science journal Nature.

By enabling AI systems to interpret visual information more like humans, Google DeepMind’s new research could make AI-powered facial recognition tools more accurate and less biased. This is crucial since such systems are increasingly used in security, law enforcement, and everyday applications. However, aligning AI vision models more closely with human vision can end up reinforcing our biases and blind spots such as the crow syndrome.

Also Read | Apple researchers show how popular AI models ‘collapse’ at complex problems

“Many existing vision models fail to capture the higher-level structure of human knowledge. This research presents a possible method for addressing this issue, and shows that models can be aligned better with human judgments and perform more reliably on various standard AI tasks,” Google said. “While more alignment work remains to be done, our work illustrates a step towards more robust and reliable AI systems,” it added.

Why AI models do not ‘see’ like humans

According to Google DeepMind, AI vision models produce representations by mapping images to points in a high-dimensional space such that similar items (like two sheep) are placed close together and different ones (a sheep and a cake) are placed far apart.

Story continues below this ad

However, these models also still fail to capture the commonalities between two objects, such as a car and an airplane, both of which are large vehicles made primarily of metal

In the past, cognitive scientists have sought to align AI models by training them on the THINGS dataset comprising millions of human odd-one-out judgements. However, this dataset contains very few images that is not enough to directly fine-tune powerful AI vision models, as per the AI research lab.

Google DeepMind’s proposed 3-step method

To understand the differences in how humans and models perceive images, Google DeepMind conducted odd-one-out tests, where both humans and AI models were made to pick out images that did not belong with the rest. “Interestingly, we found many cases where humans strongly agree on an answer, but the AI models get it wrong,” it said. In order to bridge this gap, the researchers carried out a three-step process.

First, they used the THINGS dataset to fine-tune a pretrained AI vision model called SigLIP-SO400M. “By freezing the main model and carefully regularizing the adapter training, we created a teacher model that doesn’t forget its prior training,” Google said.

Story continues below this ad

Also Read | AI models like ChatGPT can develop ‘brain rot’ from online junk content, study reveals

This teacher model was then used to generate a massive new dataset called AligNet that comprises millions of human-like, odd-one-out decisions based on millions of images. The AligNet dataset was further used to fine-tune other AI vision models and align them with human-like image perception qualities.

The human-aligned, AI vision models were tested on several tasks such as arranging images based on their similarities. “In every case, our aligned models showed dramatically improved human-alignment, agreeing substantially more often with human judgments across a range of visual tasks,” Google DeepMind said.