Explained: Google DeepMind’s Genie, an AI model that creates virtual worlds from image prompts

This AI model could soon enable users to create their own video games. Here’s why this experimental model is revolutionary.

Homepage of Google Deepmind Website magnified on logo with magnifying glass. (Wikimedia Commons)

The biggest draw of video games is the escapism or the fantasy of a world far removed from our immediate reality. Now, imagine if you get the ability to create your own world. Well, researchers at Google DeepMind have come up with something that will enable you to create your own fictional world, similar to the outlandish landscapes seen in high-octane games.

Google DeepMind has just introduced Genie, a new model that can generate interactive video games from just a text or image prompt. That too without any prior training on game mechanics (which are essentially rules, elements, and processes that make up a game).

What is Genie?

According to the official Google DeepMind blog post, Genie is a foundation world model that is trained on videos sourced from the Internet. The model can “generate an endless variety of playable (action-controllable) worlds from synthetic images, photographs, and even sketches.”

Story continues below this ad

The research paper ‘Genie: Generative Interactive Environments’ states that Genie is the first generative interactive environment that has been trained in an unsupervised manner from unlabelled internet videos. When it comes to size, Genie stands at 11B parameters and consists of a spatiotemporal video tokenizer, an autoregressive dynamics model, and a simple and scalable latent action model.

These technical specifications let Genie act in generated environments on a frame-by-frame basis even in the absence of training, labels, or any other domain-specific requirements.

What does Genie do?

According to the research paper, Genie is a new kind of generative AI that enables anyone – even children – to dream up and step into generated worlds similar to human-designed simulated environments. Genie can be prompted to generate a diverse set of interactive and controllable environments although it is trained on video-only data.

In simple words, we have seen numerous generative AI models that produce creative content with language, images and even videos. Genie is a breakthrough as it makes playable environments from a single image prompt.

Story continues below this ad

Try and remember the scene in Harry Potter and the Philosopher’s stone when Harry and his friends enter the Hogwarts Castle on their way to the Gryffindor common room. The young students see a wall full of paintings that comes to life with each character moving in their frames in fine detail. Genie essentially brings still images to life, giving them a world of their own.

According to Google DeepMind, Genie can be prompted with images it has never seen. This includes real world photographs, sketches, allowing people to interact with their imagined virtual worlds. This is what is known as a foundation world model. When it comes to training, the research paper highlights that they focus more on videos of 2D platformer games and robotics. Genie is trained on a general method, allowing it to function on any type of domain, and it is scalable to even larger Internet datasets.

Why is it important?

The standout aspect of Genie is its ability to learn and reproduce controls for in-game characters exclusively from internet videos. This is noteworthy because internet videos do not have labels about the action that is performed in the video, or even which part of the image should be controlled.

Also in Explained | Explained: What is an LLM, the backbone of AI chatbots like ChatGPT, Gemini?

“Genie learns not only which parts of an observation are generally controllable, but also infers diverse latent actions that are consistent across the generated environments. Note here how the same latent actions yield similar behaviors across different prompt images,” said the blog post.

Story continues below this ad

According to Google DeepMind, the most distinct aspect of this model is that it allows you to create an entire new interactive environment from a single image. This opens up many possibilities, especially new ways to create and step into virtual worlds. To demonstrate this, the researchers created an in image using text-to-image model Imagen 2 and then used it as a prompt to create virtual worlds. The same can be done with sketches.

With Genie, anyone will be able to create their own entirely imagined virtual worlds. Besides, the model’s ability to learn and develop new world models signals a significant leap towards general AI agents (an independent programme or entity that interacts with its environments by perceiving its surroundings via sensors).

Bijin Jose

Bijin Jose, an Assistant Editor at Indian Express Online in New Delhi, is a technology journalist with a portfolio spanning various prestigious publications. Starting as a citizen journalist with The Times of India in 2013, he transitioned through roles at India Today Digital and The Economic Times, before finding his niche at The Indian Express. With a BA in English from Maharaja Sayajirao University, Vadodara, and an MA in English Literature, Bijin's expertise extends from crime reporting to cultural features. With a keen interest in closely covering developments in artificial intelligence, Bijin provides nuanced perspectives on its implications for society and beyond. ... Read More