OpenAI launches Sora: How AI can create videos from a text prompt

OpenAI Sora: Imagine being able to create a short or long video just by typing out a description of what you want to see. The makers of ChatGPT have given a first glimpse of how it can be possible.

OpenAI logo is seen in this illustration taken, February 3, 2023. REUTERS/Dado Ruvic/Illustration/File Photo

OpenAI, the creator of the revolutionary chatbot ChatGPT, has unveiled a new generative artificial intelligence (GenAI) model that can convert a text prompt into video, an area of GenAI that was thus far fraught with inconsistencies.

The model, called Sora, can generate videos up to a minute long while maintaining visual quality and adherence to the user’s prompt, OpenAI said.

Sora is capable of creating “complex scenes with multiple characters, specific types of motion, and accurate details of the subject and background,” according to OpenAI’s blog post. The company also claimed that the model can understand how objects “exist in the physical world”, and “accurately interpret props and generate compelling characters that express vibrant emotions”.

Story continues below this ad

However, OpenAI has cautioned that the model is far from being perfect and may still struggle with more complex prompts. Before launching Sora to the general public, OpenAI will begin its outreach programme with security experts and policymakers to try and ensure that the system does not generate misinformation, and hateful content among other things.

here is sora, our video generation model:https://t.co/CDr4DdCrh1

today we are starting red-teaming and offering access to a limited number of creators.@_tim_brooks @billpeeb @model_mechanic are really incredible; amazing work by them and the team.

remarkable moment.

— Sam Altman (@sama) February 15, 2024

Why could Sora be a big deal?

While generation of images and textual responses to prompts on GenAI platforms have become significantly better in the last few years, text-to-video was an area that had largely lagged, owing to its added complexity of analysing moving objects in a three-dimensional space.

While videos are also a series of images and could, therefore, be processed using some of the same parameters as text-to-image generators, they also have their unique set of challenges.

“The model has a deep understanding of language, enabling it to accurately interpret prompts and generate compelling characters that express vibrant emotions. Sora can also create multiple shots within a single generated video that accurately persist characters and visual style,” OpenAI said.

Story continues below this ad

OpenAI posted multiple examples of Sora’s work with its blog post as well as on the social media platform X. One example is a video that was created using the prompt: “Beautiful, snowy Tokyo city is bustling. The camera moves through the bustling city street, following several people enjoying the beautiful snowy weather and shopping at nearby stalls. Gorgeous sakura petals are flying through the wind along with snowflakes.”

Introducing Sora, our text-to-video model.

Sora can create videos of up to 60 seconds featuring highly detailed scenes, complex camera motion, and multiple characters with vibrant emotions. https://t.co/7j2JN27M3W

Prompt: “Beautiful, snowy… pic.twitter.com/ruTEWn87vf

— OpenAI (@OpenAI) February 15, 2024

Other companies too have ventured into the text-to-video space. Google’s Lumiere, which was announced last month, can create five-second videos on a given prompt, both text- and image-based. Other companies like Runway and Pika have also shown impressive text-to-video models of their own.

Lumiere is a space-time diffusion research model that generates video from various inputs, including image-to-video. The model generates videos that start with the desired first frame & exhibit intricate coherent motion across the entire video duration → https://t.co/QAMgC4TmBL pic.twitter.com/CZCDDfpMAJ

— Google AI (@GoogleAI) February 13, 2024

Is Sora available for use by everybody?

Not yet. The company has said that it will take some “safety steps” ahead of making Sora available in OpenAI’s products, and will work with red teamers — domain experts in areas like misinformation, hateful content, and bias — who will be “adversarially” testing the model.

The company is also granting access to a number of visual artists, designers, and filmmakers to gain feedback on how to advance the model to be most helpful for creative professionals.

“We’re also building tools to help detect misleading content such as a detection classifier that can tell when a video was generated by Sora. We plan to include C2PA metadata in the future if we deploy the model in an OpenAI product,” OpenAI said.

The company says it will leverage existing safety protocols in its products that use DALL·E 3, which are applicable to Sora as well.

Story continues below this ad

“Once in an OpenAI product, our text classifier will check and reject text input prompts that are in violation of our usage policies, like those that request extreme violence, sexual content, hateful imagery, celebrity likeness, or the IP of others.

Also in Expressed | Check out Claude, the AI chatbot from Anthropic, and how to use it

“We’ve also developed robust image classifiers that are used to review the frames of every video generated to help ensure that it adheres to our usage policies, before it’s shown to the user,” it said.

The company will also engage with policymakers, educators and artists around the world to “understand their concerns and to identify positive use cases for this new technology. Despite extensive research and testing, we cannot predict all of the beneficial ways people will use our technology, nor all the ways people will abuse it.”

Are there any obvious shortcomings of the model?

OpenAI says that the current model of Sora has weaknesses. It may struggle with accurately simulating the physics of a complex scene, and may not understand specific instances of cause and effect. For example, a person might take a bite out of a cookie, but afterward, the cookie may not have a bite mark.

Story continues below this ad

“The model may also confuse spatial details of a prompt, for example, mixing up left and right, and may struggle with precise descriptions of events that take place over time, like following a specific camera trajectory,” it said.

Soumyarendra Barik

Soumyarendra Barik is Special Correspondent with The Indian Express and reports on the intersection of technology, policy and society. With over five years of newsroom experience, he has reported on issues of gig workers’ rights, privacy, India’s prevalent digital divide and a range of other policy interventions that impact big tech companies. He once also tailed a food delivery worker for over 12 hours to quantify the amount of money they make, and the pain they go through while doing so. In his free time, he likes to nerd about watches, Formula 1 and football. ... Read More