OpenAI, the creator of the revolutionary chatbot ChatGPT, has unveiled a new generative artificial intelligence (GenAI) model that can convert a text prompt into video, an area of GenAI that was thus far fraught with inconsistencies. The model, called Sora, can generate videos up to a minute long while maintaining visual quality and adherence to the user’s prompt, OpenAI said. Sora is capable of creating “complex scenes with multiple characters, specific types of motion, and accurate details of the subject and background,” according to OpenAI’s blog post. The company also claimed that the model can understand how objects “exist in the physical world”, and “accurately interpret props and generate compelling characters that express vibrant emotions”. However, OpenAI has cautioned that the model is far from being perfect and may still struggle with more complex prompts. Before launching Sora to the general public, OpenAI will begin its outreach programme with security experts and policymakers to try and ensure that the system does not generate misinformation, and hateful content among other things. here is sora, our video generation model: today we are starting red-teaming and offering access to a limited number of creators.@_tim_brooks @billpeeb @model_mechanic are really incredible; amazing work by them and the team. remarkable moment. — Sam Altman (@sama) February 15, 2024 Why could Sora be a big deal? While generation of images and textual responses to prompts on GenAI platforms have become significantly better in the last few years, text-to-video was an area that had largely lagged, owing to its added complexity of analysing moving objects in a three-dimensional space. While videos are also a series of images and could, therefore, be processed using some of the same parameters as text-to-image generators, they also have their unique set of challenges. “The model has a deep understanding of language, enabling it to accurately interpret prompts and generate compelling characters that express vibrant emotions. Sora can also create multiple shots within a single generated video that accurately persist characters and visual style,” OpenAI said. OpenAI posted multiple examples of Sora’s work with its blog post as well as on the social media platform X. One example is a video that was created using the prompt: “Beautiful, snowy Tokyo city is bustling. The camera moves through the bustling city street, following several people enjoying the beautiful snowy weather and shopping at nearby stalls. Gorgeous sakura petals are flying through the wind along with snowflakes.” Introducing Sora, our text-to-video model. Sora can create videos of up to 60 seconds featuring highly detailed scenes, complex camera motion, and multiple characters with vibrant emotions. Prompt: “Beautiful, snowy… pic.twitter.com/ruTEWn87vf — OpenAI (@OpenAI) February 15, 2024 Other companies too have ventured into the text-to-video space. Google’s Lumiere, which was announced last month, can create five-second videos on a given prompt, both text- and image-based. Other companies like Runway and Pika have also shown impressive text-to-video models of their own. Lumiere is a space-time diffusion research model that generates video from various inputs, including image-to-video. The model generates videos that start with the desired first frame & exhibit intricate coherent motion across the entire video duration → pic.twitter.com/CZCDDfpMAJ — Google AI (@GoogleAI) February 13, 2024 Is Sora available for use by everybody? Not yet. The company has said that it will take some “safety steps” ahead of making Sora available in OpenAI’s products, and will work with red teamers — domain experts in areas like misinformation, hateful content, and bias — who will be “adversarially” testing the model. The company is also granting access to a number of visual artists, designers, and filmmakers to gain feedback on how to advance the model to be most helpful for creative professionals. “We’re also building tools to help detect misleading content such as a detection classifier that can tell when a video was generated by Sora. We plan to include C2PA metadata in the future if we deploy the model in an OpenAI product,” OpenAI said. The company says it will leverage existing safety protocols in its products that use DALL·E 3, which are applicable to Sora as well. “Once in an OpenAI product, our text classifier will check and reject text input prompts that are in violation of our usage policies, like those that request extreme violence, sexual content, hateful imagery, celebrity likeness, or the IP of others. “We’ve also developed robust image classifiers that are used to review the frames of every video generated to help ensure that it adheres to our usage policies, before it’s shown to the user,” it said. The company will also engage with policymakers, educators and artists around the world to “understand their concerns and to identify positive use cases for this new technology. Despite extensive research and testing, we cannot predict all of the beneficial ways people will use our technology, nor all the ways people will abuse it.” Are there any obvious shortcomings of the model? OpenAI says that the current model of Sora has weaknesses. It may struggle with accurately simulating the physics of a complex scene, and may not understand specific instances of cause and effect. For example, a person might take a bite out of a cookie, but afterward, the cookie may not have a bite mark. “The model may also confuse spatial details of a prompt, for example, mixing up left and right, and may struggle with precise descriptions of events that take place over time, like following a specific camera trajectory,” it said.