A text-to-video model is a machine learning model that uses a natural language description as input to produce a video relevant to the input text.[1] Advancements during the 2020s in the generation of high-quality, text-conditioned videos have largely been driven by the development of video diffusion models.[2]
Models
There are different models, including open source models. Chinese-language input[3] CogVideo is the earliest text-to-video model "of 9.4 billion parameters" to be developed, with its demo version of open source codes first presented on GitHub in 2022.[4] That year, Meta Platforms released a partial text-to-video model called "Make-A-Video",[5][6][7] and Google's Brain (later Google DeepMind) introduced Imagen Video, a text-to-video model with 3D U-Net.[8][9][10][11][12]
In March 2023, a research paper titled "VideoFusion: Decomposed Diffusion Models for High-Quality Video Generation" was published, presenting a novel approach to video generation. [13] The VideoFusion model decomposes the diffusion process into two components: base noise and residual noise, which are shared across frames to ensure temporal coherence. By utilizing a pre-trained image diffusion model as a base generator, the model efficiently generated high-quality and coherent videos. Fine-tuning the pre-trained model on video data addressed the domain gap between image and video data, enhancing the model's ability to produce realistic and consistent video sequences. [14]
Matthias Niessner and Lourdes Agapito at AI company Synthesia work on developing 3D neural rendering techniques that can synthesise realistic video by using 2D and 3D neural representations of shape, appearances, and motion for controllable video synthesis of avatars.[15] In June 2024, Luma Labs launched its Dream Machine video tool.[16][17] That same month,[18]Kuaishou extended its Kling AI text-to-video model to international users. In July 2024, TikTok owner ByteDance released Jimeng AI in China, through its subsidiary, Faceu Technology.[19]
Alternative approaches to text-to-video models include[20] Google's Phenaki, Hour One, Colossyan,[21]Runway's Gen-3 Alpha,[22][23] and OpenAI's unreleased (as at August 2024) Sora,[24] available only to alpha testers.[25]
^Artificial Intelligence Index Report 2023 (PDF) (Report). Stanford Institute for Human-Centered Artificial Intelligence. p. 98. Multiple high quality text-to-video models, AI systems that can generate video clips from prompted text, were released in 2022.