Text-to-video model

A video generated using OpenAI's unreleased, open source Sora text-to-video model, using the prompt:

A stylish woman walks down a Tokyo street filled with warm glowing neon and animated city signage. She wears a black leather jacket, a long red dress, and black boots, and carries a black purse. She wears sunglasses and red lipstick. She walks confidently and casually. The street is damp and reflective, creating a mirror effect of the colorful lights. Many pedestrians walk about.

A text-to-video model is a machine learning model that uses a natural language description as input to produce a video relevant to the input text.^[1] Advancements during the 2020s in the generation of high-quality, text-conditioned videos have largely been driven by the development of video diffusion models.^[2]

Models

There are different models, including open source models. Chinese-language input^[3] CogVideo is the earliest text-to-video model "of 9.4 billion parameters" to be developed, with its demo version of open source codes first presented on GitHub in 2022.^[4] That year, Meta Platforms released a partial text-to-video model called "Make-A-Video",^[5]^[6]^[7] and Google's Brain (later Google DeepMind) introduced Imagen Video, a text-to-video model with 3D U-Net.^[8]^[9]^[10]^[11]^[12]

In March 2023, a research paper titled "VideoFusion: Decomposed Diffusion Models for High-Quality Video Generation" was published, presenting a novel approach to video generation. ^[13] The VideoFusion model decomposes the diffusion process into two components: base noise and residual noise, which are shared across frames to ensure temporal coherence. By utilizing a pre-trained image diffusion model as a base generator, the model efficiently generated high-quality and coherent videos. Fine-tuning the pre-trained model on video data addressed the domain gap between image and video data, enhancing the model's ability to produce realistic and consistent video sequences. ^[14]

Matthias Niessner and Lourdes Agapito at AI company Synthesia work on developing 3D neural rendering techniques that can synthesise realistic video by using 2D and 3D neural representations of shape, appearances, and motion for controllable video synthesis of avatars.^[15] In June 2024, Luma Labs launched its Dream Machine video tool.^[16]^[17] That same month,^[18] Kuaishou extended its Kling AI text-to-video model to international users. In July 2024, TikTok owner ByteDance released Jimeng AI in China, through its subsidiary, Faceu Technology.^[19]

Alternative approaches to text-to-video models include^[20] Google's Phenaki, Hour One, Colossyan,^[21] Runway's Gen-3 Alpha,^[22]^[23] and OpenAI's unreleased (as at August 2024) Sora,^[24] available only to alpha testers.^[25]

Text-to-Video AI Models Comparison

References

^ Artificial Intelligence Index Report 2023 (PDF) (Report). Stanford Institute for Human-Centered Artificial Intelligence. p. 98. Multiple high quality text-to-video models, AI systems that can generate video clips from prompted text, were released in 2022.
^ Melnik, Andrew; Ljubljanac, Michal; Lu, Cong; Yan, Qi; Ren, Weiming; Ritter, Helge (2024-05-06). "Video Diffusion Models: A Survey". arXiv:2405.03150 [cs.CV].
^ Text-to-Video Generative AI Models: The Definitive List AI Business accessed 19 August 2024.
^ CogVideo, THUDM, 2022-10-12, retrieved 2022-10-12
^ Davies, Teli (2022-09-29). "Make-A-Video: Meta AI's New Model For Text-To-Video Generation". Weights & Biases. Retrieved 2022-10-12.
^ Monge, Jim Clyde (2022-08-03). "This AI Can Create Video From Text Prompt". Medium. Retrieved 2022-10-12.
^ "Meta's Make-A-Video AI creates videos from text". www.fonearena.com. Retrieved 2022-10-12.
^ "google: Google takes on Meta, introduces own video-generating AI". The Economic Times. 6 October 2022. Retrieved 2022-10-12.
^ Monge, Jim Clyde (2022-08-03). "This AI Can Create Video From Text Prompt". Medium. Retrieved 2022-10-12.
^ "Nuh-uh, Meta, we can do text-to-video AI, too, says Google". www.theregister.com. Retrieved 2022-10-12.
^ "Papers with Code - See, Plan, Predict: Language-guided Cognitive Planning with Video Prediction". paperswithcode.com. Retrieved 2022-10-12.
^ "Papers with Code - Text-driven Video Prediction". paperswithcode.com. Retrieved 2022-10-12.
^ Luo, Zhengxiong; Chen, Dayou; Zhang, Yingya; Huang, Yan; Wang, Liang; Shen, Yujun; Zhao, Deli; Zhou, Jingren; Tan, Tieniu (2023). "VideoFusion: Decomposed Diffusion Models for High-Quality Video Generation". arXiv:2303.08320 [cs.CV].
^ "VideoFusion: Decomposed Diffusion Models for High-Quality Video Generation". ar5iv. Retrieved 2024-08-30.
^ "Text to Speech for Videos". Retrieved 2023-10-17.
^ Luma AI debuts 'Dream Machine' for realistic video generation, heating up AI media race VentureBeat accessed August 16, 2024.
^ Apple Debuts Intelligence, Mistral Raises $600 Million, New AI Text-To-Video Forbes accessed August 16, 2024.
^ What you need to know about Kling, the AI video generator rival to Sora that’s wowing creators VentureBeat accessed August 16, 2024.
^ ByteDance joins OpenAI's Sora rivals with AI video app launch Reuters accessed August 16, 2024.
^ Text2Video-Zero, Picsart AI Research (PAIR), 2023-08-12, retrieved 2023-08-12
^ Text-to-Video Generative AI Models: The Definitive List AI Business accessed August 16, 2024.
^ Runway's Sora competitor Gen-3 Alpha now available The Decoder accessed August 16, 2024.
^ Generative AI's Next Frontier Is Video Bloomberg accessed August 16, 2024.
^ OpenAI teases 'Sora,' its new text-to-video AI model NBC News accessed August 16, 2024.
^ Toys R Us creates first brand film to use OpenAI’s text-to-video tool Marketing Dive accessed August 16, 2024.
^ a b c d e f "Top AI Video Generation Models of 2024". Deepgram. Retrieved 2024-08-30.
^ a b "Runway Research | Gen-2: Generate novel videos with text, images or video clips". runwayml.com. Retrieved 2024-08-30.
^ a b Sharma, Shubham (2023-12-26). "Pika Labs' text-to-video AI platform opens to all: Here's how to use it". VentureBeat. Retrieved 2024-08-30.
^ a b "Runway Research | Introducing Gen-3 Alpha: A New Frontier for Video Generation". runwayml.com. Retrieved 2024-08-30.
^ a b "Sora | OpenAI". openai.com. Retrieved 2024-08-30.

Text-to-video model

Models

Text-to-Video AI Models Comparison

See also

References