Innovate Futures @ Benji

Nvidia Cosmos In ComfyUI - AI Diffusion Model For Video Generation - Setup Tutorial

Added 2025-01-16 14:50:51 +0000 UTC

In this video, we dive into Nvidia Cosmo, the groundbreaking Text-to-World and Video-to-World diffusion models announced at CES 2025. Discover how to run Nvidia Cosmo in ComfyUI for Text-to-Video, Image-to-Video, and Video-to-Video workflows, and explore the future of AI video generation.

What You'll Learn:

Nvidia Cosmo Overview: Understand the capabilities of the 7B and 14B parameter models for Text-to-Video, Image-to-Video, and Video-to-Video generation.

ComfyUI Integration: Step-by-step guide on setting up Nvidia Cosmo in ComfyUI, including downloading model weights, SafeTensor files, and workflow configurations.

Workflow Demos: See Text-to-Video and Image-to-Video workflows in action, with tips on optimizing settings like Torch Compile, EDM Sampling, and REST Multi-Step sampling.

Advanced Features: Explore how Nvidia Cosmo supports Video-to-Video by default, allowing seamless video input and output without additional custom nodes.

Things you need (Source from ComfyUI Github):

Text encoder and VAE:

https://huggingface.co/comfyanonymous/cosmos_1.0_text_encoder_and_VAE_ComfyUI/tree/main

oldt5_xxl_fp8_e4m3fn_scaled.safetensors -> ComfyUI/models/text_encoders

cosmos_cv8x8x8_1.0.safetensors -> ComfyUI/models/vae

Note: oldt5_xxl is not the same as the t5xxl used in flux and other models. oldt5_xxl is t5xxl 1.0 while the one used in flux and others is t5xxl 1.1

diffusion models repackaged in safetensors:

https://huggingface.co/mcmonkey/cosmos-1.0/tree/main

Goes in: ComfyUI/models/diffusion_models

Note: "Text to World" means Text to video and "Video to World" means image/video to video.

If you want the original diffusion models in .pt format the official links are:

https://huggingface.co/nvidia/Cosmos-1.0-Diffusion-7B-Text2World

https://huggingface.co/nvidia/Cosmos-1.0-Diffusion-14B-Text2World

https://huggingface.co/nvidia/Cosmos-1.0-Diffusion-7B-Video2World

https://huggingface.co/nvidia/Cosmos-1.0-Diffusion-14B-Video2World

Workflows

Text to Video

Image to Video

Future Updates:

Get a sneak peek at the latest ComfyUI updates, including support for Start Image and End Image in Image-to-Video workflows.

Resources:

https://huggingface.co/collections/nvidia/cosmos-6751e884dc10e013a0a0d8e6

https://gist.github.com/comfyanonymous/2f57adabe5a22b36a21ae024306daddb

https://github.com/comfyanonymous/ComfyUI/issues/6375

https://huggingface.co/comfyanonymous/cosmos_1.0_text_encoder_and_VAE_ComfyUI/tree/main/text_encoders

https://huggingface.co/comfyanonymous/cosmos_1.0_text_encoder_and_VAE_ComfyUI/tree/main/vae

https://huggingface.co/mcmonkey/cosmos-1.0/tree/main

https://huggingface.co/Kijai/Cosmos1_ComfyUI/tree/main

Why This Matters:

Cutting-Edge AI: Nvidia Cosmo represents the next evolution in AI video models, offering unparalleled flexibility and quality.

Efficient Workflows: Learn how to integrate Nvidia Cosmo into ComfyUI for seamless video generation, whether you're working with text, images, or videos.

Creative Potential: Unlock new possibilities in AI video creation with advanced features like Negative Prompts, Torch Compile, and RES Multi-Step sampling.

Perfect For:

AI video creators looking to explore the latest advancements in Text-to-Video, Image-to-Video, and Video-to-Video generation.

ComfyUI users interested in integrating Nvidia Cosmo into their workflows.

AI enthusiasts eager to stay ahead of the curve with cutting-edge AI video models.