Innovate Futures @ Benji

(ComfyUI Workflow) VACE Video Inpaint Mask Reference ver.full-20250407

Added 2025-04-07 15:22:44 +0000 UTC

In this post, we’ll walk through the workflow for video inpainting using WAN 2.1 and VAce in ComfyUI. This process allows creators to manipulate videos by replacing or modifying specific regions while maintaining coherence with the surrounding content. The workflow is intricate but highly rewarding when executed correctly. Let’s break it down step by step.

Step 1: Setting Up Video Parameters

The first step involves configuring the basic video settings, such as width, height, and frame rate. These parameters need to match your source footage to ensure consistency throughout the inpainting process. Proper setup at this stage lays the foundation for smooth editing later on.

Step 2: Crafting Text Prompts

Text prompts are essential for guiding the AI during inpainting. These prompts should briefly describe the action happening in the scene and what appears in the inpainting area. For example, you might write something like “a futuristic female fighter in a combat scene.” To make the prompts more effective, I use Ollama with a vision-language model to convert visual descriptions into clear text. This helps me define details like the character’s outfit and appearance accurately.

Step 3: Selecting Reference Images

Reference images play a key role in guiding the AI to replicate styles and details. While it’s possible to use multiple reference images, this approach can be resource-intensive and complicated. For simplicity, I stick to one-mask inpainting, such as animating an anime character riding a horse or a cat riding a bike. This method works smoothly on most local machines without overloading VRAM.

Step 4: Segmentation with SAM2

Segmentation is where the magic begins. Using SAM2 (Segment Anything Model 2), I place green dots on the objects or characters I want to mask in the first frame of the video. This segmentation tool automatically masks the selected region across all frames. If precision is required, I add red dots to indicate areas that shouldn’t be masked. Typically, three green dots per character are sufficient, though adjustments may be necessary depending on the complexity of the scene.

Step 5: Refining the Mask

Once the initial masking is done, I review the results and refine them as needed. Sometimes, certain areas—like straps or accessories—may not be fully captured. In such cases, I expand the mask slightly and adjust the blur radius for sharpness. This ensures the inpainting region is accurate and ready for processing.

Step 6: Running Inference with Samplers

With the mask in place, it’s time to run inference using Sampler 1 and Sampler 2. Sampler 1 handles the ink painting effect and references the embedded image of the character. Tweaking settings like CFG numbers, steps, and seed values allows for experimentation with different styles. For realistic footage, I avoid using Control Lora since the quality is often good enough without refinement.

Step 7: Enhancing Output with Tile Lora

After generating the inpainted video, I use Tile Lora to enhance the output. This tool sharpens image frames and enhances colors, resulting in smoother and more visually appealing results. If higher quality is desired, upscaling the output or Tile Lora output can render 4K-quality videos for final production.

Challenges and Considerations

While this workflow is powerful, it has its challenges. Fast-motion scenes can be tricky due to motion blur and rapid movements, making masking difficult. Detailed costumes or accessories, like swords or patterned sleeves, may not always replicate perfectly. However, simpler outfits tend to follow more closely, yielding better results.

Attached file below, have fun :)