(ComfyUI Workflow) VACE Video Inpaint Mask Reference ver.full-20250407
Added 2025-04-07 15:22:44 +0000 UTC
Video : https://youtu.be/3tu-sTY0k6M
Related Post: https://www.patreon.com/posts/126132064
In this post, we’ll walk through the workflow for video inpainting using WAN 2.1 and VAce in ComfyUI. This process allows creators to manipulate videos by replacing or modifying specific regions while maintaining coherence with the surrounding content. The workflow is intricate but highly rewarding when executed correctly. Let’s break it down step by step.
Step 1: Setting Up Video Parameters
The first step involves configuring the basic video settings, such as width, height, and frame rate. These parameters need to match your source footage to ensure consistency throughout the inpainting process. Proper setup at this stage lays the foundation for smooth editing later on.
Step 2: Crafting Text Prompts
Text prompts are essential for guiding the AI during inpainting. These prompts should briefly describe the action happening in the scene and what appears in the inpainting area. For example, you might write something like “a futuristic female fighter in a combat scene.” To make the prompts more effective, I use Ollama with a vision-language model to convert visual descriptions into clear text. This helps me define details like the character’s outfit and appearance accurately.
Step 3: Selecting Reference Images
Reference images play a key role in guiding the AI to replicate styles and details. While it’s possible to use multiple reference images, this approach can be resource-intensive and complicated. For simplicity, I stick to one-mask inpainting, such as animating an anime character riding a horse or a cat riding a bike. This method works smoothly on most local machines without overloading VRAM.
Step 4: Segmentation with SAM2
Segmentation is where the magic begins. Using SAM2 (Segment Anything Model 2), I place green dots on the objects or characters I want to mask in the first frame of the video. This segmentation tool automatically masks the selected region across all frames. If precision is required, I add red dots to indicate areas that shouldn’t be masked. Typically, three green dots per character are sufficient, though adjustments may be necessary depending on the complexity of the scene.
Step 5: Refining the Mask
Once the initial masking is done, I review the results and refine them as needed. Sometimes, certain areas—like straps or accessories—may not be fully captured. In such cases, I expand the mask slightly and adjust the blur radius for sharpness. This ensures the inpainting region is accurate and ready for processing.
Step 6: Running Inference with Samplers
With the mask in place, it’s time to run inference using Sampler 1 and Sampler 2. Sampler 1 handles the ink painting effect and references the embedded image of the character. Tweaking settings like CFG numbers, steps, and seed values allows for experimentation with different styles. For realistic footage, I avoid using Control Lora since the quality is often good enough without refinement.
Step 7: Enhancing Output with Tile Lora
After generating the inpainted video, I use Tile Lora to enhance the output. This tool sharpens image frames and enhances colors, resulting in smoother and more visually appealing results. If higher quality is desired, upscaling the output or Tile Lora output can render 4K-quality videos for final production.
Challenges and Considerations
While this workflow is powerful, it has its challenges. Fast-motion scenes can be tricky due to motion blur and rapid movements, making masking difficult. Detailed costumes or accessories, like swords or patterned sleeves, may not always replicate perfectly. However, simpler outfits tend to follow more closely, yielding better results.
Attached file below, have fun :)
Comments
Cool workflow. One question though, for the life of me I can't get the reference image to actually propagate into the video. The thing being replaced is basically 100% from the prompt. If I put an image of a blonde woman as the reference and it gets masked out etc but put "woman" or "teddy bear" or whatever in the prompt, the outcome will be a completely random woman or a teddy bear, not the woman from the reference image. It's like the vace embeds are getting ignored.
Andrewb
2025-06-20 00:17:42 +0000 UTCbecause the RemoveBG node, you might using the newer version , and that one do not have white background after remove bg, only have Alpha. That cause this message most case. I got the update for this issue fix in the recent post here : https://www.patreon.com/posts/workflow-update-129553407
Benjamin Law
2025-05-23 16:11:07 +0000 UTC"All of the VACE-related workflows you created throw the error: 'Given groups=1, weight of size [96, 3, 3, 3, 3], expected input[1, 4, 3, 370, 642] to have 3 channels, but got 4 channels instead.' How can I fix this?"
현민 안
2025-05-23 16:04:09 +0000 UTC