VideoMaMa
Mask-Guided Video Matting via Generative Prior
Loading videos and content...

VideoMaMa:

Mask-Guided Video Matting via Generative Prior

1Korea University 2Adobe Research 3KAIST AI
arXiv 2026

VideoMaMa Demo

The results shown in this video were generated using the demo application available on Hugging Face and GitHub.

Video MP4 file is also available on Github.

VideoMaMa in-the-wild Results

Input RGB
Input Mask
Result

Qualitative results of VideoMaMa on in-the-wild videos.
The input masks are generated using SAM2 by manually providing a point prompt on the first frame. Click to pause or resume.

Abstract

Generalizing video matting models to real-world videos remains a significant challenge due to the scarcity of labeled data. To address this, we present Video Mask-to-Matte Model (VideoMaMa) that converts coarse segmentation masks into pixel accurate alpha mattes, by leveraging pretrained video diffusion models. VideoMaMa demonstrates strong zero-shot generalization to real-world footage, even though it is trained solely on synthetic data. Building on this capability, we develop a scalable pseudo-labeling pipeline for large-scale video matting and construct the Matting Anything in Video (MA-V) dataset, which offers high-quality matting annotations for more than 50K real-world videos spanning diverse scenes and motions. To validate the effectiveness of this dataset, we fine-tune the SAM2 model on MA-V to obtain SAM2-Matte, which outperforms the same model trained on existing matting datasets in terms of robustness on in-the-wild videos. These findings emphasize the importance of large-scale pseudo-labeled video matting and showcase how generative priors and accessible segmentation cues can drive scalable progress in video matting research. All models and the MA-V dataset will be publicly released.

Matting Anything in Video (MA-V) Dataset

Input Video
SA-V
MA-V
SA-V
MA-V
Mask Comparison
SA-V
MA-V
SA-V
MA-V
Composition

Comparison of SA-V and MA-V.
Prior Video Matting datasets contain at most hundreds of videos, predominantly focusing on human subjects captured in controlled environments or through manual annotation. They create artificial scene compositions that differ fundamentally from natural video footage, limiting model generalization to real world scenarios. We introduce Matte Anything in Videos (MA-V), created by applying VideoMaMa to annotate SA-V’s diverse mask annotations. MA-V provides 50,541 videos captured in natural settings, nearly 50× larger than existing real-video datasets. Move slider left and right to compare between SA-V and MA-V. Click to pause and resume.

Qualitative Comparison

Input
Input RGB
Input Mask
All-frame mask-guided
MaGGIe
VideoMaMa (Ours)
First-frame mask-guided
MatAnyone
SAM2-Matte (Ours)

Qualitative comparison of VideoMaMa and SAM2-Matte with MaGGIe and MatAnyone on in-the-wild videos.
The input masks are generated using SAM2 by manually providing a point prompt on the first frame. Click to pause or resume.

VideoMaMa Architecture

VideoMaMa Architecture
RGB frames and guide masks are processed through video diffusion U-Net layers to generate high-quality video mattes. Semantic injection with DINO features is applied during training.

Quantitative Results

VideoMaMa Quantitative
All-frame mask-guided video matting comparison on V-HIM60 and YouTubeMatte benchmarks. We compare VideoMaMa (Ours) against mask-guided matting methods: MaGGIe (video mask-guided) and MGM (image mask-guided). We evaluate on two mask types: Synthetically Degraded Masks including downsampling (8×, 32×) and polygonization with varying difficulty levels, and Model-Generated Masks from SAM2. Lower values indicate better performance.

Citation

If you use VideoMaMa in your work, please cite the paper as:

                @article{
                }