VideoMaMa Demo
The results shown in this video were generated using the demo application available on Hugging Face and GitHub.
VideoMaMa in-the-wild Results
Qualitative results of VideoMaMa on in-the-wild videos.
The input masks are generated using SAM2 by manually providing a point prompt on the first frame. Click to pause or resume.
Choose a video
Abstract
Generalizing video matting models to real-world videos remains a significant challenge due to the scarcity of labeled data. To address this, we present Video Mask-to-Matte Model (VideoMaMa) that converts coarse segmentation masks into pixel accurate alpha mattes, by leveraging pretrained video diffusion models. VideoMaMa demonstrates strong zero-shot generalization to real-world footage, even though it is trained solely on synthetic data. Building on this capability, we develop a scalable pseudo-labeling pipeline for large-scale video matting and construct the Matting Anything in Video (MA-V) dataset, which offers high-quality matting annotations for more than 50K real-world videos spanning diverse scenes and motions. To validate the effectiveness of this dataset, we fine-tune the SAM2 model on MA-V to obtain SAM2-Matte, which outperforms the same model trained on existing matting datasets in terms of robustness on in-the-wild videos. These findings emphasize the importance of large-scale pseudo-labeled video matting and showcase how generative priors and accessible segmentation cues can drive scalable progress in video matting research. All models and the MA-V dataset will be publicly released.
Matting Anything in Video (MA-V) Dataset
Comparison of SA-V and MA-V.
Prior Video Matting datasets contain at most hundreds of videos, predominantly focusing on human subjects captured in controlled environments or through manual annotation.
They create artificial scene compositions that differ fundamentally from natural video footage, limiting model generalization to real world scenarios.
We introduce Matte Anything in Videos (MA-V), created by applying VideoMaMa to annotate SA-V’s diverse mask annotations.
MA-V provides 50,541 videos captured in natural settings, nearly 50× larger than existing real-video datasets.
Move slider left and right to compare between SA-V and MA-V. Click to pause and resume.
Choose a video
Qualitative Comparison
Qualitative comparison of VideoMaMa and SAM2-Matte with MaGGIe and MatAnyone on in-the-wild videos.
The input masks are generated using SAM2 by manually providing a point prompt on the first frame. Click to pause or resume.
Choose a video
VideoMaMa Architecture
Quantitative Results
Citation
If you use VideoMaMa in your work, please cite the paper as:
@article{
}