MUG-VOS: Multi-Granularity Video Object Segmentation

1Korea University, 2KAIST, 3Samsung Electronics
*Equal Contribution, Co-Corresponding Author
AAAI 2025

MUG-VOS contains multiple granularities masks from coarse to fine segments.


Abstract

Current benchmarks for video segmentation aim to segment and track objects; however, they are limited to annotating only salient objects (i.e., foreground instances). As a result, despite their impressive architectural designs, previous works have struggled to adapt to real-world scenarios, since current benchmarks only handle salient objects. We believe that developing a new video segmentation dataset aimed at tracking multi-granularity segmentation target in the video scene is necessary for the research community. In this work, we aim to generate new video segmentation dataset that is annotated for both salient and non-salient masks. To achieve this, we propose a large-scale, densely annotated Multi-Granularity Video Object Segmentation (MUG-VOS) dataset that includes various types and granularities of mask annotations. We automatically collected a training set that assists in tracking both salient and non-salient objects, and we also curated a human-annotated test set for reliable evaluation. As a result, Memory-based Mask Propagation Model (MMPM) trained on MUG-VOS dataset leads to the best performance on MUG-VOS dataset among the existing video object segmentation methods and Segment Anything Model (SAM)-based video segmentation methods.

Comparison

Teaser image

Visualization of (top) MUG-VOS masks annotated by our data collection pipeline and ground-truth masks of Youtube-VOS, DAVIS, and UVO data and (bottom) MUG-VOS dataset. MUG-VOS masks include various types and granularities of objects, parts, stuff, and backgrounds, even those not covered by existing datasets.

Architecture

We introduce the MMPM model, which generates masks based on previous results. Starting from an initial mask that indicates the target object, the MMPM model consistently tracks and segments the target throughout the entire video. Sequential memory stores low-resolution features, updated at every selected frames, while temporal memory retains high-resolution features from previous frames, capturing a variety of information gathered from multiple frames.

Method image

Qualitative Results

Qualitative comparison between DEVA, XMem, and MMPM.

DEVA XMem MMPM

Quantitative Results

The quantitative evaluation on MUG-VOS test set. “SAM” correspond to an architecture that used SAM (Kirillov et al. 2023) Encoder and Decoder.

Quan image

MUG-VOS Dataset and Benchmark

MUG-VOS Test dataset, part of our newly introduced MUG-VOS benchmark, stands out as a Multi-Granularity Video Object Segmentation dataset, surpassing existing datasets in mask diversity and density. Unlike other datasets in the table, such as DAVIS and Youtube-VOS, which primarily focus on salient objects and contain limited mask tracks per frame (e.g., 1.63 and 2.6 masks per frame, respectively), MUG-VOS Test includes an average of 29.6 masks per frame. While BURST and VIPSeg target universal segmentation and panoptic segmentation tasks, MUG-VOS provides greater mask density (0.663) and focuses on a wide range of granularities, including non-salient objects and background clutters. With 59K high-quality masks across 1,999 annotated frames, generated and refined through a semi-automatic pipeline using SAM and human verification, MUG-VOS Test establishes a unique benchmark for evaluating models on diverse and challenging video segmentation tasks.

BibTeX

@misc{lim2024multigranularityvideoobjectsegmentation,
      title={Multi-Granularity Video Object Segmentation}, 
      author={Sangbeom Lim and Seongchan Kim and Seungjun An and Seokju Cho and Paul Hongsuck Seo and Seungryong Kim},
      year={2024},
      eprint={2412.01471},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2412.01471}, 
}