Multi-Granularity Video Object Segmentation

MUG-VOS: Multi-Granularity Video Object Segmentation

¹Korea University, ²KAIST, ³Samsung Electronics
^*Equal Contribution, ^†Co-Corresponding Author

AAAI 2025

Abstract

Current benchmarks for video segmentation aim to segment and track objects; however, they are limited to annotating only salient objects (i.e., foreground instances). As a result, despite their impressive architectural designs, previous works have struggled to adapt to real-world scenarios, since current benchmarks only handle salient objects. We believe that developing a new video segmentation dataset aimed at tracking multi-granularity segmentation target in the video scene is necessary for the research community. In this work, we aim to generate new video segmentation dataset that is annotated for both salient and non-salient masks. To achieve this, we propose a large-scale, densely annotated Multi-Granularity Video Object Segmentation (MUG-VOS) dataset that includes various types and granularities of mask annotations. We automatically collected a training set that assists in tracking both salient and non-salient objects, and we also curated a human-annotated test set for reliable evaluation. As a result, Memory-based Mask Propagation Model (MMPM) trained on MUG-VOS dataset leads to the best performance on MUG-VOS dataset among the existing video object segmentation methods and Segment Anything Model (SAM)-based video segmentation methods.

Qualitative comparison between DEVA, XMem, and MMPM.

DEVA

XMem

MMPM

MUG-VOS Dataset and Benchmark

MUG-VOS Test dataset, part of our newly introduced MUG-VOS benchmark, stands out as a Multi-Granularity Video Object Segmentation dataset, surpassing existing datasets in mask diversity and density. Unlike other datasets in the table, such as DAVIS and Youtube-VOS, which primarily focus on salient objects and contain limited mask tracks per frame (e.g., 1.63 and 2.6 masks per frame, respectively), MUG-VOS Test includes an average of 29.6 masks per frame. While BURST and VIPSeg target universal segmentation and panoptic segmentation tasks, MUG-VOS provides greater mask density (0.663) and focuses on a wide range of granularities, including non-salient objects and background clutters. With 59K high-quality masks across 1,999 annotated frames, generated and refined through a semi-automatic pipeline using SAM and human verification, MUG-VOS Test establishes a unique benchmark for evaluating models on diverse and challenging video segmentation tasks.

BibTeX

@misc{lim2024multigranularityvideoobjectsegmentation, title={Multi-Granularity Video Object Segmentation}, author={Sangbeom Lim and Seongchan Kim and Seungjun An and Seokju Cho and Paul Hongsuck Seo and Seungryong Kim}, year={2024}, eprint={2412.01471}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2412.01471}, }

MUG-VOS: Multi-Granularity Video Object Segmentation

Abstract

Comparison

Architecture

Qualitative Results

Qualitative comparison between DEVA, XMem, and MMPM.

Quantitative Results

The quantitative evaluation on MUG-VOS test set. “SAM” correspond to an architecture that used SAM (Kirillov et al. 2023) Encoder and Decoder.

MUG-VOS Dataset and Benchmark

BibTeX