Referring Video Object Segmentation via Language Aligned Track Selection

1Korea University, 2KAIST
Corresponding Author *Indicates Equal Contributions
ArXiv 2024

SOLA: Selection by Object Language Alignment

SOLA demonstrates exceptional temporal consistency and precise vision-language alignment,
even when the expression describes complex motions.


The two vehicles parked on the side of the road
moving from right to left
It enthusiastically chases and pounces on the wand
baby tiger without moving position

Abstract

Referring Video Object Segmentation (RVOS) seeks to segment objects throughout a video based on natural language expressions. While existing methods have made strides in vision-language alignment, they often overlook the importance of robust video tracking, which is crucial for accurate video segmentation. Such inconsistent tracking can disrupt vision-language alignment, leading to suboptimal performance. In this work, we present Selection by Object Language Alignment (SOLA), a novel framework that reformulates RVOS into two sub-problems, track generation and track selection. In track generation, we leverage a vision foundation model, Segment Anything Model 2 (SAM2), which generates consistent mask tracks across frames, producing reliable candidates for both foreground and background objects. For track selection, we propose a light yet effective selection module that aligns visual and textual features while modeling object appearance and motion within video sequences. This design enables precise motion modeling and alignment of the vision language. Our approach achieves state-of-the-art performance on the challenging MeViS dataset and demonstrates superior results in zero-shot settings on the Ref-Youtube-VOS and Ref-DAVIS datasets. Furthermore, SOLA exhibits strong generalization and robustness in corrupted settings, such as those with added gaussian noise or motion blur.

Motivation

Motivation image 1 Motivation image 2

Previous methods simultaneously train track generation and vision-language alignment, whereas our approach focuses solely on training the latter. As a result, prior work often generates inconsistent mask tracks, which in turns shows limited performance as exemplified in (c), while our method produces more consistent outputs.

Architecture

Method image

Overall pipeline of the proposed SOLA framework

Our core idea is to redefine referring video object segmentation as two sub-problems: track generation and track selection. We first generate candidate mask tracks with the Segment Anything Model 2, ensuring consistent and clear mask tracks. Then, our light-weight language-aligned track selection module efficiently selects the referred mask tracks through motion modeling and object-language alignment. During the inference stage, we leverage the visual grounding model, Grounding DINO, for efficient candidate track generation.

Qualitative Results

Quantitative Results

The best-performing results are presented in bold, while the second-best results are underlined.
Main Quantitative Results
Quantitative comparison on MeViS.
Zero-shot Quantitative Results
Zero-shot quantitative comparison on Ref-Youtube-VOS and Ref-DAVIS.

BibTeX

@misc{kim2024referringvideoobjectsegmentation,
      title={Referring Video Object Segmentation via Language-aligned Track Selection}, 
      author={Seongchan Kim and Woojeong Jin and Sangbeom Lim and Heeji Yoon and Hyunwook Choi and Seungryong Kim},
      year={2024},
      eprint={2412.01136},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2412.01136}, 
}