InterRVOS: Interaction-aware Referring Video Object Segmentation

KAIST AI
^†Corresponding Author

ArXiv 2025

Abstract

Referring video object segmentation (RVOS) aims to segment objects in a video described by a natural language expression. However, most existing approaches focus on segmenting only the referred objects (typically the actor), even when the expression clearly describes an interaction involving multiple objects with distinct roles. For instance, "A throwing B" implies a directional interaction, but standard RVOS segments only the actor (A), neglecting other involved target objects (B). In this paper, we introduce Interaction-aware Referring Video Object Segmentation (InterRVOS), a novel task that focuses on the modeling of interactions. It requires the model to segment the actor and target objects separately, reflecting their asymmetric roles in an interaction. This task formulation enables fine-grained understanding of object relationships, as many video events are defined by such relationships rather than individual objects. To support this task, we propose a new evaluation protocol that separately evaluates actor and target segmentation, enabling more accurate assessment of the model's ability to distinguish and segment actor and target roles. We also present InterRVOS-127K, a large-scale dataset with over 127K automatically annotated expressions, including interaction expressions annotated with distinct masks for actor and target objects. Furthermore, we develop ReVIOSa, an MLLM-based architecture that introduces interaction-aware special tokens and leverages an attention mask loss to enhance role-specific segmentation. Extensive experiments show that ReVIOSa not only outperforms existing baselines on our proposed InterRVOS-127K evaluation set, but also achieves strong performance on standard RVOS benchmarks.

Dataset Statistics

Dataset	Pub. & Year	Videos	Object	Expression	Obj/Video	Actor-Target Interaction
A2D Sentence	CVPR 2018	3,782	4,825	6,656	1.28	-
J-HMDB Sentence	CVPR 2018	928	928	928	1	-
Ref-DAVIS	ACCV 2018	90	205	1,544	2.27	-
Ref-Youtube-VOS	ECCV 2020	3,978	7,451	15,009	1.86	-
MeViS	ICCV 2023	2,006	8,171	28,570	4.28	-
ReVOS	ECCV 2024	1,042	5,535	35,074	5.31	-
Ref-SAV	CVPRW 2025	37,311	72,509	72,509	1.94	-
InterRVOS-127K (Ours)	-	8,738	35,247	127,236	4.03	17,604

InterRVOS-127K offers the largest number of referring expressions and a high object-per-video ratio, enabling richer and more diverse visual grounding across complex scenes compared to existing benchmarks. Unlike existing datasets, InterRVOS-127K also provides interaction-aware referring expressions that explicitly distinguish between actor and target roles, enabling fine-grained understanding of visual interactions.

Architecture : ReVIOSa

As InterRVOS emphasizes a detailed understanding of object interactions and diverse motion dynamics, we propose ReVIOSa (Referring Video Interaction-aware ObjectSegmentation), a novel architecture tailored for this purpose. Unlike prior RVOS setting that typically segment only the actor object referred to in the expression, InterRVOS requires comprehensive reasoning over the interaction described, explicitly identifying the roles of both the actor and the target, and segmenting them accordingly. To address these challenges, ReVIOSa utilizes interaction-aware special tokens and leverages attention mask loss (AML) to enable accurate disambiguation of actor and target roles and to capture the inter-object dynamics.

Qualitative Results

Click the arrows below to see more results.

Q : Please segment 'Man helping a child'

A : Sure, it's Actor and Target

BibTeX

@misc{jin2025interrvosinteractionawarereferringvideo, title={InterRVOS: Interaction-aware Referring Video Object Segmentation}, author={Woojeong Jin and Seongchan Kim and Seungryong Kim}, year={2025}, eprint={2506.02356}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2506.02356}, }

InterRVOS: Interaction-aware Referring Video Object Segmentation

Abstract

Dataset Statistics

Dataset Annotation Pipeline

Dataset Samples

Architecture : ReVIOSa

Qualitative Results

Quantitative Results

BibTeX