InterRVOS: Interaction-aware Referring Video Object Segmentation

KAIST AI
Corresponding Author
ArXiv 2025

We introduce Interaction-aware Referring Video Object Segmentation (InterRVOS), a novel task that
extends standard RVOS by requiring the model to segment the actor and target objects separately.

Click the arrows below to see more results.

Q : Please segment 'Man holding a baby'

A : Sure, it's Actor and Target

Abstract

Referring video object segmentation (RVOS) aims to segment objects in a video described by a natural language expression. However, most existing approaches focus on segmenting only the referred objects (typically the actor), even when the expression clearly describes an interaction involving multiple objects with distinct roles. For instance, "A throwing B" implies a directional interaction, but standard RVOS segments only the actor (A), neglecting other involved target objects (B). In this paper, we introduce Interaction-aware Referring Video Object Segmentation (InterRVOS), a novel task that focuses on the modeling of interactions. It requires the model to segment the actor and target objects separately, reflecting their asymmetric roles in an interaction. This task formulation enables fine-grained understanding of object relationships, as many video events are defined by such relationships rather than individual objects. To support this task, we propose a new evaluation protocol that separately evaluates actor and target segmentation, enabling more accurate assessment of the model's ability to distinguish and segment actor and target roles. We also present InterRVOS-127K, a large-scale dataset with over 127K automatically annotated expressions, including interaction expressions annotated with distinct masks for actor and target objects. Furthermore, we develop ReVIOSa, an MLLM-based architecture that introduces interaction-aware special tokens and leverages an attention mask loss to enhance role-specific segmentation. Extensive experiments show that ReVIOSa not only outperforms existing baselines on our proposed InterRVOS-127K evaluation set, but also achieves strong performance on standard RVOS benchmarks.

Dataset Statistics

Dataset Pub. & Year Videos Object Expression Obj/Video Actor-Target Interaction
A2D Sentence CVPR 2018 3,782 4,825 6,656 1.28 -
J-HMDB Sentence CVPR 2018 928 928 928 1 -
Ref-DAVIS ACCV 2018 90 205 1,544 2.27 -
Ref-Youtube-VOS ECCV 2020 3,978 7,451 15,009 1.86 -
MeViS ICCV 2023 2,006 8,171 28,570 4.28 -
ReVOS ECCV 2024 1,042 5,535 35,074 5.31 -
Ref-SAV CVPRW 2025 37,311 72,509 72,509 1.94 -
InterRVOS-127K (Ours) - 8,738 35,247 127,236 4.03 17,604

InterRVOS-127K offers the largest number of referring expressions and a high object-per-video ratio, enabling richer and more diverse visual grounding across complex scenes compared to existing benchmarks. Unlike existing datasets, InterRVOS-127K also provides interaction-aware referring expressions that explicitly distinguish between actor and target roles, enabling fine-grained understanding of visual interactions.

Dataset Annotation Pipeline

Data Annotation Pipeline

Our proposed automatic data annotation pipeline constructs referring expressions for single, multi-object, and interaction scenarios in four stages, which extracts object appearance and motion, detects inter-object interactions, and generates detailed expressions grounded in both visual properties and interaction context.

Dataset Samples

Method image

Our dataset includes a wide range of referring expressions, covering both challenging cases such as multi-object references and motion-only descriptions, as well as a diverse spectrum of expression granularity— from simple class-level descriptions to fine-grained appearance-based references. In addition to conventional referring expressions, InterRVOS-127K explicitly incorporates interaction-focused expressions that distinguish between actor and target roles. The examples also demonstrate the presence of multiple objects within a single video and highlight the relationships between them, confirming that our dataset effectively captures object-level interactions in complex visual scenes.

Architecture : ReVIOSa

Qualitative Results

As InterRVOS emphasizes a detailed understanding of object interactions and diverse motion dynamics, we propose ReVIOSa (Referring Video Interaction-aware ObjectSegmentation), a novel architecture tailored for this purpose. Unlike prior RVOS setting that typically segment only the actor object referred to in the expression, InterRVOS requires comprehensive reasoning over the interaction described, explicitly identifying the roles of both the actor and the target, and segmenting them accordingly. To address these challenges, ReVIOSa utilizes interaction-aware special tokens and leverages attention mask loss (AML) to enable accurate disambiguation of actor and target roles and to capture the inter-object dynamics.

Qualitative Results

Click the arrows below to see more results.

Q : Please segment 'Man helping a child'

A : Sure, it's Actor and Target

Suppl Qual Image

Quantitative Results

The best-performing results are presented in bold, while the second-best results are underlined.
Quantitative Results
Quantitative Results

BibTeX


@misc{jin2025interrvosinteractionawarereferringvideo,
    title={InterRVOS: Interaction-aware Referring Video Object Segmentation},
    author={Woojeong Jin and Seongchan Kim and Seungryong Kim},
    year={2025},
    eprint={2506.02356},
    archivePrefix={arXiv},
    primaryClass={cs.CV},
    url={https://arxiv.org/abs/2506.02356},
}