InterRVOS: Interaction-aware Referring Video Object Segmentation

KAIST AI
ArXiv 2025

We introduce Interaction-aware Referring Video Object Segmentation, a new task that
requires segmenting both actor and target entities involved in an interaction.

Click the arrows below to see more results.

Q : Please segment 'Man holding a baby'

A : Sure, it's Actor and Target

Abstract

Referring video object segmentation aims to segment the object in a video corresponding to a given natural language expression. While prior work has explored various referring scenarios, including motion-centric or multi-instance expressions, most approaches still focus on localizing a single target object in isolation. However, in comprehensive video understanding, an object’s role is often defined by its interactions with other entities, which are largely overlooked in existing datasets and models. In this work, we introduce InterRVOS, a new task that requires segmenting both actor and target entities involved in an interaction. Each interaction is described through a pair of complementary expressions from different semantic perspectives, enabling fine-grained modeling of inter-object relationships. To tackle this task, we propose InterRVOS-8K, the large-scale and automatically constructed dataset containing diverse interaction-aware expressions, including challenging cases such as motion-only multi-instance expressions. We also present a baseline architecture ReVIOSa designed to handle actor-target segmentation from a single expression, achieving strong performance on both standard and interaction-focused settings. Furthermore, we introduce a actor-target aware evaluation setting that enables a more targeted assessment of interaction understanding. Experimental results demonstrate that our approach outperforms prior methods in modeling complex object interactions for referring video object segmentation task, establishing a strong foundation for future research in interaction-centric video understanding.

Dataset Statistics

Dataset Pub. & Year Videos Object Expression Obj/Video Actor-Target Interaction
A2D Sentence CVPR 2018 3,782 4,825 6,656 1.28 -
J-HMDB Sentence CVPR 2018 928 928 928 1 -
Ref-DAVIS ACCV 2018 90 205 1,544 2.27 -
Ref-Youtube-VOS ECCV 2020 3,978 7,451 15,009 1.86 -
MeViS ICCV 2023 2,006 8,171 28,570 4.28 -
ReVOS ECCV 2024 1,042 5,535 35,074 5.31 -
Ref-SAV CVPRW 2025 37,311 72,509 72,509 1.94 -
InterRVOS-8K (Ours) - 8,738 35,247 127,314 4.03 17,682

Our newly proposed InterRVOS-8K offers the largest number of referring expressions and a high object-per-video ratio, enabling richer and more diverse visual grounding across complex scenes compared to existing benchmarks. Unlike existing datasets, InterRVOS-8K also provides interaction-aware referring expressions that explicitly distinguish between actor and target roles, enabling fine-grained understanding of visual interactions.

Dataset Annotation Pipeline

Data Annotation Pipeline

Our proposed automatic data annotation pipeline constructs referring expressions for single, multi-object, and interaction scenarios in four stages, which extracts object appearance and motion, detects inter-object interactions, and generates detailed expressions grounded in both visual properties and interaction context.

Dataset Samples

Method image

Our dataset includes a wide range of referring expressions, covering both challenging cases such as multi-object references and motion-only descriptions, as well as a diverse spectrum of expression granularity— from simple class-level descriptions to fine-grained appearance-based references. In addition to conventional referring expressions, InterRVOS-8K explicitly incorporates interaction-focused expressions that distinguish between actor and target roles. The examples also demonstrate the presence of multiple objects within a single video and highlight the relationships between them, confirming that our dataset effectively captures object-level interactions in complex visual scenes.

Baseline Approach : ReVIOSa

Qualitative Results

As our proposed dataset InterRVOS-8K emphasizes a detailed understanding of object interactions and diverse motion dynamics, we present ReVIOSa (Referring Video Interaction-aware Object Segmentation), a baseline architecture tailored for this purpose. Unlike prior RVOS approaches that typically segment only the actor referred to in the language expression, ReVIOSa is designed to jointly reason about and segment both the actor object and the target object, especially in cases involving unidirectional interactions. This enables a richer interpretation of referring expressions by explicitly modeling the relational context between interacting entities.

Qualitative Results

Click the arrows below to see more results.

Q : Please segment 'Man helping a child'

A : Sure, it's Actor and Target

Qualitative Results
Qualitative results on InterRVOS-8K.

Quantitative Results

The best-performing results are presented in bold, while the second-best results are underlined.
Quantitative Results

BibTeX


@misc{jin2025interrvosinteractionawarereferringvideo,
    title={InterRVOS: Interaction-aware Referring Video Object Segmentation},
    author={Woojeong Jin and Seongchan Kim and Seungryong Kim},
    year={2025},
    eprint={2506.02356},
    archivePrefix={arXiv},
    primaryClass={cs.CV},
    url={https://arxiv.org/abs/2506.02356},
}