We introduce Interaction-aware Referring Video Object Segmentation (InterRVOS),
a novel task that
extends standard RVOS by requiring the model to segment the
actor
and
target
objects separately.
Click the arrows below to see more results.
Referring video object segmentation (RVOS) aims to segment objects in a video described by a natural language expression. However, most existing approaches focus on segmenting only the referred objects (typically the actor), even when the expression clearly describes an interaction involving multiple objects with distinct roles. For instance, "A throwing B" implies a directional interaction, but standard RVOS segments only the actor (A), neglecting other involved target objects (B). In this paper, we introduce Interaction-aware Referring Video Object Segmentation (InterRVOS), a novel task that focuses on the modeling of interactions. It requires the model to segment the actor and target objects separately, reflecting their asymmetric roles in an interaction. This task formulation enables fine-grained understanding of object relationships, as many video events are defined by such relationships rather than individual objects. To support this task, we propose a new evaluation protocol that separately evaluates actor and target segmentation, enabling more accurate assessment of the model's ability to distinguish and segment actor and target roles. We also present InterRVOS-127K, a large-scale dataset with over 127K automatically annotated expressions, including interaction expressions annotated with distinct masks for actor and target objects. Furthermore, we develop ReVIOSa, an MLLM-based architecture that introduces interaction-aware special tokens and leverages an attention mask loss to enhance role-specific segmentation. Extensive experiments show that ReVIOSa not only outperforms existing baselines on our proposed InterRVOS-127K evaluation set, but also achieves strong performance on standard RVOS benchmarks.
Dataset | Pub. & Year | Videos | Object | Expression | Obj/Video | Actor-Target Interaction |
---|---|---|---|---|---|---|
A2D Sentence | CVPR 2018 | 3,782 | 4,825 | 6,656 | 1.28 | - |
J-HMDB Sentence | CVPR 2018 | 928 | 928 | 928 | 1 | - |
Ref-DAVIS | ACCV 2018 | 90 | 205 | 1,544 | 2.27 | - |
Ref-Youtube-VOS | ECCV 2020 | 3,978 | 7,451 | 15,009 | 1.86 | - |
MeViS | ICCV 2023 | 2,006 | 8,171 | 28,570 | 4.28 | - |
ReVOS | ECCV 2024 | 1,042 | 5,535 | 35,074 | 5.31 | - |
Ref-SAV | CVPRW 2025 | 37,311 | 72,509 | 72,509 | 1.94 | - |
InterRVOS-127K (Ours) | - | 8,738 | 35,247 | 127,236 | 4.03 | 17,604 |
InterRVOS-127K offers the largest number of referring expressions and a high object-per-video ratio, enabling richer and more diverse visual grounding across complex scenes compared to existing benchmarks. Unlike existing datasets, InterRVOS-127K also provides interaction-aware referring expressions that explicitly distinguish between actor and target roles, enabling fine-grained understanding of visual interactions.
Click the arrows below to see more results.
Q : Please segment 'Man helping a child'
A : Sure, it's Actor and Target
@misc{jin2025interrvosinteractionawarereferringvideo,
title={InterRVOS: Interaction-aware Referring Video Object Segmentation},
author={Woojeong Jin and Seongchan Kim and Seungryong Kim},
year={2025},
eprint={2506.02356},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2506.02356},
}