Arxiv 2026
RVOS (RVOS) aims to segment a target object throughout a video given a natural language query. Training-free pipelines for this task follow a common pattern: an MLLM selects keyframes, grounds the referred object within those frames, and a video segmentation model propagates the result. While intuitive, this design asks the MLLM to make temporal decisions before any object-level evidence is available, limiting both reasoning quality and spatio-temporal coverage. We propose AgentRVOS, a training-free agentic pipeline built on the complementary strengths of SAM3 and an MLLM. Given a concept derived from the query, SAM3 provides reliable perception over the full spatio-temporal extent by generated mask tracks. The MLLM then identifies the target through query-grounded reasoning over this object-level evidence, iteratively pruning guided by SAM3’s temporal existence information. Extensive experiments show that AgentRVOS achieves state-of-the-art performance across multiple benchmarks, with consistent results across diverse MLLM backbones.
Given a video and a natural language query, our pipeline operates in two phases:
For referring queries (a), SAM3 generates tracks using the extracted concept. For reasoning queries (b), the video is provided alongside the expression. SAM3 processes all frames, enabling detection even when the target appears sparsely (e.g., "hand" in two frames).
Between candidate mask track generation and iterative spatio-temporal pruning, the appearance tool can be optionally invoked. When appearance-level evidence is needed, the agent generates brief descriptions for each candidate object to supplement information obscured by visual prompting.
The MLLM classifies each candidate as Accept, Reject, or Uncertain, iteratively narrowing the spatio-temporal scope by focusing on frames with uncertain candidates while excluding rejected ones. Starting with candidates [0, 2, 3, 5, 6], the agent prunes [0, 6] and examines [2, 3, 5] until convergence.
We evaluate AgentRVOS on multiple RVOS benchmarks: MeViS, ReVOS, and ReasonVOS.
AgentRVOS achieves state-of-the-art performance across all benchmarks with different MLLM backbones.
| Method | MLLM | MeViS | ReVOS | ReasonVOS | ||||||
|---|---|---|---|---|---|---|---|---|---|---|
| J | F | J&F | J | F | J&F | J | F | J&F | ||
| Training-based Methods | ||||||||||
| VideoLISA | LLaVA-3.8B | 41.3 | 47.6 | 44.4 | - | - | - | 45.1 | 49.9 | 47.5 |
| VISA | Chat-UniVi-13B | 41.8 | 47.1 | 44.5 | 48.8 | 52.9 | 50.9 | - | - | - |
| HyperSeg | Mipha-3B | - | - | - | 53.1 | 58.4 | 55.7 | - | - | - |
| InstructSeg | Mipha-3B | - | - | - | 52.0 | 56.9 | 54.5 | - | - | - |
| GLUS | LISA-7B | 48.5 | 54.2 | 51.3 | 52.4 | 57.3 | 54.9 | 47.5 | 52.4 | 49.9 |
| ViLLa | InternVideo2-6B | 46.5 | 52.3 | 49.4 | 54.9 | 59.1 | 57.0 | - | - | - |
| Sa2VA | InternVL2-8B | - | - | 46.9 | - | - | 57.6 | - | - | - |
| Sa2VA | InternVL3-14B | - | - | - | - | - | 60.7 | - | - | - |
| RGA3 | Qwen2.5-VL-7B | 47.4 | 52.8 | 50.1 | 55.9 | 60.0 | 58.0 | 51.3 | 56.0 | 53.6 |
| VRS-HQ | Chat-UniVi-13B | 48.0 | 53.7 | 50.9 | 57.6 | 62.5 | 60.0 | - | - | - |
| VideoSeg-R1 | Qwen2.5-VL-7B | 52.7 | 57.8 | 55.3 | 58.2 | 64.0 | 61.1 | - | - | - |
| Training-free Methods | ||||||||||
| AL-Ref-SAM2 | GPT-4 | 39.5 | 46.2 | 42.8 | - | - | - | - | - | - |
| CoT-RVS | Gemma3-12B | 40.3 | 48.1 | 44.2 | 43.4 | 50.9 | 47.1 | 47.5 | 54.0 | 50.7 |
| CoT-RVS † | Qwen3-VL-8B-T | 37.7 | 43.9 | 40.8 | 51.4 | 57.3 | 54.3 | 52.5 | 58.9 | 55.7 |
| CoT-RVS | GPT-4o | 48.7 | 55.7 | 52.2 | 52.8 | 59.0 | 55.9 | 62.4 | 68.7 | 65.5 |
| AgentRVOS (Ours) | Qwen3-VL-8B-T | 59.2 | 64.5 | 61.9 | 57.7 | 61.9 | 59.8 | 65.5 | 71.8 | 68.6 |
| Qwen3-VL-32B-T | 65.3 | 70.0 | 67.7 | 60.4 | 64.7 | 62.5 | 67.3 | 73.4 | 70.4 | |
| GPT-5 | 70.4 | 75.7 | 73.1 | 64.1 | 68.4 | 66.3 | 73.1 | 78.0 | 75.5 | |
Best results are in bold, second-best are underlined.
† denotes reproduced results.
We visualize the predicted mask tracks on MeViS, ReVOS, and ReasonVOS.
If you find our work useful, please consider citing our paper.
@misc{jin2026agentrvosreasoningobjecttracks,
title = {AgentRVOS: Reasoning over Object Tracks for Zero-Shot Referring Video Object Segmentation},
author = {Woojeong Jin and Jaeho Lee and Heeseong Shin and Seungho Jang and Junhwan Heo and Seungryong Kim},
year = {2026},
eprint = {2603.23489},
archivePrefix = {arXiv},
primaryClass = {cs.CV},
url = {https://arxiv.org/abs/2603.23489},
}