AgentRVOS: Reasoning Over Object Tracks for Zero-Shot Referring Video Object Segmentation

Arxiv 2026

KAIST AI
*: Equal contribution †: Corresponding Author
Teaser Figure
TL;DR

We propose AgentRVOS, a training-free, agentic RVOS pipeline that first obtains full-video object tracks with SAM3 and then lets an MLLM identify the referred target through query-grounded reasoning over object-level evidence, achieving state-of-the-art performance across multiple benchmarks.

Abstract

RVOS (RVOS) aims to segment a target object throughout a video given a natural language query. Training-free pipelines for this task follow a common pattern: an MLLM selects keyframes, grounds the referred object within those frames, and a video segmentation model propagates the result. While intuitive, this design asks the MLLM to make temporal decisions before any object-level evidence is available, limiting both reasoning quality and spatio-temporal coverage. We propose AgentRVOS, a training-free agentic pipeline built on the complementary strengths of SAM3 and an MLLM. Given a concept derived from the query, SAM3 provides reliable perception over the full spatio-temporal extent by generated mask tracks. The MLLM then identifies the target through query-grounded reasoning over this object-level evidence, iteratively pruning guided by SAM3’s temporal existence information. Extensive experiments show that AgentRVOS achieves state-of-the-art performance across multiple benchmarks, with consistent results across diverse MLLM backbones.

AgentRVOS

Method Overview

Given a video and a natural language query, our pipeline operates in two phases:

  • Candidate Mask Track Generation — the MLLM extracts concepts from the query, which SAM3 uses to produce temporally consistent candidate mask tracks through an iterative process.
  • Iterative Spatio-temporal Pruning — the MLLM reasons over candidates, classifying each as Accepted, Rejected, or Uncertain, while progressively narrowing the spatio-temporal scope until convergence.

1. Candidate Mask Track Generation

Method Overview

For referring queries (a), SAM3 generates tracks using the extracted concept. For reasoning queries (b), the video is provided alongside the expression. SAM3 processes all frames, enabling detection even when the target appears sparsely (e.g., "hand" in two frames).

2. Appearance Tool (Optional)

Method Overview

Between candidate mask track generation and iterative spatio-temporal pruning, the appearance tool can be optionally invoked. When appearance-level evidence is needed, the agent generates brief descriptions for each candidate object to supplement information obscured by visual prompting.

3. Iterative Spatio-temporal Pruning

Method Overview

The MLLM classifies each candidate as Accept, Reject, or Uncertain, iteratively narrowing the spatio-temporal scope by focusing on frames with uncertain candidates while excluding rejected ones. Starting with candidates [0, 2, 3, 5, 6], the agent prunes [0, 6] and examines [2, 3, 5] until convergence.

Experiments

Quantitative Results

We evaluate AgentRVOS on multiple RVOS benchmarks: MeViS, ReVOS, and ReasonVOS.
AgentRVOS achieves state-of-the-art performance across all benchmarks with different MLLM backbones.

Method MLLM MeViS ReVOS ReasonVOS
J F J&F J F J&F J F J&F
Training-based Methods
VideoLISA LLaVA-3.8B 41.347.644.4 --- 45.149.947.5
VISA Chat-UniVi-13B 41.847.144.5 48.852.950.9 ---
HyperSeg Mipha-3B --- 53.158.455.7 ---
InstructSeg Mipha-3B --- 52.056.954.5 ---
GLUS LISA-7B 48.554.251.3 52.457.354.9 47.552.449.9
ViLLa InternVideo2-6B 46.552.349.4 54.959.157.0 ---
Sa2VA InternVL2-8B --46.9 --57.6 ---
Sa2VA InternVL3-14B --- --60.7 ---
RGA3 Qwen2.5-VL-7B 47.452.850.1 55.960.058.0 51.356.053.6
VRS-HQ Chat-UniVi-13B 48.053.750.9 57.662.560.0 ---
VideoSeg-R1 Qwen2.5-VL-7B 52.757.855.3 58.264.061.1 ---
Training-free Methods
AL-Ref-SAM2 GPT-4 39.546.242.8 --- ---
CoT-RVS Gemma3-12B 40.348.144.2 43.450.947.1 47.554.050.7
CoT-RVS † Qwen3-VL-8B-T 37.743.940.8 51.457.354.3 52.558.955.7
CoT-RVS GPT-4o 48.755.752.2 52.859.055.9 62.468.765.5
AgentRVOS (Ours) Qwen3-VL-8B-T 59.264.561.9 57.761.959.8 65.571.868.6
Qwen3-VL-32B-T 65.370.067.7 60.464.762.5 67.373.470.4
GPT-5 70.475.773.1 64.168.466.3 73.178.075.5

Best results are in bold, second-best are underlined.
† denotes reproduced results.

Qualitative Results

We visualize the predicted mask tracks on MeViS, ReVOS, and ReasonVOS.

Original
Prediction

Citation

If you find our work useful, please consider citing our paper.


@misc{jin2026agentrvosreasoningobjecttracks,
    title = {AgentRVOS: Reasoning over Object Tracks for Zero-Shot Referring Video Object Segmentation},
    author = {Woojeong Jin and Jaeho Lee and Heeseong Shin and Seungho Jang and Junhwan Heo and Seungryong Kim},
    year = {2026},
    eprint = {2603.23489},
    archivePrefix = {arXiv},
    primaryClass  = {cs.CV},
    url = {https://arxiv.org/abs/2603.23489},
}