AgentRVOS: Reasoning Over Object Tracks for Zero-Shot Referring Video Object Segmentation

Arxiv 2026

Woojeong Jin^* Jaeho Lee^* Heeseong Shin Seungho Jang Junhwan Heo Seungryong Kim^†

KAIST AI

*: Equal contribution †: Corresponding Author

TL;DR

We propose AgentRVOS, a training-free, agentic RVOS pipeline that first obtains full-video object tracks with SAM3 and then lets an MLLM identify the referred target through query-grounded reasoning over object-level evidence, achieving state-of-the-art performance across multiple benchmarks.

Abstract

RVOS (RVOS) aims to segment a target object throughout a video given a natural language query. Training-free pipelines for this task follow a common pattern: an MLLM selects keyframes, grounds the referred object within those frames, and a video segmentation model propagates the result. While intuitive, this design asks the MLLM to make temporal decisions before any object-level evidence is available, limiting both reasoning quality and spatio-temporal coverage. We propose AgentRVOS, a training-free agentic pipeline built on the complementary strengths of SAM3 and an MLLM. Given a concept derived from the query, SAM3 provides reliable perception over the full spatio-temporal extent by generated mask tracks. The MLLM then identifies the target through query-grounded reasoning over this object-level evidence, iteratively pruning guided by SAM3’s temporal existence information. Extensive experiments show that AgentRVOS achieves state-of-the-art performance across multiple benchmarks, with consistent results across diverse MLLM backbones.

AgentRVOS

Given a video and a natural language query, our pipeline operates in two phases:

Candidate Mask Track Generation — the MLLM extracts concepts from the query, which SAM3 uses to produce temporally consistent candidate mask tracks through an iterative process.
Iterative Spatio-temporal Pruning — the MLLM reasons over candidates, classifying each as Accepted, Rejected, or Uncertain, while progressively narrowing the spatio-temporal scope until convergence.

1. Candidate Mask Track Generation

For referring queries (a), SAM3 generates tracks using the extracted concept. For reasoning queries (b), the video is provided alongside the expression. SAM3 processes all frames, enabling detection even when the target appears sparsely (e.g., "hand" in two frames).

2. Appearance Tool (Optional)

Between candidate mask track generation and iterative spatio-temporal pruning, the appearance tool can be optionally invoked. When appearance-level evidence is needed, the agent generates brief descriptions for each candidate object to supplement information obscured by visual prompting.

3. Iterative Spatio-temporal Pruning

The MLLM classifies each candidate as Accept, Reject, or Uncertain, iteratively narrowing the spatio-temporal scope by focusing on frames with uncertain candidates while excluding rejected ones. Starting with candidates [0, 2, 3, 5, 6], the agent prunes [0, 6] and examines [2, 3, 5] until convergence.

Experiments

Quantitative Results

We evaluate AgentRVOS on multiple RVOS benchmarks: MeViS, ReVOS, and ReasonVOS.
AgentRVOS achieves state-of-the-art performance across all benchmarks with different MLLM backbones.

Method	MLLM	MeViS			ReVOS			ReasonVOS
Method	MLLM	J	F	J&F	J	F	J&F	J	F	J&F

Training-based Methods
VideoLISA	LLaVA-3.8B	41.3	47.6	44.4	-	-	-	45.1	49.9	47.5
VISA	Chat-UniVi-13B	41.8	47.1	44.5	48.8	52.9	50.9	-	-	-
HyperSeg	Mipha-3B	-	-	-	53.1	58.4	55.7	-	-	-
InstructSeg	Mipha-3B	-	-	-	52.0	56.9	54.5	-	-	-
GLUS	LISA-7B	48.5	54.2	51.3	52.4	57.3	54.9	47.5	52.4	49.9
ViLLa	InternVideo2-6B	46.5	52.3	49.4	54.9	59.1	57.0	-	-	-
Sa2VA	InternVL2-8B	-	-	46.9	-	-	57.6	-	-	-
Sa2VA	InternVL3-14B	-	-	-	-	-	60.7	-	-	-
RGA3	Qwen2.5-VL-7B	47.4	52.8	50.1	55.9	60.0	58.0	51.3	56.0	53.6
VRS-HQ	Chat-UniVi-13B	48.0	53.7	50.9	57.6	62.5	60.0	-	-	-
VideoSeg-R1	Qwen2.5-VL-7B	52.7	57.8	55.3	58.2	64.0	61.1	-	-	-

Training-free Methods
AL-Ref-SAM2	GPT-4	39.5	46.2	42.8	-	-	-	-	-	-
CoT-RVS	Gemma3-12B	40.3	48.1	44.2	43.4	50.9	47.1	47.5	54.0	50.7
CoT-RVS †	Qwen3-VL-8B-T	37.7	43.9	40.8	51.4	57.3	54.3	52.5	58.9	55.7
CoT-RVS	GPT-4o	48.7	55.7	52.2	52.8	59.0	55.9	62.4	68.7	65.5

AgentRVOS (Ours)	Qwen3-VL-8B-T	59.2	64.5	61.9	57.7	61.9	59.8	65.5	71.8	68.6
	Qwen3-VL-32B-T	65.3	70.0	67.7	60.4	64.7	62.5	67.3	73.4	70.4
	GPT-5	70.4	75.7	73.1	64.1	68.4	66.3	73.1	78.0	75.5

Best results are in bold, second-best are underlined.
† denotes reproduced results.

Qualitative Results

We visualize the predicted mask tracks on MeViS, ReVOS, and ReasonVOS.

Original

Prediction

Citation

If you find our work useful, please consider citing our paper.


@misc{jin2026agentrvosreasoningobjecttracks,
    title = {AgentRVOS: Reasoning over Object Tracks for Zero-Shot Referring Video Object Segmentation},
    author = {Woojeong Jin and Jaeho Lee and Heeseong Shin and Seungho Jang and Junhwan Heo and Seungryong Kim},
    year = {2026},
    eprint = {2603.23489},
    archivePrefix = {arXiv},
    primaryClass  = {cs.CV},
    url = {https://arxiv.org/abs/2603.23489},
}