TL;DR: MV-TAP is a robust multi-view point tracker that uses camera geometry and cross-view attention to achieve reliable trajectory estimation in dynamic scenes.

Introducing MV-TAP

MV-TAP (Tracking Any Point in Multi-view Videos) is a robust point tracker that tracks points across multi-view videos of dynamic scenes by leveraging cross-view information. MV-TAP utilizes a cross-view attention mechanism to aggregate spatio-temporal information across views, enabling more complete and reliable trajectory estimation in multi-view videos.

Motivation

MV-TAP robot Why should we consider using multi-view information when strong 2D point trackers already exist?
Despite its strong capability, existing point tracking methods have only been explored in single-view videos. Since the 2D projection of a 3D scene inherently incurs geometric ambiguities, such as frequent occlusion, erratic motion, and depth uncertainty, single-view point tracking methods struggle with these challenges. Consequently, a direct application of these trackers independently to each viewpoint fails to leverage the multi-view cues required to construct reliable point trajectories.

MV-TAP robot How is this different from existing multi-view tasks?
Most existing multi-view methods are designed for static scenes, assume rigid geometry, or require geometric priors that are unavailable in casual, in-the-wild videos. Although a prior approach targets multi-view 3D point tracking in the 3D world coordinate system, it relies on external depth inputs that introduce reprojection errors when reprojecting 3D points to the 2D pixel space. To address these challenges, we present MV-TAP, a framework that builds a holistic understanding of dynamic multi-view scenes by aggregating information across views and timesteps through camera encoding and cross-view attention.

MV-TAP robot How does the attention in MV-TAP fundamentally differ from the attention mechanism used in Single-view point tracking and Multi-view matching?

MV-TAP introduces a novel joint attention mechanism that models both temporal and cross-view interactions simultaneously, unlike existing methods. This attention module allows the model to exchange information between different viewpoints, thereby overcoming view-dependent ambiguities. Single-view point tracking primarily uses attention for temporal consistency, ignoring cross-view cues, and Multi-view matching focuses only on cross-view consistency, failing to maintain temporal coherence over time. MV-TAP effectively combines both temporal and cross-view alignment, ensuring the tracked point is consistent across time and space simultaneously.

MV-TAP robot What is the core goal of the model?
Motivated by above, we introduce multi-view point tracking in 2D camera space. Our goal is to leverage multi-view information to enhance tracking performance while maintaining the strong spatio-temporal consistency established by 2D trackers.

Method

Our approach integrates a strong 2D tracking backbone with additional modules designed to leverage multi-view information. Specifically, we introduce a camera encoding module to inject geometric information and a cross view-attention module to aggregate complementary cues across viewpoints. This combination allows our model to achieve robust spatio-temporal consistency across multiple views.

Qualitative Results

Multi-view results on DexYCB dataset

Multi-view results on Panoptic Studio dataset

Multi-view results on Harmony4D dataset

Comparison on DexYCB dataset

Comparison on Panoptic Studio dataset

Quantitative Results

Consistently strong across diverse multi-view point tracking scenarios

We compare our approaches with recent state-of-the art point trackers on DexYCB, Panoptic Studio, Kubric and Harmony4D dataset. Compared to baselines, MV-TAP achieves superior performance, demonstrating its ability to leverage multi-view information.

Method	Target	Space	Depth	DexYCB			Panoptic Studio			Harmony4D
Method	Target	Space	Depth	AJ	<$\delta_{avg}^x$	OA	AJ	<$\delta_{avg}^x$	OA	AJ	<$\delta_{avg}^x$	OA
Single-view input
TAPIR	2D	Camera		29.6	43.9	66.4	22.1	39.3	60.0	27.6	53.1	60.0
CoTracker2	2D	Camera		37.5	62.5	69.4	33.3	59.1	64.4	37.2	71.9	55.7
LocoTrack	2D	Camera		38.7	55.8	74.1	34.9	56.1	67.5	40.8	72.0	64.1
CoTracker3	2D	Camera		41.5	59.6	76.4	39.6	61.4	72.3	41.4	73.5	63.2
SpatialTracker	3D	Camera		23.2	43.3	61.8	19.7	40.5	59.6	25.4	54.4	58.1
Multi-view input
CoTracker3 w/ Flat.	2D	Camera		2.7	7.1	35.7	1.0	12.7	38.8	2.1	20.7	46.4
CoTracker3 w/ Tri.	2D	Camera		39.2	57.1	76.4	37.9	59.5	72.3	39.2	70.4	63.2
MVTracker	3D	World		-	32.6	-	-	62.4	-	-	13.3	-
MV-TAP (Ours)	2D	Camera		44.2	61.9	78.3	40.3	62.8	73.1	42.6	74.9	65.8

'Target' denotes the dimension of the predicted trajectory. 'Space' specifies the coordinate domain: 'Camera' and 'World' denotes pixel and world space, respectively. 'Depth' indicates whether depth input is required.

Ablation studies

Ablation on various number of views

We compare MV-TAP with baselines using a varying number of input views. While performance generally improves with more views, the baselines exhibit only marginal gains. In contrast, MV-TAP demonstrates a significantly larger improvement consistently, highlighting its superior ability to leverage multi-view information.

Method	2 views			4 views			6 views			8 views
Method	AJ	<$\delta^{x}_{avg}$	OA	AJ	<$\delta^{x}_{avg}$	OA	AJ	<$\delta^{x}_{avg}$	OA	AJ	<$\delta^{x}_{avg}$	OA
CoTracker3	37.5	56.4	77.8	38.9	56.6	75.1	41.0	58.9	75.8	41.5	59.6	76.4
CoTracker3 w/ Flat.	9.5	19.5	54.3	4.4	8.4	44.5	3.2	8.3	39.1	2.7	7.1	35.7
CoTracker3 w/ Tri.	37.1	55.6	77.8	37.8	55.3	75.1	38.7	56.4	75.8	39.2	57.1	76.4
MVTracker	-	35.8	-	-	31.8	-	-	32.8	-	-	32.6	-
MV-TAP (Ours)	39.2	56.8	76.8	40.3	57.7	75.2	43.3	60.7	76.9	44.2	61.9	78.3

Can multi-view resolve occlusion ambiguity?

We also evaluate the position accuracy on in-frame occluded points. Our model shows robustness on the occlusion, indicating that our model effectively utilizes the multi-view cues. $<\delta^{x}_{occ}$ denotes point accuracy on in-frame occlusion points.

Method	DexYCB		Panoptic Studio		Harmony4D
Method	<$\delta^{x}_{avg}$	<$\delta^{x}_{occ}$	<$\delta^{x}_{avg}$	<$\delta^{x}_{occ}$	<$\delta^{x}_{avg}$	<$\delta^{x}_{occ}$
CoTracker3	59.6	33.9	61.4	46.2	73.5	58.4
CoTracker3 w/ Flat.	7.1	1.9	12.7	6.3	20.7	15.0
CoTracker3 w/ Tri.	57.1	34.8	59.5	47.7	70.4	59.1
MVTracker	32.6	16.0	62.4	61.2	13.3	8.9
MV-TAP (Ours)	61.9	38.4	62.8	48.7	74.9	60.3

Effect of additional training

Although initialized from the same pretrained model, MV-TAP attains consistently higher performance across all metrics. This shows that its gains primarily come from the architectural design, not merely extended training.

Method	DexYCB			Panoptic Studio
Method	AJ	<$\delta^{x}_{avg}$	OA	AJ	<$\delta^{x}_{avg}$	OA
CoTracker3	41.8	59.0	73.8	39.6	61.6	71.8
MV-TAP (Ours)	44.2	61.9	78.3	40.3	62.8	73.1

Comparison on various number of points

We measure tracking performances under varying numbers of query points. Our model consistently outperforms shows better robustness compared to the baselines across both sparse and dense settings.

Method	50 Points			100 Points			300 Points			500 Points
Method	AJ	<$\delta^{x}_{avg}$	OA	AJ	<$\delta^{x}_{avg}$	OA	AJ	<$\delta^{x}_{avg}$	OA	AJ	<$\delta^{x}_{avg}$	OA
CoTracker3	42.0	59.9	74.6	41.9	60.1	77.2	41.5	59.6	76.4	41.5	59.8	76.7
CoTracker3 w/ Flat.	2.7	7.4	34.4	2.5	6.8	35.6	2.7	7.1	35.7	2.6	7.0	35.7
CoTracker3 w/ Tri.	39.4	53.4	74.6	39.4	57.1	77.2	39.2	57.1	76.4	39.5	57.3	76.7
MVTracker	-	34.2	-	-	32.9	-	-	32.6	-	-	34.9	-
MV-TAP (Ours)	44.3	62.0	77.5	44.7	62.5	78.3	44.2	61.9	78.3	44.3	62.1	78.7

Ablation on model architecture

We present an ablation study on our model components for multi-view awareness. The performance consistently improves as each component is added, demonstrating that each module significantly contributes to leveraging multi-view information.

Method	DexYCB			Panoptic Studio
Method	AJ	<$\delta_{avg}^x$	OA	AJ	<$\delta_{avg}^x$	OA
CoTracker3	41.5	59.6	76.4	39.6	61.4	72.3
+ View attn.	43.6	61.5	77.4	38.6	61.6	69.4
+ Cam embed.	42.2	60.6	78.0	39.9	60.9	73.0
MV-TAP (Ours)	44.2	61.9	78.3	40.3	62.8	73.1

Comparison under frequently occluded trajectories

We evaluate methods on trajectories with high occlusion frequency measured by visibility-transition rate. MV-TAP leverages cross-view cues to remain robust on frequently occluded points, improving AJ, $\delta^{x}_{\text{avg}}$ and OA.

Method	DexYCB			Panoptic Studio
Method	AJ	<$\delta^{x}_{avg}$	OA	AJ	<$\delta^{x}_{avg}$	OA
CoTracker3	26.2	43.4	66.6	37.4	60.6	69.0
CoTracker3 w/ Flat.	0.5	1.8	41.2	0.8	13.4	40.6
CoTracker3 w/ Tri.	26.0	43.6	66.6	36.6	59.4	69.0
MVTracker	-	7.9	-	-	59.4	-
MV-TAP (Ours)	29.7	47.3	70.5	38.0	61.9	69.9

Conclusion

This work establishes multi-view 2D point tracking as a new and important task for advancing reliable spatio-temporal correspondence in dynamic, real-world scenes. By introducing MV-TAP, a model that aggregates cross-view information through camera embedding and view-attention, we demonstrate how leveraging multi-view inputs can overcome key limitations of monocular trackers such as occlusion and motion ambiguity. Together with a large-scale synthetic dataset and a real-world evaluation dataset specifically designed for this task, our contributions provide both a principled formulation of the problem and a strong baseline method, paving the way for future research in robust multi-view point tracking.

Citation

If you use this work or find it helpful, please consider citing:

@article{koo2025mvtap,
    title={MV-TAP: Tracking Any Point in Multi-View Videos},
    author={Koo, Jahyeok and Kim, In{\`e}s Hyeonsu and Kim, Mungyeom and Park, Junghyun and Park, Seohyeon and Kim, Jaeyeong and Yi, Jung and Cho, Seokju and Kim, Seungryong},
    journal={arXiv preprint arXiv:2512.02006},
    year={2025}
}

MV-TAP: Tracking Any Point in Multi-View Videos

CVPR 2026