Introducing MV-TAP
MV-TAP (Tracking Any Point in Multi-view Videos) is a robust point tracker that tracks points across multi-view videos of dynamic scenes by leveraging cross-view information. MV-TAP utilizes a cross-view attention mechanism to aggregate spatio-temporal information across views, enabling more complete and reliable trajectory estimation in multi-view videos.
Motivation
Why should we consider using multi-view information when strong 2D point trackers already exist?
Despite its strong capability, existing point tracking methods have only been explored in single-view videos.
Since the 2D projection of a 3D scene inherently incurs geometric ambiguities, such as frequent occlusion, erratic motion, and depth uncertainty, single-view point tracking methods struggle with these challenges.
Consequently, a direct application of these trackers independently to each viewpoint fails to leverage the multi-view cues required to construct reliable point trajectories.
How is this different from existing multi-view tasks?
Most existing multi-view methods are designed for static scenes, assume rigid geometry, or require geometric priors that are unavailable in casual, in-the-wild videos.
Although a prior approach targets multi-view 3D point tracking in the 3D world coordinate system, it relies on external depth inputs that introduce reprojection errors when reprojecting 3D points to the 2D pixel space.
To address these challenges, we present MV-TAP, a framework that builds a holistic understanding of dynamic multi-view scenes by aggregating information across views and timesteps through camera encoding and cross-view attention.
How does the attention in MV-TAP fundamentally differ from the attention mechanism used in Single-view point tracking and Multi-view matching?
MV-TAP introduces a novel joint attention mechanism that models both temporal and cross-view interactions simultaneously, unlike existing methods. This attention module allows the model to exchange information between different viewpoints, thereby overcoming view-dependent ambiguities. Single-view point tracking primarily uses attention for temporal consistency, ignoring cross-view cues, and Multi-view matching focuses only on cross-view consistency, failing to maintain temporal coherence over time. MV-TAP effectively combines both temporal and cross-view alignment, ensuring the tracked point is consistent across time and space simultaneously.
What is the core goal of the model?
Motivated by above, we introduce multi-view point tracking in 2D camera space.
Our goal is to leverage multi-view information to enhance tracking performance while maintaining the strong spatio-temporal consistency established by 2D trackers.
Method
Our approach integrates a strong 2D tracking backbone with additional modules designed to leverage multi-view information. Specifically, we introduce a camera encoding module to inject geometric information and a cross view-attention module to aggregate complementary cues across viewpoints. This combination allows our model to achieve robust spatio-temporal consistency across multiple views.
Qualitative Results
Multi-view results on DexYCB dataset
Multi-view results on Panoptic Studio dataset
Multi-view results on Harmony4D dataset
Comparison on DexYCB dataset
Comparison on Panoptic Studio dataset
Quantitative Results
Consistently strong across diverse multi-view point tracking scenarios
We compare our approaches with recent state-of-the art point trackers on DexYCB, Panoptic Studio, Kubric and Harmony4D dataset. Compared to baselines, MV-TAP achieves superior performance, demonstrating its ability to leverage multi-view information.
| Method | Target | Space | Depth | DexYCB | Panoptic Studio | Harmony4D | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| AJ | <$\delta_{avg}^x$ | OA | AJ | <$\delta_{avg}^x$ | OA | AJ | <$\delta_{avg}^x$ | OA | |||||||
| Single-view input | |||||||||||||||
| TAPIR | 2D | Camera | 29.6 | 43.9 | 66.4 | 22.1 | 39.3 | 60.0 | 27.6 | 53.1 | 60.0 | ||||
| CoTracker2 | 2D | Camera | 37.5 | 62.5 | 69.4 | 33.3 | 59.1 | 64.4 | 37.2 | 71.9 | 55.7 | ||||
| LocoTrack | 2D | Camera | 38.7 | 55.8 | 74.1 | 34.9 | 56.1 | 67.5 | 40.8 | 72.0 | 64.1 | ||||
| CoTracker3 | 2D | Camera | 41.5 | 59.6 | 76.4 | 39.6 | 61.4 | 72.3 | 41.4 | 73.5 | 63.2 | ||||
| SpatialTracker | 3D | Camera | 23.2 | 43.3 | 61.8 | 19.7 | 40.5 | 59.6 | 25.4 | 54.4 | 58.1 | ||||
| Multi-view input | |||||||||||||||
| CoTracker3 w/ Flat. |
2D | Camera | 2.7 | 7.1 | 35.7 | 1.0 | 12.7 | 38.8 | 2.1 | 20.7 | 46.4 | ||||
| CoTracker3 w/ Tri. |
2D | Camera | 39.2 | 57.1 | 76.4 | 37.9 | 59.5 | 72.3 | 39.2 | 70.4 | 63.2 | ||||
| MVTracker | 3D | World | - | 32.6 | - | - | 62.4 | - | - | 13.3 | - | ||||
| MV-TAP (Ours) | 2D | Camera | 44.2 | 61.9 | 78.3 | 40.3 | 62.8 | 73.1 | 42.6 | 74.9 | 65.8 | ||||
'Target' denotes the dimension of the predicted trajectory. 'Space' specifies the coordinate domain: 'Camera' and 'World' denotes pixel and world space, respectively. 'Depth' indicates whether depth input is required.
Ablation studies
Ablation on various number of views
We compare MV-TAP with baselines using a varying number of input views. While performance generally improves with more views, the baselines exhibit only marginal gains. In contrast, MV-TAP demonstrates a significantly larger improvement consistently, highlighting its superior ability to leverage multi-view information.
| Method | 2 views | 4 views | 6 views | 8 views | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| AJ | <$\delta^{x}_{avg}$ | OA | AJ | <$\delta^{x}_{avg}$ | OA | AJ | <$\delta^{x}_{avg}$ | OA | AJ | <$\delta^{x}_{avg}$ | OA | |
| CoTracker3 | 37.5 | 56.4 | 77.8 | 38.9 | 56.6 | 75.1 | 41.0 | 58.9 | 75.8 | 41.5 | 59.6 | 76.4 |
| CoTracker3 w/ Flat. |
9.5 | 19.5 | 54.3 | 4.4 | 8.4 | 44.5 | 3.2 | 8.3 | 39.1 | 2.7 | 7.1 | 35.7 |
| CoTracker3 w/ Tri. |
37.1 | 55.6 | 77.8 | 37.8 | 55.3 | 75.1 | 38.7 | 56.4 | 75.8 | 39.2 | 57.1 | 76.4 |
| MVTracker | - | 35.8 | - | - | 31.8 | - | - | 32.8 | - | - | 32.6 | - |
| MV-TAP (Ours) | 39.2 | 56.8 | 76.8 | 40.3 | 57.7 | 75.2 | 43.3 | 60.7 | 76.9 | 44.2 | 61.9 | 78.3 |
Can multi-view resolve occlusion ambiguity?
We also evaluate the position accuracy on in-frame occluded points. Our model shows robustness on the occlusion, indicating that our model effectively utilizes the multi-view cues. $<\delta^{x}_{occ}$ denotes point accuracy on in-frame occlusion points.
| Method | DexYCB | Panoptic Studio | Harmony4D | |||
|---|---|---|---|---|---|---|
| <$\delta^{x}_{avg}$ | <$\delta^{x}_{occ}$ | <$\delta^{x}_{avg}$ | <$\delta^{x}_{occ}$ | <$\delta^{x}_{avg}$ | <$\delta^{x}_{occ}$ | |
| CoTracker3 | 59.6 | 33.9 | 61.4 | 46.2 | 73.5 | 58.4 |
| CoTracker3 w/ Flat. |
7.1 | 1.9 | 12.7 | 6.3 | 20.7 | 15.0 |
| CoTracker3 w/ Tri. |
57.1 | 34.8 | 59.5 | 47.7 | 70.4 | 59.1 |
| MVTracker | 32.6 | 16.0 | 62.4 | 61.2 | 13.3 | 8.9 |
| MV-TAP (Ours) | 61.9 | 38.4 | 62.8 | 48.7 | 74.9 | 60.3 |
Effect of additional training
Although initialized from the same pretrained model, MV-TAP attains consistently higher performance across all metrics. This shows that its gains primarily come from the architectural design, not merely extended training.
| Method | DexYCB | Panoptic Studio | ||||
|---|---|---|---|---|---|---|
| AJ | <$\delta^{x}_{avg}$ | OA | AJ | <$\delta^{x}_{avg}$ | OA | |
| CoTracker3 | 41.8 | 59.0 | 73.8 | 39.6 | 61.6 | 71.8 |
| MV-TAP (Ours) | 44.2 | 61.9 | 78.3 | 40.3 | 62.8 | 73.1 |
Comparison on various number of points
We measure tracking performances under varying numbers of query points. Our model consistently outperforms shows better robustness compared to the baselines across both sparse and dense settings.
| Method | 50 Points | 100 Points | 300 Points | 500 Points | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| AJ | <$\delta^{x}_{avg}$ | OA | AJ | <$\delta^{x}_{avg}$ | OA | AJ | <$\delta^{x}_{avg}$ | OA | AJ | <$\delta^{x}_{avg}$ | OA | |
| CoTracker3 | 42.0 | 59.9 | 74.6 | 41.9 | 60.1 | 77.2 | 41.5 | 59.6 | 76.4 | 41.5 | 59.8 | 76.7 |
| CoTracker3 w/ Flat. |
2.7 | 7.4 | 34.4 | 2.5 | 6.8 | 35.6 | 2.7 | 7.1 | 35.7 | 2.6 | 7.0 | 35.7 |
| CoTracker3 w/ Tri. |
39.4 | 53.4 | 74.6 | 39.4 | 57.1 | 77.2 | 39.2 | 57.1 | 76.4 | 39.5 | 57.3 | 76.7 |
| MVTracker | - | 34.2 | - | - | 32.9 | - | - | 32.6 | - | - | 34.9 | - |
| MV-TAP (Ours) | 44.3 | 62.0 | 77.5 | 44.7 | 62.5 | 78.3 | 44.2 | 61.9 | 78.3 | 44.3 | 62.1 | 78.7 |
Ablation on model architecture
We present an ablation study on our model components for multi-view awareness. The performance consistently improves as each component is added, demonstrating that each module significantly contributes to leveraging multi-view information.
| Method | DexYCB | Panoptic Studio | ||||
|---|---|---|---|---|---|---|
| AJ | <$\delta_{avg}^x$ | OA | AJ | <$\delta_{avg}^x$ | OA | |
| CoTracker3 | 41.5 | 59.6 | 76.4 | 39.6 | 61.4 | 72.3 |
| + View attn. | 43.6 | 61.5 | 77.4 | 38.6 | 61.6 | 69.4 |
| + Cam embed. | 42.2 | 60.6 | 78.0 | 39.9 | 60.9 | 73.0 |
| MV-TAP (Ours) | 44.2 | 61.9 | 78.3 | 40.3 | 62.8 | 73.1 |
Comparison under frequently occluded trajectories
We evaluate methods on trajectories with high occlusion frequency measured by visibility-transition rate. MV-TAP leverages cross-view cues to remain robust on frequently occluded points, improving AJ, $\delta^{x}_{\text{avg}}$ and OA.
| Method | DexYCB | Panoptic Studio | ||||
|---|---|---|---|---|---|---|
| AJ | <$\delta^{x}_{avg}$ | OA | AJ | <$\delta^{x}_{avg}$ | OA | |
| CoTracker3 | 26.2 | 43.4 | 66.6 | 37.4 | 60.6 | 69.0 |
| CoTracker3 w/ Flat. |
0.5 | 1.8 | 41.2 | 0.8 | 13.4 | 40.6 |
| CoTracker3 w/ Tri. |
26.0 | 43.6 | 66.6 | 36.6 | 59.4 | 69.0 |
| MVTracker | - | 7.9 | - | - | 59.4 | - |
| MV-TAP (Ours) | 29.7 | 47.3 | 70.5 | 38.0 | 61.9 | 69.9 |
Conclusion
This work establishes multi-view 2D point tracking as a new and important task for advancing reliable spatio-temporal correspondence in dynamic, real-world scenes. By introducing MV-TAP, a model that aggregates cross-view information through camera embedding and view-attention, we demonstrate how leveraging multi-view inputs can overcome key limitations of monocular trackers such as occlusion and motion ambiguity. Together with a large-scale synthetic dataset and a real-world evaluation dataset specifically designed for this task, our contributions provide both a principled formulation of the problem and a strong baseline method, paving the way for future research in robust multi-view point tracking.
Citation
If you use this work or find it helpful, please consider citing:
@article{koo2025mvtap,
title={MV-TAP: Tracking Any Point in Multi-View Videos},
author={Koo, Jahyeok and Kim, In{\`e}s Hyeonsu and Kim, Mungyeom and Park, Junghyun and Park, Seohyun and Kim, Jaeyeong and Yi, Jung and Cho, Seokju and Kim, Seungryong},
journal={arXiv preprint arXiv:2512.02006},
year={2025}
}