KAIST AI × Sony AI

MVTrack4Gen
Multi-View Point Tracking as Geometric Supervision
for 4D Video Generation

JoungBin Lee1, Jaewoo Jung1, Jongmin Lee1, Tongmin Kim1, Hyunsung Kim1,
Takuya Narihira2, Kazumi Fukuda2, Jahyeok Koo1, Jisang Han1,
Yuki Mitsufuji2,3,†, Seungryong Kim1,†

1KAIST AI  ·  2Sony AI  ·  3Sony Group Corporation

Corresponding authors

TL;DR

MVTrack4Gen turns multi-view point tracking into a geometric & motion supervision signal for novel-view video diffusion. Our key finding: specific attention layers already encode intra-video temporal and inter-video cross-view correspondences. We route those features into an auxiliary tracking head and add a correspondence cross-entropy loss — yielding state-of-the-art geometric consistency with no 3D reconstruction at inference, on top of two backbones (ReCamMaster & ReDirector).

SOTA
Cross-view geometric consistency (MEt3R)
No 3D Recon
No external depth or reconstruction
2 Backbones
ReCamMaster & ReDirector
DAVIS + iPhone
In-the-wild benchmarks
Abstract

Abstract

Synthesizing a novel-view video from a monocular reference video along a target camera trajectory requires both geometric consistency and motion fidelity with respect to the reference video. Existing methods based on explicit 3D representations are limited by the accuracy of off-the-shelf reconstruction modules, which often produce inaccurate geometry for dynamic objects in monocular videos. In contrast, camera-conditioning-only methods can achieve high visual quality but often struggle to preserve geometric and motion consistency. In this work, we introduce MVTrack4Gen (Multi-View point Tracking for Novel-View Generation), a motion-aware training framework that leverages multi-view point tracking as an additional geometric and motion supervision signal for novel-view video diffusion models. Our key finding is that attention layers encode strong correspondence cues, where query features attend to key features at geometrically corresponding locations across views and over time, and the misalignment of these correspondences causes motion inconsistency. Based on this observation, we route these features into an auxiliary multi-view tracking head and jointly train the diffusion model with a point-tracking objective. By explicitly strengthening these motion-aware correspondences, MVTrack4Gen improves the model's ability to follow the motion in the reference view and maintain cross-view geometric consistency. Across diverse benchmarks, our method achieves state-of-the-art geometric consistency and competitive camera accuracy.

Teaser

Joint Novel-View Video Generation & Multi-View Point Tracking

MVTrack4Gen jointly generates a novel-view video and multi-view point tracks, given a monocular reference video with query points and a user-specified camera trajectory. Lifting both the reference and generated frames into 3D space using Depth Anything 3 shows that the target views and multi-view point tracks faithfully preserve the dynamic motion of the reference video while remaining geometrically consistent.

■ Reference Video■ Generated Video

Contributions

What's new

1

We reveal that specific attention layers in a novel-view video diffusion model simultaneously encode intra-video and inter-video correspondences, and strengthening these correspondences through auxiliary supervision improves novel-view generation quality.

2

We introduce a motion-aware training framework for novel-view video diffusion models, where the diffusion backbone is augmented with a multi-view point tracking head and jointly trained to estimate multi-view point tracks from its attention features — enabling explicit motion-correspondence learning across reference and target views.

3

We demonstrate that our framework generalizes across two camera-conditioning-only backbones (ReCamMaster & ReDirector), consistently improving visual quality, geometric and motion consistency, and camera-pose accuracy on DAVIS and iPhone — achieving state-of-the-art geometric consistency while maintaining competitive camera accuracy.

Analysis

Correspondence concentrates in specific attention layers

Camera-conditioning-only novel-view diffusion models exchange information across views through 3D attention. We probe where: at every diffusion layer and denoising timestep we measure how well attention matches reliable correspondences (cycle-consistency check & PCK), together with how strongly and how sharply attention concentrates on the correct match (a harmonic mean of matching accuracy, attention score, and confidence).

Matching Accuracy and Harmonic Mean in ReCamMaster. We visualize matching accuracy (top) and the harmonic mean of matching accuracy, attention score, and confidence score (bottom) across diffusion layers and denoising timesteps. Results are shown for intra-video temporal correspondence in the reference and target views in (a) and (b), respectively, and for inter-video cross-view correspondence in (c). The rightmost column plots the top three layers with the highest cross-view correspondence scores. This figure shows that accurate cross-view and temporal matching emerge at specific intermediate layers.

Key findings. First, query-key matching within 3D attention blocks provides clear correspondence cues, capturing intra-video temporal correspondences within each view as well as inter-video cross-view correspondences between the generated and reference views. Second, temporal and cross-view correspondences become simultaneously prominent in specific intermediate diffusion layers, indicating which regions the model refers to when synthesizing each part of the generated frame. Third, in regions where dynamic objects exhibit geometric or motion inconsistencies, the attention maps at certain layers also show incorrect cross-view correspondences.

Method

How it works

We jointly train a camera-controlled video diffusion model (DiT) and a multi-view tracking module that shares the query and key features of the DiT's 3D attention layers. From these shared features the tracking module builds intra-video temporal correlation (for temporal consistency) and inter-video cross-view correlation (for geometric correspondence). We additionally supervise the 3D attention map directly with a cross-entropy multi-view correspondence loss.

Main architecture — features from selected DiT layers are upsampled by a lightweight MLP and correlated into cross-view cost volumes in the multi-view tracking (MVTAP) head; a multi-layer matching loss supervises the diffusion features to respect the 3D scene geometry
1

Improved camera encoding

Condition on the extrinsics and intrinsics of both reference and target views, encoded as dense Plücker ray maps injected into every DiT layer.

2

Multi-view tracking head

A transformer head over multi-scale local 4D correlation volumes (from attention query/key similarity) predicts multi-view point tracks, visibility, and confidence.

3

Tracking + correspondence loss

Joint objective Ldiff + 0.01·Ltrack + 0.01·Lcorr: tracking enforces motion fidelity; a cross-entropy loss on the attention map enforces geometric consistency.

Quantitative

Results on DAVIS

We evaluate MVTrack4Gen on DAVIS across visual quality (VBench), geometric consistency (MEt3R and MEt3R-dynamic), and camera accuracy, where it improves both backbones on every axis. In each column, the best result is shown in bold, while green and red indicate a gain or a loss relative to the corresponding backbone. This is a condensed view of the results; the full table is provided in the paper.

Method Subj. Cons. ↑ Imaging ↑ Motion Smooth. ↑ MEt3R ↓ MEt3R-dyn ↓ mRotErr ↓ mTransErr ↓ mCamMC ↓
Explicit 3D Lifting
GEN3C0.8560.5820.9800.2900.3282.5380.1270.163
TrajectoryCrafter0.8470.5500.9700.2910.30610.1260.1900.355
CogNVS0.8110.5300.9780.3330.34610.4390.2280.400
NeoVerse0.8580.5910.9830.3020.3234.7050.1590.228
Camera Conditioning Only
ReCamMaster0.9040.6520.9850.3370.3693.6600.1130.169
ReDirector0.8970.6800.9850.3180.3951.7140.0860.109
Ours
MVTrack4GenReCamMaster0.892 −.0120.685 +.0330.984 −.0010.274 −.0630.287 −.0821.858 −1.8020.100 −.0130.125 −.044
MVTrack4GenReDirector0.905 +.0080.687 +.0070.986 +.0010.267 −.0510.349 −.0361.718 +.0040.073 −.0130.097 −.012
Interactive · Depth Anything 3 lift

Geometric 4D point cloud

The reference and the generated novel view, lifted into one shared 3D scene at the same timestamp. The clouds follow the videos on the left — drag to orbit, scroll to zoom. Toggle the generated cloud, per-view color, and the camera frustums.

■ Reference Video■ Generated Video

Reference Video
MVTrack4GenReCamMaster — generated
loading 3D…

Depth Anything 3 · joint reference + generated · per-frame, synced to the videos · toggle Point tracks to lift the dynamic-object tracks into 3D (and onto both videos)

Comparison

vs. baselines on DAVIS

Against Explicit 3D Lifting (GEN3C, NeoVerse, TrajectoryCrafter) and Camera-Conditioning-Only (ReDirector) methods. Pick a scene below.

Camera control

Diverse camera trajectories

The same reference video rendered by MVTrack4GenReCamMaster along diverse user-specified camera trajectories — shown all at once.

Key Finding

Emergent correspondence in attention

For the same query point on a generated frame, baselines attend to the wrong region in the reference frame, while MVTrack4Gen attends to the correct correspondence on the dynamic object — which is what keeps motion and geometry consistent across views.

Citation

BibTeX

@misc{lee2026mvtrack4genmultiviewpointtracking,
      title={MVTrack4Gen: Multi-View Point Tracking as Geometric Supervision for 4D Video Generation},
      author={JoungBin Lee and Jaewoo Jung and Jongmin Lee and Tongmin Kim and Hyunsung Kim and Takuya Narihira and Kazumi Fukuda and Jahyeok Koo and Jisang Han and Yuki Mitsufuji and Seungryong Kim},
      year={2026},
      eprint={2606.26087},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2606.26087},
}