KAIST AI × Sony AI

MVTrack4Gen
Multi-View Point Tracking as Geometric Supervision
for 4D Video Generation

JoungBin Lee¹, Jaewoo Jung¹, Jongmin Lee¹, Tongmin Kim¹, Hyunsung Kim¹,
Takuya Narihira², Kazumi Fukuda², Jahyeok Koo¹, Jisang Han¹,
Yuki Mitsufuji^2,3,†, Seungryong Kim^1,†

¹KAIST AI · ²Sony AI · ³Sony Group Corporation

^†Corresponding authors

Paper arXiv Hugging Face Code (Coming Soon) Video BibTeX

TL;DR

MVTrack4Gen turns multi-view point tracking into a geometric & motion supervision signal for novel-view video diffusion. Our key finding: specific attention layers already encode intra-video temporal and inter-video cross-view correspondences. We route those features into an auxiliary tracking head and add a correspondence cross-entropy loss — yielding state-of-the-art geometric consistency with no 3D reconstruction at inference, on top of two backbones (ReCamMaster & ReDirector).

SOTA

Cross-view geometric consistency (MEt3R)

No 3D Recon

No external depth or reconstruction

2 Backbones

ReCamMaster & ReDirector

DAVIS + iPhone

In-the-wild benchmarks

Abstract

Synthesizing a novel-view video from a monocular reference video along a target camera trajectory requires both geometric consistency and motion fidelity with respect to the reference video. Existing methods based on explicit 3D representations are limited by the accuracy of off-the-shelf reconstruction modules, which often produce inaccurate geometry for dynamic objects in monocular videos. In contrast, camera-conditioning-only methods can achieve high visual quality but often struggle to preserve geometric and motion consistency. In this work, we introduce MVTrack4Gen (Multi-View point Tracking for Novel-View Generation), a motion-aware training framework that leverages multi-view point tracking as an additional geometric and motion supervision signal for novel-view video diffusion models. Our key finding is that attention layers encode strong correspondence cues, where query features attend to key features at geometrically corresponding locations across views and over time, and the misalignment of these correspondences causes motion inconsistency. Based on this observation, we route these features into an auxiliary multi-view tracking head and jointly train the diffusion model with a point-tracking objective. By explicitly strengthening these motion-aware correspondences, MVTrack4Gen improves the model's ability to follow the motion in the reference view and maintain cross-view geometric consistency. Across diverse benchmarks, our method achieves state-of-the-art geometric consistency and competitive camera accuracy.

Teaser

Joint Novel-View Video Generation & Multi-View Point Tracking

MVTrack4Gen jointly generates a novel-view video and multi-view point tracks, given a monocular reference video with query points and a user-specified camera trajectory. Lifting both the reference and generated frames into 3D space using Depth Anything 3 shows that the target views and multi-view point tracks faithfully preserve the dynamic motion of the reference video while remaining geometrically consistent.

■ Reference Video■ Generated Video

Contributions

What's new

We reveal that specific attention layers in a novel-view video diffusion model simultaneously encode intra-video and inter-video correspondences, and strengthening these correspondences through auxiliary supervision improves novel-view generation quality.

We introduce a motion-aware training framework for novel-view video diffusion models, where the diffusion backbone is augmented with a multi-view point tracking head and jointly trained to estimate multi-view point tracks from its attention features — enabling explicit motion-correspondence learning across reference and target views.

We demonstrate that our framework generalizes across two camera-conditioning-only backbones (ReCamMaster & ReDirector), consistently improving visual quality, geometric and motion consistency, and camera-pose accuracy on DAVIS and iPhone — achieving state-of-the-art geometric consistency while maintaining competitive camera accuracy.

Analysis

Correspondence concentrates in specific attention layers

Camera-conditioning-only novel-view diffusion models exchange information across views through 3D attention. We probe where: at every diffusion layer and denoising timestep we measure how well attention matches reliable correspondences (cycle-consistency check & PCK), together with how strongly and how sharply attention concentrates on the correct match (a harmonic mean of matching accuracy, attention score, and confidence).

Matching Accuracy and Harmonic Mean in ReCamMaster. We visualize matching accuracy (top) and the harmonic mean of matching accuracy, attention score, and confidence score (bottom) across diffusion layers and denoising timesteps. Results are shown for intra-video temporal correspondence in the reference and target views in (a) and (b), respectively, and for inter-video cross-view correspondence in (c). The rightmost column plots the top three layers with the highest cross-view correspondence scores. This figure shows that accurate cross-view and temporal matching emerge at specific intermediate layers.

Key findings. First, query-key matching within 3D attention blocks provides clear correspondence cues, capturing intra-video temporal correspondences within each view as well as inter-video cross-view correspondences between the generated and reference views. Second, temporal and cross-view correspondences become simultaneously prominent in specific intermediate diffusion layers, indicating which regions the model refers to when synthesizing each part of the generated frame. Third, in regions where dynamic objects exhibit geometric or motion inconsistencies, the attention maps at certain layers also show incorrect cross-view correspondences.

Method

How it works

We jointly train a camera-controlled video diffusion model (DiT) and a multi-view tracking module that shares the query and key features of the DiT's 3D attention layers. From these shared features the tracking module builds intra-video temporal correlation (for temporal consistency) and inter-video cross-view correlation (for geometric correspondence). We additionally supervise the 3D attention map directly with a cross-entropy multi-view correspondence loss.

Main architecture — features from selected DiT layers are upsampled by a lightweight MLP and correlated into cross-view cost volumes in the multi-view tracking (MVTAP) head; a multi-layer matching loss supervises the diffusion features to respect the 3D scene geometry

Improved camera encoding

Condition on the extrinsics and intrinsics of both reference and target views, encoded as dense Plücker ray maps injected into every DiT layer.

Multi-view tracking head

A transformer head over multi-scale local 4D correlation volumes (from attention query/key similarity) predicts multi-view point tracks, visibility, and confidence.

Tracking + correspondence loss

Joint objective L_diff + 0.01·L_track + 0.01·L_corr: tracking enforces motion fidelity; a cross-entropy loss on the attention map enforces geometric consistency.

Quantitative

Results on DAVIS

We evaluate MVTrack4Gen on DAVIS across visual quality (VBench), geometric consistency (MEt3R and MEt3R-dynamic), and camera accuracy, where it improves both backbones on every axis. In each column, the best result is shown in bold, while green and red indicate a gain or a loss relative to the corresponding backbone. This is a condensed view of the results; the full table is provided in the paper.

Method	Subj. Cons. ↑	Imaging ↑	Motion Smooth. ↑	MEt3R ↓	MEt3R-dyn ↓	mRotErr ↓	mTransErr ↓	mCamMC ↓
Explicit 3D Lifting
GEN3C	0.856	0.582	0.980	0.290	0.328	2.538	0.127	0.163
TrajectoryCrafter	0.847	0.550	0.970	0.291	0.306	10.126	0.190	0.355
CogNVS	0.811	0.530	0.978	0.333	0.346	10.439	0.228	0.400
NeoVerse	0.858	0.591	0.983	0.302	0.323	4.705	0.159	0.228
Camera Conditioning Only
ReCamMaster	0.904	0.652	0.985	0.337	0.369	3.660	0.113	0.169
ReDirector	0.897	0.680	0.985	0.318	0.395	1.714	0.086	0.109
Ours
MVTrack4Gen_ReCamMaster	0.892 −.012	0.685 +.033	0.984 −.001	0.274 −.063	0.287 −.082	1.858 −1.802	0.100 −.013	0.125 −.044
MVTrack4Gen_ReDirector	0.905 +.008	0.687 +.007	0.986 +.001	0.267 −.051	0.349 −.036	1.718 +.004	0.073 −.013	0.097 −.012

Interactive · Depth Anything 3 lift

Geometric 4D point cloud

The reference and the generated novel view, lifted into one shared 3D scene at the same timestamp. The clouds follow the videos on the left — drag to orbit, scroll to zoom. Toggle the generated cloud, per-view color, and the camera frustums.

■ Reference Video■ Generated Video

Reference Video

MVTrack4Gen_ReCamMaster — generated

loading 3D…

Depth Anything 3 · joint reference + generated · per-frame, synced to the videos · toggle Point tracks to lift the dynamic-object tracks into 3D (and onto both videos)

Comparison

vs. baselines on DAVIS

Against Explicit 3D Lifting (GEN3C, NeoVerse, TrajectoryCrafter) and Camera-Conditioning-Only (ReDirector) methods. Pick a scene below.

Camera control

Diverse camera trajectories

The same reference video rendered by MVTrack4Gen_ReCamMaster along diverse user-specified camera trajectories — shown all at once.

Key Finding

Emergent correspondence in attention

For the same query point on a generated frame, baselines attend to the wrong region in the reference frame, while MVTrack4Gen attends to the correct correspondence on the dynamic object — which is what keeps motion and geometry consistent across views.

Citation

BibTeX

@misc{lee2026mvtrack4genmultiviewpointtracking,
      title={MVTrack4Gen: Multi-View Point Tracking as Geometric Supervision for 4D Video Generation},
      author={JoungBin Lee and Jaewoo Jung and Jongmin Lee and Tongmin Kim and Hyunsung Kim and Takuya Narihira and Kazumi Fukuda and Jahyeok Koo and Jisang Han and Yuki Mitsufuji and Seungryong Kim},
      year={2026},
      eprint={2606.26087},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2606.26087},
}

MVTrack4Gen Multi-View Point Tracking as Geometric Supervision for 4D Video Generation