Emergent Temporal Correspondences from Video Diffusion Transformers

Jisu Nam^*1, Soowon Son^*1, Dahyun Chung², Jiyoung Kim¹, Siyoon Jin¹, Junhwa Hur^†3, Seungryong Kim^†1

¹KAIST AI ²Korea University ³Google DeepMind

^* Equal contribution. ^†Co-corresponding author.

TL;DR: Generated videos by video diffusion transformer (DiT) with emergent temporal correspondence

Overview of DiffTrack: DiffTrack reveals how Video Diffusion Transformers (DiTs) establish temporal correspondences during video generation. Given a prompt and starting points in the first frame, DiffTrack tracks how individual points align across subsequent frames in video DiTs, enabling the extraction of coherent motion trajectoriesb from both generated and real-world videos in a zero-shot manner. As practical applications, DiffTrack also enables 1) zero-shot point tracking and 2) motion-enhanced video generation with a novel guidance method.

Analysis of Temporal Matching in CogVideoX-5B

Representation selection: Query-key matching achieves higher accuracy than intermediate feature matching, providing better representations for temporal matching.
Layer-wise analysis: The harmonic mean of query-key matching across layers and timesteps reveals that temporal correspondence is primarily governed by a limited set of layers.
Noise-level analysis: Temporal matching improves as noise decreases, with early timesteps relying more on text embeddings and self-frame attention, while later timesteps shift towards cross-frame attention for enhanced coherence.

DiffTrack for Zero-Shot Point Tracking

Quantitative comparison on the TAP-Vid datasets: Video DiTs combined with DiffTrack outperform all vision foundation models trained on single images and self-supervised models trained on two-view images or videos for zero-shot tracking.

DINOv2

VFS

DiffTrack(CogVideoX-5B)

Qualitative comparison with previous video foundation models: DiffTrack on CogVideoX-5B produces smoother and more accurate trajectories compared to DINOv2 and VFS, which struggle with temporal dynamics and often yield inconsistent tracks.

DiffTrack for Motion Enhanced Video Generation

CogVideoX-5B

Motion Enhanced CogVideoX-5B

Prompt: "... He takes a slow, appreciative sip, his eyes closing momentarily as he savors the complex flavors. ..."

Prompt: "A determined individual, ...ascends a thick, rugged rope hanging from a towering rock face. ..."

Prompt:"A motorcycle cruising along a coastal highway."

Qualitative comparison with baseline and motion enhanced video: CAG(Cross-Attention Guidance) enhances temporal matching and corrects motion inconsistencies in the synthesized videos.

Analysis Framework: DiffTrack

Evaluation Dataset Curation

We collect two distinct datasets: (a) an object dataset for dynamic object-centric videos, and (b) a scene dataset for static scenes with camera motion. Each dataset includes 50 text prompts, with 50 videos generated per prompt using CogVideoX-2B. To assess temporal matching, we predefine starting points in the first frame and obtain pseudo ground-truth trajectories using an off-the-shelf tracking method, CoTracker.

Temporal Correspondence Estimation

We evaluate temporal matching by extracting pointwise correspondences across video frames using latent feature descriptors from a video diffusion transformer (DiT).

Matching Cost Computation:
At each timestep \( t \) and layer \( l \), we compute the matching cost between the descriptors of the first frame and the \( j \)-th frame (\( j \in \{2, \dots, 1+f\} \)):

\[ \mathbf{C}^{1,j}_{t,l} = \mathtt{Softmax} \left( \frac{\mathbf{D}_{t,l}^1 (\mathbf{D}_{t,l}^j)^T}{\sqrt{d}} \right) \]

where \( \mathbf{D}_{t,l}^1 \) and \( \mathbf{D}_{t,l}^j \) are latent descriptors, and \( d \) is the channel dimension.

Correspondence Estimation:
Matched points are obtained by selecting the highest scoring location for each query:

\[ \mathbf{p}^j_{t,l} = \underset{\mathbf{x} \in \Omega}{\mathtt{Argmax}} \ \mathbf{C}^{1,j}_{t,l}(\mathbf{p}^1, \mathbf{x}) \]

where \( \mathbf{p}^1 \) is the point in the first frame and \( \Omega \) is the spatial domain.

Trajectory Reconstruction:
The temporal motion track is formed by concatenating matched points and interpolating back to RGB space:

\[ \hat{\mathbf{T}}_{t,l} = \mathtt{Interp}\left(\mathtt{Concat}(\mathbf{p}^1, \mathbf{p}^2_{t,l}, \dots, \mathbf{p}^{1+f}_{t,l})\right) \]

Evaluation Metrics

To evaluate temporal consistency in video generation, we propose three complementary metrics:

Matching Accuracy: Measures how precisely points are aligned across frames.
Confidence Score: Measures how strongly each point attends to its match in the attention map.
Attention Score: Measures how strongly cross-frame interactions influence during video generation

These metrics capture different aspects of matching. High accuracy does not guarantee influence on generation if attention score is low, and high confidence does not ensure correct matching. To jointly assess three things, we compute the harmonic mean of the normalized scores, highlighting regions where all accuracy, confidence and attention are high.

BibTeX

@misc{nam2025emergenttemporalcorrespondencesvideo,
      title={Emergent Temporal Correspondences from Video Diffusion Transformers},
      author={Jisu Nam and Soowon Son and Dahyun Chung and Jiyoung Kim and Siyoon Jin and Junhwa Hur and Seungryong Kim},
      year={2025},
      eprint={2506.17220},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2506.17220},
}