TL;DR: Generated videos by video diffusion transformer (DiT) with emergent temporal correspondence
Overview of DiffTrack: DiffTrack reveals how Video Diffusion Transformers (DiTs) establish temporal correspondences during video generation. Given a prompt and starting points in the first frame, DiffTrack tracks how individual points align across subsequent frames in video DiTs, enabling the extraction of coherent motion trajectoriesb from both generated and real-world videos in a zero-shot manner. As practical applications, DiffTrack also enables 1) zero-shot point tracking and 2) motion-enhanced video generation with a novel guidance method.
Prompt: "... He takes a slow, appreciative sip, his eyes closing momentarily as he savors the complex flavors. ..."
Prompt: "A determined individual, ...ascends a thick, rugged rope hanging from a towering rock face. ..."
Prompt:"A motorcycle cruising along a coastal highway."
We evaluate temporal matching by extracting pointwise correspondences across video frames using latent feature descriptors from a video diffusion transformer (DiT).
Matching Cost Computation:
At each timestep \( t \) and layer \( l \), we compute the matching cost between the descriptors of the first frame and the \( j \)-th frame (\( j \in \{2, \dots, 1+f\} \)):
\[ \mathbf{C}^{1,j}_{t,l} = \mathtt{Softmax} \left( \frac{\mathbf{D}_{t,l}^1 (\mathbf{D}_{t,l}^j)^T}{\sqrt{d}} \right) \]
where \( \mathbf{D}_{t,l}^1 \) and \( \mathbf{D}_{t,l}^j \) are latent descriptors, and \( d \) is the channel dimension.
Correspondence Estimation:
Matched points are obtained by selecting the highest scoring location for each query:
\[ \mathbf{p}^j_{t,l} = \underset{\mathbf{x} \in \Omega}{\mathtt{Argmax}} \ \mathbf{C}^{1,j}_{t,l}(\mathbf{p}^1, \mathbf{x}) \]
where \( \mathbf{p}^1 \) is the point in the first frame and \( \Omega \) is the spatial domain.
Trajectory Reconstruction:
The temporal motion track is formed by concatenating matched points and interpolating back to RGB space:
\[ \hat{\mathbf{T}}_{t,l} = \mathtt{Interp}\left(\mathtt{Concat}(\mathbf{p}^1, \mathbf{p}^2_{t,l}, \dots, \mathbf{p}^{1+f}_{t,l})\right) \]
To evaluate temporal consistency in video generation, we propose three complementary metrics:
These metrics capture different aspects of matching. High accuracy does not guarantee influence on generation if attention score is low, and high confidence does not ensure correct matching. To jointly assess three things, we compute the harmonic mean of the normalized scores, highlighting regions where all accuracy, confidence and attention are high.
@misc{nam2025emergenttemporalcorrespondencesvideo,
title={Emergent Temporal Correspondences from Video Diffusion Transformers},
author={Jisu Nam and Soowon Son and Dahyun Chung and Jiyoung Kim and Siyoon Jin and Junhwa Hur and Seungryong Kim},
year={2025},
eprint={2506.17220},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2506.17220},
}