DiTracker: Repurposing
Video Diffusion Transformers for Robust Point Tracking

arXiv 2025

Soowon Son¹ Honggyu An¹ Chaehyun Kim¹ Hyunah Ko¹ Jisu Nam¹ Dahyun Chung¹ Siyoon Jin¹ Jung Yi¹ Jaewon Min¹ Junhwa Hur^2† Seungryong Kim^1†

¹KAIST AI ²Google DeepMind

†: Co-corresponding author

Paper Code BibTeX

TL;DR

DiTracker repurposes video DiTs for point tracking with softmax-based matching, LoRA adaptation, and cost fusion, achieving stronger robustness and faster convergence.

DiTracker leverages pre-trained video Diffusion Transformer (DiT) features to outperform state-of-the-art methods such as CoTracker3 in challenging real-world scenarios involving complex motion and frequent occlusions, while achieving comparable final performance with 10x faster convergence, substantially reducing training cost. These results demonstrate that pre-trained video DiT features constitute an effective and efficient foundation for robust point tracking

Abstract

Point tracking aims to localize corresponding points across video frames, serving as a fundamental task for 4D reconstruction, robotics, and video editing. Existing methods commonly rely on shallow convolutional backbones such as ResNet that process frames independently, lacking temporal coherence and producing unreliable matching costs under challenging conditions. Through systematic analysis, we find that video Diffusion Transformers (DiTs), pre-trained on large-scale real-world videos with spatio-temporal attention, inherently exhibit strong point tracking capability and robustly handle dynamic motions and frequent occlusions. We propose DiTracker, which adapts video DiTs through: (1) query-key attention matching, (2) lightweight LoRA tuning, and (3) cost fusion with a ResNet backbone. Despite training with 8x smaller batch size, DiTracker achieves state-of-the-art performance on challenging ITTO benchmark and matches or outperforms state-of-the-art models on TAP-Vid benchmarks. Our work validates video DiT features as an effective and efficient foundation for point tracking.

Analysis

To investigate whether pre-trained video DiT features can address key limitations in point tracking, our analysis compares multiple pre-trained video DiT models (HunyuanVideo, CogVideoX-2B, CogVideoX-5B, Wan-14B) with the supervised ResNet backbone from CoTracker3 and other vision foundation models (DINOv2, DINOv3, V-JEPA2).

Best 2nd 3rd

Method	< δ⁰	< δ¹	< δ²	< δ³	< δ⁴	δ^x_avg
Method
ResNet (CoTracker3)	10.5	34.0	49.7	57.7	66.6	43.7

V-JEPA2	2.8	10.5	30.5	55.4	69.9	33.8
DINOv2-B/14	2.8	10.9	35.6	67.2	83.5	40.0
DINOv3-B/16	3.0	11.8	37.7	68.2	84.7	41.1

HunyuanVideo	4.4	18.2	44.8	70.1	82.8	44.1
CogVideoX-2B	4.8	19.4	49.2	73.6	86.3	46.3
CogVideoX-5B	5.2	20.5	50.7	73.9	84.3	46.9
WAN-14B	12.4	31.9	56.7	72.1	82.7	51.2

Zero-shot point tracking performance comparison on TAP-Vid-DAVIS benchmark. Video DiTs consistently provide superior initial matching despite having no tracking-specific training.

Best 2nd 3rd

Method	Motion Blur					Track Motion			Reappearance Frequency
Method	1	2	3	4	5	[0%, 0.5%)	[0.5%, 1.5%)	[1.5%, 5%)	[0, 1)	[1, 3)	[3, 100)
ResNet (CoTracker3)	42.4	40.0	36.2	31.0	27.9	24.7	16.7	9.7	22.7	17.1	10.6
V-JEPA2	29.0	27.3	24.3	20.3	19.0	21.8	23.2	17.4	24.4	20.9	17.0
DINOv2	39.0	37.6	34.7	32.4	29.9	24.3	23.5	19.3	23.9	23.6	19.9
DINOv3	40.0	38.5	36.5	34.0	31.8	30.1	24.4	20.8	29.1	25.8	20.1

HunyuanVideo	41.7	40.6	38.9	36.4	34.5	42.7	30.7	16.9	40.6	28.7	19.4
CogVideoX-2B	45.0	43.6	41.6	39.0	36.7	45.4	33.7	19.2	43.1	31.7	21.5
CogVideoX-5B	45.5	44.2	42.2	39.5	37.7	44.9	37.4	22.2	44.5	34.1	23.9
WAN-14B	50.7	49.6	47.6	44.6	42.5	46.3	35.3	21.2	43.4	34.3	22.9

Analysis on challenging real-world scenarios from ITTO-MOSE benchmark: (1) motion blur, (2) dynamic motion and (3) frequent occlusions. The consistent superiority of all video DiTs validates that large-scale pretraining with full 3D attention provides fundamentally stronger motion priors for challenging correspondence tasks.

DiTracker

Overall Architecture of DiTracker. For long video sequences, input frames are divided into \( N \) temporal chunks with the global first frame prepended. Individual video frames are encoded via a VAE and processed by a video DiT to extract query features \( q_i \) and key features \( k_j \).

\[ \mathcal{C}^{\mathrm{DiT}}_{i,j} = \mathrm{Softmax}\!\left( \frac{q_i k_j^\top}{\sqrt{d}} \right) \]

The DiT local cost is fused with the ResNet local cost \( \mathcal{C}^{\mathrm{ResNet}}_{i,j} \). Finally, a tracking head refines trajectories over \( T \) iterations, updating displacement \( \Delta P \), visibility \( \Delta V \), and confidence \( \Delta C \).

Quantitative Results

Qualitative Results

ITTO-MOSE

TAP-Vid-DAVIS

TAP-Vid-DAVIS with Corruptions

Citation


      @misc{son2025repurposingvideodiffusiontransformers,
            title={Repurposing Video Diffusion Transformers for Robust Point Tracking}, 
            author={Soowon Son and Honggyu An and Chaehyun Kim and Hyunah Ko and Jisu Nam and Dahyun Chung and Siyoon Jin and Jung Yi and Jaewon Min and Junhwa Hur and Seungryong Kim},
            year={2025},
            eprint={2512.20606},
            archivePrefix={arXiv},
            primaryClass={cs.CV},
            url={https://arxiv.org/abs/2512.20606}, 
      }

DiTracker: Repurposing Video Diffusion Transformers for Robust Point Tracking