


We also provide a comparative analysis of prevalent point tracking training datasets, focusing on trajectory complexity and diversity.
Human motion, with its inherent complexities, such as non-rigid deformations, articulated movements, clothing distortions, and frequent occlusions caused by limbs or other individuals, provides a rich and challenging source of supervision that is crucial for training robust and generalizable point trackers. Despite the suitability of human motion, acquiring extensive training data for point tracking remains difficult due to laborious manual annotation.
Our proposed pipeline, AnthroTAP, addresses this by proposing an automated pipeline to generate pseudo-labeled training data, leveraging the Skinned Multi-Person Linear (SMPL) model. We first fit the SMPL model to detected humans in video frames, project the resulting 3D mesh vertices onto 2D image planes to generate pseudo-trajectories, handle occlusions using ray-casting, and filter out unreliable tracks based on optical flow consistency.
A point tracking model trained on AnthroTAP annotated dataset achieves state-of-the-art performance on the TAP-Vid benchmark, surpassing other models trained on real videos while using \( 10,000\times \) less data and only 1 day in 4 GPUs, compared to 256 GPUs used in recent state-of-the-art.
CoTracker3 |
BootsTAPIR |
Anthro-LocoTrack (Ours) |
FlowTrack proposes a novel framework for long-range dense tracking by combining strengths of both optical flow and point tracking. It chains confident optical flow predictions while automatically switching to an error compensation module when flow becomes unreliable.
LocoTrack introduces local 4D correlation to overcome matching ambiguities from local 2D correlation. It incorporates a lightweight correlation encoder to enhance computational efficiency, and compact Transformer architecture to integrate long-term temporal information.
Chrono designs a novel feature backbone which leverages pre-trained representation from DINOv2 with a temporal adapter. It effectively captures long-temporal context even without a refinement stage.
Seurat presents a monocular video depth estimation framework that leverages 2D point trajectories to infer depth. It utilizes both spatial and temporal transformers to model accurate and temporally consistent depth predictions across frames.
@article{kim2025learning,
title = {Learning to Track Any Points from Human Motion},
author = {Kim, In{\`e}s Hyeonsu and Cho, Seokju and Koo, Jahyeok and Park, Junghyun and Huang, Jiahui and Lee, Joon-Young and Kim, Seungryong},
journal = {arXiv preprint arXiv:2507.06233},
year = {2025},
}