Learning to Track Any Points from Human Motion

1KAIST AI, 2Adobe Research
*Equal Contribution
ArXiv 2025

TL;DR: AnthroTAP generates highly complex pseudo-labeled data for point tracking
by leveraging the inherent complexities of human motion captured in videos.

Abstract

Human motion, with its inherent complexities, such as non-rigid deformations, articulated movements, clothing distortions, and frequent occlusions caused by limbs or other individuals, provides a rich and challenging source of supervision that is crucial for training robust and generalizable point trackers. Despite the suitability of human motion, acquiring extensive training data for point tracking remains difficult due to laborious manual annotation.

Our proposed pipeline, AnthroTAP, addresses this by proposing an automated pipeline to generate pseudo-labeled training data, leveraging the Skinned Multi-Person Linear (SMPL) model. We first fit the SMPL model to detected humans in video frames, project the resulting 3D mesh vertices onto 2D image planes to generate pseudo-trajectories, handle occlusions using ray-casting, and filter out unreliable tracks based on optical flow consistency.

A point tracking model trained on AnthroTAP annotated dataset achieves state-of-the-art performance on the TAP-Vid benchmark, surpassing other models trained on real videos while using \( 10,000\times \) less data and only 1 day in 4 GPUs, compared to 256 GPUs used in recent state-of-the-art.

Overall Pipeline

AnthroTAP extract human meshes using an off-the-shelf human mesh recovery model, and track points by projecting the mesh vertices. Point visibility is determined by using ray-casting. In parallel, we extract optical flow and retain only reliable flow using forward-backward consistency. Finally, to enhance pseudo-label reliability, trajectories are filtered by checking the consistency between the optical flow and the trajectories generated from the human mesh.

Qualitative Comparisons

Compared to previous state-of-the-art methods, Anthro-LocoTrack, LocoTrack-base model trained with the dataset generated by our pipeline, consistently demonstrates strong performance on highly deformable objects and severe occlusions.

CoTracker3

BootsTAPIR

Anthro-LocoTrack (Ours)

Quantitative Comparison

LocoTrack-B, trained with our approach, shows a significant performance improvement across all metrics and datasets on TAP-Vid benchmark and RoboTAP, even with \( 11\times \) smaller training dataset than CoTracker3 in terms of the number of videos, and \( 1,000\times \) smaller dataset used in BootsTAPIR in terms of the number of training frames. Furthermore, in terms of position accuracy (\(< δ_{avg}^x \)), our model achieves the best performance on every dataset.

Ablation Studies

To examine whether training on human points generalize to non-human points, we grouped query points in the TAP-Vid-DAVIS dataset into human and non-human regions and compared performance. Our method shows greater improvement on non-human points.
We compare our pipeline with self-training introduced in Karaev et al.. (I) CoTracker3 model trained on Kubric. (II) fine-tunes the model on the Let’s Dance (LD) dataset using the self-training introduced in the paper. (III) trains CoTracker3 model using our pipeline. Our method shows greater performance boost than self-training introduced in CoTracker3.
We ablate the effect of optical flow-based filtering by comparing it with the baseline that uses trajectories projected from the human mesh and occlusion prediction via ray casting. While the baseline already achieves strong performance, applying trajectory rejection yields a further performance boost

We also provide a comparative analysis of prevalent point tracking training datasets, focusing on trajectory complexity and diversity.

Related Links

FlowTrack proposes a novel framework for long-range dense tracking by combining strengths of both optical flow and point tracking. It chains confident optical flow predictions while automatically switching to an error compensation module when flow becomes unreliable.

LocoTrack introduces local 4D correlation to overcome matching ambiguities from local 2D correlation. It incorporates a lightweight correlation encoder to enhance computational efficiency, and compact Transformer architecture to integrate long-term temporal information.

Chrono designs a novel feature backbone which leverages pre-trained representation from DINOv2 with a temporal adapter. It effectively captures long-temporal context even without a refinement stage.

Seurat presents a monocular video depth estimation framework that leverages 2D point trajectories to infer depth. It utilizes both spatial and temporal transformers to model accurate and temporally consistent depth predictions across frames.

BibTeX

@article{kim2025learning,
  title     = {Learning to Track Any Points from Human Motion},
  author    = {Kim, In{\`e}s Hyeonsu and Cho, Seokju and Koo, Jahyeok and Park, Junghyun and Huang, Jiahui and Lee, Joon-Young and Kim, Seungryong},
  journal   = {arXiv preprint arXiv:2507.06233},
  year      = {2025},
}