CVPR 2026

AnthroTAP:
Learning Point Tracking with Real-World Motion

1KAIST AI 2Adobe Research 3University of Michigan 4LG AI Research
*: Equal contribution
Pseudo-labeled trajectories extracted using the AnthroTAP pipeline on videos from DanceTrack dataset
TL;DR

What if tracking any point in a video didn't require millions of labeled examples, but just people dancing? We built AnthroTAP, a pipeline that distills rich supervision from real human motion videos using 3D body mesh fitting. Trained on just 1,400 dance videos, our model outperforms methods using 10,000x more data, showing that structured human movement offers a scalable and effective source of real-world supervision for point tracking.

11×
fewer training videos
than CoTracker3
10,000×
fewer training videos
than BootsTAPIR
1 day
of training on
4 NVIDIA A6000 GPUs

Why Human Motion?

The Data Problem

  • Real-world annotation requires tracking hundreds of points across frames, which is prohibitively expensive
  • Synthetic datasets (Kubric, PointOdyssey) miss real-world visual complexity: motion blur, lighting, reflections
  • Self-training methods need 15M+ real videos and still suffer from weak supervision signals

Our Insight

  • Human motion is a free source of complex, structured supervision already captured in millions of videos
  • Non-rigid deformations, articulated joints, and frequent occlusions: exactly what trackers need to learn
  • 3D mesh models (SMPL) let us extract precise trajectories automatically, without any manual labeling
💡
Key insight: Human dance videos freely available online are inherently annotated by the laws of physics: every person in every frame follows a consistent 3D trajectory. We just need to recover and project that 3D structure to obtain high-quality pseudo-labels at scale.

Pipeline

AnthroTAP is an automated pseudo-labeling pipeline that distills the rich supervision signal in human motion videos into 2D point tracking data. By fitting SMPL models to detected people, projecting 3D mesh vertices onto the image plane, and resolving occlusions via ray-casting, it generates trajectories with accurate occlusion labels, entirely without manual annotation.

AnthroTAP overall pipeline AnthroTAP pipeline: human mesh recovery and vertex projection form the core pseudo-labeling process, with ray-casting for occlusion modeling.
① Human Mesh Recovery

TokenHMR fits SMPL 3D body meshes to every person detected in each frame, producing a temporally consistent 3D mesh with Nv vertices per person.

② Vertex Projection

3D mesh vertices are projected onto the 2D image plane using known camera parameters, forming initial pseudo-trajectories 𝒳p,j across frames.

③ Visibility via Ray Casting

A ray cast from the camera toward each 3D vertex is tested against all human mesh triangles. Points blocked before reaching the target are marked as occluded, resolving both self-occlusion and inter-person occlusion.

+ Additional Refinement
Optical Flow Filtering

HMR-predicted displacements are compared against optical flow at each frame. Trajectory segments where the two diverge are discarded before training, catching occlusions from scene elements (e.g., furniture, background objects) outside the scope of the SMPL model.

Dataset: Anthro-LD

We apply AnthroTAP to 1.4K videos from the Let's Dance dataset, spanning diverse dance styles from solo performances to complex multi-person scenes. Unlike competing approaches, our training data is entirely non-proprietary.

Trajectory Complexity

We measure complexity as the mean angular acceleration of trajectories, measuring how sharply and frequently a trajectory changes direction over all contiguous visible segments. Anthro-LD achieves the highest complexity by a large margin, surpassing even synthetic datasets specifically designed for diversity.

KubricSynthetic
0.18
DriveTrackReal
0.44
PointOdysseySynthetic
0.52
Anthro-LD (Ours)Real
1.25

Trajectory complexity (mean angular acceleration ↑). Anthro-LD is 3× more complex than the next real-world dataset (DriveTrack).

Results

Qualitative Comparison

Compared to CoTracker3 and BootsTAPIR, Anthro-LocoTrack consistently demonstrates stronger tracking on highly deformable objects and severe occlusions.

Quantitative Results on TAP-Vid & RoboTAP

Both LocoTrack and TAPNext trained with our approach show significant improvements across all metrics and datasets. We use 11× fewer videos than CoTracker3 and 1,000× fewer training frames than BootsTAPIR.

Method Training Dataset DAVIS First DAVIS Strided Kinetics First RoboTAP First
AJ↑xavgOA↑ AJ↑xavgOA↑ AJ↑xavgOA↑ AJ↑xavgOA↑
Models evaluated at 256×256 resolution
OmniMotion----51.767.585.3------
Dino-Tracker----62.378.287.5------
TAPNetKub33.048.678.838.453.182.338.554.480.6---
TAPIRKub58.570.086.561.373.688.849.664.285.059.673.487.0
Online TAPIRKub56.270.086.5---51.5-----
TAPTRKub63.076.191.166.379.291.049.064.485.260.175.386.9
TAPTRv2Kub63.575.991.466.478.891.349.764.285.7---
TAPTRv3Kub63.276.791.0---54.567.588.2---
BootsTAPIRKub+15M61.474.088.466.278.590.754.668.486.564.980.186.3
LocoTrackKub63.075.387.267.879.689.952.966.885.362.376.287.1
Anthro-LocoTrack (Ours)Kub+1.4K64.877.389.169.081.090.853.968.486.464.779.288.4
Improvement over baseline+1.8+2.0+1.9+1.2+1.4+0.9+1.0+1.6+1.1+2.4+3.0+1.3
TAPNextKub62.476.690.565.479.788.9---59.873.188.1
BootsTAPNextKub+15M65.278.591.268.982.491.6---64.175.188.8
Anthro-TAPNext (Ours)Kub+1.4K66.179.391.771.483.592.4---63.476.390.2
Improvement over baseline+3.7+2.7+1.2+6.0+3.8+3.5---+3.6+3.2+2.1
Models evaluated at 384×512 resolution
PIPsFT42.264.877.752.470.083.6------
CoTracker2Kub62.275.789.365.979.489.948.864.585.8---
Track-OnKub65.078.090.8---53.967.387.8---
CoTracker3 (online)Kub64+15K64.476.991.2---54.767.887.4---
CoTracker3 (offline)Kub64+15K63.876.390.2---55.868.588.3---
LocoTrackKub64.877.486.269.481.388.652.366.482.1---
Anthro-LocoTrack (Ours)Kub+1.4K65.978.987.371.182.990.354.868.685.3---
Improvement over baseline+1.1+1.5+1.1+1.7+1.6+1.7+2.5+2.2+3.2---
Quantitative results on TAP-Vid benchmark and RoboTAP. Kub = Kubric, Kub64 = Kubric64 (dataset rendered in CoTracker3). Best in bold, second best underlined.

Ablation Study

Abstract

Point tracking models often struggle to generalize to real-world videos because large-scale training data is predominantly synthetic, the only source currently feasible to produce at scale. Collecting real-world annotations, however, is prohibitively expensive, as it requires tracking hundreds of points across frames.

We introduce AnthroTAP, an automated pipeline that generates large-scale pseudo-labeled point tracking data from real human motion videos. Leveraging the structured complexity of human movement, including non-rigid deformations, articulated motion, and frequent occlusions. AnthroTAP fits Skinned Multi-Person Linear (SMPL) models to detected humans, projects mesh vertices onto image planes, resolves occlusions via ray-casting, and filters unreliable tracks using optical flow consistency.

A model trained on the AnthroTAP dataset achieves state-of-the-art performance on TAP-Vid, a challenging general-domain benchmark for tracking any point on diverse rigid and non-rigid objects (e.g., humans, animals, robots, and vehicles). Our approach outperforms recent self-training methods trained on vastly larger real datasets, while requiring only one day of training on 4 GPUs.

AnthroTAP shows that structured human motion offers a scalable and effective source of real-world supervision for point tracking.

Citation

@inproceedings{kim2026anthrotap,
  title     = {AnthroTAP: Learning Point Tracking with Real-World Motion},
  author    = {Kim, In{\`e}s Hyeonsu and Cho, Seokju and Koo, Jahyeok and Park, Junghyun and Huang, Jiahui and Lee, Honglak and Lee, Joon-Young and Kim, Seungryong},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year      = {2026}
}