CVPR 2026

AnthroTAP:
Learning Point Tracking with Real-World Motion

Inès Hyeonsu Kim^1,3* Seokju Cho^1* Jahyeok Koo¹ Junghyun Park¹

Jiahui Huang² Honglak Lee^3,4 Joon-Young Lee² Seungryong Kim¹

¹KAIST AI ²Adobe Research ³University of Michigan ⁴LG AI Research

*: Equal contribution

Pseudo-labeled trajectories extracted using the AnthroTAP pipeline on videos from DanceTrack dataset

TL;DR

What if tracking any point in a video didn't require millions of labeled examples, but just people dancing? We built AnthroTAP, a pipeline that distills rich supervision from real human motion videos using 3D body mesh fitting. Trained on just 1,400 dance videos, our model outperforms methods using 10,000x more data, showing that structured human movement offers a scalable and effective source of real-world supervision for point tracking.

11×

fewer training videos
than CoTracker3

10,000×

fewer training videos
than BootsTAPIR

1 day

of training on
4 NVIDIA A6000 GPUs

Why Human Motion?

The Data Problem

Real-world annotation requires tracking hundreds of points across frames, which is prohibitively expensive
Synthetic datasets (Kubric, PointOdyssey) miss real-world visual complexity: motion blur, lighting, reflections
Self-training methods need 15M+ real videos and still suffer from weak supervision signals

Our Insight

Human motion is a free source of complex, structured supervision already captured in millions of videos
Non-rigid deformations, articulated joints, and frequent occlusions: exactly what trackers need to learn
3D mesh models (SMPL) let us extract precise trajectories automatically, without any manual labeling

💡

Key insight: Human dance videos freely available online are inherently annotated by the laws of physics: every person in every frame follows a consistent 3D trajectory. We just need to recover and project that 3D structure to obtain high-quality pseudo-labels at scale.

Pipeline

AnthroTAP is an automated pseudo-labeling pipeline that distills the rich supervision signal in human motion videos into 2D point tracking data. By fitting SMPL models to detected people, projecting 3D mesh vertices onto the image plane, and resolving occlusions via ray-casting, it generates trajectories with accurate occlusion labels, entirely without manual annotation.

AnthroTAP pipeline: human mesh recovery and vertex projection form the core pseudo-labeling process, with ray-casting for occlusion modeling.

① Human Mesh Recovery

TokenHMR fits SMPL 3D body meshes to every person detected in each frame, producing a temporally consistent 3D mesh with N_v vertices per person.

② Vertex Projection

3D mesh vertices are projected onto the 2D image plane using known camera parameters, forming initial pseudo-trajectories 𝒳_p,j across frames.

③ Visibility via Ray Casting

A ray cast from the camera toward each 3D vertex is tested against all human mesh triangles. Points blocked before reaching the target are marked as occluded, resolving both self-occlusion and inter-person occlusion.

+ Additional Refinement

Optical Flow Filtering

HMR-predicted displacements are compared against optical flow at each frame. Trajectory segments where the two diverge are discarded before training, catching occlusions from scene elements (e.g., furniture, background objects) outside the scope of the SMPL model.

Dataset: Anthro-LD

We apply AnthroTAP to 1.4K videos from the Let's Dance dataset, spanning diverse dance styles from solo performances to complex multi-person scenes. Unlike competing approaches, our training data is entirely non-proprietary.

Trajectory Complexity

We measure complexity as the mean angular acceleration of trajectories, measuring how sharply and frequently a trajectory changes direction over all contiguous visible segments. Anthro-LD achieves the highest complexity by a large margin, surpassing even synthetic datasets specifically designed for diversity.

KubricSynthetic

0.18

DriveTrackReal

0.44

PointOdysseySynthetic

0.52

Anthro-LD (Ours)Real

1.25

Trajectory complexity (mean angular acceleration ↑). Anthro-LD is 3× more complex than the next real-world dataset (DriveTrack).

Results

Qualitative Comparison

Compared to CoTracker3 and BootsTAPIR, Anthro-LocoTrack consistently demonstrates stronger tracking on highly deformable objects and severe occlusions.

CoTracker3

BootsTAPIR

Anthro-LocoTrack (Ours)

Quantitative Results on TAP-Vid & RoboTAP

Both LocoTrack and TAPNext trained with our approach show significant improvements across all metrics and datasets. We use 11× fewer videos than CoTracker3 and 1,000× fewer training frames than BootsTAPIR.

Quantitative results on **TAP-Vid benchmark** and **RoboTAP**. Kub = Kubric, Kub64 = Kubric64 (dataset rendered in CoTracker3). Best in **bold**, second best underlined.
Method	Training Dataset	DAVIS First			DAVIS Strided			Kinetics First			RoboTAP First
Method	Training Dataset	AJ↑	<δ^x_avg↑	OA↑	AJ↑	<δ^x_avg↑	OA↑	AJ↑	<δ^x_avg↑	OA↑	AJ↑	<δ^x_avg↑	OA↑
Models evaluated at 256×256 resolution
OmniMotion	-	-	-	-	51.7	67.5	85.3	-	-	-	-	-	-
Dino-Tracker	-	-	-	-	62.3	78.2	87.5	-	-	-	-	-	-

TAPNet	Kub	33.0	48.6	78.8	38.4	53.1	82.3	38.5	54.4	80.6	-	-	-
TAPIR	Kub	58.5	70.0	86.5	61.3	73.6	88.8	49.6	64.2	85.0	59.6	73.4	87.0
Online TAPIR	Kub	56.2	70.0	86.5	-	-	-	51.5	-	-	-	-	-
TAPTR	Kub	63.0	76.1	91.1	66.3	79.2	91.0	49.0	64.4	85.2	60.1	75.3	86.9
TAPTRv2	Kub	63.5	75.9	91.4	66.4	78.8	91.3	49.7	64.2	85.7	-	-	-
TAPTRv3	Kub	63.2	76.7	91.0	-	-	-	54.5	67.5	88.2	-	-	-
BootsTAPIR	Kub+15M	61.4	74.0	88.4	66.2	78.5	90.7	54.6	68.4	86.5	64.9	80.1	86.3

LocoTrack	Kub	63.0	75.3	87.2	67.8	79.6	89.9	52.9	66.8	85.3	62.3	76.2	87.1
Anthro-LocoTrack (Ours)	Kub+1.4K	64.8	77.3	89.1	69.0	81.0	90.8	53.9	68.4	86.4	64.7	79.2	88.4
Improvement over baseline		+1.8	+2.0	+1.9	+1.2	+1.4	+0.9	+1.0	+1.6	+1.1	+2.4	+3.0	+1.3

TAPNext	Kub	62.4	76.6	90.5	65.4	79.7	88.9	-	-	-	59.8	73.1	88.1
BootsTAPNext	Kub+15M	65.2	78.5	91.2	68.9	82.4	91.6	-	-	-	64.1	75.1	88.8
Anthro-TAPNext (Ours)	Kub+1.4K	66.1	79.3	91.7	71.4	83.5	92.4	-	-	-	63.4	76.3	90.2
Improvement over baseline		+3.7	+2.7	+1.2	+6.0	+3.8	+3.5	-	-	-	+3.6	+3.2	+2.1
Models evaluated at 384×512 resolution
PIPs	FT	42.2	64.8	77.7	52.4	70.0	83.6	-	-	-	-	-	-
CoTracker2	Kub	62.2	75.7	89.3	65.9	79.4	89.9	48.8	64.5	85.8	-	-	-
Track-On	Kub	65.0	78.0	90.8	-	-	-	53.9	67.3	87.8	-	-	-
CoTracker3 (online)	Kub64+15K	64.4	76.9	91.2	-	-	-	54.7	67.8	87.4	-	-	-
CoTracker3 (offline)	Kub64+15K	63.8	76.3	90.2	-	-	-	55.8	68.5	88.3	-	-	-

LocoTrack	Kub	64.8	77.4	86.2	69.4	81.3	88.6	52.3	66.4	82.1	-	-	-
Anthro-LocoTrack (Ours)	Kub+1.4K	65.9	78.9	87.3	71.1	82.9	90.3	54.8	68.6	85.3	-	-	-
Improvement over baseline		+1.1	+1.5	+1.1	+1.7	+1.6	+1.7	+2.5	+2.2	+3.2	-	-	-

Abstract

Point tracking models often struggle to generalize to real-world videos because large-scale training data is predominantly synthetic, the only source currently feasible to produce at scale. Collecting real-world annotations, however, is prohibitively expensive, as it requires tracking hundreds of points across frames.

We introduce AnthroTAP, an automated pipeline that generates large-scale pseudo-labeled point tracking data from real human motion videos. Leveraging the structured complexity of human movement, including non-rigid deformations, articulated motion, and frequent occlusions. AnthroTAP fits Skinned Multi-Person Linear (SMPL) models to detected humans, projects mesh vertices onto image planes, resolves occlusions via ray-casting, and filters unreliable tracks using optical flow consistency.

A model trained on the AnthroTAP dataset achieves state-of-the-art performance on TAP-Vid, a challenging general-domain benchmark for tracking any point on diverse rigid and non-rigid objects (e.g., humans, animals, robots, and vehicles). Our approach outperforms recent self-training methods trained on vastly larger real datasets, while requiring only one day of training on 4 GPUs.

AnthroTAP shows that structured human motion offers a scalable and effective source of real-world supervision for point tracking.

Citation

@inproceedings{kim2026anthrotap,
  title     = {AnthroTAP: Learning Point Tracking with Real-World Motion},
  author    = {Kim, In{\`e}s Hyeonsu and Cho, Seokju and Koo, Jahyeok and Park, Junghyun and Huang, Jiahui and Lee, Honglak and Lee, Joon-Young and Kim, Seungryong},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year      = {2026}
}

AnthroTAP: Learning Point Tracking with Real-World Motion