arXiv 2026
Event cameras capture per-pixel brightness changes with microsecond resolution, offering continuous motion information lost between RGB frames. However, existing event-based motion estimators depend on large-scale synthetic data that suffers from a significant sim-to-real gap. We propose TETO, a teacher-student framework that learns event motion estimation from only ~25 minutes of unannotated real-world recordings through knowledge distillation from a pretrained RGB tracker. Our motion-aware data curation and query sampling strategy maximizes learning from limited data by disentangling object motion from dominant ego-motion. The resulting estimator jointly predicts point trajectories and dense optical flow from events, which we leverage as explicit motion priors to condition a pretrained video diffusion transformer for frame interpolation. We achieve competitive point tracking on EVIMO2 and optical flow on DSEC using orders of magnitude less training data, and demonstrate that accurate motion estimation translates directly to superior frame interpolation quality on BS-ERGB and HQ-EVFI.
We distill knowledge from a pretrained RGB tracker (AllTracker) that generates pseudo trajectories and optical flow on real event-RGB paired sequences. A lightweight concentration network \(\mathcal{C}\) converts multi-scale event stacks into a 3-channel representation compatible with the pretrained encoder, bridging the modality gap without architectural modification.
The student is trained end-to-end with a combined objective \(\mathcal{L} = \mathcal{L}_{\text{track}} + \lambda\,\mathcal{L}_{\text{flow}}\), jointly predicting point trajectories and dense optical flow from a single forward pass.
Camera ego-motion dominates the motion distribution in limited data, causing naive sampling to overfit to global motion patterns. We decompose the teacher's optical flow via RANSAC to isolate object-specific movement, then oversample 90% of query points from dynamic regions.
| Method | Event data | AJ↑ | δxavg↑ | OA↑ |
|---|---|---|---|---|
| CoTracker | - | 53.1 | 66.3 | 86.1 |
| AllTracker | - | 57.9 | 72.4 | 91.7 |
| ETAP | ~5hr Synth. | 66.1 | 78.9 | 89.5 |
| TETO | ~25min Real | 67.9 | 81.4 | 92.2 |
| Method | Setting | EPE↓ | AE↓ |
|---|---|---|---|
| MultiCM | In-domain | 3.47 | 13.98 |
| Paredes et al. | In-domain | 2.33 | 10.56 |
| VSA-SM | In-domain | 2.22 | 8.86 |
| E2FAI | In-domain | 1.78 | 6.44 |
| TETO | Zero-shot | 2.15 | 6.08 |
| TETO | In-domain | 1.39 | 4.31 |
Point tracking on EVIMO2. Ground-truth trajectories in green, predictions in blue.
Event cameras capture continuous brightness changes where RGB sensors fail. Through distillation, TETO inherits this advantage — the student exploits temporal motion cues unavailable to the teacher, enabling robust tracking where the teacher itself breaks down.
In nighttime and high-speed scenarios, AllTracker (teacher) fails while TETO succeeds.
Appearance-based RGB trackers confuse visually similar objects or fail under visual noise like reflections. TETO tracks through these cases using event-derived motion cues instead of appearance matching.
Visually similar juggling pins confuse appearance-based matching. TETO distinguishes each object through its distinct motion trajectory.
Reflections and bubbles create appearance noise that breaks RGB matching. TETO tracks through these distractors using motion signals.
Event cameras provide continuous motion signals during blind time between RGB frames — exactly the information needed for video frame interpolation. If this motion estimation is accurate, frame synthesis naturally follows. We condition a pretrained video diffusion transformer with three complementary signals from TETO: optical flow, point trajectories, and an event motion mask.
| Model | FID↓ | LPIPS↓ | PSNR↑ | SSIM↑ |
|---|---|---|---|---|
| TETO-VFI | 10.11 | 0.0821 | 25.42 | 0.8022 |
| − \(\mathcal{M}_{\text{event}}\) | 11.06 | 0.0867 | 23.58 | 0.8076 |
| − \(\mathcal{L}_{\text{attn}}\) | 10.82 | 0.0869 | 23.37 | 0.8014 |
| − \(\mathcal{F}_{\text{warp}}\) | 13.88 | 0.1203 | 21.96 | 0.7573 |
Explicit motion signals (flow warping, trajectory attention, event motion mask) guide the diffusion model to faithfully reconstruct fine-grained motion patterns such as ball seams and elastic deformations.
| Method | ×1 | ×3 | ×6 | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| FID↓ | LPIPS↓ | PSNR↑ | SSIM↑ | FID↓ | LPIPS↓ | PSNR↑ | SSIM↑ | FID↓ | LPIPS↓ | PSNR↑ | SSIM↑ | |
| TimeLens-XL | 15.29 | 0.0808 | 28.27 | 0.8223 | 38.11 | 0.0695 | 29.73 | 0.8367 | 38.11 | 0.1314 | 22.43 | 0.7574 |
| CBMNet-Large | 10.64 | 0.1667 | 29.23 | 0.7737 | 11.23 | 0.1730 | 28.46 | 0.7580 | 13.31 | 0.1791 | 27.67 | 0.7511 |
| RE-VDM | 16.37 | 0.1004 | 28.04 | 0.8631 | 16.66 | 0.1268 | 27.03 | 0.8119 | 19.40 | 0.1258 | 26.93 | 0.8426 |
| TETO-VFI | 10.41 | 0.0684 | 25.73 | 0.8346 | 7.65 | 0.0689 | 26.21 | 0.8314 | 7.88 | 0.0859 | 25.15 | 0.7955 |
| Method | ×1 | ×15 | ||||||
|---|---|---|---|---|---|---|---|---|
| FID↓ | LPIPS↓ | PSNR↑ | SSIM↑ | FID↓ | LPIPS↓ | PSNR↑ | SSIM↑ | |
| TimeLens-XL | 26.50 | 0.0601 | 25.51 | 0.8856 | 28.83 | 0.0579 | 25.67 | 0.8942 |
| CBMNet-Large | 18.92 | 0.0717 | 29.00 | 0.8849 | 26.19 | 0.0827 | 27.38 | 0.8767 |
| RE-VDM | 21.18 | 0.0605 | 26.95 | 0.8714 | 33.42 | 0.0801 | 25.56 | 0.8565 |
| TETO-VFI | 17.39 | 0.0401 | 26.75 | 0.9093 | 19.89 | 0.0501 | 25.36 | 0.8923 |
Without fine-tuning on HQ-EVFI, TETO-VFI achieves the best perceptual quality (FID, LPIPS).
TETO-VFI produces sharper boundaries and more faithful reconstructions in dynamic motion regions.
@misc{yang2026tetotrackingeventsteacher,
title={TETO: Tracking Events with Teacher Observation for Motion Estimation and Frame Interpolation},
author={Jini Yang and Eunbeen Hong and Soowon Son and Hyunkoo Lee and Sunghwan Hong and Sunok Kim and Seungryong Kim},
year={2026},
eprint={2603.23487},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2603.23487},
}