TETO
Tracking Events with Teacher Observation for Motion Estimation and Frame Interpolation

arXiv 2026

1KAIST AI 2ETH Zurich 3Korea Aerospace University
*Equal contribution
TL;DR

TETO learns event-based motion estimation from only ~25 minutes of real-world data via RGB teacher distillation, achieving strong point tracking and optical flow, and enabling high-quality video frame interpolation.

~25 min
real-world data
0
manual annotations
2-in-1
flow + tracking estimator

Abstract

Event cameras capture per-pixel brightness changes with microsecond resolution, offering continuous motion information lost between RGB frames. However, existing event-based motion estimators depend on large-scale synthetic data that suffers from a significant sim-to-real gap. We propose TETO, a teacher-student framework that learns event motion estimation from only ~25 minutes of unannotated real-world recordings through knowledge distillation from a pretrained RGB tracker. Our motion-aware data curation and query sampling strategy maximizes learning from limited data by disentangling object motion from dominant ego-motion. The resulting estimator jointly predicts point trajectories and dense optical flow from events, which we leverage as explicit motion priors to condition a pretrained video diffusion transformer for frame interpolation. We achieve competitive point tracking on EVIMO2 and optical flow on DSEC using orders of magnitude less training data, and demonstrate that accurate motion estimation translates directly to superior frame interpolation quality on BS-ERGB and HQ-EVFI.

Event-based Motion Estimation

Why Real-World Data?

  • Synthetic events from interpolated RGB frames oversimplify continuous motion dynamics
  • Synthetic events exhibit periodic artifacts and unrealistic noise statistics
  • Rendering + interpolation + event simulation pipelines are computationally expensive and hard to scale
  • The resulting sim-to-real gap limits generalization to real-world dynamic scenes
Training data scale comparison
Real vs Synthetic event analysis

Teacher-Student Distillation

We distill knowledge from a pretrained RGB tracker (AllTracker) that generates pseudo trajectories and optical flow on real event-RGB paired sequences. A lightweight concentration network \(\mathcal{C}\) converts multi-scale event stacks into a 3-channel representation compatible with the pretrained encoder, bridging the modality gap without architectural modification.

Training Pipeline

The student is trained end-to-end with a combined objective \(\mathcal{L} = \mathcal{L}_{\text{track}} + \lambda\,\mathcal{L}_{\text{flow}}\), jointly predicting point trajectories and dense optical flow from a single forward pass.

Motion-aware Query Sampling

Camera ego-motion dominates the motion distribution in limited data, causing naive sampling to overfit to global motion patterns. We decompose the teacher's optical flow via RANSAC to isolate object-specific movement, then oversample 90% of query points from dynamic regions.

Motion Query Sampling

Quantitative Results

Point Tracking on EVIMO2

Method Event data AJ↑ δxavg OA↑
CoTracker - 53.1 66.3 86.1
AllTracker - 57.9 72.4 91.7
ETAP ~5hr Synth. 66.1 78.9 89.5
TETO ~25min Real 67.9 81.4 92.2

Optical Flow on DSEC

Method Setting EPE↓ AE↓
MultiCM In-domain 3.47 13.98
Paredes et al. In-domain 2.33 10.56
VSA-SM In-domain 2.22 8.86
E2FAI In-domain 1.78 6.44
TETO Zero-shot 2.15 6.08
TETO In-domain 1.39 4.31

Qualitative Results

EVIMO2 Qualitative Comparison

Point tracking on EVIMO2. Ground-truth trajectories in green, predictions in blue.

Tracking on BS-ERGB

Beyond the Teacher

Event cameras capture continuous brightness changes where RGB sensors fail. Through distillation, TETO inherits this advantage — the student exploits temporal motion cues unavailable to the teacher, enabling robust tracking where the teacher itself breaks down.

Extreme Conditions

In nighttime and high-speed scenarios, AllTracker (teacher) fails while TETO succeeds.

Motion over Appearance

Appearance-based RGB trackers confuse visually similar objects or fail under visual noise like reflections. TETO tracks through these cases using event-derived motion cues instead of appearance matching.

Juggling scene

Visually similar juggling pins confuse appearance-based matching. TETO distinguishes each object through its distinct motion trajectory.

Fish tank scene

Reflections and bubbles create appearance noise that breaks RGB matching. TETO tracks through these distractors using motion signals.

Video Frame Interpolation

Event cameras provide continuous motion signals during blind time between RGB frames — exactly the information needed for video frame interpolation. If this motion estimation is accurate, frame synthesis naturally follows. We condition a pretrained video diffusion transformer with three complementary signals from TETO: optical flow, point trajectories, and an event motion mask.

Architecture

VFI Architecture
  • Flow warping — boundary frames warped to target timestamp for motion-aligned denoising initialization
  • Event motion mask — \(\mathcal{M}_{\text{event}}\) from raw events directs generation toward dynamic regions
  • Trajectory-guided attention — point trajectories supervise DiT self-attention for fine-grained correspondence

Ablation Study

Component Analysis on BS-ERGB

Model FID↓ LPIPS↓ PSNR↑ SSIM↑
TETO-VFI 10.11 0.0821 25.42 0.8022
− \(\mathcal{M}_{\text{event}}\) 11.06 0.0867 23.58 0.8076
− \(\mathcal{L}_{\text{attn}}\) 10.82 0.0869 23.37 0.8014
− \(\mathcal{F}_{\text{warp}}\) 13.88 0.1203 21.96 0.7573
VFI Ablation Qualitative

Explicit motion signals (flow warping, trajectory attention, event motion mask) guide the diffusion model to faithfully reconstruct fine-grained motion patterns such as ball seams and elastic deformations.

Quantitative Results

Video Frame Interpolation on BS-ERGB

Method ×1 ×3 ×6
FID↓LPIPS↓PSNR↑SSIM↑ FID↓LPIPS↓PSNR↑SSIM↑ FID↓LPIPS↓PSNR↑SSIM↑
TimeLens-XL 15.290.080828.270.8223 38.110.069529.730.8367 38.110.131422.430.7574
CBMNet-Large 10.640.166729.230.7737 11.230.173028.460.7580 13.310.179127.670.7511
RE-VDM 16.370.100428.040.8631 16.660.126827.030.8119 19.400.125826.930.8426
TETO-VFI 10.410.068425.730.8346 7.650.068926.210.8314 7.880.085925.150.7955

Zero-shot Generalization on HQ-EVFI

Method ×1 ×15
FID↓LPIPS↓PSNR↑SSIM↑ FID↓LPIPS↓PSNR↑SSIM↑
TimeLens-XL 26.500.060125.510.8856 28.830.057925.670.8942
CBMNet-Large 18.920.071729.000.8849 26.190.082727.380.8767
RE-VDM 21.180.060526.950.8714 33.420.080125.560.8565
TETO-VFI 17.390.040126.750.9093 19.890.050125.360.8923

Without fine-tuning on HQ-EVFI, TETO-VFI achieves the best perceptual quality (FID, LPIPS).

Qualitative Results

TETO-VFI produces sharper boundaries and more faithful reconstructions in dynamic motion regions.

Citation

@misc{yang2026tetotrackingeventsteacher,
      title={TETO: Tracking Events with Teacher Observation for Motion Estimation and Frame Interpolation}, 
      author={Jini Yang and Eunbeen Hong and Soowon Son and Hyunkoo Lee and Sunghwan Hong and Sunok Kim and Seungryong Kim},
      year={2026},
      eprint={2603.23487},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2603.23487}, 
}