TETO
Tracking Events with Teacher Observation for Motion Estimation and Frame Interpolation

arXiv 2026

Jini Yang^*1 Eunbeen Hong^*1 Soowon Son¹ Hyunkoo Lee¹ Sunghwan Hong² Sunok Kim³ Seungryong Kim¹

¹KAIST AI ²ETH Zurich ³Korea Aerospace University

*Equal contribution

arXiv Code BibTeX

TL;DR

TETO learns event-based motion estimation from only ~25 minutes of real-world data via RGB teacher distillation, achieving strong point tracking and optical flow, and enabling high-quality video frame interpolation.

~25 min

real-world data

manual annotations

2-in-1

flow + tracking estimator

Abstract

Event cameras capture per-pixel brightness changes with microsecond resolution, offering continuous motion information lost between RGB frames. However, existing event-based motion estimators depend on large-scale synthetic data that suffers from a significant sim-to-real gap. We propose TETO, a teacher-student framework that learns event motion estimation from only ~25 minutes of unannotated real-world recordings through knowledge distillation from a pretrained RGB tracker. Our motion-aware data curation and query sampling strategy maximizes learning from limited data by disentangling object motion from dominant ego-motion. The resulting estimator jointly predicts point trajectories and dense optical flow from events, which we leverage as explicit motion priors to condition a pretrained video diffusion transformer for frame interpolation. We achieve competitive point tracking on EVIMO2 and optical flow on DSEC using orders of magnitude less training data, and demonstrate that accurate motion estimation translates directly to superior frame interpolation quality on BS-ERGB and HQ-EVFI.

Event-based Motion Estimation

Why Real-World Data?

Synthetic events from interpolated RGB frames oversimplify continuous motion dynamics
Synthetic events exhibit periodic artifacts and unrealistic noise statistics
Rendering + interpolation + event simulation pipelines are computationally expensive and hard to scale
The resulting sim-to-real gap limits generalization to real-world dynamic scenes

Teacher-Student Distillation

We distill knowledge from a pretrained RGB tracker (AllTracker) that generates pseudo trajectories and optical flow on real event-RGB paired sequences. A lightweight concentration network \(\mathcal{C}\) converts multi-scale event stacks into a 3-channel representation compatible with the pretrained encoder, bridging the modality gap without architectural modification.

The student is trained end-to-end with a combined objective \(\mathcal{L} = \mathcal{L}_{\text{track}} + \lambda\,\mathcal{L}_{\text{flow}}\), jointly predicting point trajectories and dense optical flow from a single forward pass.

Motion-aware Query Sampling

Camera ego-motion dominates the motion distribution in limited data, causing naive sampling to overfit to global motion patterns. We decompose the teacher's optical flow via RANSAC to isolate object-specific movement, then oversample 90% of query points from dynamic regions.

Quantitative Results

Point Tracking on EVIMO2

Method	Event data	AJ↑	δ^x_avg↑	OA↑

CoTracker	-	53.1	66.3	86.1
AllTracker	-	57.9	72.4	91.7

ETAP	~5hr Synth.	66.1	78.9	89.5
TETO	~25min Real	67.9	81.4	92.2

Optical Flow on DSEC

Method	Setting	EPE↓	AE↓

MultiCM	In-domain	3.47	13.98
Paredes et al.	In-domain	2.33	10.56
VSA-SM	In-domain	2.22	8.86
E2FAI	In-domain	1.78	6.44

TETO	Zero-shot	2.15	6.08
TETO	In-domain	1.39	4.31

Qualitative Results

Point tracking on EVIMO2. Ground-truth trajectories in green, predictions in blue.

Tracking on BS-ERGB

Beyond the Teacher

Event cameras capture continuous brightness changes where RGB sensors fail. Through distillation, TETO inherits this advantage — the student exploits temporal motion cues unavailable to the teacher, enabling robust tracking where the teacher itself breaks down.

In nighttime and high-speed scenarios, AllTracker (teacher) fails while TETO succeeds.

Motion over Appearance

Appearance-based RGB trackers confuse visually similar objects or fail under visual noise like reflections. TETO tracks through these cases using event-derived motion cues instead of appearance matching.

Visually similar juggling pins confuse appearance-based matching. TETO distinguishes each object through its distinct motion trajectory.

Reflections and bubbles create appearance noise that breaks RGB matching. TETO tracks through these distractors using motion signals.

Video Frame Interpolation

Event cameras provide continuous motion signals during blind time between RGB frames — exactly the information needed for video frame interpolation. If this motion estimation is accurate, frame synthesis naturally follows. We condition a pretrained video diffusion transformer with three complementary signals from TETO: optical flow, point trajectories, and an event motion mask.

Architecture

Flow warping — boundary frames warped to target timestamp for motion-aligned denoising initialization
Event motion mask — \(\mathcal{M}_{\text{event}}\) from raw events directs generation toward dynamic regions
Trajectory-guided attention — point trajectories supervise DiT self-attention for fine-grained correspondence

Ablation Study

Component Analysis on BS-ERGB

Model	FID↓	LPIPS↓	PSNR↑	SSIM↑

TETO-VFI	10.11	0.0821	25.42	0.8022
− \(\mathcal{M}_{\text{event}}\)	11.06	0.0867	23.58	0.8076
− \(\mathcal{L}_{\text{attn}}\)	10.82	0.0869	23.37	0.8014
− \(\mathcal{F}_{\text{warp}}\)	13.88	0.1203	21.96	0.7573

Explicit motion signals (flow warping, trajectory attention, event motion mask) guide the diffusion model to faithfully reconstruct fine-grained motion patterns such as ball seams and elastic deformations.

Quantitative Results

Video Frame Interpolation on BS-ERGB

Method	×1				×3				×6
Method	FID↓	LPIPS↓	PSNR↑	SSIM↑	FID↓	LPIPS↓	PSNR↑	SSIM↑	FID↓	LPIPS↓	PSNR↑	SSIM↑

TimeLens-XL	15.29	0.0808	28.27	0.8223	38.11	0.0695	29.73	0.8367	38.11	0.1314	22.43	0.7574
CBMNet-Large	10.64	0.1667	29.23	0.7737	11.23	0.1730	28.46	0.7580	13.31	0.1791	27.67	0.7511
RE-VDM	16.37	0.1004	28.04	0.8631	16.66	0.1268	27.03	0.8119	19.40	0.1258	26.93	0.8426

TETO-VFI	10.41	0.0684	25.73	0.8346	7.65	0.0689	26.21	0.8314	7.88	0.0859	25.15	0.7955

Zero-shot Generalization on HQ-EVFI

Method	×1				×15
Method	FID↓	LPIPS↓	PSNR↑	SSIM↑	FID↓	LPIPS↓	PSNR↑	SSIM↑

TimeLens-XL	26.50	0.0601	25.51	0.8856	28.83	0.0579	25.67	0.8942
CBMNet-Large	18.92	0.0717	29.00	0.8849	26.19	0.0827	27.38	0.8767
RE-VDM	21.18	0.0605	26.95	0.8714	33.42	0.0801	25.56	0.8565

TETO-VFI	17.39	0.0401	26.75	0.9093	19.89	0.0501	25.36	0.8923

Without fine-tuning on HQ-EVFI, TETO-VFI achieves the best perceptual quality (FID, LPIPS).

Qualitative Results

TETO-VFI produces sharper boundaries and more faithful reconstructions in dynamic motion regions.

Citation

@misc{yang2026tetotrackingeventsteacher,
      title={TETO: Tracking Events with Teacher Observation for Motion Estimation and Frame Interpolation}, 
      author={Jini Yang and Eunbeen Hong and Soowon Son and Hyunkoo Lee and Sunghwan Hong and Sunok Kim and Seungryong Kim},
      year={2026},
      eprint={2603.23487},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2603.23487}, 
}

TETO Tracking Events with Teacher Observation for Motion Estimation and Frame Interpolation

Abstract

Event-based Motion Estimation

Why Real-World Data?

Teacher-Student Distillation

Motion-aware Query Sampling

Quantitative Results

Point Tracking on EVIMO2

Optical Flow on DSEC

Qualitative Results

Tracking on BS-ERGB

Beyond the Teacher

Motion over Appearance

Video Frame Interpolation

Architecture

Ablation Study

Component Analysis on BS-ERGB

Quantitative Results

Video Frame Interpolation on BS-ERGB

Zero-shot Generalization on HQ-EVFI

Qualitative Results

Citation

TETO
Tracking Events with Teacher Observation for Motion Estimation and Frame Interpolation