arXiv 2026
Optical flow models trained on high-quality data degrade severely when confronted with real-world corruptions such as blur, noise, and compression artifacts. We formulate Degradation-Aware Optical Flow, a new task targeting accurate dense correspondence estimation from corrupted video. Our key insight is that intermediate representations of image restoration diffusion models are inherently corruption-aware. We lift such a model to attend to multiple frames through full spatio-temporal attention and show that the resulting features exhibit zero-shot correspondence capabilities. Based on this finding, we present DA-Flow, a hybrid architecture that fuses diffusion features with convolutional features within an iterative refinement framework. DA-Flow substantially outperforms existing optical flow methods under severe degradation across multiple benchmarks.
We investigate whether pretrained image restoration diffusion models can provide degradation-aware features for geometric correspondence. By lifting the DiT4SR model with full spatio-temporal attention over multi-frame inputs, we analyze the zero-shot correspondence quality of diffusion features under severe degradation.
Comparison of layer-wise average EPE over timesteps. Across nearly all feature indices, the lifted features exhibit substantially lower EPE. This demonstrates that lifting enhances correspondence capability by enabling the model to learn cross-frame information.
Comparison of zero-shot geometric correspondence between Baseline and Lifting features. (Left) Top-10 layers ranked by timestep-averaged EPE (lower is better). Lifting consistently achieves lower EPE across all ranks. (Right) EPE over denoising steps for the top-4 layers of each method. Baseline features show high sensitivity to the denoising step, while Lifting features remain stable across the denoising steps.
Overall Architecture of DA-Flow. DA-Flow retains the standard correlation operator and iterative update operator from RAFT, while incorporating the lifted diffusion model alongside a conventional CNN feature encoder. Given a pair of degraded input frames, the lifted diffusion model extracts query and key features from full spatio-temporal MM-Attention layers. Features from the top-\(L\) layers with strongest correspondence are aggregated via DPT-based upsampling with three separate heads (query, key, context), then concatenated with CNN encoder features to form hybrid representations for cost volume construction and iterative flow refinement.
Coming soon.