All videos were generated from a single prompt, with no re-caching, and are shown at 1× speed.
Deep Forcing is a training-free framework that enables long-video generation in autoregressive video diffusion models by combining Deep Sink and Participative Compression.
Deep Forcing achieves more than 12× length extrapolation (5s → 60s+) without fine-tuning. This approach delivers superior imaging and aesthetic quality compared to training-based methods while preserving real-time streaming capability.
Motivation
Deep Sink: Our analysis reveals attention distributes across both early and intermediate frames—unlike LLM sinks preserving only few initial tokens—requiring deep sink preservation (40-60%) to maintain fidelity beyond training length.
Participative Compression: To prevent quality degeneration from attention dilution caused by error accumulation during extreme out-of-distribution long-context extrapolation.
Deep Forcing
Deep Sink maintains a substantially enlarged attention sink (~50% of cache), with temporal RoPE adjustment, ensuring temporal coherence between sink tokens and current frames.
Participative Compression, which selectively prunes redundant tokens by computing attention scores from recent frames, retains only the top-C most contextually relevant tokens while evicting redundant and degraded tokens.
Qualitative Results
Qualitative results demonstrating the capabilities of Deep Forcing.
Deep Forcing on diverse prompts
Comparison with baselines on shared prompts
Qualitative comparisons of Deep Forcing against Self Forcing and other baselines on shared prompts.
Ablation
Sink Size Ablation: Larger sinks (10-15 frames) reduce degradation and aesthetic drift, but excessive size (18) causes over-preservation and repetition.
Component Analysis: Deep Sink (DS) and Participative Compression (PC) can be layered on top of the vanilla baseline. The clips below highlight how each component progressively improves long video generation.
Conclusion
In this work, we propose Deep Forcing, a training-free framework that mitigates error accumulation in autoregressive long video generation through two key mechanisms: Deep Sink and Participative Compression. By exploiting the inherent deep attention sink behavior in pre-trained models, our method enables minute-long video generation while preserving visual fidelity and motion dynamics. Extensive experiments across VBench-Long, user studies, and VLM evaluation demonstrate that our training-free framework achieves strong performance competitive with training-based methods.
Citation
If you use this work or find it helpful, please consider citing: