All videos, except 'Interactive Prompting' & 'World Models', were generated from a single prompt, with no re-caching, and are shown at 1× speed.
Deep Forcing is a training-free framework that enables long-video generation in autoregressive video diffusion models by combining Deep Sink and Participative Compression.
Deep Forcing achieves more than 12× length extrapolation (5s → 60s+) without fine-tuning. This approach delivers superior imaging and aesthetic quality compared to training-based methods while preserving real-time streaming capability.
Motivation
Deep Sink: Our analysis reveals attention distributes across both early and intermediate frames—unlike LLM sinks preserving only few initial tokens—requiring deep sink preservation (40-60%) to maintain fidelity beyond training length.
Participative Compression: To prevent quality degeneration from attention dilution caused by error accumulation during extreme out-of-distribution long-context extrapolation.
Deep Forcing
Deep Sink maintains a substantially enlarged attention sink (~50% of cache), with temporal RoPE adjustment, ensuring temporal coherence between sink tokens and current frames.
Participative Compression, which selectively prunes redundant tokens by computing attention scores from recent frames, retains only the top-C most contextually relevant tokens while evicting redundant and degraded tokens.
Qualitative Results
Qualitative results demonstrating the capabilities of Deep Forcing.
Deep Forcing on diverse prompts
Comparison with baselines on same prompts
Qualitative comparisons of Deep Forcing against Self Forcing and other baselines on same prompts.
Ablation
Sink Size Ablation: Larger sinks (10-15 frames) reduce degradation and aesthetic drift, but excessive size (18) causes over-preservation and repetition.
Component Analysis: Deep Sink (DS) and Participative Compression (PC) can be layered on top of the vanilla baseline. The clips below highlight how each component progressively improves long video generation.
Interactive Prompting
Deep Forcing can be used with interactive prompting, where users modify prompts during streaming to interactively generate video in real time. Input prompts can be viewed via the "Prompt" button in the video.
Application: World Models
Our methods can be applied to world models such as Matrix-Game 2.0, mitigating error accumulation and color drift.
Dynamic Scene
When action inputs (e.g., mouse or keyboard controls) are provided in real time, Deep Forcing substantially reduces visual fidelity degradation (e.g., color shifts and vehicle distortion).
Static Scene
When no action input (e.g., mouse or keyboard controls) is provided, Matrix-Game-2.0 begins to degrade rapidly at around 10s. In contrast, Deep Forcing can maintain a stable world even without action inputs, preventing such degradation.
On Causal Forcing
Our methods can also be applied to Causal Forcing. Deep Sink and Participative Compression effectively mitigate error accumulation and color drift in Causal Forcing.
Conclusion
In this work, we propose Deep Forcing, a training-free framework that mitigates error accumulation in autoregressive long video generation through two key mechanisms: Deep Sink and Participative Compression. By exploiting the inherent deep attention sink behavior in pre-trained models, our method enables minute-long video generation while preserving visual fidelity and motion dynamics. Extensive experiments across VBench-Long, user studies, and VLM evaluation demonstrate that our training-free framework achieves strong performance competitive with training-based methods.
Citation
If you use this work or find it helpful, please consider citing:
@article{yi2025deep,
title={Deep Forcing: Training-Free Long Video Generation with Deep Sink and Participative Compression},
author={Yi, Jung and Jang, Wooseok and Cho, Paul Hyunbin and Nam, Jisu and Yoon, Heeji and Kim, Seungryong},
journal={arXiv preprint arXiv:2512.05081},
year={2025}
}