Teaser

All videos were generated from a single prompt, with no re-caching, and are shown at 1× speed.

Rolling Forcing
Deep Forcing (Training-Free)
LongLive
Deep Forcing (Training-Free)
TL;DR: Deep Forcing amplifies Self Forcing's emergent deep sink and performs training-free KV cache compression, mitigating quality degradation in long-horizon video diffusion while preserving dynamic, high-fidelity visuals far beyond the base model's context.

Deep Forcing is a training-free framework that enables long-video generation in autoregressive video diffusion models by combining Deep Sink and Participative Compression.

Deep Forcing achieves more than 12× length extrapolation (5s → 60s+) without fine-tuning. This approach delivers superior imaging and aesthetic quality compared to training-based methods while preserving real-time streaming capability.

Motivation

Comparisons diagram
Additional comparisons diagram

Deep Sink: Our analysis reveals attention distributes across both early and intermediate frames—unlike LLM sinks preserving only few initial tokens—requiring deep sink preservation (40-60%) to maintain fidelity beyond training length.

Participative Compression: To prevent quality degeneration from attention dilution caused by error accumulation during extreme out-of-distribution long-context extrapolation.

Deep Forcing

VIRAL framework illustration

Deep Sink maintains a substantially enlarged attention sink (~50% of cache), with temporal RoPE adjustment, ensuring temporal coherence between sink tokens and current frames.

Participative Compression, which selectively prunes redundant tokens by computing attention scores from recent frames, retains only the top-C most contextually relevant tokens while evicting redundant and degraded tokens.

Qualitative Results

Qualitative results demonstrating the capabilities of Deep Forcing.

Deep Forcing on diverse prompts

Corgi
Underwater
Burger Boy
Kitchen
Romantic
Pigeon
Dwarf
Flame Wisp
Running Pig
Dogs
Dynamic Puddle
Fluffy Monster

Comparison with baselines on shared prompts

Qualitative comparisons of Deep Forcing against Self Forcing and other baselines on shared prompts.

Deep Forcing (Ours)
Self Forcing
Rolling Forcing
LongLive
Deep Forcing (Ours)
Self Forcing
Rolling Forcing
LongLive
Deep Forcing (Ours)
Self Forcing
Rolling Forcing
LongLive
Deep Forcing (Ours)
Self Forcing
Rolling Forcing
LongLive

Ablation

Sink Size Ablation: Larger sinks (10-15 frames) reduce degradation and aesthetic drift, but excessive size (18) causes over-preservation and repetition.

Sink Size 0
Sink Size 4
Sink Size 9
Sink Size 12
Sink Size 14
Sink Size 18

Component Analysis: Deep Sink (DS) and Participative Compression (PC) can be layered on top of the vanilla baseline. The clips below highlight how each component progressively improves long video generation.

Baseline
Baseline + Deep Sink
Baseline + DS + PC

Conclusion

In this work, we propose Deep Forcing, a training-free framework that mitigates error accumulation in autoregressive long video generation through two key mechanisms: Deep Sink and Participative Compression. By exploiting the inherent deep attention sink behavior in pre-trained models, our method enables minute-long video generation while preserving visual fidelity and motion dynamics. Extensive experiments across VBench-Long, user studies, and VLM evaluation demonstrate that our training-free framework achieves strong performance competitive with training-based methods.

Citation

If you use this work or find it helpful, please consider citing: