New! Check out Deep Forcing on Interactive Prompting, World Models & Causal Forcing!

Teaser

All videos, except 'Interactive Prompting' & 'World Models', were generated from a single prompt, with no re-caching, and are shown at 1× speed.

Rolling Forcing
Deep Forcing (Training-Free)
LongLive
Deep Forcing (Training-Free)

Deep Forcing is a training-free framework that enables long-video generation in autoregressive video diffusion models by combining Deep Sink and Participative Compression.

Deep Forcing achieves more than 12× length extrapolation (5s → 60s+) without fine-tuning. This approach delivers superior imaging and aesthetic quality compared to training-based methods while preserving real-time streaming capability.

Motivation

Comparisons diagram
Additional comparisons diagram

Deep Sink: Our analysis reveals attention distributes across both early and intermediate frames—unlike LLM sinks preserving only few initial tokens—requiring deep sink preservation (40-60%) to maintain fidelity beyond training length.

Participative Compression: To prevent quality degeneration from attention dilution caused by error accumulation during extreme out-of-distribution long-context extrapolation.

Deep Forcing

VIRAL framework illustration

Deep Sink maintains a substantially enlarged attention sink (~50% of cache), with temporal RoPE adjustment, ensuring temporal coherence between sink tokens and current frames.

Participative Compression, which selectively prunes redundant tokens by computing attention scores from recent frames, retains only the top-C most contextually relevant tokens while evicting redundant and degraded tokens.

Qualitative Results

Qualitative results demonstrating the capabilities of Deep Forcing.

Deep Forcing on diverse prompts

Corgi
Underwater
Burger Boy
Kitchen
Romantic
Pigeon
Dwarf
Flame Wisp
Running Pig
Dogs
Dynamic Puddle
Fluffy Monster

Comparison with baselines on same prompts

Qualitative comparisons of Deep Forcing against Self Forcing and other baselines on same prompts.

Deep Forcing (Ours)
Self Forcing
Rolling Forcing
LongLive
Deep Forcing (Ours)
Self Forcing
Rolling Forcing
LongLive
Deep Forcing (Ours)
Self Forcing
Rolling Forcing
LongLive
Deep Forcing (Ours)
Self Forcing
Rolling Forcing
LongLive

Ablation

Sink Size Ablation: Larger sinks (10-15 frames) reduce degradation and aesthetic drift, but excessive size (18) causes over-preservation and repetition.

Sink Size 0
Sink Size 4
Sink Size 9
Sink Size 12
Sink Size 14
Sink Size 18

Component Analysis: Deep Sink (DS) and Participative Compression (PC) can be layered on top of the vanilla baseline. The clips below highlight how each component progressively improves long video generation.

Baseline
Baseline + Deep Sink
Baseline + DS + PC

Interactive Prompting

Deep Forcing can be used with interactive prompting, where users modify prompts during streaming to interactively generate video in real time. Input prompts can be viewed via the "Prompt" button in the video.

LongLive (Trained)
Deep Forcing (Training-Free)
LongLive (Trained)
Deep Forcing (Training-Free)

Application: World Models

Our methods can be applied to world models such as Matrix-Game 2.0, mitigating error accumulation and color drift.

Dynamic Scene

When action inputs (e.g., mouse or keyboard controls) are provided in real time, Deep Forcing substantially reduces visual fidelity degradation (e.g., color shifts and vehicle distortion).

Matrix-Game 2.0
Ours (Training-Free)

Static Scene

When no action input (e.g., mouse or keyboard controls) is provided, Matrix-Game-2.0 begins to degrade rapidly at around 10s. In contrast, Deep Forcing can maintain a stable world even without action inputs, preventing such degradation.

Matrix-Game 2.0
Ours (Training-Free)

On Causal Forcing

Our methods can also be applied to Causal Forcing. Deep Sink and Participative Compression effectively mitigate error accumulation and color drift in Causal Forcing.

Causal Forcing
Ours (Training-Free)
Causal Forcing
Ours (Training-Free)

Conclusion

In this work, we propose Deep Forcing, a training-free framework that mitigates error accumulation in autoregressive long video generation through two key mechanisms: Deep Sink and Participative Compression. By exploiting the inherent deep attention sink behavior in pre-trained models, our method enables minute-long video generation while preserving visual fidelity and motion dynamics. Extensive experiments across VBench-Long, user studies, and VLM evaluation demonstrate that our training-free framework achieves strong performance competitive with training-based methods.

Citation

If you use this work or find it helpful, please consider citing:

    @article{yi2025deep,
    title={Deep Forcing: Training-Free Long Video Generation with Deep Sink and Participative Compression},
    author={Yi, Jung and Jang, Wooseok and Cho, Paul Hyunbin and Nam, Jisu and Yoon, Heeji and Kim, Seungryong},
    journal={arXiv preprint arXiv:2512.05081},
    year={2025}
    }