New! Check out Deep Forcing on Interactive Prompting, World Models & Causal Forcing!

Teaser

All videos, except 'Interactive Prompting' & 'World Models', were generated from a single prompt, with no re-caching, and are shown at 1× speed.

Rolling Forcing

Deep Forcing (Training-Free)

LongLive

Deep Forcing (Training-Free)

Deep Forcing is a training-free framework that enables long-video generation in autoregressive video diffusion models by combining Deep Sink and Participative Compression.

Deep Forcing achieves more than 12× length extrapolation (5s → 60s+) without fine-tuning. This approach delivers superior imaging and aesthetic quality compared to training-based methods while preserving real-time streaming capability.

Motivation

Deep Sink: Our analysis reveals attention distributes across both early and intermediate frames—unlike LLM sinks preserving only few initial tokens—requiring deep sink preservation (40-60%) to maintain fidelity beyond training length.

Participative Compression: To prevent quality degeneration from attention dilution caused by error accumulation during extreme out-of-distribution long-context extrapolation.

Deep Forcing

Deep Sink maintains a substantially enlarged attention sink (~50% of cache), with temporal RoPE adjustment, ensuring temporal coherence between sink tokens and current frames.

Participative Compression, which selectively prunes redundant tokens by computing attention scores from recent frames, retains only the top-C most contextually relevant tokens while evicting redundant and degraded tokens.

Qualitative Results

Qualitative results demonstrating the capabilities of Deep Forcing.

Deep Forcing on diverse prompts

Corgi

Underwater

Burger Boy

Kitchen

Romantic

Pigeon

Dwarf

Flame Wisp

Running Pig

Dogs

Dynamic Puddle

Fluffy Monster

Comparison with baselines on same prompts

Qualitative comparisons of Deep Forcing against Self Forcing and other baselines on same prompts.

Deep Forcing (Ours)

Self Forcing

Rolling Forcing

LongLive

Deep Forcing (Ours)

Self Forcing

Rolling Forcing

LongLive

Deep Forcing (Ours)

Self Forcing

Rolling Forcing

LongLive

Deep Forcing (Ours)

Self Forcing

Rolling Forcing

LongLive

Ablation

Sink Size Ablation: Larger sinks (10-15 frames) reduce degradation and aesthetic drift, but excessive size (18) causes over-preservation and repetition.

Sink Size 0

Sink Size 4

Sink Size 9

Sink Size 12

Sink Size 14

Sink Size 18

Component Analysis: Deep Sink (DS) and Participative Compression (PC) can be layered on top of the vanilla baseline. The clips below highlight how each component progressively improves long video generation.

Baseline

Baseline + Deep Sink

Baseline + DS + PC

Interactive Prompting

Deep Forcing can be used with interactive prompting, where users modify prompts during streaming to interactively generate video in real time. Input prompts can be viewed via the "Prompt" button in the video.

LongLive (Trained)

Deep Forcing (Training-Free)

LongLive (Trained)

Deep Forcing (Training-Free)

Application: World Models

Our methods can be applied to world models such as Matrix-Game 2.0, mitigating error accumulation and color drift.

Dynamic Scene

When action inputs (e.g., mouse or keyboard controls) are provided in real time, Deep Forcing substantially reduces visual fidelity degradation (e.g., color shifts and vehicle distortion).

Matrix-Game 2.0

Ours (Training-Free)

Static Scene

When no action input (e.g., mouse or keyboard controls) is provided, Matrix-Game-2.0 begins to degrade rapidly at around 10s. In contrast, Deep Forcing can maintain a stable world even without action inputs, preventing such degradation.

Matrix-Game 2.0

Ours (Training-Free)

On Causal Forcing

Our methods can also be applied to Causal Forcing. Deep Sink and Participative Compression effectively mitigate error accumulation and color drift in Causal Forcing.

Causal Forcing

Ours (Training-Free)

Causal Forcing

Ours (Training-Free)

Conclusion

In this work, we propose Deep Forcing, a training-free framework that mitigates error accumulation in autoregressive long video generation through two key mechanisms: Deep Sink and Participative Compression. By exploiting the inherent deep attention sink behavior in pre-trained models, our method enables minute-long video generation while preserving visual fidelity and motion dynamics. Extensive experiments across VBench-Long, user studies, and VLM evaluation demonstrate that our training-free framework achieves strong performance competitive with training-based methods.

Citation

If you use this work or find it helpful, please consider citing:

    @article{yi2025deep,
    title={Deep Forcing: Training-Free Long Video Generation with Deep Sink and Participative Compression},
    author={Yi, Jung and Jang, Wooseok and Cho, Paul Hyunbin and Nam, Jisu and Yoon, Heeji and Kim, Seungryong},
    journal={arXiv preprint arXiv:2512.05081},
    year={2025}
    }

Deep Forcing: Training-Free Long Video Generation with Deep Sink and Participative Compression

arXiv 2025

Teaser

Motivation

Deep Forcing

Qualitative Results

Deep Forcing on diverse prompts

Comparison with baselines on same prompts

Ablation

Interactive Prompting

Application: World Models

Dynamic Scene

Static Scene

On Causal Forcing

Conclusion

Citation