Teaser

Input Image
Input image: a yellow duckling swimming on a sunlit pond with water lilies
WorldKV (Ours)
Full KV
Sliding Window

WorldKV is a training-free framework that enables efficient world memory in autoregressive video world models by combining World Retrieval and World Compression.

WorldKV matches and in some cases exceeds full KV-cache memory fidelity at roughly 2Γ— the throughput under almost half the KV cache VRAM/RAM footprint on Matrix-Game 2.0 and LingBot-World-Fast.

Introduction

Emergent Memory

The KV cache of autoregressive video world models is not merely a computational buffer β€” even models trained only on short clips can leverage the full KV history as an emergent long-term visual memory, faithfully reproducing previously seen viewpoints upon revisit.

Matrix-Game-2.0
Lingbot-World-Fast

Motivation

VRAM & Attention Cost Efficiency

Full KV-cache attention preserves long-term memory, but its VRAM footprint quickly exceeds GPU capacity and the dramatically growing attention cost degrades inference speed below real-time. Efficient world memory requires retaining only the chunks that matter for revisits.

WorldKV efficiency analysis
WorldKV cache efficiency comparison

Attention Analysis

Our analysis reveals attention concentrates on past KV chunks whose viewpoints overlap with the current frame, motivating camera/action-indexed retrieval of the top-k viewpoint-relevant chunks back into the active attention window.

Additional comparisons diagram

WorldKV

VIRAL framework illustration

World Retrieval stores evicted KV-cache chunks in GPU/CPU memory indexed by camera/action state, and selectively retrieves the top-k viewpoint-relevant chunks back into the active attention window at revisit time, with no re-encoding required.

World Compression, which designates the first frame of each chunk as an anchor and prunes tokens by key-key cosine similarity, retains only the low-similarity distinctive tokens that encode newly revealed regions and temporally changing content while halving per-chunk storage.

Qualitative Results

Qualitative results demonstrating the capabilities of WorldKV.

WorldKV on diverse scenes

Astronaut
Underwater
Dawn
Flying Sword
Hero
Reed
Squirrel
Cat1
Magic Girl
Cat2
Track
Urban

Comparison with baselines

Qualitative comparisons of WorldKV against Full KV and other baselines on same prompts.

WorldKV (Ours)
Full KV
Sliding Window
WorldKV (Ours)
Full KV
Sliding Window
WorldKV (Ours)
Full KV
Sliding Window
WorldKV (Ours)
Full KV
Sliding Window

Cases when WorldKV outperforms Full KV

In Lingbot-World-Fast, WorldKV sometimes recalls revisited scenes more consistently than full KV-cache attention, likely because restricting attention to viewpoint-relevant chunks avoids the attention dilution caused by attending over many viewpoint-irrelevant caches in the Full KV.

Full KV
WorldKV (Ours)
Full KV
WorldKV (Ours)

On Inspatio-World

Our methods can also be applied to Inspatio-World. World Retrieval and World Compression effectively mitigate memory drift in Inspatio-World.

Inspatio-World
Ours (Training-Free)

Conclusion

In this work, we propose WorldKV, a training-free framework that enables long-term world memory in autoregressive video world models through World Retrieval and World Compression. Experiments show memory fidelity competitive with full KV-cache attention and memory-trained baselines, while preserving real-time inference.

Citation

If you use this work or find it helpful, please consider citing: