Arxiv 2026

MORPHOS

Autoregressive 4D Generation with Temporal Structured Latents

Minkyung Kwon* Jinhyeok Choi* Youngjin Shin Jaeyeong Kim JongMin Lee Seungryong Kim
KAIST AI
* Equal contribution  ·   Corresponding author
MORPHOS teaser: given video inputs, MORPHOS autoregressively generates unified dynamic 3D representations — meshes, 3D Gaussians, and radiance fields.
MORPHOS takes a video as input and autoregressively generates unified dynamic 3D assets — meshes, 3D Gaussians, and radiance fields — while handling complex motion and evolving topologies.
TL;DR

We present MORPHOS, an autoregressive 4D generative framework that produces dynamic 3D assets from video across diverse representations — meshes, 3D Gaussians, and Radiance Fields. We introduce Temporal Structured Latents (T-SLAT), a unified 4D representation that jointly encodes geometry and appearance over time. With causal attention, MORPHOS conditions each frame on its preceding history, and a temporal-structural augmentation strategy mitigates error accumulation for robust long-horizon generation.

Results

Video Results

Flip through samples — each shows the input video alongside MORPHOS's 4D mesh novel view renderings.

Motivation

Why is video-to-4D generation hard?

Existing 4D generation methods perform well in narrow settings, but face three recurring limitations:

1

Representation fragmentation

Most frameworks specialize in a single 3D format — meshes or Gaussians — restricting generalization across modalities.

2

Fixed-topology constraints

Deformation-based modeling keeps temporal consistency for rigid motion but cannot handle topological changes or large structural shifts.

3

Error accumulation

Long-horizon generation drifts as self-generated history degrades, breaking temporal consistency over long videos.

💡

MORPHOS addresses all three: a unified 4D latent (T-SLAT) decodable to meshes / Gaussians / radiance fields, autoregressive causal generation that accommodates evolving topologies, and temporal-structural augmentation for stable long-horizon rollouts.

A single model across representations and long horizons

MethodMesh3D GaussianRadiance FieldLong-horizon
Motion324
ActionMesh
Mesh4D
L4GM
GVFD
MORPHOS (Ours)
Method

Method

Temporal Structured Latents (T-SLAT)

We extend Structured Latents (SLAT) into the temporal domain to obtain a unified 4D representation that jointly encodes geometry and appearance along time. An animated mesh sequence is normalized into a shared canonical space via a global union axis-aligned bounding box, then encoded by a sparse VAE into T-SLAT. A single T-SLAT can be decoded into meshes, 3D Gaussians, and radiance fields.

T-SLAT generation pipeline: input video and noise pass through a flow transformer with causal and cross attention to produce T-SLAT, decoded into Gaussian, mesh, and radiance-field outputs.
T-SLAT generation. A flow transformer ΦL with causal attention generates T-SLAT from the input video, which is decoded into multiple dynamic 3D representations through dedicated decoders.

Autoregressive 4D Generation

MORPHOS factorizes 4D generation as a Markovian process: each latent zt is conditioned on the preceding history z<t and the current video frame. Two rectified-flow transformers generate the sparse structure (voxels) and then the T-SLAT conditioned on it. A causal attention architecture with a sliding window restricts each query to its history, enabling arbitrarily long generation and KV caching for efficient inference.

Autoregressive streaming: a flow transformer generates each frame's structured latent conditioned on previously generated frames.
Autoregressive streaming. Each frame is generated conditioned on its preceding history, supporting long-horizon video inputs.
Causal attention mask over the temporal dimension: queries attend only to current and previous frames within a sliding window.
Causal attention. Queries at frame t attend only to current and past frames within the window, enabling KV caching.

Training with Temporal-Structural Augmentation

Autoregressive models are trained on ground-truth history but must generalize to imperfect, self-generated history at inference. To bridge this gap, MORPHOS trains both flow transformers with a temporal-structural augmentation strategy.

T

Temporal augmentation

Assigns an independent noise level per frame during training, exposing the model to histories of varying quality so it stays robust to cumulative errors in autoregressive rollouts.

S

Structural augmentation

Randomly drops voxels in the sparse structure conditioning T-SLAT generation, making the model robust to structural inaccuracies propagated from the structure-generation stage.

Evaluation

Qualitative Results

Appearance

MORPHOS produces visually consistent, high-fidelity appearance with stable textures across the sequence, while deformation-based baselines cannot model topological changes and frame-wise TRELLIS loses temporal consistency.

Qualitative results on appearance. MORPHOS maintains stable, high-fidelity textures throughout the sequence.

Geometry

Rendered normal maps show MORPHOS generates geometrically faithful structures and handles complex topological transitions where fixed-topology methods distort or fail.

Qualitative results on geometry. MORPHOS preserves accurate shape and temporal consistency through topological changes.

Generalization to Real Domain

Generalization to real-world DAVIS videos: input videos and MORPHOS novel-view 4D renderings across four scenes.
Given real-world videos from DAVIS, MORPHOS generates temporally consistent 4D results with stable geometry and coherent appearance. Our method further enables novel-view video generation.
Evaluation

Quantitative Results

Evaluated against state-of-the-art video-to-4D baselines and the image-to-3D model TRELLIS. Bold = best, underline = second best per column.

Quantitative evaluation on Motion80, split into short and long sequences (long > 128 frames).

MethodLPIPS↓CLIP↑DreamSim↓FVD↓CD↓F-score↑P2S↓
Short
Motion3240.21180.80510.2347336.630.06150.32590.0308
ActionMesh0.10620.25970.0528
Mesh4D0.20230.71860.3465592.560.17910.07120.0813
TRELLIS0.20310.86430.1861796.510.20330.13540.1022
L4GM0.12960.86630.1605188.32
GVFD0.16610.84390.1998328.14
MORPHOS (Ours)0.15050.87510.1512246.220.07610.14550.0320
Long
Motion3240.23470.79050.2407889.930.07010.23350.0353
ActionMesh0.16140.17180.0786
Mesh4D0.24080.58600.51701327.540.47240.02650.2250
TRELLIS0.21180.83590.20051527.190.23830.08750.1177
L4GM0.13550.85780.1535487.44
GVFD0.17960.80490.2319827.03
MORPHOS (Ours)0.14940.86700.1526330.590.07920.13710.0350

MORPHOS achieves the best CLIP and DreamSim across both splits and the best long-sequence FVD, demonstrating robust long-horizon appearance consistency.

Quantitative evaluation on ActionBench (128 animated scenes, 16 frames each).

MethodLPIPS↓CLIP↑DreamSim↓FVD↓CD↓F-score↑
Motion3240.20250.83040.2257195.250.10820.2013
ActionMesh0.08980.2146
Mesh4D0.17000.81140.2447403.200.17760.1235
TRELLIS0.20050.83670.2199547.210.19030.1367
L4GM0.19080.80710.2522211.55
GVFD0.16870.83010.2335188.67
MORPHOS (Ours)0.19040.85510.1857203.020.09720.2138

MORPHOS leads on perceptual appearance (CLIP, DreamSim) while remaining second-best on geometry among methods that also model appearance.

Appearance evaluation on Consist4D (7 videos, 32 frames each).

MethodLPIPS↓CLIP↑DreamSim↓FVD↓
Motion3240.20440.82850.2013936.68
Mesh4D0.17690.79680.25071189.67
TRELLIS0.24790.80440.29621488.32
L4GM0.16330.83740.2063825.64
GVFD0.14870.81420.1706821.75
MORPHOS (Ours)0.15310.85710.18491013.11

MORPHOS achieves the best CLIP similarity and competitive LPIPS / DreamSim on real-world video inputs.

Ablation studies on ActionBench. Each variant trained for 10k iterations.

ComponentLPIPS↓CLIP↑DreamSim↓FVD↓CD↓F-score↑
(a) w/o Causal attn.0.15780.84870.1979323.200.13050.1986
(b) w/o Temporal aug.0.16680.84150.2084424.430.12910.1909
(c) w/o Structural aug.0.16700.83990.2087397.780.12940.1906
(d) w/o ΦL training0.15690.85030.1956370.870.12090.2026
(d) w/o ΦS training0.16700.84000.2104506.200.17700.1553
MORPHOS (Ours)0.15760.84500.1966321.370.12190.2088

Every component contributes: causal attention, temporal and structural augmentation, and training both flow transformers (ΦS, ΦL) are each needed for balanced geometry, appearance, and video consistency.

Inference-time analysis on a single B200 GPU for a 16-frame video.

SettingTime (s)LPIPS↓CLIP↑DreamSim↓FVD↓CD↓F-score↑
w/o Cache111.420.15910.84480.1987325.390.11890.2084
w/ Cache54.900.15760.84500.1966321.370.12190.2088
w/ Cache & Few-step28.030.16060.83990.2086326.580.12210.2029

KV caching gives a 2.03× speedup with negligible quality change; few-step sampling reaches a 3.98× speedup overall.

💡

MORPHOS attains state-of-the-art appearance and competitive geometry across benchmarks, while a single unified model generalizes across representations and remains robust on long videos.

Analysis

Analysis

Attention captures geometric correspondence

For a query voxel (green star) in the target frame, we visualize its attention over the conditioning frames' voxel tokens. MORPHOS establishes geometric correspondences in 3D and even exploits symmetry across frames — e.g., a query on one hand attends to both hands — supporting temporally consistent voxel generation.

Attention visualization in T-SLAT space: a query voxel attends to geometrically corresponding and symmetric voxels in conditioning frames.
Attention analysis in T-SLAT space. Attention concentrates on geometrically corresponding regions across frames.

Minimal error accumulation over long horizons

As video length grows, baselines degrade progressively in both appearance (CLIP↑) and geometry (P2S↓). MORPHOS stays substantially more stable across the full sequence, confirming that causal generation with temporal-structural augmentation mitigates error accumulation.

Line charts of CLIP (higher is better) and P2S (lower is better) versus frame index, showing MORPHOS remaining stable while baselines degrade.
Error accumulation analysis. MORPHOS maintains quality as frame index increases.

Citation

@article{kwon2026morphos,
  title   = {MORPHOS: Autoregressive 4D Generation with Temporal Structured Latents},
  author  = {Kwon, Minkyung and Choi, Jinhyeok and Shin, Youngjin and
             Kim, Jaeyeong and Lee, JongMin and Kim, Seungryong},
  journal = {arXiv preprint},
  year    = {2026}
}