Geometric Action Model for Robot Policy Learning

Overview

Geometry is the missing substrate of robot policies.

Generalist robot policies must follow user instructions while reasoning about how objects, cameras, and robot actions interact in the 3D physical world. Recent vision-language-action models and video world-action models inherit strong semantic or temporal priors from large-scale foundation models, but they still operate primarily on 2D image frames or 2D-derived latent spaces, leaving the geometry required for contact-rich manipulation implicit.

We propose Geometric Action Model (GAM), a language-conditioned manipulation policy that directly repurposes a pretrained Geometric Foundation Model (GFM) as a shared substrate for perception, temporal prediction, and action decoding. GAM splits the GFM at an intermediate layer: the shallow layers serve as an observation encoder, while a causal future predictor inserted at the split layer forecasts future latent tokens conditioned on language, proprioception, and action history. The predicted future tokens are then routed through the remaining GFM blocks, allowing a single backbone to produce both future geometry and actions. On a broad suite of simulation and real-robot benchmarks, GAM is more accurate, more robust, faster, and lighter than current foundation-model-scale baselines.

Method

One geometric backbone for perception, prediction, and action

GAM cuts a pretrained GFM at an intermediate layer and inserts a causal transformer in between, turning a static 3D perception model into a language-conditioned world-action model with minimal architectural change, while preserving its rich geometric priors.

1

Observation Encoding

The GFM's shallow layers are reused as the observation encoder, mapping multi-view RGB frames into spatially meaningful latent geometric states, no task-specific encoder trained from scratch.

2

Causal Future Predictor

A causal transformer inserted at the split layer forecasts the next latent geometric state, conditioned on the task instruction, proprioception, and action history: next-token prediction, but for 3D world states.

3

Propagation & Decoding

Predicted future tokens flow through the remaining GFM blocks: the original depth head decodes future geometry while a lightweight action head regresses the executable action chunk in one forward pass, both outputs.

Why not just another VLA or world model?

Comparison of paradigms: video world-action models in 2D pixel space, geometry-aware VLAs with passive distillation, and GAM unifying perception, geometry prediction, and action decoding inside a single GFM. — **Three paradigms.** GAM moves beyond 2D latents and passive feature distillation by making the geometric model itself the policy.

(a) Video WAMs

2D pixel-space futures

Strong temporal prior from video generation
Depth, scale, and occlusion remain implicit
Diffusion-style inference is slow

(b) Geometry-aware VLAs

Geometry as a side signal

Injects geometric features into a VLA policy
Uses the GFM mostly as a passive feature extractor
Does not make the GFM's layered geometry the policy substrate

(c) GAM (ours)

The GFM is the policy

Preserves rich 3D geometric priors for policy learning
Grounds action decoding in explicit future geometry
Runs perception, prediction, and action in one forward pass

Simulation Results

State-of-the-art robustness with lower inference cost

On LIBERO, GAM matches saturated state-of-the-art success rates. On LIBERO-Plus, which perturbs cameras, lighting, backgrounds, and layouts, it takes the overall lead, with the smallest degradation of any method and a +9.7%p margin in the camera-perturbation split.

Method	Size	LIBERO ↑	LIBERO-Plus ↑	Camera ↑
π_0.5	3.3B	96.9	84.6 (↓12.3)	72.0
OpenVLA-OFT	7B	97.1	69.6 (↓27.5)	56.4
π₀	3.3B	91.3	69.3 (↓22.0)	61.0
Cosmos-Policy	2B	98.5	82.4 (↓16.1)	73.4
Fast-WAM	6B	97.6	50.0 (↓47.5)	16.4
π_0.5 + Spatial Forcing	3.3B	94.0	25.7 (↓58.3)	0.1
π_0.5 + ROCKET	3.3B	95.3	47.5 (↓46.6)	30.9
GAM (Ours)	1.4B	97.6	85.5 (↓12.1)	83.1

Success rates (%) on LIBERO and LIBERO-Plus. "Camera" is the camera-perturbation split of LIBERO-Plus. Full tables in the paper.

Success rate vs. camera difficulty

Graceful degradation. As camera perturbations intensify (L1→L5), GAM consistently stays on top.

Single-pass inference latency

Cosmos-PolicyVideo Diffusion-based

382.4 ms

OpenVLA-OFTVLA

77.8 ms

π_0.5VLA

29.2 ms

GAM (Ours)1.4B

6.9 ms · ≈145 Hz

One feed-forward pass with KV-cached history, up to 55× faster than diffusion-based policies.

Spatial Object Goal Long

Zero-shot robustness across every perturbation factor: camera, robot, light, background, noise, layout, and language, reported across difficulty levels on LIBERO-Plus.

Watch the simulation rollouts

Real-Robot Results

Robustness on
real-robot tasks

Four contact-rich manipulation tasks, trained from ~200 teleoperated demonstrations each. Half of all evaluation trials shift the external camera by 85 cm and 45°, an out-of-distribution setting where 2D-based policies collapse and GAM keeps working.

Real-world success rates for GAM, pi-0.5, and Spatial Forcing across four tasks under in-domain and out-of-domain settings, with task illustrations. — **Real-world success rates.** Light bars: in-distribution. Dark bars: camera-perturbed OOD. GAM leads on every task in both settings.

Four real-world manipulation tasks with instructions: pick and place, stack milk and cube, place pot and pan on cooktop, insert cube into covered pot. — **Task suite.** Pick & place · stack milk & cube · pot & pan on cooktop · insert cube into covered pot.

In-distribution versus out-of-distribution camera setup used for real-robot evaluation. — **ID vs. OOD.** The external camera is translated 85 cm and rotated 45° for OOD trials.

Watch the full head-to-head video

Qualitative Results

GAM predicts the geometry of what happens next

Because future prediction lives in the GFM's latent space, GAM's forecasts decode into dense future depth maps that closely track ground truth. Swipe through rollouts from each LIBERO suite.

Predicted future depth on LIBERO-Spatial: bowl from table center to plate. — **LIBERO-Spatial** · "Move the bowl from the table center to the plate": current RGB/depth, ground-truth future, and GAM's predicted future depth over time.

Predicted future depth on LIBERO-Object: tomato sauce to basket. — **LIBERO-Object** · "Put the tomato sauce in the basket": predicted future depth stays aligned with the true scene evolution.

Predicted future depth on LIBERO-Long: cream cheese and butter to basket. — **LIBERO-Long** · "Put the cream cheese and butter in the basket": coherent geometric futures across a long-horizon task.

Predicted future depth on LIBERO-Goal: wine bottle on cabinet. — **LIBERO-Goal** · "Put the wine bottle on the cabinet": the same backbone that acts also predicts the 3D future.

Attention maps of action tokens focusing on task-relevant objects and the end effector. — **Action-token attention.** Action tokens learn to attend to task-relevant objects and the end effector, geometry-grounded credit assignment for free.

Simulation Rollouts

Policy rollouts from simulation

Uncut open-loop rollouts of GAM on LIBERO Original and LIBERO-Plus. Each clip shows the external camera (left) and the wrist camera (right). Pick a task suite and swipe through.

Training & Setup

Pretrained on 784K trajectories. Post-trained per benchmark.

GAM builds on DA3-Giant and is pretrained on a mixture of Open-X Embodiment (72%), MimicGen (18%), and RoboCasa365 (10%) single-arm robot data, then fine-tuned on each benchmark with future-depth supervision from simulators or teacher pseudo-depth.

Pretraining mixture. High-level source mixture (left) and the share of each constituent dataset relative to the entire training corpus (right).

Citation

BibTeX

@misc{han2026geometricactionmodelrobot,
      title={Geometric Action Model for Robot Policy Learning},
      author={Jisang Han and Seonghu Jeon and Jaewoo Jung and Ren{\'e} Zurbr{\"u}gg and Honggyu An and Tifanny Portela and Marco Hutter and Marc Pollefeys and Seungryong Kim and Sunghwan Hong},
      year={2026},
      eprint={2606.17046},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2606.17046},
}