Geometric Action Model
for Robot Policy Learning

A language-conditioned manipulation policy that repurposes a pretrained geometric foundation model as one shared backbone for perception, future prediction, and action; more accurate, more robust, faster, and lighter than foundation-model-scale baselines.

Jisang Han*1, Seonghu Jeon*1, Jaewoo Jung1,2, René Zurbrügg2,3, Honggyu An1, Tifanny Portela2,3, Marco Hutter2, Marc Pollefeys2, Seungryong Kim†1, Sunghwan Hong†2,3

1KAIST AI  ·  2ETH Zurich  ·  3ETH AI Center

*Equal contribution    Co-corresponding authors

Teaser figure summarizing the Geometric Action Model pipeline, geometry prediction, benchmark latency, and real-world success rates.
GAM reuses a geometric foundation model as the policy backbone, predicting future geometry and actions in one efficient pass.

Real-robot comparison of GAM against π0.5 and Spatial Forcing across four manipulation tasks, in nominal (ID) and camera-perturbed out-of-distribution (OOD) settings.

85.5%
LIBERO-Plus success
best overall under perturbations
+9.7%p
Camera robustness
over the next-best baseline
6.9 ms
Inference latency
≈145 Hz · up to 55× faster
1.4B
Total parameters
vs 2–8.5B for baselines
Overview

Geometry is the missing substrate of robot policies.

Generalist robot policies must follow user instructions while reasoning about how objects, cameras, and robot actions interact in the 3D physical world. Recent vision-language-action models and video world-action models inherit strong semantic or temporal priors from large-scale foundation models, but they still operate primarily on 2D image frames or 2D-derived latent spaces, leaving the geometry required for contact-rich manipulation implicit.

We propose Geometric Action Model (GAM), a language-conditioned manipulation policy that directly repurposes a pretrained Geometric Foundation Model (GFM) as a shared substrate for perception, temporal prediction, and action decoding. GAM splits the GFM at an intermediate layer: the shallow layers serve as an observation encoder, while a causal future predictor inserted at the split layer forecasts future latent tokens conditioned on language, proprioception, and action history. The predicted future tokens are then routed through the remaining GFM blocks, allowing a single backbone to produce both future geometry and actions. On a broad suite of simulation and real-robot benchmarks, GAM is more accurate, more robust, faster, and lighter than current foundation-model-scale baselines.

Method

One geometric backbone for perception, prediction, and action

GAM cuts a pretrained GFM at an intermediate layer and inserts a causal transformer in between, turning a static 3D perception model into a language-conditioned world-action model with minimal architectural change, while preserving its rich geometric priors.

GAM architecture: observation encoding with GFM shallow layers, a causal future predictor at the split layer, and feature propagation through GFM deeper layers for action decoding with block-causal attention.
GAM architecture. Multi-view RGB is encoded by the GFM's shallow layers; a block-causal transformer predicts future latent geometry from language, proprioception, and action history; the GFM's deep layers decode the predicted tokens into future depth and executable action chunks.
1

Observation Encoding

The GFM's shallow layers are reused as the observation encoder, mapping multi-view RGB frames into spatially meaningful latent geometric states, no task-specific encoder trained from scratch.

2

Causal Future Predictor

A causal transformer inserted at the split layer forecasts the next latent geometric state, conditioned on the task instruction, proprioception, and action history: next-token prediction, but for 3D world states.

3

Propagation & Decoding

Predicted future tokens flow through the remaining GFM blocks: the original depth head decodes future geometry while a lightweight action head regresses the executable action chunk in one forward pass, both outputs.

Why not just another VLA or world model?

Comparison of paradigms: video world-action models in 2D pixel space, geometry-aware VLAs with passive distillation, and GAM unifying perception, geometry prediction, and action decoding inside a single GFM.
Three paradigms. GAM moves beyond 2D latents and passive feature distillation by making the geometric model itself the policy.
(a) Video WAMs

2D pixel-space futures

  • Strong temporal prior from video generation
  • Depth, scale, and occlusion remain implicit
  • Diffusion-style inference is slow
(b) Geometry-aware VLAs

Geometry as a side signal

  • Injects geometric features into a VLA policy
  • Uses the GFM mostly as a passive feature extractor
  • Does not make the GFM's layered geometry the policy substrate
(c) GAM (ours)

The GFM is the policy

  • Preserves rich 3D geometric priors for policy learning
  • Grounds action decoding in explicit future geometry
  • Runs perception, prediction, and action in one forward pass
Simulation Results

State-of-the-art robustness with lower inference cost

On LIBERO, GAM matches saturated state-of-the-art success rates. On LIBERO-Plus, which perturbs cameras, lighting, backgrounds, and layouts, it takes the overall lead, with the smallest degradation of any method and a +9.7%p margin in the camera-perturbation split.

Method Size LIBERO ↑ LIBERO-Plus ↑ Camera ↑
π0.53.3B96.984.6 (↓12.3)72.0
OpenVLA-OFT7B97.169.6 (↓27.5)56.4
π03.3B91.369.3 (↓22.0)61.0
Cosmos-Policy2B98.582.4 (↓16.1)73.4
Fast-WAM6B97.650.0 (↓47.5)16.4
π0.5 + Spatial Forcing3.3B94.025.7 (↓58.3)0.1
π0.5 + ROCKET3.3B95.347.5 (↓46.6)30.9
GAM (Ours)1.4B97.685.5 (↓12.1)83.1

Success rates (%) on LIBERO and LIBERO-Plus. "Camera" is the camera-perturbation split of LIBERO-Plus. Full tables in the paper.

Success rate vs. camera difficulty

Graceful degradation. As camera perturbations intensify (L1→L5), GAM consistently stays on top.

Single-pass inference latency

Cosmos-PolicyVideo Diffusion-based
382.4 ms
OpenVLA-OFTVLA
77.8 ms
π0.5VLA
29.2 ms
GAM (Ours)1.4B
6.9 ms · ≈145 Hz

One feed-forward pass with KV-cached history, up to 55× faster than diffusion-based policies.

Spatial Object Goal Long
Zero-shot robustness across every perturbation factor: camera, robot, light, background, noise, layout, and language, reported across difficulty levels on LIBERO-Plus.

Watch the simulation rollouts

Real-Robot Results

Robustness on
real-robot tasks

Four contact-rich manipulation tasks, trained from ~200 teleoperated demonstrations each. Half of all evaluation trials shift the external camera by 85 cm and 45°, an out-of-distribution setting where 2D-based policies collapse and GAM keeps working.

Real-world success rates for GAM, pi-0.5, and Spatial Forcing across four tasks under in-domain and out-of-domain settings, with task illustrations.
Real-world success rates. Light bars: in-distribution. Dark bars: camera-perturbed OOD. GAM leads on every task in both settings.
Four real-world manipulation tasks with instructions: pick and place, stack milk and cube, place pot and pan on cooktop, insert cube into covered pot.
Task suite. Pick & place · stack milk & cube · pot & pan on cooktop · insert cube into covered pot.
In-distribution versus out-of-distribution camera setup used for real-robot evaluation.
ID vs. OOD. The external camera is translated 85 cm and rotated 45° for OOD trials.

Watch the full head-to-head video

Qualitative Results

GAM predicts the geometry of what happens next

Because future prediction lives in the GFM's latent space, GAM's forecasts decode into dense future depth maps that closely track ground truth. Swipe through rollouts from each LIBERO suite.

Simulation Rollouts

Policy rollouts from simulation

Uncut open-loop rollouts of GAM on LIBERO Original and LIBERO-Plus. Each clip shows the external camera (left) and the wrist camera (right). Pick a task suite and swipe through.

Training & Setup

Pretrained on 784K trajectories. Post-trained per benchmark.

GAM builds on DA3-Giant and is pretrained on a mixture of Open-X Embodiment (72%), MimicGen (18%), and RoboCasa365 (10%) single-arm robot data, then fine-tuned on each benchmark with future-depth supervision from simulators or teacher pseudo-depth.

Pretraining mixture. High-level source mixture (left) and the share of each constituent dataset relative to the entire training corpus (right).
Citation

BibTeX

@misc{han2026geometricactionmodelrobot,
      title={Geometric Action Model for Robot Policy Learning},
      author={Jisang Han and Seonghu Jeon and Jaewoo Jung and Ren{\'e} Zurbr{\"u}gg and Honggyu An and Tifanny Portela and Marco Hutter and Marc Pollefeys and Seungryong Kim and Sunghwan Hong},
      year={2026},
      eprint={2606.17046},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2606.17046},
}