C3G: Learning Compact 3D Representations with 2K Gaussians

arXiv 2025

1KAIST AI 2ETH AI Center, ETH Zurich 3SONY AI 4Sony Group Corporation
*Co-first authors Co-corresponding authors
TL;DR

We propose a feed-forward framework for learning compact 3D representations from unposed images. Our approach estimates only 2K Gaussians that allocated in meaningful regions to enable generalizable scene reconstruction and understanding.

Overview

Emergence of Geometry in Layer 10

Compact 3D scene Representations (C3G) is a generalizable framework that uses only 2K Gaussians to reconsruct and understand 3D scenes from unposed images. By aggregating multi-view features to Gaussian queries, each Gaussian is positioned at essential spatial location by integrating relavant visual features, enabling efficitently feature lifting.

Comparison with Previous Works

Emergence of Geometry in Layer 10
  • Previous per-pixel estimators: Predict one Gaussians per pixel, resulting redundant Gaussians, misalignment and depth error.
  • C3G: Predicts only compact Gaussians at essential locations while avoiding redundancy, misalignment, and depth error, resulting superior 3D scene understanding and novel view synthesis performance.

Our Gaussian Decoder: C3G-$\mathcal{G}$

We introduce learnable queries that aggregates multi-view features through self-attention to guide Gaussian generation, enabling Gaussians to be positioned at essential spatial locations with only RGB supervision.
CAMEO Method Overview

(a) Our framework decodes compact 3D Gaussians by processing multi-view features through transformer-based learnable queries.
(b) Visualization of learned attention between a target Gaussian (red dots) and features. Query tokens attend to spatially coherent regions across views, naturally discovering correspondences.

Our Feature Decoder: C3G-$\mathcal{F}$

We leverage the learned attention from Gaussian decoder C3G-$\mathcal{G}$ to efficiently train a view-invariant feature decoder C3G-$\mathcal{F}$. We use the same architecture as C3G-$\mathcal{G}$ but share the attention weights to efficiently aggregate multi-view features for feature decoding.

With this training procedure, we can easily lift any 2D features by learning only value projection layers, enabling efficient training.

CAMEO Method Overview

Quantitative Results

Qualitative Results

Citation