arXiv 2025
Compact 3D scene Representations (C3G) is a generalizable framework that uses only 2K Gaussians to reconsruct and understand 3D scenes from unposed images. By aggregating multi-view features to Gaussian queries, each Gaussian is positioned at essential spatial location by integrating relavant visual features, enabling efficitently feature lifting.
(a) Our framework decodes compact 3D Gaussians by processing multi-view features through transformer-based learnable queries.
(b) Visualization of learned attention between a target Gaussian (red dots) and features. Query tokens attend to spatially coherent regions across views, naturally discovering correspondences.
We leverage the learned attention from Gaussian decoder C3G-$\mathcal{G}$ to efficiently train a view-invariant feature decoder C3G-$\mathcal{F}$. We use the same architecture as C3G-$\mathcal{G}$ but share the attention weights to efficiently aggregate multi-view features for feature decoding.
With this training procedure, we can easily lift any 2D features by learning only value projection layers, enabling efficient training.