Emergent Outlier View Rejection in Visual Geometry Grounded Transformers

1 KAIST AI  2 New York University  3 ETH AI Center, ETH Zurich  4 UC Berkeley 
* Equal Contribution Co-corresponding authors

We reveal that Visual Geometry Grounded Transformers (VGGT) has a built-in ability to detect outliers, which we leverage to perform outlier-view rejection without any fine-tuning.


Overview Image
Our key contributions. We reveal that VGGT’s internal attention and feature representations exhibit emergent noise-suppressing behavior, and we exploit this to design a simple, training-free filtering mechanism that selects geometrically consistent views via a single global threshold on internal signals. Across diverse benchmarks and noise settings, this approach consistently improves robustness and reconstruction quality over strong baselines.

Abstract

Reliable 3D reconstruction from in-the-wild image collections is often hindered by ``noisy'' images—irrelevant inputs with little or no view overlap with others. While traditional Structure-from-Motion pipelines handle such cases through geometric verification and outlier rejection, feed-forward 3D reconstruction models lack these explicit mechanisms, leading to degraded performance under in-the-wild conditions. In this paper, we discover that the existing feed-forward reconstruction model, e.g., VGGT, despite lacking explicit outlier-rejection mechanisms or noise-aware training, can inherently distinguish distractor images. Through an in-depth analysis under varying proportions of synthetic distractors, we identify a specific layer that naturally exhibits outlier-suppressing behavior. Further probing reveals that this layer encodes discriminative internal representations that enable an effective noise-filtering capability, which we simply leverage to perform outlier-view rejection in feed-forward 3D reconstruction without any additional fine-tuning or supervision. Extensive experiments on both controlled and in-the-wild datasets demonstrate that this implicit filtering mechanism is consistent and generalizes well across diverse scenarios.


Layer analysis

Overview Image
Layer analysis. We measure the gap between clean and distractor views for attention and feature similarity across VGGT’s all layers. The separation grows with depth and peaks at the final layer, indicating emergent noise suppression.
Overview Image
Feature/attention visualization. We show cross-view attention maps and intermediate feature–similarity maps from the final layer of VGGT on mixed sets containing clean and distractor images. Distractors are marked with red boxes; the query image is marked with a blue box. For each context image, scores are computed with respect to the query, averaged over all query tokens, and normalized for display. Both probes clearly downweight distractor views, revealing the model’s emergent view selectivity.

Architecture

Overview of our framework
Our framework. We compute per-view relevance from VGGT’s internal representations using two probes: (i) cross-view attention from query–key projections and (ii) cosine similarity of intermediate dense features. The resulting score is thresholded with a single global value to filter distractors, and the filtered set is re-fed to VGGT for reconstruction—without retraining or architectural changes.

Visualization from internet collected images

Visualization from various datasets

Baseline Comparison Table

Methods Small Medium Large Avg
ATE ↓ RPEtrans RPErot ATE ↓ RPEtrans RPErot ATE ↓ RPEtrans RPErot ATE ↓ RPEtrans RPErot
Phototourism
MAST3R-SfM 1.3556 2.4084 12.2887 1.2683 2.4735 11.9315 1.2329 2.3143 11.2861 1.2856 2.3987 11.8354
VGGT 0.3068 0.4553 0.9906 0.3612 0.5314 1.1987 0.3833 0.5649 1.3304 0.3504 0.5172 1.1732
MegaLoc+VGGT 0.2735 0.4076 0.8841 0.2999 0.4460 0.9872 0.3161 0.4700 1.0714 0.2965 0.4412 0.9809
MegaLoc*+VGGT 0.2689 0.4013 0.8686 0.2873 0.4274 0.9316 0.2997 0.4458 1.0020 0.2853 0.4248 0.9341
DINOv3+VGGT 0.3068 0.4553 0.9909 0.3612 0.5315 1.1989 0.3833 0.5649 1.3307 0.3504 0.5172 1.1735
DINOv2†+VGGT 0.3068 0.4553 0.9906 0.3612 0.5314 1.1987 0.3833 0.5649 1.3304 0.3504 0.5172 1.1732
RobustVGGT-A 0.2732 0.4094 0.8719 0.2792 0.4153 0.8806 0.2930 0.4349 0.9310 0.2818 0.4199 0.8945
RobustVGGT-F 0.2641 0.3936 0.8420 0.2645 0.3949 0.8402 0.2664 0.3973 0.8388 0.2650 0.3953 0.8403
On-the-Go
MAST3R-SfM 0.0754 0.1548 2.0281 0.0749 0.1474 1.9934 0.0768 0.1510 2.0328 0.0757 0.15107 2.0181
VGGT 0.0788 0.1239 1.0261 0.1281 0.1963 1.4194 0.1562 0.2393 1.7315 0.1210 0.1865 1.3923
MegaLoc+VGGT 0.0658 0.1056 0.9090 0.1044 0.1612 1.1516 0.1229 0.1881 1.3315 0.0977 0.1516 1.1307
MegaLoc*+VGGT 0.0529 0.0872 0.8366 0.0580 0.0942 0.8812 0.0637 0.1033 0.9464 0.0582 0.0949 0.8881
DINOv3+VGGT 0.0675 0.1077 0.9184 0.1075 0.1667 1.2201 0.1312 0.2025 1.4602 0.1021 0.1589 1.1996
DINOv2†+VGGT 0.0788 0.1239 1.0261 0.1281 0.1963 1.4194 0.1562 0.2393 1.7315 0.1210 0.1865 1.3923
RobustVGGT-A 0.0578 0.0952 0.8800 0.0790 0.1253 1.2922 0.0697 0.1126 1.0361 0.0688 0.1110 1.0694
RobustVGGT-F 0.0521 0.0861 0.8179 0.0568 0.0931 0.8914 0.0660 0.1055 0.9872 0.0583 0.0949 0.8988
RobustNeRF
MAST3R-SfM 0.1153 0.2597 2.7124 0.1196 0.2650 2.7617 0.1182 0.2626 2.6882 0.1177 0.2624 2.7208
VGGT 0.1519 0.2742 1.2052 0.1598 0.2908 1.2311 0.1680 0.3062 1.3493 0.1599 0.2904 1.2619
MegaLoc+VGGT 0.1496 0.2692 1.1920 0.1573 0.2847 1.2199 0.1618 0.2934 1.2721 0.1562 0.2824 1.2280
MegaLoc*+VGGT 0.1352 0.2418 1.1707 0.1358 0.2416 1.1663 0.1356 0.2394 1.1552 0.1355 0.2409 1.1641
DINOv3+VGGT 0.1516 0.2738 1.2032 0.1595 0.2902 1.2296 0.1677 0.3055 1.3426 0.1596 0.2898 1.2585
DINOv2†+VGGT 0.1519 0.2742 1.2052 0.1598 0.2908 1.2311 0.1680 0.3062 1.3493 0.1599 0.2904 1.2619
RobustVGGT-A 0.1361 0.2433 1.1766 0.1379 0.2447 1.1657 0.1406 0.2478 1.1632 0.1382 0.2453 1.1685
RobustVGGT-F 0.1388 0.2480 1.1656 0.1374 0.2432 1.1670 0.1374 0.2415 1.1514 0.1379 0.2442 1.1613
ETH3D
MAST3R-SfM 2.3871 4.2586 77.368 2.3931 4.2766 78.521 2.3796 4.1958 76.4021 2.3866 4.2437 77.4303
VGGT 0.8572 1.3675 6.3908 0.9182 1.5028 9.6272 1.0165 1.7348 15.2774 0.9306 1.5350 10.4318
MegaLoc+VGGT 0.9275 1.4470 5.5320 0.9233 1.4789 7.5607 0.9639 1.5977 11.4364 0.9382 1.5079 8.1764
MegaLoc*+VGGT 0.9170 1.4904 3.7593 0.9418 1.5194 4.3126 0.9800 1.5802 5.7108 0.9463 1.5300 4.5942
DINOv3+VGGT 0.8551 1.3667 6.2305 0.9113 1.4891 9.4103 1.0027 1.7022 14.9265 0.9230 1.5193 10.1891
DINOv2†+VGGT 0.8556 1.3661 6.2515 0.9188 1.5022 9.5374 1.0165 1.7346 15.2582 0.9303 1.5343 10.3490
RobustVGGT-A 0.7447 1.1724 3.8938 0.7673 1.2123 3.7325 0.6874 1.0708 4.4779 0.7331 1.1518 4.0347
RobustVGGT-F 0.6224 1.0300 2.7304 0.8038 1.3159 2.9959 0.8636 1.3882 3.4866 0.7633 1.2447 3.0710

Table 1. Camera pose estimation across noise levels. * denotes per-dataset hyperparameter tuning with oracle knowledge of the number of clean images in the test set; these entries are shaded as they are not directly comparable. † uses DINOv2 features extracted from VGGT.

BibTeX


need to fill in

Acknowledgements

need to fill in