Robust VGGT

Abstract

Reliable 3D reconstruction from in-the-wild image collections is often hindered by ``noisy'' images—irrelevant inputs with little or no view overlap with others. While traditional Structure-from-Motion pipelines handle such cases through geometric verification and outlier rejection, feed-forward 3D reconstruction models lack these explicit mechanisms, leading to degraded performance under in-the-wild conditions. In this paper, we discover that the existing feed-forward reconstruction model, e.g., VGGT, despite lacking explicit outlier-rejection mechanisms or noise-aware training, can inherently distinguish distractor images. Through an in-depth analysis under varying proportions of synthetic distractors, we identify a specific layer that naturally exhibits outlier-suppressing behavior. Further probing reveals that this layer encodes discriminative internal representations that enable an effective noise-filtering capability, which we simply leverage to perform outlier-view rejection in feed-forward 3D reconstruction without any additional fine-tuning or supervision. Extensive experiments on both controlled and in-the-wild datasets demonstrate that this implicit filtering mechanism is consistent and generalizes well across diverse scenarios.

Layer analysis

Layer analysis. We measure the gap between clean and distractor views for attention and feature similarity across VGGT’s all layers. The separation grows with depth and peaks at the final layer, indicating emergent noise suppression.

Feature/attention visualization. We show cross-view attention maps and intermediate feature–similarity maps from the final layer of VGGT on mixed sets containing clean and distractor images. Distractors are marked with red boxes; the query image is marked with a blue box. For each context image, scores are computed with respect to the query, averaged over all query tokens, and normalized for display. Both probes clearly downweight distractor views, revealing the model’s emergent view selectivity.

Architecture

Overview of our framework — **Our framework.** We compute per-view relevance from VGGT’s internal representations using two probes: (i) cross-view attention from query–key projections and (ii) cosine similarity of intermediate dense features. The resulting score is thresholded with a single global value to filter distractors, and the filtered set is re-fed to VGGT for reconstruction—without retraining or architectural changes.

Visualization from internet collected images

Visualization from various datasets

Baseline Comparison Table

Methods	Small			Medium			Large			Avg
Methods	ATE ↓	RPE_trans ↓	RPE_rot ↓	ATE ↓	RPE_trans ↓	RPE_rot ↓	ATE ↓	RPE_trans ↓	RPE_rot ↓	ATE ↓	RPE_trans ↓	RPE_rot ↓
Phototourism
MAST3R-SfM	1.3556	2.4084	12.2887	1.2683	2.4735	11.9315	1.2329	2.3143	11.2861	1.2856	2.3987	11.8354
VGGT	0.3068	0.4553	0.9906	0.3612	0.5314	1.1987	0.3833	0.5649	1.3304	0.3504	0.5172	1.1732
MegaLoc+VGGT	0.2735	0.4076	0.8841	0.2999	0.4460	0.9872	0.3161	0.4700	1.0714	0.2965	0.4412	0.9809
MegaLoc*+VGGT	0.2689	0.4013	0.8686	0.2873	0.4274	0.9316	0.2997	0.4458	1.0020	0.2853	0.4248	0.9341
DINOv3+VGGT	0.3068	0.4553	0.9909	0.3612	0.5315	1.1989	0.3833	0.5649	1.3307	0.3504	0.5172	1.1735
DINOv2†+VGGT	0.3068	0.4553	0.9906	0.3612	0.5314	1.1987	0.3833	0.5649	1.3304	0.3504	0.5172	1.1732
RobustVGGT-A	0.2732	0.4094	0.8719	0.2792	0.4153	0.8806	0.2930	0.4349	0.9310	0.2818	0.4199	0.8945
RobustVGGT-F	0.2641	0.3936	0.8420	0.2645	0.3949	0.8402	0.2664	0.3973	0.8388	0.2650	0.3953	0.8403
On-the-Go
MAST3R-SfM	0.0754	0.1548	2.0281	0.0749	0.1474	1.9934	0.0768	0.1510	2.0328	0.0757	0.15107	2.0181
VGGT	0.0788	0.1239	1.0261	0.1281	0.1963	1.4194	0.1562	0.2393	1.7315	0.1210	0.1865	1.3923
MegaLoc+VGGT	0.0658	0.1056	0.9090	0.1044	0.1612	1.1516	0.1229	0.1881	1.3315	0.0977	0.1516	1.1307
MegaLoc*+VGGT	0.0529	0.0872	0.8366	0.0580	0.0942	0.8812	0.0637	0.1033	0.9464	0.0582	0.0949	0.8881
DINOv3+VGGT	0.0675	0.1077	0.9184	0.1075	0.1667	1.2201	0.1312	0.2025	1.4602	0.1021	0.1589	1.1996
DINOv2†+VGGT	0.0788	0.1239	1.0261	0.1281	0.1963	1.4194	0.1562	0.2393	1.7315	0.1210	0.1865	1.3923
RobustVGGT-A	0.0578	0.0952	0.8800	0.0790	0.1253	1.2922	0.0697	0.1126	1.0361	0.0688	0.1110	1.0694
RobustVGGT-F	0.0521	0.0861	0.8179	0.0568	0.0931	0.8914	0.0660	0.1055	0.9872	0.0583	0.0949	0.8988
RobustNeRF
MAST3R-SfM	0.1153	0.2597	2.7124	0.1196	0.2650	2.7617	0.1182	0.2626	2.6882	0.1177	0.2624	2.7208
VGGT	0.1519	0.2742	1.2052	0.1598	0.2908	1.2311	0.1680	0.3062	1.3493	0.1599	0.2904	1.2619
MegaLoc+VGGT	0.1496	0.2692	1.1920	0.1573	0.2847	1.2199	0.1618	0.2934	1.2721	0.1562	0.2824	1.2280
MegaLoc*+VGGT	0.1352	0.2418	1.1707	0.1358	0.2416	1.1663	0.1356	0.2394	1.1552	0.1355	0.2409	1.1641
DINOv3+VGGT	0.1516	0.2738	1.2032	0.1595	0.2902	1.2296	0.1677	0.3055	1.3426	0.1596	0.2898	1.2585
DINOv2†+VGGT	0.1519	0.2742	1.2052	0.1598	0.2908	1.2311	0.1680	0.3062	1.3493	0.1599	0.2904	1.2619
RobustVGGT-A	0.1361	0.2433	1.1766	0.1379	0.2447	1.1657	0.1406	0.2478	1.1632	0.1382	0.2453	1.1685
RobustVGGT-F	0.1388	0.2480	1.1656	0.1374	0.2432	1.1670	0.1374	0.2415	1.1514	0.1379	0.2442	1.1613
ETH3D
MAST3R-SfM	2.3871	4.2586	77.368	2.3931	4.2766	78.521	2.3796	4.1958	76.4021	2.3866	4.2437	77.4303
VGGT	0.8572	1.3675	6.3908	0.9182	1.5028	9.6272	1.0165	1.7348	15.2774	0.9306	1.5350	10.4318
MegaLoc+VGGT	0.9275	1.4470	5.5320	0.9233	1.4789	7.5607	0.9639	1.5977	11.4364	0.9382	1.5079	8.1764
MegaLoc*+VGGT	0.9170	1.4904	3.7593	0.9418	1.5194	4.3126	0.9800	1.5802	5.7108	0.9463	1.5300	4.5942
DINOv3+VGGT	0.8551	1.3667	6.2305	0.9113	1.4891	9.4103	1.0027	1.7022	14.9265	0.9230	1.5193	10.1891
DINOv2†+VGGT	0.8556	1.3661	6.2515	0.9188	1.5022	9.5374	1.0165	1.7346	15.2582	0.9303	1.5343	10.3490
RobustVGGT-A	0.7447	1.1724	3.8938	0.7673	1.2123	3.7325	0.6874	1.0708	4.4779	0.7331	1.1518	4.0347
RobustVGGT-F	0.6224	1.0300	2.7304	0.8038	1.3159	2.9959	0.8636	1.3882	3.4866	0.7633	1.2447	3.0710

Table 1. Camera pose estimation across noise levels. * denotes per-dataset hyperparameter tuning with oracle knowledge of the number of clean images in the test set; these entries are shaded as they are not directly comparable. † uses DINOv2 features extracted from VGGT.

Emergent Outlier View Rejection in Visual Geometry Grounded Transformers