Reliable 3D reconstruction from in-the-wild image collections is often hindered by ``noisy'' images—irrelevant inputs with little or no view overlap with others. While traditional Structure-from-Motion pipelines handle such cases through geometric verification and outlier rejection, feed-forward 3D reconstruction models lack these explicit mechanisms, leading to degraded performance under in-the-wild conditions. In this paper, we discover that the existing feed-forward reconstruction model, e.g., VGGT, despite lacking explicit outlier-rejection mechanisms or noise-aware training, can inherently distinguish distractor images. Through an in-depth analysis under varying proportions of synthetic distractors, we identify a specific layer that naturally exhibits outlier-suppressing behavior. Further probing reveals that this layer encodes discriminative internal representations that enable an effective noise-filtering capability, which we simply leverage to perform outlier-view rejection in feed-forward 3D reconstruction without any additional fine-tuning or supervision. Extensive experiments on both controlled and in-the-wild datasets demonstrate that this implicit filtering mechanism is consistent and generalizes well across diverse scenarios.
| Methods | Small | Medium | Large | Avg | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ATE ↓ | RPEtrans ↓ | RPErot ↓ | ATE ↓ | RPEtrans ↓ | RPErot ↓ | ATE ↓ | RPEtrans ↓ | RPErot ↓ | ATE ↓ | RPEtrans ↓ | RPErot ↓ | |
| Phototourism | ||||||||||||
| MAST3R-SfM | 1.3556 | 2.4084 | 12.2887 | 1.2683 | 2.4735 | 11.9315 | 1.2329 | 2.3143 | 11.2861 | 1.2856 | 2.3987 | 11.8354 |
| VGGT | 0.3068 | 0.4553 | 0.9906 | 0.3612 | 0.5314 | 1.1987 | 0.3833 | 0.5649 | 1.3304 | 0.3504 | 0.5172 | 1.1732 |
| MegaLoc+VGGT | 0.2735 | 0.4076 | 0.8841 | 0.2999 | 0.4460 | 0.9872 | 0.3161 | 0.4700 | 1.0714 | 0.2965 | 0.4412 | 0.9809 |
| MegaLoc*+VGGT | 0.2689 | 0.4013 | 0.8686 | 0.2873 | 0.4274 | 0.9316 | 0.2997 | 0.4458 | 1.0020 | 0.2853 | 0.4248 | 0.9341 |
| DINOv3+VGGT | 0.3068 | 0.4553 | 0.9909 | 0.3612 | 0.5315 | 1.1989 | 0.3833 | 0.5649 | 1.3307 | 0.3504 | 0.5172 | 1.1735 |
| DINOv2†+VGGT | 0.3068 | 0.4553 | 0.9906 | 0.3612 | 0.5314 | 1.1987 | 0.3833 | 0.5649 | 1.3304 | 0.3504 | 0.5172 | 1.1732 |
| RobustVGGT-A | 0.2732 | 0.4094 | 0.8719 | 0.2792 | 0.4153 | 0.8806 | 0.2930 | 0.4349 | 0.9310 | 0.2818 | 0.4199 | 0.8945 |
| RobustVGGT-F | 0.2641 | 0.3936 | 0.8420 | 0.2645 | 0.3949 | 0.8402 | 0.2664 | 0.3973 | 0.8388 | 0.2650 | 0.3953 | 0.8403 |
| On-the-Go | ||||||||||||
| MAST3R-SfM | 0.0754 | 0.1548 | 2.0281 | 0.0749 | 0.1474 | 1.9934 | 0.0768 | 0.1510 | 2.0328 | 0.0757 | 0.15107 | 2.0181 |
| VGGT | 0.0788 | 0.1239 | 1.0261 | 0.1281 | 0.1963 | 1.4194 | 0.1562 | 0.2393 | 1.7315 | 0.1210 | 0.1865 | 1.3923 |
| MegaLoc+VGGT | 0.0658 | 0.1056 | 0.9090 | 0.1044 | 0.1612 | 1.1516 | 0.1229 | 0.1881 | 1.3315 | 0.0977 | 0.1516 | 1.1307 |
| MegaLoc*+VGGT | 0.0529 | 0.0872 | 0.8366 | 0.0580 | 0.0942 | 0.8812 | 0.0637 | 0.1033 | 0.9464 | 0.0582 | 0.0949 | 0.8881 |
| DINOv3+VGGT | 0.0675 | 0.1077 | 0.9184 | 0.1075 | 0.1667 | 1.2201 | 0.1312 | 0.2025 | 1.4602 | 0.1021 | 0.1589 | 1.1996 |
| DINOv2†+VGGT | 0.0788 | 0.1239 | 1.0261 | 0.1281 | 0.1963 | 1.4194 | 0.1562 | 0.2393 | 1.7315 | 0.1210 | 0.1865 | 1.3923 |
| RobustVGGT-A | 0.0578 | 0.0952 | 0.8800 | 0.0790 | 0.1253 | 1.2922 | 0.0697 | 0.1126 | 1.0361 | 0.0688 | 0.1110 | 1.0694 |
| RobustVGGT-F | 0.0521 | 0.0861 | 0.8179 | 0.0568 | 0.0931 | 0.8914 | 0.0660 | 0.1055 | 0.9872 | 0.0583 | 0.0949 | 0.8988 |
| RobustNeRF | ||||||||||||
| MAST3R-SfM | 0.1153 | 0.2597 | 2.7124 | 0.1196 | 0.2650 | 2.7617 | 0.1182 | 0.2626 | 2.6882 | 0.1177 | 0.2624 | 2.7208 |
| VGGT | 0.1519 | 0.2742 | 1.2052 | 0.1598 | 0.2908 | 1.2311 | 0.1680 | 0.3062 | 1.3493 | 0.1599 | 0.2904 | 1.2619 |
| MegaLoc+VGGT | 0.1496 | 0.2692 | 1.1920 | 0.1573 | 0.2847 | 1.2199 | 0.1618 | 0.2934 | 1.2721 | 0.1562 | 0.2824 | 1.2280 |
| MegaLoc*+VGGT | 0.1352 | 0.2418 | 1.1707 | 0.1358 | 0.2416 | 1.1663 | 0.1356 | 0.2394 | 1.1552 | 0.1355 | 0.2409 | 1.1641 |
| DINOv3+VGGT | 0.1516 | 0.2738 | 1.2032 | 0.1595 | 0.2902 | 1.2296 | 0.1677 | 0.3055 | 1.3426 | 0.1596 | 0.2898 | 1.2585 |
| DINOv2†+VGGT | 0.1519 | 0.2742 | 1.2052 | 0.1598 | 0.2908 | 1.2311 | 0.1680 | 0.3062 | 1.3493 | 0.1599 | 0.2904 | 1.2619 |
| RobustVGGT-A | 0.1361 | 0.2433 | 1.1766 | 0.1379 | 0.2447 | 1.1657 | 0.1406 | 0.2478 | 1.1632 | 0.1382 | 0.2453 | 1.1685 |
| RobustVGGT-F | 0.1388 | 0.2480 | 1.1656 | 0.1374 | 0.2432 | 1.1670 | 0.1374 | 0.2415 | 1.1514 | 0.1379 | 0.2442 | 1.1613 |
| ETH3D | ||||||||||||
| MAST3R-SfM | 2.3871 | 4.2586 | 77.368 | 2.3931 | 4.2766 | 78.521 | 2.3796 | 4.1958 | 76.4021 | 2.3866 | 4.2437 | 77.4303 |
| VGGT | 0.8572 | 1.3675 | 6.3908 | 0.9182 | 1.5028 | 9.6272 | 1.0165 | 1.7348 | 15.2774 | 0.9306 | 1.5350 | 10.4318 |
| MegaLoc+VGGT | 0.9275 | 1.4470 | 5.5320 | 0.9233 | 1.4789 | 7.5607 | 0.9639 | 1.5977 | 11.4364 | 0.9382 | 1.5079 | 8.1764 |
| MegaLoc*+VGGT | 0.9170 | 1.4904 | 3.7593 | 0.9418 | 1.5194 | 4.3126 | 0.9800 | 1.5802 | 5.7108 | 0.9463 | 1.5300 | 4.5942 |
| DINOv3+VGGT | 0.8551 | 1.3667 | 6.2305 | 0.9113 | 1.4891 | 9.4103 | 1.0027 | 1.7022 | 14.9265 | 0.9230 | 1.5193 | 10.1891 |
| DINOv2†+VGGT | 0.8556 | 1.3661 | 6.2515 | 0.9188 | 1.5022 | 9.5374 | 1.0165 | 1.7346 | 15.2582 | 0.9303 | 1.5343 | 10.3490 |
| RobustVGGT-A | 0.7447 | 1.1724 | 3.8938 | 0.7673 | 1.2123 | 3.7325 | 0.6874 | 1.0708 | 4.4779 | 0.7331 | 1.1518 | 4.0347 |
| RobustVGGT-F | 0.6224 | 1.0300 | 2.7304 | 0.8038 | 1.3159 | 2.9959 | 0.8636 | 1.3882 | 3.4866 | 0.7633 | 1.2447 | 3.0710 |
Table 1. Camera pose estimation across noise levels. * denotes per-dataset hyperparameter tuning with oracle knowledge of the number of clean images in the test set; these entries are shaded as they are not directly comparable. † uses DINOv2 features extracted from VGGT.
need to fill in