Performance Gains over Baseline
CORAL consistently improves performance over baseline, preserving global garment shape while recovering fine local details. CORAL is model-agnostic and plugs into DiT architectures with a person→garment attention map, where the CORAL loss directly supervises correspondence.
↔️ For detailed comparison, move the slider on the image left and right.
👉 Click on the image to use its own slider. Click again to resume.
🔍 Click the Zoom button to view details.
Application : Person-to-Person
We evaluate CORAL under more challenging, less curated inputs using paired and unpaired sets built from PPR10K. Person: in-the-wild full-body photos with broad variation in pose, viewpoint, background, framing, and occlusions. Garment: worn-garment images observed on another clothed person, which we refer to as person-to-person garment transfer, where it is harder to determine what should be transferred from the reference.
↔️ For detailed comparison, move the slider on the image left and right.
👉 Click on the image to use its own slider. Click again to resume.
🔍 Click the Zoom button to view details.
Abstract
Existing methods for Virtual Try-On (VTON) often struggle to preserve fine garment details, especially in unpaired settings where accurate person→garment correspondence is required. These methods do not explicitly enforce person→garment alignment and fail to explain how correspondence emerges within Diffusion Transformers (DiTs). In this paper, we first analyze full 3D attention in DiT-based architecture and reveal that person→garment correspondence critically depends on precise person→garment query-key matching within the full 3D attention. Building on this insight, we introduce CORrespondence ALignment (CORAL), a DiT based framework that explicitly aligns query-key matching with robust external correspondences. CORAL integrates two complementary components: a correspondence distillation loss that aligns reliable matches with person→garment attention, and an entropy minimization loss that sharpens the attention distribution. We further propose a VLM-based evaluation protocol to better reflect human preference. CORAL consistently improves over the baseline, enhancing both global shape transfer and local detail preservation. Extensive ablations validate our design choices.
Motivation
We compare attention derived correspondences with pseudo ground-truth from DINOv3 and find that query-key matching accuracy relates to try-on quality. High-quality outputs show sharp, well-localized person to garment attention, and low-quality outputs show dispersed or misplaced attention. Across 540 VITON-HD test pairs, higher PCK relates to higher SSIM and lower LPIPS, motivating direct supervision of attention toward stronger correspondence. Pink markers denote the attention query points.
Method Overview
CORAL builds upon a baseline architecture that constructs the noisy latent by horizontally concatenating the noisy garment and person latents, and then channel-wise concatenates the conditioning canvas and mask canvas with the noisy latent before the input projection layer. Pose is injected by adding pose latents as tokens, with RoPE set to share spatial positions between person and pose tokens. CORAL is applied to the person→garment matching cost estimated from MM-Attention within DiT blocks using two complementary loss terms: the correspondence distillation loss aligns the matching to pseudo ground-truth correspondences, while the entropy loss encourages sharper, more localized matches.
Results
Qualitative Results
↔️ For detailed comparison, move the slider on the image left and right.
👉 Click on the image to use its own slider. Click again to resume.
🔍 Click the Zoom button to view details.
Quantitative Results
| Methods | Paired | Unpaired | ||||
|---|---|---|---|---|---|---|
| SSIM (↑) | LPIPS (↓) | FID (↓) | KID (↓) | FID (↓) | KID (↓) | |
| GPVTON | 0.878 | 0.067 | 8.938 | 4.257 | 11.993 | 4.570 |
| StableVTION | 0.888 | 0.073 | 8.233 | 0.490 | 9.026 | 3.029 |
| OOTDiffusion | 0.842 | 0.087 | 6.619 | 0.845 | 9.938 | 1.302 |
| IDM-VTON | 0.866 | 0.062 | 6.009 | 0.838 | 9.198 | 1.203 |
| CatVTON | 0.874 | 0.058 | 5.458 | 0.439 | 9.076 | 1.184 |
| Any2AnyTryOn | 0.838 | 0.087 | 5.482 | 0.384 | 9.623 | 1.601 |
| CORAL (w/o LCORAL) | 0.889 | 0.055 | 5.543 | 0.870 | 9.641 | 1.323 |
| CORAL (w LCORAL) | 0.907 | 0.048 | 4.962 | 0.565 | 8.763 | 0.880 |
| Methods | Paired | Unpaired | ||||
|---|---|---|---|---|---|---|
| SSIM (↑) | LPIPS (↓) | FID (↓) | KID (↓) | FID (↓) | KID (↓) | |
| GPVTON | 0.918 | 0.068 | 8.423 | 2.439 | 9.144 | 3.936 |
| OOTDiffusion | 0.886 | 0.069 | 5.082 | 1.377 | 9.276 | 4.009 |
| IDM-VTON | 0.904 | 0.052 | 3.472 | 0.882 | 5.343 | 1.321 |
| CatVTON | 0.875 | 0.075 | 5.384 | 1.903 | 7.998 | 3.242 |
| Any2AnyTryOn* | - | - | - | - | 5.573 | 1.458 |
| CORAL (w/o LCORAL) | 0.908 | 0.045 | 2.896 | 0.418 | 5.221 | 1.315 |
| CORAL (w LCORAL) | 0.927 | 0.029 | 2.333 | 0.401 | 4.692 | 0.846 |
| Methods | Paired | Unpaired | ||||
|---|---|---|---|---|---|---|
| SSIM (↑) | LPIPS (↓) | FID (↓) | KID (↓) | FID (↓) | KID (↓) | |
| OOTDiffusion | 0.804 | 0.130 | 99.492 | 15.592 | 87.818 | 15.467 |
| IDM-VTON | 0.844 | 0.111 | 64.250 | 1.911 | 63.638 | 3.600 |
| CatVTON | 0.743 | 0.147 | 81.722 | 10.979 | 76.417 | 11.207 |
| Any2AnyTryOn* | - | - | - | - | 49.728 | 0.421 |
| CORAL (w/o LCORAL) | 0.877 | 0.078 | 44.117 | 0.015 | 56.644 | 1.202 |
| CORAL (w LCORAL) | 0.915 | 0.060 | 43.648 | 0.011 | 53.164 | 1.101 |
| Train/Test | VITON-HD/VITON-HD | DressCode/DressCode | DressCode/PPR10K | ||||||
|---|---|---|---|---|---|---|---|---|---|
| VTC (↑) | TAC (↑) | FPC (↑) | VTC (↑) | TAC (↑) | FPC (↑) | VTC (↑) | TAC (↑) | FPC (↑) | |
| GPVTON | 3.93 | 4.28 | 4.17 | 3.19 | 3.31 | 3.14 | - | - | - |
| OOTDiffusion | 3.97 | 4.34 | 4.12 | 3.25 | 3.24 | 3.45 | 1.04 | 2.06 | 1.57 |
| IDM-VTON | 3.96 | 4.40 | 4.25 | 3.25 | 3.28 | 3.61 | 1.72 | 2.21 | 2.37 |
| CatVTON | 3.82 | 4.20 | 4.06 | 3.07 | 3.22 | 3.53 | 1.05 | 2.00 | 2.01 |
| Any2AnyTryOn | 3.81 | 4.27 | 4.19 | 2.99 | 3.28 | 3.36 | 1.92 | 2.45 | 2.48 |
| CORAL (Ours) | 3.99 | 4.40 | 4.26 | 3.47 | 3.31 | 3.83 | 2.07 | 2.56 | 2.89 |
Conclusion
We analyze how person–garment correspondence is established within DiT-based VTON and show that RGB-space alignment depends on accurate query–key correspondence. We introduce CORAL, a framework that aligns person–garment query–key matches with robust DINOv3 correspondences through a correspondence distillation loss and an entropy minimization loss to sharpen attention. CORAL achieves state-of-the-art performance across all standard benchmarks as well as an in-the-wild benchmark. Extensive analyses and ablation studies further demonstrate the effectiveness of our design choices.
Citation
@misc{kim2026coralcorrespondencealignmentimproved,
title={CORAL: Correspondence Alignment for Improved Virtual Try-On},
author={Jiyoung Kim and Youngjin Shin and Siyoon Jin and Dahyun Chung and Jisu Nam and Tongmin Kim and Jongjae Park and Hyeonwoo Kang and Seungryong Kim},
year={2026},eprint={2602.17636},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2602.17636},
}
If you have any questions, please contact: kplove01@kaist.ac.kr