Performance Gains over Baseline

CORAL consistently improves performance over baseline, preserving global garment shape while recovering fine local details. CORAL is model-agnostic and plugs into DiT architectures with a person→garment attention map, where the CORAL loss directly supervises correspondence.

↔️ For detailed comparison, move the slider on the image left and right.
👉 Click on the image to use its own slider. Click again to resume.
🔍 Click the Zoom button to view details.

VITON-HD
Garment 01
Garment
Input Person 01
Person
Try-on Example 01
Move your mouse to reveal the result
Input model Try-on result
⬅ w/o CORAL
w/ CORAL ⮕
Garment 04
Garment
Input Person 04
Person
Try-on Example 04
DressCode Sample
Input model Try-on result
⬅ w/o CORAL
w/ CORAL ⮕
Garment 03
Garment
Input Person 03
Person
Try-on Example 03
Move your mouse to reveal the result
Input model Try-on result
⬅ w/o CORAL
w/ CORAL ⮕
DressCode
Garment 01
Upper Garment
Input Person 01
Person
Try-on Example 01
Move your mouse to reveal the result
Input model Try-on result
⬅ w/o CORAL
w/ CORAL ⮕
Garment 05
Lower Garment
Input Person 05
Person
Try-on Example 05
Extra sample
Input model Try-on result
⬅ w/o CORAL
w/ CORAL ⮕
Garment 06
Dress Garment
Input Person 06
Person
Try-on Example 06
Extra sample
Input model Try-on result
⬅ w/o CORAL
w/ CORAL ⮕
VITON-HD
Garment 01
Garment
Input Person 01
Person
Try-on Example 01
Move your mouse to reveal the result
Input model Try-on result
⬅ w/o CORAL
w/ CORAL ⮕
Garment 04
Garment
Input Person 04
Person
Try-on Example 04
DressCode Sample
Input model Try-on result
⬅ w/o CORAL
w/ CORAL ⮕
Garment 03
Garment
Input Person 03
Person
Try-on Example 03
Move your mouse to reveal the result
Input model Try-on result
⬅ w/o CORAL
w/ CORAL ⮕
DressCode
Garment 01
Upper Garment
Input Person 01
Person
Try-on Example 01
Move your mouse to reveal the result
Input model Try-on result
⬅ w/o CORAL
w/ CORAL ⮕
Garment 05
Lower Garment
Input Person 05
Input Person
Try-on Example 05
Extra sample
Input model Try-on result
⬅ w/o CORAL
w/ CORAL ⮕
Garment 06
Dress Garment
Input Person 06
Person
Try-on Example 06
Extra sample
Input model Try-on result
⬅ w/o CORAL
w/ CORAL ⮕
VITON-HD
Garment 01
Garment
Input Person 01
Person
Try-on Example 01
Move your mouse to reveal the result
Input model Try-on result
⬅ w/o CORAL
w/ CORAL ⮕
Garment 04
Garment
Input Person 04
Person
Try-on Example 04
DressCode Sample
Input model Try-on result
⬅ w/o CORAL
w/ CORAL ⮕
Garment 03
Garment
Input Person 03
Person
Try-on Example 03
Move your mouse to reveal the result
Input model Try-on result
⬅ w/o CORAL
w/ CORAL ⮕
DressCode
Garment 01
Upper Garment
Input Person 01
Person
Try-on Example 01
Move your mouse to reveal the result
Input model Try-on result
⬅ w/o CORAL
w/ CORAL ⮕
Garment 05
Lower Garment
Input Person 05
Input Person
Try-on Example 05
Extra sample
Input model Try-on result
⬅ w/o CORAL
w/ CORAL ⮕
Garment 03
Dress Garment
Input Person 03
Person
Try-on Example 03
Move your mouse to reveal the result
Input model Try-on result
⬅ w/o CORAL
w/ CORAL ⮕

Application : Person-to-Person

We evaluate CORAL under more challenging, less curated inputs using paired and unpaired sets built from PPR10K. Person: in-the-wild full-body photos with broad variation in pose, viewpoint, background, framing, and occlusions. Garment: worn-garment images observed on another clothed person, which we refer to as person-to-person garment transfer, where it is harder to determine what should be transferred from the reference.

↔️ For detailed comparison, move the slider on the image left and right.
👉 Click on the image to use its own slider. Click again to resume.
🔍 Click the Zoom button to view details.

CORAL
Garment 01
Lower Garment
Input Person 01
Person
Try-on Example 01
Move your mouse to reveal the result
Input model Try-on result
⬅ input person
result ⮕
Garment 02
Upper Garment
Input Person 02
Person
Try-on Example 02
Unpaired / In-the-wild
Input model Try-on result
⬅ input person
result ⮕
Garment 03
Dress Garment
Input Person 03
Person
Try-on Example 03
Move your mouse to reveal the result
Input model Try-on result
⬅ input person
result ⮕
CORAL-P2P
Garment 04
Dress Garment
Input Person 04
Person
Try-on Example 04
DressCode Sample
Input model Try-on result
⬅ input person
result ⮕
Garment 05
Upper Garment
Input Person 05
Person
Try-on Example 05
Unpaired / In-the-wild
Input model Try-on result
⬅ input person
result ⮕
Garment 06
Lower Garment
Input Person 06
Person
Try-on Example 06
Unpaired / In-the-wild
Input model Try-on result
⬅ input person
result ⮕
CORAL
Garment 01
Dress Garment
Input Person 01
Input Person
Try-on Example 01
Move your mouse to reveal the result
Try-on result Input model
⬅ input person
result ⮕
Garment 02
Lower Garment
Input Person 02
Input Person
Try-on Example 02
Move your mouse to reveal the result
Try-on result Input model
⬅ input person
result ⮕
Garment 03
Dress Garment
Input Person 03
Input Person
Try-on Example 03
Unpaired / In-the-wild
Try-on result Input model
⬅ input person
result ⮕
CORAL-P2P
Garment 04
Upper Garment
Input Person 04
Input Person
Try-on Example 04
DressCode Sample
Try-on result Input model
⬅ input person
result ⮕
Garment 05
Dress Garment
Input Person 05
Input Person
Try-on Example 05
Unpaired / In-the-wild
Try-on result Input model
⬅ input person
result ⮕
Garment 06
Lower Garment
Input Person 06
Input Person
Try-on Example 06
Move your mouse to reveal the result
Try-on result Input model
⬅ input person
result ⮕
CORAL
Garment 01
Upper Garment
Input Person 01
Input Person
Try-on Example 01
Move your mouse to reveal the result
Try-on result Input model
⬅ input person
result ⮕
Garment 02
Lower Garment
Input Person 02
Input Person
Try-on Example 02
Move your mouse to reveal the result
Try-on result Input model
⬅ input person
result ⮕
Garment 03
Lower Garment
Input Person 03
Input Person
Try-on Example 03
Unpaired / In-the-wild
Try-on result Input model
⬅ input person
result ⮕
CORAL-P2P
Garment 04
Upper Garment
Input Person 04
Input Person
Try-on Example 04
DressCode Sample
Try-on result Input model
⬅ input person
result ⮕
Garment 05
Lower Garment
Input Person 05
Input Person
Try-on Example 05
Unpaired / In-the-wild
Try-on result Input model
⬅ input person
result ⮕
Garment 06
Dress Garment
Input Person 06
Input Person
Try-on Example 06
Move your mouse to reveal the result
Try-on result Input model
⬅ input person
result ⮕

Abstract

Existing methods for Virtual Try-On (VTON) often struggle to preserve fine garment details, especially in unpaired settings where accurate person→garment correspondence is required. These methods do not explicitly enforce person→garment alignment and fail to explain how correspondence emerges within Diffusion Transformers (DiTs). In this paper, we first analyze full 3D attention in DiT-based architecture and reveal that person→garment correspondence critically depends on precise person→garment query-key matching within the full 3D attention. Building on this insight, we introduce CORrespondence ALignment (CORAL), a DiT based framework that explicitly aligns query-key matching with robust external correspondences. CORAL integrates two complementary components: a correspondence distillation loss that aligns reliable matches with person→garment attention, and an entropy minimization loss that sharpens the attention distribution. We further propose a VLM-based evaluation protocol to better reflect human preference. CORAL consistently improves over the baseline, enhancing both global shape transfer and local detail preservation. Extensive ablations validate our design choices.

Motivation

We compare attention derived correspondences with pseudo ground-truth from DINOv3 and find that query-key matching accuracy relates to try-on quality. High-quality outputs show sharp, well-localized person to garment attention, and low-quality outputs show dispersed or misplaced attention. Across 540 VITON-HD test pairs, higher PCK relates to higher SSIM and lower LPIPS, motivating direct supervision of attention toward stronger correspondence. Pink markers denote the attention query points.

Motivation

Method Overview

Method Overview

CORAL builds upon a baseline architecture that constructs the noisy latent by horizontally concatenating the noisy garment and person latents, and then channel-wise concatenates the conditioning canvas and mask canvas with the noisy latent before the input projection layer. Pose is injected by adding pose latents as tokens, with RoPE set to share spatial positions between person and pose tokens. CORAL is applied to the person→garment matching cost estimated from MM-Attention within DiT blocks using two complementary loss terms: the correspondence distillation loss aligns the matching to pseudo ground-truth correspondences, while the entropy loss encourages sharper, more localized matches.

Results

Qualitative Results

↔️ For detailed comparison, move the slider on the image left and right.
👉 Click on the image to use its own slider. Click again to resume.
🔍 Click the Zoom button to view details.

Garment
Person
OOTDiffusion
CatVTON
IDM-VTON
Any2AnyTryOn
Ours
Garment
Input
Result 1 Result 1
⬅ input person
result ⮕
Result 2 Result 2
⬅ input person
result ⮕
Result 3 Result 3
⬅ input person
result ⮕
Result 4 Result 4
⬅ input person
result ⮕
Result 5 Result 5
⬅ input person
result ⮕
Garment
Input
Result 6 Result 6
result ⮕
⬅ input person
Result 7 Result 7
result ⮕
⬅ input person
Result 8 Result 8
result ⮕
⬅ input person
Result 9 Result 9
result ⮕
⬅ input person
Result 10 Result 10
result ⮕
⬅ input person
Garment
Input
Result 11 Result 11
result ⮕
⬅ input person
Result 12 Result 12
result ⮕
⬅ input person
Result 13 Result 13
result ⮕
⬅ input person
Result 14 Result 14
result ⮕
⬅ input person
Result 15 Result 15
result ⮕
⬅ input person

Quantitative Results

VITON-HD
Methods Paired Unpaired
SSIM (↑) LPIPS (↓) FID (↓) KID (↓) FID (↓) KID (↓)
GPVTON 0.878 0.067 8.938 4.257 11.993 4.570
StableVTION 0.888 0.073 8.233 0.490 9.026 3.029
OOTDiffusion 0.842 0.087 6.619 0.845 9.938 1.302
IDM-VTON 0.866 0.062 6.009 0.838 9.198 1.203
CatVTON 0.874 0.058 5.458 0.439 9.076 1.184
Any2AnyTryOn 0.838 0.087 5.482 0.384 9.623 1.601
CORAL (w/o LCORAL) 0.889 0.055 5.543 0.870 9.641 1.323
CORAL (w LCORAL) 0.907 0.048 4.962 0.565 8.763 0.880
DressCode
* does not support paired setting.
Methods Paired Unpaired
SSIM (↑) LPIPS (↓) FID (↓) KID (↓) FID (↓) KID (↓)
GPVTON 0.918 0.068 8.423 2.439 9.144 3.936
OOTDiffusion 0.886 0.069 5.082 1.377 9.276 4.009
IDM-VTON 0.904 0.052 3.472 0.882 5.343 1.321
CatVTON 0.875 0.075 5.384 1.903 7.998 3.242
Any2AnyTryOn* - - - - 5.573 1.458
CORAL (w/o LCORAL) 0.908 0.045 2.896 0.418 5.221 1.315
CORAL (w LCORAL) 0.927 0.029 2.333 0.401 4.692 0.846
PPR10K
* does not support paired setting.
Methods Paired Unpaired
SSIM (↑) LPIPS (↓) FID (↓) KID (↓) FID (↓) KID (↓)
OOTDiffusion 0.804 0.130 99.492 15.592 87.818 15.467
IDM-VTON 0.844 0.111 64.250 1.911 63.638 3.600
CatVTON 0.743 0.147 81.722 10.979 76.417 11.207
Any2AnyTryOn* - - - - 49.728 0.421
CORAL (w/o LCORAL) 0.877 0.078 44.117 0.015 56.644 1.202
CORAL (w LCORAL) 0.915 0.060 43.648 0.011 53.164 1.101
VLM-based Evaluation
Train/Test VITON-HD/VITON-HD DressCode/DressCode DressCode/PPR10K
VTC (↑) TAC (↑) FPC (↑) VTC (↑) TAC (↑) FPC (↑) VTC (↑) TAC (↑) FPC (↑)
GPVTON 3.934.284.17 3.193.313.14 ---
OOTDiffusion 3.974.344.12 3.253.243.45 1.042.061.57
IDM-VTON 3.964.404.25 3.253.283.61 1.722.212.37
CatVTON 3.824.204.06 3.073.223.53 1.052.002.01
Any2AnyTryOn 3.814.274.19 2.993.283.36 1.922.452.48
CORAL (Ours) 3.994.404.26 3.473.313.83 2.072.562.89
For more information about VLM-based evaluation metrics, please refer to the paper.
User Study
User Study Results

Conclusion

We analyze how person–garment correspondence is established within DiT-based VTON and show that RGB-space alignment depends on accurate query–key correspondence. We introduce CORAL, a framework that aligns person–garment query–key matches with robust DINOv3 correspondences through a correspondence distillation loss and an entropy minimization loss to sharpen attention. CORAL achieves state-of-the-art performance across all standard benchmarks as well as an in-the-wild benchmark. Extensive analyses and ablation studies further demonstrate the effectiveness of our design choices.

Citation

@misc{kim2026coralcorrespondencealignmentimproved,
title={CORAL: Correspondence Alignment for Improved Virtual Try-On}, 
author={Jiyoung Kim and Youngjin Shin and Siyoon Jin and Dahyun Chung and Jisu Nam and Tongmin Kim and Jongjae Park and Hyeonwoo Kang and Seungryong Kim},
year={2026},eprint={2602.17636}, 
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2602.17636}, 
					}

If you have any questions, please contact: kplove01@kaist.ac.kr