Overview of UniT

We present UniT (Unified Diffusion Transformer), a unified text-aware image restoration framework that combines a Diffusion Transformer (DiT), a Vision–Language Model (VLM), and a Text Spotting Module (TSM) to perform high-fidelity text-aware image restoration. In UniT, each component serves a distinct role. The VLM extracts textual content from degraded images to provide initial textual guidance. The TSM generates intermediate OCR predictions at each denoising timestep, allowing the VLM to iteratively correct potential textual errors. The DiT-based restoration module, with its strong representational capacity, fully leverages the textual guidance to achieve fine-grained text restoration while suppressing text hallucinations.

unit_teaser_img

Qualitative Results

Text Restoration Results on SA-Text and Real-Text Benchmarks

We present qualitative text restoration results on the SA-Text and Real-Text benchmarks. Despite severe degradations affecting readability and style, UniT leverages a rich visual-linguistic prior, guided by precise character-level OCR predictions, to provide accurate textual guidance to the DiT restoration module, effectively recovering the degraded text. In contrast, existing methods frequently fail and often produce hallucinated text.

SA-Text (Level 1)

Qualitative Result on SA-Text (Level 1)
Qualitative Result on SA-Text (Level 1)
Qualitative Result on SA-Text (Level 1)
Qualitative Result on SA-Text (Level 1)

SA-Text (Level 2)

Qualitative Result on SA-Text (Level 2)
Qualitative Result on SA-Text (Level 2)
Qualitative Result on SA-Text (Level 2)
Qualitative Result on SA-Text (Level 2)

SA-Text (Level 3)

Qualitative Result on SA-Text (Level 3)
Qualitative Result on SA-Text (Level 3)
Qualitative Result on SA-Text (Level 3)
Qualitative Result on SA-Text (Level 3)

Real-Text

Qualitative Result on Real-Text
Qualitative Result on Real-Text
Qualitative Result on Real-Text
Qualitative Result on Real-Text

Quantitative Results

Text Restoration Results on SA-Text and Real-Text Benchmarks

Text restoration performance is evaluated using detection and recognition (E2E) text-spotting metrics. Detection metrics measure the accuracy of locating text regions, while end-to-end (E2E) metrics assess the correctness of the recognized text compared with ground-truth annotations. Our proposed UniT framework outperforms all prior models by leveraging VLM-derived visual–linguistic priors and TSM OCR predictions, achieving the highest restoration accuracy across all SA-Text degradation levels and the Real-Text benchmark.

SA-Text (Level 1, Level 2, Level 3)

unit_teaser_img

Real-Text

realtext

Citation

If you use this work or find it helpful, please consider citing:

@article{kim2025unit,
title={Unified Diffusion Transformer for High-fidelity Text-Aware Image Restoration},
author={Kim, Jin Hyeon and Cho, Paul Hyunbin and Kim, Claire and Min, Jaewon and Lee, Jaeeun and Park, Jihye and Choi, Yeji and Kim, Seungryong},
journal={arXiv preprint arXiv:2512.08922},
year={2025},
}