Overview of UniT
We present UniT (Unified Diffusion Transformer), a unified text-aware image restoration framework that combines a Diffusion Transformer (DiT), a Vision–Language Model (VLM), and a Text Spotting Module (TSM) to perform high-fidelity text-aware image restoration. In UniT, each component serves a distinct role. The VLM extracts textual content from degraded images to provide initial textual guidance. The TSM generates intermediate OCR predictions at each denoising timestep, allowing the VLM to iteratively correct potential textual errors. The DiT-based restoration module, with its strong representational capacity, fully leverages the textual guidance to achieve fine-grained text restoration while suppressing text hallucinations.
Qualitative Results
Text Restoration Results on SA-Text and Real-Text Benchmarks
We present qualitative text restoration results on the SA-Text and Real-Text benchmarks. Despite severe degradations affecting readability and style, UniT leverages a rich visual-linguistic prior, guided by precise character-level OCR predictions, to provide accurate textual guidance to the DiT restoration module, effectively recovering the degraded text. In contrast, existing methods frequently fail and often produce hallucinated text.
SA-Text (Level 1)
SA-Text (Level 2)
SA-Text (Level 3)
Real-Text
Quantitative Results
Text Restoration Results on SA-Text and Real-Text Benchmarks
Text restoration performance is evaluated using detection and recognition (E2E) text-spotting metrics. Detection metrics measure the accuracy of locating text regions, while end-to-end (E2E) metrics assess the correctness of the recognized text compared with ground-truth annotations. Our proposed UniT framework outperforms all prior models by leveraging VLM-derived visual–linguistic priors and TSM OCR predictions, achieving the highest restoration accuracy across all SA-Text degradation levels and the Real-Text benchmark.
SA-Text (Level 1, Level 2, Level 3)
Real-Text
Citation
If you use this work or find it helpful, please consider citing:
@article{kim2025unit,
title={Unified Diffusion Transformer for High-fidelity Text-Aware Image Restoration},
author={Kim, Jin Hyeon and Cho, Paul Hyunbin and Kim, Claire and Min, Jaewon and Lee, Jaeeun and Park, Jihye and Choi, Yeji and Kim, Seungryong},
journal={arXiv preprint arXiv:2512.08922},
year={2025},
}