Text-Aware Image Restoration
with Diffusion Models
Arxiv 2025

Jaewon Min1*, Jin Hyeon Kim2*, Paul Hyunbin Cho1, Jaeeun Lee3, Jihye Park4,
Minkyu Park4, Sangpil Kim2†, Hyunhee Park4†, Seungryong Kim1†
1KAIST AI    2Korea University    3Yonsei University    4Samsung Electronics   
* Equal Contribution, Co-corresponding Author


TL;DR
We introduce Text-Aware Image Restoration (TAIR), a new task focused on restoring both visual quality and text fidelity in degraded images. To support this, we present SA-Text, a large-scale dataset with detailed text annotations, and propose a multi-task diffusion framework, called TeReDiff, that leverages text-spotting features to enhance restoration.

Abstract

Image restoration aims to recover degraded images. However, existing diffusion-based restoration methods, despite great success in natural image restoration, often struggle to faithfully reconstruct textual regions in degraded images, frequently generating plausible but incorrect textures caused by a text–image hallucination. In this paper, we introduce Text-Aware Image Restoration (TAIR), a novel restoration task that requires the simultaneous recovery of scene appearance and textual fidelity. To tackle this task, we present SA-Text, a large-scale benchmark of 100K high-quality scene images densely annotated with diverse and complex text instances. Furthermore, we propose a multi-task diffusion framework, called TeReDiff, that integrates internal features from diffusion models into a text-spotting module, which are jointly trained to mutually boost the performance. This allows for the extraction of rich text representations, which are utilized as prompts in subsequent denoising steps. Extensive experiments demonstrate that our approach consistently outperforms state-of-the-art restoration methods, achieving significant gains in text recognition accuracy.


SA-Text Dataset

SA-Text curation pipeline. First, text-spotting module detects text regions in the full image, then re-applies detection to image patches to recover missed instances. Text is transcribed and verified by two VLMs, and only consistently recognized patches are annotated.
Example images from SA-Text. Our SA-Text, built on SA-1B, comprises high-quality and diverse images featuring text in varied sizes, styles, and layouts—including curved, rotated, and complex forms—providing a robust foundation for the proposed TAIR task.

Architecture

Overview of TeReDiff architecture, training, and inference pipeline. TAIR integrates a text-spotting module into a diffusion image restoration framework, using text supervision during training and recognized text as a prompt at inference to enhance text-aware image restoration.

Qualitative Results


Quantitative Results

Quantitative results of text spotting on SA-Text. Each block shows the performance of various image restoration methods under different degradation strengths, evaluated using two text spotting models. 'None' refers to recognition without a lexicon, and 'Full' denotes recognition with a full lexicon. Best results are in bold and second-best are underlined.

BibTeX


    @article{min2025text,
      title   = {Text-Aware Image Restoration with Diffusion Models},
      author  = {Jaewon Min and Jin Hyeon Kim and Paul Hyunbin Cho and Jaeeun Lee and Jihye Park and Minkyu Park and Sangpil Kim and Hyunhee Park and Seungryong Kim},
      journal = {arXiv preprint arXiv:2506.09993},
      year    = {2025}
    }