ImageDoctor: Diagnosing Text-to-Image Generation via Grounded Image Reasoning

ImageDoctor

: Diagnosing Text-to-Image Generation via Grounded Image Reasoning

¹Johns Hopkins University, ²AMD

^*Equal Contribution, ^†Work done during internship at AMD

Abstract

The rapid advancement of text-to-image (T2I) models has increased the need for reliable human preference modeling, a demand further amplified by recent progress in reinforcement learning for preference alignment. However, existing approaches typically quantify the quality of a generated image using a single scalar, limiting their ability to provide comprehensive and interpretable feedback on image quality. To address this, we introduce ImageDoctor, a unified multi-aspect T2I model evaluation framework that assesses image quality across four complementary dimensions: plausibility, semantic alignment, aesthetics, and overall quality. ImageDoctor also provides pixel-level flaw indicators in the form of heatmaps, which highlight misaligned or implausible regions, and can be used as a dense reward for T2I model preference alignment. Inspired by the diagnostic process, we improve the detail sensitivity and reasoning capability of ImageDoctor by introducing a "look-think-predict" paradigm, where the model first localizes potential flaws, then generates reasoning, and finally concludes the evaluation with quantitative scores. Built on top of a vision-language model and trained through a combination of supervised fine-tuning and reinforcement learning, ImageDoctor demonstrates strong alignment with human preference across multiple datasets, establishing its effectiveness as an evaluation metric. Furthermore, when used as a reward model for preference tuning, ImageDoctor significantly improves generation quality—achieving an improvement of 10% over scalar-based reward models.

Verifier

We employ ImageDoctor as a verifier to distinguish subtle differences among generated images and reliably select the best candidate. It consistently favors images that align more closely with the prompt, often preferring those with more realistic and coherent details.

BibTeX

@misc{guo2025imagedoctordiagnosingtexttoimagegeneration, author = {Yuxiang Guo, Jiang Liu, Ze Wang, Hao Chen, Ximeng Sun, Yang Zhao, Jialian Wu, Xiaodong Yu, Zicheng Liu and Emad Barsoum}, title = {ImageDoctor: Diagnosing Text-to-Image Generation via Grounded Image Reasoning}, eprint = {2510.01010}, archivePrefix={arXiv}, year = {2025}, url = {https://arxiv.org/abs/2510.01010}, }

ImageDoctor : Diagnosing Text-to-Image Generation via Grounded Image Reasoning

Abstract

Model Architecture

Quantitative Results

Qualitative Results

Heatmap Visualization

Example Response

Downstream Applications

Verifier

Reward Function

BibTeX