ImageDoctor logo: Diagnosing Text-to-Image Generation via Grounded Image Reasoning

1Johns Hopkins University, 2AMD
*Equal Contribution, Work done during internship at AMD
data-overview
'

Comparison between ImageDoctor and scalar-based reward functions.


Left: ImageDoctor follows a look-think-predict paradigm, providing rich feedback with four-dimensional scores and heatmaps that highlight misalignment and artifact locations.


Right: Leveraging this fine-grained feedback, DenseFlow-GRPO generates images with more faithful and realistic local details, outperforming Flow-GRPO, which relies on the scalar-based reward PickScore.

Abstract

The rapid advancement of text-to-image (T2I) models has increased the need for reliable human preference modeling, a demand further amplified by recent progress in reinforcement learning for preference alignment. However, existing approaches typically quantify the quality of a generated image using a single scalar, limiting their ability to provide comprehensive and interpretable feedback on image quality. To address this, we introduce ImageDoctor, a unified multi-aspect T2I model evaluation framework that assesses image quality across four complementary dimensions: plausibility, semantic alignment, aesthetics, and overall quality. ImageDoctor also provides pixel-level flaw indicators in the form of heatmaps, which highlight misaligned or implausible regions, and can be used as a dense reward for T2I model preference alignment. Inspired by the diagnostic process, we improve the detail sensitivity and reasoning capability of ImageDoctor by introducing a "look-think-predict" paradigm, where the model first localizes potential flaws, then generates reasoning, and finally concludes the evaluation with quantitative scores. Built on top of a vision-language model and trained through a combination of supervised fine-tuning and reinforcement learning, ImageDoctor demonstrates strong alignment with human preference across multiple datasets, establishing its effectiveness as an evaluation metric. Furthermore, when used as a reward model for preference tuning, ImageDoctor significantly improves generation quality—achieving an improvement of 10% over scalar-based reward models.

Model Architecture

pipeline

Given a prompt-image pair, the MLLM follows a "look-think-predict" paradigm for T2I evaluation by localizing potential flaw regions, analyzing them, and generating holistic scores and special task tokens. The task token, with a learned heatmap token and image features are fed into the heatmap decoder to produce the misalignment and artifact heatmaps.

Quantitative Results

pipeline

Qualitative Results

Heatmap Visualization

pipeline

Example Response

Given an image–prompt pair, ImageDoctor first localizes potential flaw regions, where its reasoning and heatmap predictions closely align. In addition, it detects artifacts appearing in the image. Finally, the heatmaps accurately depict the misaligned and implausible areas, highlighting ImageDoctor’s strong localization and reasoning capabilities and alignment with human preferences.

pipeline

Downstream Applications

Verifier

verifier

We employ ImageDoctor as a verifier to distinguish subtle differences among generated images and reliably select the best candidate. It consistently favors images that align more closely with the prompt, often preferring those with more realistic and coherent details.

Reward Function

reward function

We incorporate heatmap-guided dense rewards from ImageDoctor to enable more fine-grained optimization. DenseFlow-GRPO leverages heatmap-based dense rewards to directly target and refine local regions, effectively reducing flaws.

BibTeX

@misc{guo2025imagedoctordiagnosingtexttoimagegeneration,
  author    = {Yuxiang Guo, Jiang Liu, Ze Wang, Hao Chen, Ximeng Sun, Yang Zhao, Jialian Wu, Xiaodong Yu, Zicheng Liu and Emad Barsoum},
  title     = {ImageDoctor: Diagnosing Text-to-Image Generation via Grounded Image Reasoning}, 
  eprint    = {2510.01010},
  archivePrefix={arXiv},
  year      = {2025},
  url       = {https://arxiv.org/abs/2510.01010}, 
}