Vision Language Model for Medical Image Analysis

Liu, Yunyi

Permalink

Access status:

Open Access

Type

Thesis

Thesis type

Doctor of Philosophy

Author/s

Liu, Yunyi

Abstract

Vision-language models (VLMs) are increasingly important for medical image understanding by enabling joint reasoning over visual and textual information. This thesis advances three core tasks—Visual Question Answering (VQA), Radiology Report Generation (RRG), and Visual Grounding ...
See moreVision-language models (VLMs) are increasingly important for medical image understanding by enabling joint reasoning over visual and textual information. This thesis advances three core tasks—Visual Question Answering (VQA), Radiology Report Generation (RRG), and Visual Grounding (VG)—and proposes methods that improve both model performance and clinically aligned evaluation. For VQA, we propose Q2ATransformer, a unified architecture for close-ended and open-ended questions. By integrating learnable answer embeddings into a Transformer decoder, it performs answer-aware attention over fused image–question features, combining the stability of classification with the flexibility of generative models. The model achieves strong results on VQA-RAD and PathVQA. For RRG, we develop SAT-RRG, a self-adaptive training framework that uses a large language model to detect semantically incorrect spans. Two losses—CTAL to reinforce correct predictions and ETAPL to penalize confident errors—guide the model toward clinically meaningful improvements without increasing inference cost. Experiments on MIMIC-CXR and IU-Xray show consistent gains in factual accuracy and coherence. We also conduct a systematic evaluation of GPT-4V across VQA, RRG, and VG. GPT-4V produces plausible outputs but struggles with localization, highlighting gaps between automatic metrics and human judgment. To improve evaluation, we introduce MRScore, an LLM-based metric trained on synthetic accepted–rejected report pairs that capture nuanced clinical differences, achieving higher correlation with expert ratings. Finally, we present ReFINE, a reward-based evaluation framework producing interpretable fine-grained sub-scores via a margin-based loss. ReFINE decomposes report quality into criterion-specific dimensions and demonstrates robustness across datasets, offering a transparent and clinically meaningful evaluation of radiology report generation systems.
See less

Date

2025

Licence

The author retains copyright of this thesis

Rights statement

The author retains copyright of this thesis. It may only be used for the purposes of research and study. It must not be used for any other purposes and may not be transmitted or shared with others without prior permission.

Faculty/School

Faculty of Engineering

Awarding institution

The University of Sydney

Subjects

Medical Image Understanding
Vision-Language Models (VLMs)
Visual Question Answering (VQA)
Radiology Report Generation (RRG)
LLM-based Evaluation Metrics
MLLM