Vision Language Model for Medical Image Analysis
Access status:
Open Access
Type
ThesisThesis type
Doctor of PhilosophyAuthor/s
Liu, YunyiAbstract
Vision-language models (VLMs) are increasingly important for medical image understanding by enabling joint reasoning over visual and textual information. This thesis advances three core tasks—Visual Question Answering (VQA), Radiology Report Generation (RRG), and Visual Grounding ...
See moreVision-language models (VLMs) are increasingly important for medical image understanding by enabling joint reasoning over visual and textual information. This thesis advances three core tasks—Visual Question Answering (VQA), Radiology Report Generation (RRG), and Visual Grounding (VG)—and proposes methods that improve both model performance and clinically aligned evaluation. For VQA, we propose Q2ATransformer, a unified architecture for close-ended and open-ended questions. By integrating learnable answer embeddings into a Transformer decoder, it performs answer-aware attention over fused image–question features, combining the stability of classification with the flexibility of generative models. The model achieves strong results on VQA-RAD and PathVQA. For RRG, we develop SAT-RRG, a self-adaptive training framework that uses a large language model to detect semantically incorrect spans. Two losses—CTAL to reinforce correct predictions and ETAPL to penalize confident errors—guide the model toward clinically meaningful improvements without increasing inference cost. Experiments on MIMIC-CXR and IU-Xray show consistent gains in factual accuracy and coherence. We also conduct a systematic evaluation of GPT-4V across VQA, RRG, and VG. GPT-4V produces plausible outputs but struggles with localization, highlighting gaps between automatic metrics and human judgment. To improve evaluation, we introduce MRScore, an LLM-based metric trained on synthetic accepted–rejected report pairs that capture nuanced clinical differences, achieving higher correlation with expert ratings. Finally, we present ReFINE, a reward-based evaluation framework producing interpretable fine-grained sub-scores via a margin-based loss. ReFINE decomposes report quality into criterion-specific dimensions and demonstrates robustness across datasets, offering a transparent and clinically meaningful evaluation of radiology report generation systems.
See less
See moreVision-language models (VLMs) are increasingly important for medical image understanding by enabling joint reasoning over visual and textual information. This thesis advances three core tasks—Visual Question Answering (VQA), Radiology Report Generation (RRG), and Visual Grounding (VG)—and proposes methods that improve both model performance and clinically aligned evaluation. For VQA, we propose Q2ATransformer, a unified architecture for close-ended and open-ended questions. By integrating learnable answer embeddings into a Transformer decoder, it performs answer-aware attention over fused image–question features, combining the stability of classification with the flexibility of generative models. The model achieves strong results on VQA-RAD and PathVQA. For RRG, we develop SAT-RRG, a self-adaptive training framework that uses a large language model to detect semantically incorrect spans. Two losses—CTAL to reinforce correct predictions and ETAPL to penalize confident errors—guide the model toward clinically meaningful improvements without increasing inference cost. Experiments on MIMIC-CXR and IU-Xray show consistent gains in factual accuracy and coherence. We also conduct a systematic evaluation of GPT-4V across VQA, RRG, and VG. GPT-4V produces plausible outputs but struggles with localization, highlighting gaps between automatic metrics and human judgment. To improve evaluation, we introduce MRScore, an LLM-based metric trained on synthetic accepted–rejected report pairs that capture nuanced clinical differences, achieving higher correlation with expert ratings. Finally, we present ReFINE, a reward-based evaluation framework producing interpretable fine-grained sub-scores via a margin-based loss. ReFINE decomposes report quality into criterion-specific dimensions and demonstrates robustness across datasets, offering a transparent and clinically meaningful evaluation of radiology report generation systems.
See less
Date
2025Licence
The author retains copyright of this thesisRights statement
The author retains copyright of this thesis. It may only be used for the purposes of research and study. It must not be used for any other purposes and may not be transmitted or shared with others without prior permission.Faculty/School
Faculty of EngineeringAwarding institution
The University of SydneyShare