Vision Language Model for Medical Image Analysis

Liu, Yunyi

Access status:

Open Access

Field	Value	Language
dc.contributor.author	Liu, Yunyi
dc.date.accessioned	2026-02-19T00:14:17Z
dc.date.available	2026-02-19T00:14:17Z
dc.date.issued	2025	en
dc.identifier.uri	https://hdl.handle.net/2123/34867
dc.description.abstract	Vision-language models (VLMs) are increasingly important for medical image understanding by enabling joint reasoning over visual and textual information. This thesis advances three core tasks—Visual Question Answering (VQA), Radiology Report Generation (RRG), and Visual Grounding (VG)—and proposes methods that improve both model performance and clinically aligned evaluation. For VQA, we propose Q2ATransformer, a unified architecture for close-ended and open-ended questions. By integrating learnable answer embeddings into a Transformer decoder, it performs answer-aware attention over fused image–question features, combining the stability of classification with the flexibility of generative models. The model achieves strong results on VQA-RAD and PathVQA. For RRG, we develop SAT-RRG, a self-adaptive training framework that uses a large language model to detect semantically incorrect spans. Two losses—CTAL to reinforce correct predictions and ETAPL to penalize confident errors—guide the model toward clinically meaningful improvements without increasing inference cost. Experiments on MIMIC-CXR and IU-Xray show consistent gains in factual accuracy and coherence. We also conduct a systematic evaluation of GPT-4V across VQA, RRG, and VG. GPT-4V produces plausible outputs but struggles with localization, highlighting gaps between automatic metrics and human judgment. To improve evaluation, we introduce MRScore, an LLM-based metric trained on synthetic accepted–rejected report pairs that capture nuanced clinical differences, achieving higher correlation with expert ratings. Finally, we present ReFINE, a reward-based evaluation framework producing interpretable fine-grained sub-scores via a margin-based loss. ReFINE decomposes report quality into criterion-specific dimensions and demonstrates robustness across datasets, offering a transparent and clinically meaningful evaluation of radiology report generation systems.	en
dc.language.iso	en	en
dc.rights	The author retains copyright of this thesis
dc.subject	Medical Image Understanding	en
dc.subject	Vision-Language Models (VLMs)	en
dc.subject	Visual Question Answering (VQA)	en
dc.subject	Radiology Report Generation (RRG)	en
dc.subject	LLM-based Evaluation Metrics	en
dc.subject	MLLM	en
dc.title	Vision Language Model for Medical Image Analysis	en
dc.type	Thesis
dc.type.thesis	Doctor of Philosophy	en
dc.rights.other	The author retains copyright of this thesis. It may only be used for the purposes of research and study. It must not be used for any other purposes and may not be transmitted or shared with others without prior permission.	en
usyd.faculty	SeS faculties schools::Faculty of Engineering	en
usyd.degree	Doctor of Philosophy Ph.D.	en
usyd.awardinginst	The University of Sydney	en
usyd.advisor	Zhou, Luping
usyd.include.pub	No	en