Show simple item record

FieldValueLanguage
dc.contributor.authorLiu, Yunyi
dc.date.accessioned2026-02-19T00:14:17Z
dc.date.available2026-02-19T00:14:17Z
dc.date.issued2025en
dc.identifier.urihttps://hdl.handle.net/2123/34867
dc.description.abstractVision-language models (VLMs) are increasingly important for medical image understanding by enabling joint reasoning over visual and textual information. This thesis advances three core tasks—Visual Question Answering (VQA), Radiology Report Generation (RRG), and Visual Grounding (VG)—and proposes methods that improve both model performance and clinically aligned evaluation. For VQA, we propose Q2ATransformer, a unified architecture for close-ended and open-ended questions. By integrating learnable answer embeddings into a Transformer decoder, it performs answer-aware attention over fused image–question features, combining the stability of classification with the flexibility of generative models. The model achieves strong results on VQA-RAD and PathVQA. For RRG, we develop SAT-RRG, a self-adaptive training framework that uses a large language model to detect semantically incorrect spans. Two losses—CTAL to reinforce correct predictions and ETAPL to penalize confident errors—guide the model toward clinically meaningful improvements without increasing inference cost. Experiments on MIMIC-CXR and IU-Xray show consistent gains in factual accuracy and coherence. We also conduct a systematic evaluation of GPT-4V across VQA, RRG, and VG. GPT-4V produces plausible outputs but struggles with localization, highlighting gaps between automatic metrics and human judgment. To improve evaluation, we introduce MRScore, an LLM-based metric trained on synthetic accepted–rejected report pairs that capture nuanced clinical differences, achieving higher correlation with expert ratings. Finally, we present ReFINE, a reward-based evaluation framework producing interpretable fine-grained sub-scores via a margin-based loss. ReFINE decomposes report quality into criterion-specific dimensions and demonstrates robustness across datasets, offering a transparent and clinically meaningful evaluation of radiology report generation systems.en
dc.language.isoenen
dc.rightsThe author retains copyright of this thesis
dc.subjectMedical Image Understandingen
dc.subjectVision-Language Models (VLMs)en
dc.subjectVisual Question Answering (VQA)en
dc.subjectRadiology Report Generation (RRG)en
dc.subjectLLM-based Evaluation Metricsen
dc.subjectMLLMen
dc.titleVision Language Model for Medical Image Analysisen
dc.typeThesis
dc.type.thesisDoctor of Philosophyen
dc.rights.otherThe author retains copyright of this thesis. It may only be used for the purposes of research and study. It must not be used for any other purposes and may not be transmitted or shared with others without prior permission.en
usyd.facultySeS faculties schools::Faculty of Engineeringen
usyd.degreeDoctor of Philosophy Ph.D.en
usyd.awardinginstThe University of Sydneyen
usyd.advisorZhou, Luping
usyd.include.pubNoen


Show simple item record

Associated file/s

Associated collections

Show simple item record

There are no previous versions of the item available.