Vision Language Model for Medical Image Analysis
| Field | Value | Language |
| dc.contributor.author | Liu, Yunyi | |
| dc.date.accessioned | 2026-02-19T00:14:17Z | |
| dc.date.available | 2026-02-19T00:14:17Z | |
| dc.date.issued | 2025 | en |
| dc.identifier.uri | https://hdl.handle.net/2123/34867 | |
| dc.description.abstract | Vision-language models (VLMs) are increasingly important for medical image understanding by enabling joint reasoning over visual and textual information. This thesis advances three core tasks—Visual Question Answering (VQA), Radiology Report Generation (RRG), and Visual Grounding (VG)—and proposes methods that improve both model performance and clinically aligned evaluation. For VQA, we propose Q2ATransformer, a unified architecture for close-ended and open-ended questions. By integrating learnable answer embeddings into a Transformer decoder, it performs answer-aware attention over fused image–question features, combining the stability of classification with the flexibility of generative models. The model achieves strong results on VQA-RAD and PathVQA. For RRG, we develop SAT-RRG, a self-adaptive training framework that uses a large language model to detect semantically incorrect spans. Two losses—CTAL to reinforce correct predictions and ETAPL to penalize confident errors—guide the model toward clinically meaningful improvements without increasing inference cost. Experiments on MIMIC-CXR and IU-Xray show consistent gains in factual accuracy and coherence. We also conduct a systematic evaluation of GPT-4V across VQA, RRG, and VG. GPT-4V produces plausible outputs but struggles with localization, highlighting gaps between automatic metrics and human judgment. To improve evaluation, we introduce MRScore, an LLM-based metric trained on synthetic accepted–rejected report pairs that capture nuanced clinical differences, achieving higher correlation with expert ratings. Finally, we present ReFINE, a reward-based evaluation framework producing interpretable fine-grained sub-scores via a margin-based loss. ReFINE decomposes report quality into criterion-specific dimensions and demonstrates robustness across datasets, offering a transparent and clinically meaningful evaluation of radiology report generation systems. | en |
| dc.language.iso | en | en |
| dc.rights | The author retains copyright of this thesis | |
| dc.subject | Medical Image Understanding | en |
| dc.subject | Vision-Language Models (VLMs) | en |
| dc.subject | Visual Question Answering (VQA) | en |
| dc.subject | Radiology Report Generation (RRG) | en |
| dc.subject | LLM-based Evaluation Metrics | en |
| dc.subject | MLLM | en |
| dc.title | Vision Language Model for Medical Image Analysis | en |
| dc.type | Thesis | |
| dc.type.thesis | Doctor of Philosophy | en |
| dc.rights.other | The author retains copyright of this thesis. It may only be used for the purposes of research and study. It must not be used for any other purposes and may not be transmitted or shared with others without prior permission. | en |
| usyd.faculty | SeS faculties schools::Faculty of Engineering | en |
| usyd.degree | Doctor of Philosophy Ph.D. | en |
| usyd.awardinginst | The University of Sydney | en |
| usyd.advisor | Zhou, Luping | |
| usyd.include.pub | No | en |
Associated file/s
Associated collections