Evaluating the Quality and Safety of Retrieval-Augmented Large Language Models for a Post-Discharge Patient Question Answering System

Shao, Lexuan

Access status:

Open Access

Field	Value	Language
dc.contributor.author	Shao, Lexuan
dc.date.accessioned	2026-03-24T09:00:12Z
dc.date.available	2026-03-24T09:00:12Z
dc.date.issued	2026	en
dc.identifier.uri	https://hdl.handle.net/2123/35026
dc.description.abstract	Evidence on the quality and safety of large language models (LLMs) used to support patient communication remains limited. Recent advances in AI have increased interest in tools that assist patients during transitions of care, such as after hospital discharge. However, many evaluations rely on language similarity metrics or expert judgement without considering patient preferences or safety implications. This thesis develops and evaluates a real-time question–answering (QA) system to support patients following hospital discharge. The QA system used two language models (GPT-4o and QWen) within a retrieval-augmented generation (RAG) framework and could incorporate domain-specific knowledge bases, including MIMIC-IV-Note and a synthetic clinical question–answer dataset. The system was evaluated using 111 patient questions derived from 37 discharge summaries from MIMIC-IV. Three studies examined patient preference, response safety, and language similarity metrics. In study one, patient experts ranked responses from QA system configurations and clinical expert answers based on preference and perceived empathy. AI-generated responses were frequently preferred, particularly when RAG and clinical question datasets were included. In study two, clinical experts assessed the likelihood and severity of safety issues. Unsafe responses were relatively rare and comparable between AI-generated and clinician answers. In study three, language similarity metrics (BLEU, ROUGE, and BERTScore) showed no correlation with patient preference or safety outcomes. These findings suggest that QA systems using discharge information can produce responses acceptable to patients and generally safe under certain configurations. The results highlight limitations of standard language metrics and demonstrate the value of structured safety evaluation. Future systems may benefit from recognising question intent and routing queries to configurations optimised for retrieval, safety, or explanation.	en
dc.language.iso	en	en
dc.rights	The author retains copyright of this thesis
dc.subject	Large Language Models	en
dc.subject	Retrieval-Augmented Generation	en
dc.subject	Patient Communication	en
dc.subject	Clinical Safety	en
dc.subject	Question Answering Systems	en
dc.subject	Hospital Discharge Communication	en
dc.title	Evaluating the Quality and Safety of Retrieval-Augmented Large Language Models for a Post-Discharge Patient Question Answering System	en
dc.type	Thesis
dc.type.thesis	Masters by Research	en
dc.rights.other	The author retains copyright of this thesis. It may only be used for the purposes of research and study. It must not be used for any other purposes and may not be transmitted or shared with others without prior permission.	en
usyd.faculty	SeS faculties schools::Faculty of Medicine and Health::The University of Sydney School of Public Health	en
usyd.degree	Master of Philosophy M.Phil	en
usyd.awardinginst	The University of Sydney	en
usyd.advisor	Dunn, Adam
usyd.include.pub	No	en