Evaluating the Quality and Safety of Retrieval-Augmented Large Language Models for a Post-Discharge Patient Question Answering System
Access status:
Open Access
Type
ThesisThesis type
Masters by ResearchAuthor/s
Shao, LexuanAbstract
Evidence on the quality and safety of large language models (LLMs) used to support patient communication remains limited. Recent advances in AI have increased interest in tools that assist patients during transitions of care, such as after hospital discharge. However, many evaluations ...
See moreEvidence on the quality and safety of large language models (LLMs) used to support patient communication remains limited. Recent advances in AI have increased interest in tools that assist patients during transitions of care, such as after hospital discharge. However, many evaluations rely on language similarity metrics or expert judgement without considering patient preferences or safety implications. This thesis develops and evaluates a real-time question–answering (QA) system to support patients following hospital discharge. The QA system used two language models (GPT-4o and QWen) within a retrieval-augmented generation (RAG) framework and could incorporate domain-specific knowledge bases, including MIMIC-IV-Note and a synthetic clinical question–answer dataset. The system was evaluated using 111 patient questions derived from 37 discharge summaries from MIMIC-IV. Three studies examined patient preference, response safety, and language similarity metrics. In study one, patient experts ranked responses from QA system configurations and clinical expert answers based on preference and perceived empathy. AI-generated responses were frequently preferred, particularly when RAG and clinical question datasets were included. In study two, clinical experts assessed the likelihood and severity of safety issues. Unsafe responses were relatively rare and comparable between AI-generated and clinician answers. In study three, language similarity metrics (BLEU, ROUGE, and BERTScore) showed no correlation with patient preference or safety outcomes. These findings suggest that QA systems using discharge information can produce responses acceptable to patients and generally safe under certain configurations. The results highlight limitations of standard language metrics and demonstrate the value of structured safety evaluation. Future systems may benefit from recognising question intent and routing queries to configurations optimised for retrieval, safety, or explanation.
See less
See moreEvidence on the quality and safety of large language models (LLMs) used to support patient communication remains limited. Recent advances in AI have increased interest in tools that assist patients during transitions of care, such as after hospital discharge. However, many evaluations rely on language similarity metrics or expert judgement without considering patient preferences or safety implications. This thesis develops and evaluates a real-time question–answering (QA) system to support patients following hospital discharge. The QA system used two language models (GPT-4o and QWen) within a retrieval-augmented generation (RAG) framework and could incorporate domain-specific knowledge bases, including MIMIC-IV-Note and a synthetic clinical question–answer dataset. The system was evaluated using 111 patient questions derived from 37 discharge summaries from MIMIC-IV. Three studies examined patient preference, response safety, and language similarity metrics. In study one, patient experts ranked responses from QA system configurations and clinical expert answers based on preference and perceived empathy. AI-generated responses were frequently preferred, particularly when RAG and clinical question datasets were included. In study two, clinical experts assessed the likelihood and severity of safety issues. Unsafe responses were relatively rare and comparable between AI-generated and clinician answers. In study three, language similarity metrics (BLEU, ROUGE, and BERTScore) showed no correlation with patient preference or safety outcomes. These findings suggest that QA systems using discharge information can produce responses acceptable to patients and generally safe under certain configurations. The results highlight limitations of standard language metrics and demonstrate the value of structured safety evaluation. Future systems may benefit from recognising question intent and routing queries to configurations optimised for retrieval, safety, or explanation.
See less
Date
2026Licence
The author retains copyright of this thesisRights statement
The author retains copyright of this thesis. It may only be used for the purposes of research and study. It must not be used for any other purposes and may not be transmitted or shared with others without prior permission.Faculty/School
Faculty of Medicine and Health, The University of Sydney School of Public HealthAwarding institution
The University of SydneyShare