Evaluating the Quality and Safety of Retrieval-Augmented Large Language Models for a Post-Discharge Patient Question Answering System

Shao, Lexuan

Permalink

Access status:

Open Access

Type

Thesis

Thesis type

Masters by Research

Author/s

Shao, Lexuan

Abstract

Evidence on the quality and safety of large language models (LLMs) used to support patient communication remains limited. Recent advances in AI have increased interest in tools that assist patients during transitions of care, such as after hospital discharge. However, many evaluations ...
See moreEvidence on the quality and safety of large language models (LLMs) used to support patient communication remains limited. Recent advances in AI have increased interest in tools that assist patients during transitions of care, such as after hospital discharge. However, many evaluations rely on language similarity metrics or expert judgement without considering patient preferences or safety implications. This thesis develops and evaluates a real-time question–answering (QA) system to support patients following hospital discharge. The QA system used two language models (GPT-4o and QWen) within a retrieval-augmented generation (RAG) framework and could incorporate domain-specific knowledge bases, including MIMIC-IV-Note and a synthetic clinical question–answer dataset. The system was evaluated using 111 patient questions derived from 37 discharge summaries from MIMIC-IV. Three studies examined patient preference, response safety, and language similarity metrics. In study one, patient experts ranked responses from QA system configurations and clinical expert answers based on preference and perceived empathy. AI-generated responses were frequently preferred, particularly when RAG and clinical question datasets were included. In study two, clinical experts assessed the likelihood and severity of safety issues. Unsafe responses were relatively rare and comparable between AI-generated and clinician answers. In study three, language similarity metrics (BLEU, ROUGE, and BERTScore) showed no correlation with patient preference or safety outcomes. These findings suggest that QA systems using discharge information can produce responses acceptable to patients and generally safe under certain configurations. The results highlight limitations of standard language metrics and demonstrate the value of structured safety evaluation. Future systems may benefit from recognising question intent and routing queries to configurations optimised for retrieval, safety, or explanation.
See less

Date

2026

Licence

The author retains copyright of this thesis

Rights statement

The author retains copyright of this thesis. It may only be used for the purposes of research and study. It must not be used for any other purposes and may not be transmitted or shared with others without prior permission.

Faculty/School

Faculty of Medicine and Health, The University of Sydney School of Public Health

Awarding institution

The University of Sydney

Subjects

Large Language Models
Retrieval-Augmented Generation
Patient Communication
Clinical Safety
Question Answering Systems
Hospital Discharge Communication