Short Text Clustering with Large Language Models

Miller, Justin

Permalink

Access status:

Open Access

Type

Thesis

Thesis type

Doctor of Philosophy

Author/s

Miller, Justin

Abstract

This thesis addresses the challenges of clustering short text data, focusing on human interpretability and validation metrics. Employing Gaussian Mixture Models with embeddings from Large Language Models, this thesis demonstrates that these methods produce clusters that are more ...
See moreThis thesis addresses the challenges of clustering short text data, focusing on human interpretability and validation metrics. Employing Gaussian Mixture Models with embeddings from Large Language Models, this thesis demonstrates that these methods produce clusters that are more interpretable than traditional approaches. The thesis introduces the concept of multi-level clustering, an approach that examines how clusters form and evolve as the number of clusters in an algorithm increases. It also introduces a method to maximise the information conveyed in each cluster, while minimising the cognitive load required to understand the clusters. The findings bridge the gap between automated metrics and human evaluation, offering insights into optimal clustering techniques for short text. This is then used to examine human identity in Twitter bios and create visualisations that provide a better understanding of clusters, as well as employing linguistic methodology to identify key distinctions between the clusters.
See less

Date

2025

Rights statement

The author retains copyright of this thesis. It may only be used for the purposes of research and study. It must not be used for any other purposes and may not be transmitted or shared with others without prior permission.

Faculty/School

Faculty of Science, School of Physics

Department, Discipline or Centre

Physics

Awarding institution

The University of Sydney

Subjects

Clustering
Large Language Models
Short Text