Short Text Clustering with Large Language Models
Access status:
Open Access
Type
ThesisThesis type
Doctor of PhilosophyAuthor/s
Miller, JustinAbstract
This thesis addresses the challenges of clustering short text data, focusing on human interpretability
and validation metrics. Employing Gaussian Mixture Models with embeddings from Large Language
Models, this thesis demonstrates that these methods produce clusters that are more ...
See moreThis thesis addresses the challenges of clustering short text data, focusing on human interpretability and validation metrics. Employing Gaussian Mixture Models with embeddings from Large Language Models, this thesis demonstrates that these methods produce clusters that are more interpretable than traditional approaches. The thesis introduces the concept of multi-level clustering, an approach that examines how clusters form and evolve as the number of clusters in an algorithm increases. It also introduces a method to maximise the information conveyed in each cluster, while minimising the cognitive load required to understand the clusters. The findings bridge the gap between automated metrics and human evaluation, offering insights into optimal clustering techniques for short text. This is then used to examine human identity in Twitter bios and create visualisations that provide a better understanding of clusters, as well as employing linguistic methodology to identify key distinctions between the clusters.
See less
See moreThis thesis addresses the challenges of clustering short text data, focusing on human interpretability and validation metrics. Employing Gaussian Mixture Models with embeddings from Large Language Models, this thesis demonstrates that these methods produce clusters that are more interpretable than traditional approaches. The thesis introduces the concept of multi-level clustering, an approach that examines how clusters form and evolve as the number of clusters in an algorithm increases. It also introduces a method to maximise the information conveyed in each cluster, while minimising the cognitive load required to understand the clusters. The findings bridge the gap between automated metrics and human evaluation, offering insights into optimal clustering techniques for short text. This is then used to examine human identity in Twitter bios and create visualisations that provide a better understanding of clusters, as well as employing linguistic methodology to identify key distinctions between the clusters.
See less
Date
2025Rights statement
The author retains copyright of this thesis. It may only be used for the purposes of research and study. It must not be used for any other purposes and may not be transmitted or shared with others without prior permission.Faculty/School
Faculty of Science, School of PhysicsDepartment, Discipline or Centre
PhysicsAwarding institution
The University of SydneyShare