Optimizing Large Language Models: Algorithmic Advancements and Model Design Strategies

Bie, Fengxiang

Access status:

Open Access

Field	Value	Language
dc.contributor.author	Bie, Fengxiang
dc.date.accessioned	2026-06-15T23:06:39Z
dc.date.available	2026-06-15T23:06:39Z
dc.date.issued	2026	en_AU
dc.identifier.uri	https://hdl.handle.net/2123/35423
dc.description.abstract	Large Language models have achieved remarkable performance across diverse tasks, but face two critical deployment challenges: (1) the key-value (KV) cache memory bottleneck that limits model deployment in resource-constrained environments, and (2) the sequential autoregressive generation latency that reduces inference throughput and user experience. This thesis presents two complementary contributions addressing these distinct challenges. First, CARE (Covariance-Aware and Rank-Enhanced) tackles the KV-cache memory bottleneck by converting pretrained Grouped Query Attention (GQA) models into memory-efficient Multi-Head Latent Attention (MLA) architectures. Unlike naive SVD approaches that ignore activation patterns, CARE introduces activation-preserving factorization using covariance-weighted SVD and adaptive rank allocation via water-filling algorithms. Second, Infinigram-based speculative decoding addresses inference latency by leveraging large-scale n-gram statistics to predict multiple tokens in parallel, achieving significant speedup through CPU-optimized data structures and confidence-based acceptance strategies. Experimental results on Llama-3.1-8B demonstrate that CARE achieves up to 331% relative improvement in zero-shot accuracy over baseline conversion methods while maintaining identical KV-cache footprint. Post-conversion healing fully recovers original model performance with minimal fine-tuning. Infinigram delivers significant inference speedups across various sequence lengths and batch sizes, with acceptance rates improving for longer context matches and higher-frequency patterns. This work contributes novel methodologies combining model design strategies and algorithmic advancements for efficient large generative model deployment, providing practical solutions to key memory and computational challenges without compromising model capabilities.	en_AU
dc.language.iso	en	en_AU
dc.subject	Large Language Models	en_AU
dc.subject	KV-Cache Compression	en_AU
dc.subject	Speculative Decoding	en_AU
dc.subject	Inference Acceleration	en_AU
dc.title	Optimizing Large Language Models: Algorithmic Advancements and Model Design Strategies	en_AU
dc.type	Thesis
dc.type.thesis	Masters by Research	en_AU
dc.rights.other	The author retains copyright of this thesis. It may only be used for the purposes of research and study. It must not be used for any other purposes and may not be transmitted or shared with others without prior permission.	en
usyd.faculty	SeS faculties schools::Faculty of Engineering	en_AU
usyd.degree	Master of Philosophy M.Phil	en_AU
usyd.awardinginst	The University of Sydney	en_AU
usyd.advisor	Song, Shuaiwen
usyd.include.pub	No	en_AU