Optimizing Large Language Models: Algorithmic Advancements and Model Design Strategies
Access status:
Open Access
Type
ThesisThesis type
Masters by ResearchAuthor/s
Bie, FengxiangAbstract
Large Language models have achieved remarkable performance across diverse tasks, but face two critical deployment challenges: (1) the key-value (KV) cache memory bottleneck that limits model deployment in resource-constrained environments, and (2) the sequential autoregressive ...
See moreLarge Language models have achieved remarkable performance across diverse tasks, but face two critical deployment challenges: (1) the key-value (KV) cache memory bottleneck that limits model deployment in resource-constrained environments, and (2) the sequential autoregressive generation latency that reduces inference throughput and user experience. This thesis presents two complementary contributions addressing these distinct challenges. First, CARE (Covariance-Aware and Rank-Enhanced) tackles the KV-cache memory bottleneck by converting pretrained Grouped Query Attention (GQA) models into memory-efficient Multi-Head Latent Attention (MLA) architectures. Unlike naive SVD approaches that ignore activation patterns, CARE introduces activation-preserving factorization using covariance-weighted SVD and adaptive rank allocation via water-filling algorithms. Second, Infinigram-based speculative decoding addresses inference latency by leveraging large-scale n-gram statistics to predict multiple tokens in parallel, achieving significant speedup through CPU-optimized data structures and confidence-based acceptance strategies. Experimental results on Llama-3.1-8B demonstrate that CARE achieves up to 331% relative improvement in zero-shot accuracy over baseline conversion methods while maintaining identical KV-cache footprint. Post-conversion healing fully recovers original model performance with minimal fine-tuning. Infinigram delivers significant inference speedups across various sequence lengths and batch sizes, with acceptance rates improving for longer context matches and higher-frequency patterns. This work contributes novel methodologies combining model design strategies and algorithmic advancements for efficient large generative model deployment, providing practical solutions to key memory and computational challenges without compromising model capabilities.
See less
See moreLarge Language models have achieved remarkable performance across diverse tasks, but face two critical deployment challenges: (1) the key-value (KV) cache memory bottleneck that limits model deployment in resource-constrained environments, and (2) the sequential autoregressive generation latency that reduces inference throughput and user experience. This thesis presents two complementary contributions addressing these distinct challenges. First, CARE (Covariance-Aware and Rank-Enhanced) tackles the KV-cache memory bottleneck by converting pretrained Grouped Query Attention (GQA) models into memory-efficient Multi-Head Latent Attention (MLA) architectures. Unlike naive SVD approaches that ignore activation patterns, CARE introduces activation-preserving factorization using covariance-weighted SVD and adaptive rank allocation via water-filling algorithms. Second, Infinigram-based speculative decoding addresses inference latency by leveraging large-scale n-gram statistics to predict multiple tokens in parallel, achieving significant speedup through CPU-optimized data structures and confidence-based acceptance strategies. Experimental results on Llama-3.1-8B demonstrate that CARE achieves up to 331% relative improvement in zero-shot accuracy over baseline conversion methods while maintaining identical KV-cache footprint. Post-conversion healing fully recovers original model performance with minimal fine-tuning. Infinigram delivers significant inference speedups across various sequence lengths and batch sizes, with acceptance rates improving for longer context matches and higher-frequency patterns. This work contributes novel methodologies combining model design strategies and algorithmic advancements for efficient large generative model deployment, providing practical solutions to key memory and computational challenges without compromising model capabilities.
See less
Date
2026Rights statement
The author retains copyright of this thesis. It may only be used for the purposes of research and study. It must not be used for any other purposes and may not be transmitted or shared with others without prior permission.Faculty/School
Faculty of EngineeringAwarding institution
The University of SydneyShare