Optimizing the Use of Heterogeneous Memory
| Field | Value | Language |
| dc.contributor.author | Wu, Xiaoxiang | |
| dc.date.accessioned | 2026-03-30T01:52:01Z | |
| dc.date.available | 2026-03-30T01:52:01Z | |
| dc.date.issued | 2026 | en |
| dc.identifier.uri | https://hdl.handle.net/2123/35044 | |
| dc.description.abstract | Chapter 2 studies persistent key-value stores and isolate the impact of individual design techniques within a unified code base. Unlike prior works that evaluate complete systems, our methodology enables an apples-to-apples comparison of trade-offs. We show that random allocation achieves performance comparable to log-structured persistence while avoiding garbage-collection latency spikes, that persistent CPU caches, such as Extended Asynchronous DRAM Refresh or Compute Express Link global flush, often hinder rather than help performance, necessitating explicit flushes, and that recovery mechanisms require careful handling of allocator metadata, with transactions imposing nontrivial overhead. Chapter 3 introduces the concept of software pre-storing, the converse of prefetching, which issues instructions to proactively move data down the memory hierarchy. Implemented via existing CPU instructions, pre-storing benefits write-intensive workloads, especially on architectures with heterogeneous memories such as PMem or CXL-attached DRAM. We develop DirtBuster, a tool that identifies applications and code regions where pre-storing is beneficial. Evaluations on ARM and x86 systems with PMem and cache-coherent DRAM demonstrate performance improvements of up to 2.3× across key-value stores, HPC applications, message-passing systems, and TensorFlow. Chapter 4 examines unified memory architectures that combine high-bandwidth access with a coherent, shared address space, thereby addressing the limitations of conventional iGPU (bandwidth-bound) and dGPU (PCIe-bound) designs. Using a state-of-the-art unified memory architecture platform, we characterize performance under diverse workloads, identify scenarios where unified memory architectures excels, and reveal the costs of fully shared memory. Our analysis provides practical guidelines for memory management in unified memory architectures systems and highlights their significant potential for balanced CPU–GPU workloads. | en |
| dc.language.iso | en | en |
| dc.subject | Heterogeneous systems | en |
| dc.subject | Persistent memory | en |
| dc.subject | Unified memory | en |
| dc.subject | CPU caches | en |
| dc.subject | Pre-store | en |
| dc.subject | Pre-fetch | en |
| dc.title | Optimizing the Use of Heterogeneous Memory | en |
| dc.type | Thesis | |
| dc.type.thesis | Doctor of Philosophy | en |
| dc.rights.other | The author retains copyright of this thesis. It may only be used for the purposes of research and study. It must not be used for any other purposes and may not be transmitted or shared with others without prior permission. | en |
| usyd.faculty | SeS faculties schools::Faculty of Engineering::School of Computer Science | en |
| usyd.degree | Doctor of Philosophy Ph.D. | en |
| usyd.awardinginst | The University of Sydney | en |
| usyd.advisor | Zwaenepoel, Willy | |
| usyd.include.pub | No | en |
Associated file/s
Associated collections