Block Arithmetic Techniques for the Implementation of Deep Neural Networks
| Field | Value | Language |
| dc.contributor.author | Zhou, Wenjie | |
| dc.date.accessioned | 2026-01-15T03:01:26Z | |
| dc.date.available | 2026-01-15T03:01:26Z | |
| dc.date.issued | 2025 | en |
| dc.identifier.uri | https://hdl.handle.net/2123/34707 | |
| dc.description | Includes publication | |
| dc.description.abstract | Performance is crucial for the evolution of DNNs, particularly as computational requirements are surging. Along with increasing DNN model scale, the training cost is becoming a new problem for DNNs. One of the critical techniques for energy-efficient training is low-precision arithmetic. Block arithmetic is a promising technique that reduces precision requirements and power consumption. This method further reduces the word length of the element, and the shared exponent expands their dynamic range. This thesis aims to develop an improved block arithmetic algorithm and implementation methodology. At the arithmetic level, this work investigates the implementation of block arithmetic. At the GEMM kernel level, this dissertation examines kernel design under different block arithmetic implementations. For rescaling, this work further addresses the challenges associated with block arithmetic and introduces the proposed delayed scaling method called the delay update. At the application level, this work utilizes N-BEATS based inference and training accelerators to demonstrate the advantages of block arithmetic. The contributions of this work are as follows: Firstly, we propose the block minifloat (BM) implementation for inference, the first implementation of an FPGA based accelerator using BM arithmetic during publication, demonstrating hardware efficiency and accuracy benefits over integer and floating point on N-BEATS. Secondly, we propose the BM implementation for training, in the form of the first FPGA implementation of a 4-bit BM, mixed-precision neural network training of N-BEATS. Thirdly, we propose the delay update method to reduce the rescaling computation in block arithmetic. Empirical studies show that the delay update scheme achieves nearly the same accuracy as the commonly used maximum calibration method, with a significant hardware implementation advantage. | en |
| dc.language.iso | en | en |
| dc.subject | block arithmetic | en |
| dc.subject | neural network training | en |
| dc.subject | FPGA | en |
| dc.subject | low-precision | en |
| dc.subject | microscaling | en |
| dc.subject | block minifloat | en |
| dc.title | Block Arithmetic Techniques for the Implementation of Deep Neural Networks | en |
| dc.type | Thesis | |
| dc.type.thesis | Doctor of Philosophy | en |
| dc.rights.other | The author retains copyright of this thesis. It may only be used for the purposes of research and study. It must not be used for any other purposes and may not be transmitted or shared with others without prior permission. | en |
| usyd.faculty | SeS faculties schools::Faculty of Engineering | en |
| usyd.degree | Doctor of Philosophy Ph.D. | en |
| usyd.awardinginst | The University of Sydney | en |
| usyd.advisor | Leong, Philip | |
| usyd.include.pub | Yes | en |
Associated file/s
Associated collections