Block Arithmetic Techniques for the Implementation of Deep Neural Networks

Zhou, Wenjie

Access status:

USyd Access

Field	Value	Language
dc.contributor.author	Zhou, Wenjie
dc.date.accessioned	2026-01-15T03:01:26Z
dc.date.available	2026-01-15T03:01:26Z
dc.date.issued	2025	en
dc.identifier.uri	https://hdl.handle.net/2123/34707
dc.description	Includes publication
dc.description.abstract	Performance is crucial for the evolution of DNNs, particularly as computational requirements are surging. Along with increasing DNN model scale, the training cost is becoming a new problem for DNNs. One of the critical techniques for energy-efficient training is low-precision arithmetic. Block arithmetic is a promising technique that reduces precision requirements and power consumption. This method further reduces the word length of the element, and the shared exponent expands their dynamic range. This thesis aims to develop an improved block arithmetic algorithm and implementation methodology. At the arithmetic level, this work investigates the implementation of block arithmetic. At the GEMM kernel level, this dissertation examines kernel design under different block arithmetic implementations. For rescaling, this work further addresses the challenges associated with block arithmetic and introduces the proposed delayed scaling method called the delay update. At the application level, this work utilizes N-BEATS based inference and training accelerators to demonstrate the advantages of block arithmetic. The contributions of this work are as follows: Firstly, we propose the block minifloat (BM) implementation for inference, the first implementation of an FPGA based accelerator using BM arithmetic during publication, demonstrating hardware efficiency and accuracy benefits over integer and floating point on N-BEATS. Secondly, we propose the BM implementation for training, in the form of the first FPGA implementation of a 4-bit BM, mixed-precision neural network training of N-BEATS. Thirdly, we propose the delay update method to reduce the rescaling computation in block arithmetic. Empirical studies show that the delay update scheme achieves nearly the same accuracy as the commonly used maximum calibration method, with a significant hardware implementation advantage.	en
dc.language.iso	en	en
dc.subject	block arithmetic	en
dc.subject	neural network training	en
dc.subject	FPGA	en
dc.subject	low-precision	en
dc.subject	microscaling	en
dc.subject	block minifloat	en
dc.title	Block Arithmetic Techniques for the Implementation of Deep Neural Networks	en
dc.type	Thesis
dc.type.thesis	Doctor of Philosophy	en
dc.rights.other	The author retains copyright of this thesis. It may only be used for the purposes of research and study. It must not be used for any other purposes and may not be transmitted or shared with others without prior permission.	en
usyd.faculty	SeS faculties schools::Faculty of Engineering	en
usyd.degree	Doctor of Philosophy Ph.D.	en
usyd.awardinginst	The University of Sydney	en
usyd.advisor	Leong, Philip
usyd.include.pub	Yes	en