Low Latency and Scalable Machine Learning on FPGA-based System-on-Chip
Access status:
Open Access
Type
ThesisThesis type
Doctor of PhilosophyAuthor/s
Lou, BingleiAbstract
Machine learning (ML) is highly effective for data analysis, decision-making, and solving complex problems, especially when explicit mathematical models are difficult to derive. Field-Programmable Gate Arrays (FPGAs) provide a powerful platform for ML tasks at the edge, where low ...
See moreMachine learning (ML) is highly effective for data analysis, decision-making, and solving complex problems, especially when explicit mathematical models are difficult to derive. Field-Programmable Gate Arrays (FPGAs) provide a powerful platform for ML tasks at the edge, where low latency and real-time responsiveness are essential. However, deploying ML on customized FPGA hardware involves balancing latency, accuracy, and flexibility under resource constraints. This thesis focuses on three objectives: (1) enhancing ML accelerator accuracy in an area-efficient manner, (2) integrating these accelerators into unified system-on-chip (SoC) architectures, and (3) developing reconfigurable blocks to adapt to changing environments. At the circuit level, it introduces LUTEnsemble, a specialized LUT-based architecture for fast, scalable DNN inference. By integrating sparsely connected PolyLUT sub-neurons with adder tree structures, LUTEnsemble mitigates the exponential resource scaling of traditional LUT-based DNNs, achieving superior accuracy, latency, and resource efficiency. At the system level, the thesis applies FPGA-based neural networks to qubit state measurements in trapped-ion quantum processing. Using LUTEnsemble and Vision Transformer (ViT) architectures, it achieves low latency and high accuracy. Optimized interfacing with an EMCCD camera reduced detection latency by 119 times for single-qubit tests and 94 times for three-qubit tests compared to a GPU baseline. Finally, the thesis proposes a flexible FPGA framework for anomaly detection (fSEAD). It uses partially reconfigurable blocks (pblocks) connected via an AXI switch, supporting dynamic composition of ensemble results. Experiments on the PYNQ platform showed 3-8 times speed-ups over CPU implementations across four datasets. This work advances FPGA-based ML design across circuit, system, and tool levels, providing innovative solutions for real-time, resource-efficient, and reconfigurable edge applications.
See less
See moreMachine learning (ML) is highly effective for data analysis, decision-making, and solving complex problems, especially when explicit mathematical models are difficult to derive. Field-Programmable Gate Arrays (FPGAs) provide a powerful platform for ML tasks at the edge, where low latency and real-time responsiveness are essential. However, deploying ML on customized FPGA hardware involves balancing latency, accuracy, and flexibility under resource constraints. This thesis focuses on three objectives: (1) enhancing ML accelerator accuracy in an area-efficient manner, (2) integrating these accelerators into unified system-on-chip (SoC) architectures, and (3) developing reconfigurable blocks to adapt to changing environments. At the circuit level, it introduces LUTEnsemble, a specialized LUT-based architecture for fast, scalable DNN inference. By integrating sparsely connected PolyLUT sub-neurons with adder tree structures, LUTEnsemble mitigates the exponential resource scaling of traditional LUT-based DNNs, achieving superior accuracy, latency, and resource efficiency. At the system level, the thesis applies FPGA-based neural networks to qubit state measurements in trapped-ion quantum processing. Using LUTEnsemble and Vision Transformer (ViT) architectures, it achieves low latency and high accuracy. Optimized interfacing with an EMCCD camera reduced detection latency by 119 times for single-qubit tests and 94 times for three-qubit tests compared to a GPU baseline. Finally, the thesis proposes a flexible FPGA framework for anomaly detection (fSEAD). It uses partially reconfigurable blocks (pblocks) connected via an AXI switch, supporting dynamic composition of ensemble results. Experiments on the PYNQ platform showed 3-8 times speed-ups over CPU implementations across four datasets. This work advances FPGA-based ML design across circuit, system, and tool levels, providing innovative solutions for real-time, resource-efficient, and reconfigurable edge applications.
See less
Date
2024Rights statement
The author retains copyright of this thesis. It may only be used for the purposes of research and study. It must not be used for any other purposes and may not be transmitted or shared with others without prior permission.Faculty/School
Faculty of Engineering, School of Electrical and Information EngineeringAwarding institution
The University of SydneyShare