Deep Learning and Job Scheduling in Clusters
Access status:
USyd Access
Type
ThesisThesis type
Doctor of PhilosophyAuthor/s
Zhang, ShaojunAbstract
The recent surge of deep learning challenges scheduling and resource management in clusters. Their workloads are composed of large numbers of parallel tasks and complex intra-dependencies. The cluster scheduler must maintain efficient utilization of resources to keep high service ...
See moreThe recent surge of deep learning challenges scheduling and resource management in clusters. Their workloads are composed of large numbers of parallel tasks and complex intra-dependencies. The cluster scheduler must maintain efficient utilization of resources to keep high service quality. However, traditional heuristics can hardly capture the diversity within the workload pattern, whose optimization is NP-hard. So, recent work introduces deep reinforcement learning (DRL) as a potential alternative in scheduling. It is able to tackle the optimization problem effectively but still shares some limitations. First, the deep neural network model is hard to interpret and the policy has to work as a black box. Moreover, the model is vulnerable to input perturbations and might cause panic in the system. In this research, we solve key problems of scheduling deep learning workloads in the cluster with both heuristic and DRL-based schedulers. First, we propose a scheduling system for deep learning inference, incorporating fine-grained batching and fair scheduling. Our scheduler collaborates with existing deep learning frameworks to offer high throughput and low latency. Second, we develop the multi-level explanation framework for the DRL-based policy. It utilizes interpretable features, simple machine learning models, and heuristics to approximate and explain the policy. Then, we propose job perturbation to investigate the robustness issue associated with the DRL-based scheduler. We show that a user may craft gradient-guided perturbations to job features or structures to obtain more computational resources and make the scheduler dispatch her task faster. Finally, we propose the adversarial training framework to improve the robustness of the DRL-based scheduler. By learning deliberate perturbations during training, the scheduler lowers the success rate of perturbation and the benefits of the perturbed job during testing.
See less
See moreThe recent surge of deep learning challenges scheduling and resource management in clusters. Their workloads are composed of large numbers of parallel tasks and complex intra-dependencies. The cluster scheduler must maintain efficient utilization of resources to keep high service quality. However, traditional heuristics can hardly capture the diversity within the workload pattern, whose optimization is NP-hard. So, recent work introduces deep reinforcement learning (DRL) as a potential alternative in scheduling. It is able to tackle the optimization problem effectively but still shares some limitations. First, the deep neural network model is hard to interpret and the policy has to work as a black box. Moreover, the model is vulnerable to input perturbations and might cause panic in the system. In this research, we solve key problems of scheduling deep learning workloads in the cluster with both heuristic and DRL-based schedulers. First, we propose a scheduling system for deep learning inference, incorporating fine-grained batching and fair scheduling. Our scheduler collaborates with existing deep learning frameworks to offer high throughput and low latency. Second, we develop the multi-level explanation framework for the DRL-based policy. It utilizes interpretable features, simple machine learning models, and heuristics to approximate and explain the policy. Then, we propose job perturbation to investigate the robustness issue associated with the DRL-based scheduler. We show that a user may craft gradient-guided perturbations to job features or structures to obtain more computational resources and make the scheduler dispatch her task faster. Finally, we propose the adversarial training framework to improve the robustness of the DRL-based scheduler. By learning deliberate perturbations during training, the scheduler lowers the success rate of perturbation and the benefits of the perturbed job during testing.
See less
Date
2021Rights statement
The author retains copyright of this thesis. It may only be used for the purposes of research and study. It must not be used for any other purposes and may not be transmitted or shared with others without prior permission.Faculty/School
Faculty of Engineering, School of Computer ScienceAwarding institution
The University of SydneyShare