Show simple item record

FieldValueLanguage
dc.contributor.authorEgwutuoha, Ifeanyi Paulinus
dc.date.accessioned2014-07-10
dc.date.available2014-07-10
dc.date.issued2013-12-10
dc.identifier.urihttp://hdl.handle.net/2123/11484
dc.description.abstractHigh Performance Computing (HPC) systems have been widely used by scientists and researchers in both industry and university laboratories to solve advanced computation problems. Most advanced computation problems are either data-intensive or computation-intensive. They may take hours, days or even weeks to complete execution. For example, some of the traditional HPC systems computations run on 100,000 processors for weeks. Consequently traditional HPC systems often require huge capital investments. As a result, scientists and researchers sometimes have to wait in long queues to access shared, expensive HPC systems. Cloud computing, on the other hand, offers new computing paradigms, capacity, and flexible solutions for both business and HPC applications. Some of the computation-intensive applications that are usually executed in traditional HPC systems can now be executed in the cloud. Cloud computing price model eliminates huge capital investments. However, even for cloud-based HPC systems, fault tolerance is still an issue of growing concern. The large number of virtual machines and electronic components, as well as software complexity and overall system reliability, availability and serviceability (RAS), are factors with which HPC systems in the cloud must contend. The reactive fault tolerance approach of checkpoint/restart, which is commonly used in HPC systems, does not scale well in the cloud due to resource sharing and distributed systems networks. Hence, the need for reliable fault tolerant HPC systems is even greater in a cloud environment. In this thesis we present a proactive fault tolerance approach to HPC systems in the cloud to reduce the wall-clock execution time, as well as dollar cost, in the presence of hardware failure. We have developed a generic fault tolerance algorithm for HPC systems in the cloud. We have further developed a cost model for executing computation-intensive applications on HPC systems in the cloud. Our experimental results obtained from a real cloud execution environment show that the wall-clock execution time and cost of running computation-intensive applications in the cloud can be considerably reduced compared to checkpoint and redundancy techniques used in traditional HPC systems.en_AU
dc.publisherUniversity of Sydney.en_AU
dc.publisherFaculty of Engineering & ITen_AU
dc.publisherSchool of Electrical and Information Engineeringen_AU
dc.subjectHPC systems in the clouden_AU
dc.subjectFault toleranceen_AU
dc.subjectCloud computingen_AU
dc.subjectCheckpoint/restarten_AU
dc.subjectHaaSen_AU
dc.titleA proactive fault tolerance framework for high performance computing (HPC) systems in the clouden_AU
dc.typePhD Doctorateen_AU
dc.date.valid2014-01-01en_AU
dc.type.pubtypeDoctor of Philosophy Ph.D.en_AU


Show simple item record

Associated file/s

Associated collections

Show simple item record

There are no previous versions of the item available.