UniversityLibraryCurrent studentsStaff intranet
University of Sydney
University of Sydney
View Item 
  • Sydney eScholarship Home
  • Postgraduate Theses
  • Sydney Digital Theses (Open Access)
  • View Item
  • Sydney eScholarship Home
  • Postgraduate Theses
  • Sydney Digital Theses (Open Access)
  • View Item
JavaScript is disabled for your browser. Some features of this site may not work without it.

A proactive fault tolerance framework for high performance computing (HPC) systems in the cloud

Thumbnail
View/Open
Thesis (PDF, 1.88MB)
Date
2013-12-10
Author
Egwutuoha, Ifeanyi Paulinus
Metadata
Show full item record
Abstract
High Performance Computing (HPC) systems have been widely used by scientists and researchers in both industry and university laboratories to solve advanced computation problems. Most advanced computation problems are either data-intensive or computation-intensive. They may take hours, days or even weeks to complete execution. For example, some of the traditional HPC systems computations run on 100,000 processors for weeks. Consequently traditional HPC systems often require huge capital investments. As a result, scientists and researchers sometimes have to wait in long queues to access shared, expensive HPC systems. Cloud computing, on the other hand, offers new computing paradigms, capacity, and flexible solutions for both business and HPC applications. Some of the computation-intensive applications that are usually executed in traditional HPC systems can now be executed in the cloud. Cloud computing price model eliminates huge capital investments. However, even for cloud-based HPC systems, fault tolerance is still an issue of growing concern. The large number of virtual machines and electronic components, as well as software complexity and overall system reliability, availability and serviceability (RAS), are factors with which HPC systems in the cloud must contend. The reactive fault tolerance approach of checkpoint/restart, which is commonly used in HPC systems, does not scale well in the cloud due to resource sharing and distributed systems networks. Hence, the need for reliable fault tolerant HPC systems is even greater in a cloud environment. In this thesis we present a proactive fault tolerance approach to HPC systems in the cloud to reduce the wall-clock execution time, as well as dollar cost, in the presence of hardware failure. We have developed a generic fault tolerance algorithm for HPC systems in the cloud. We have further developed a cost model for executing computation-intensive applications on HPC systems in the cloud. Our experimental results obtained from a real cloud execution environment show that the wall-clock execution time and cost of running computation-intensive applications in the cloud can be considerably reduced compared to checkpoint and redundancy techniques used in traditional HPC systems.
URI
http://hdl.handle.net/2123/11484
Collections
  • Sydney Digital Theses (Open Access) [4718]

Browse

All of Sydney eScholarshipCommunities & CollectionsBy Issue DateAuthorsTitlesSubjectsThis CollectionBy Issue DateAuthorsTitlesSubjects

My Account

LoginRegister

Statistics

Most Popular ItemsStatistics by CountryMost Popular Authors

Links

University homeLibraryCurrent studentsStaff intranet

Repository

  • About us
  • FAQ
  • Policies & guidelines
  • Email us
  • Non-UniKey login
Leadership for good starts here

Media

  • News
  • Find an expert
  • Media contacts

Student links

  • Log in to University systems
  • Study dates
  • Student handbooks
  • Timetables
  • Library

About us

  • Our world rankings
  • Faculties and schools
  • Centres and institutes
  • Campus locations
  • Maps and locations

Connect

  • Contact us
  • Find a staff member
  • Careers at Sydney
  • Events
  • Emergencies and personal safety
Inspired: Campaign to support the University of SydneyGroup of Eight
Disclaimer
Privacy
Accessibility
Website feedback
ABN: 15 211 513 464
CRICOS Number: 00026A
Disclaimer
Privacy
Accessibility
Website feedback
ABN: 15 211 513 464
CRICOS Number: 00026A