The Journal of Supercomputing

, Volume 68, Issue 3, pp 1630–1651 | Cite as

Reliability-aware performance model for optimal GPU-enabled cluster environment

  • Supada Laosooksathit
  • Raja Nassar
  • Chokchai Leangsuksun
  • Mihaela PaunEmail author


Given that the reliability of a very large-scaled system is inversely related to the number of computing elements, fault tolerance has become a major concern in high performance computing including the most recent deployments with graphic processing units (GPUs). Many fault tolerance strategies, such as the checkpoint/restart mechanism, have been studied to mitigate failures within such systems. However, fault tolerance mechanisms generate additional costs and these may cause a significant performance drop if it is not used carefully. This paper presents a novel fault tolerance scheduling model that explores the interplay between the GPGPU application performance and the reliability of a large GPU system. This work focuses on the checkpoint scheduling model that aims to minimize fault tolerance costs. Additionally, a GPU performance analysis is conducted. Furthermore, the effect of a checkpoint/restart mechanism on the application performance is thoroughly studied and discussed.


GPUs Reliability Fault tolerance Checkpoint scheduling 



This work was partially supported by the grants CNS-0834483, EPS-1003897 and TE97/2010.


  1. 1.
    General-purpose computation on graphics hardware. Accessed Dec 2012
  2. 2.
    Fan Z, Qiu F, Kaufman A, Yoakum-Stover S (2004) GPU cluster for high performance computing. In: Proceedings of the ACM/IEEE conference on supercomputing, Pittsburgh, PA, USA, pp 47–53. ISBN:0-7695-2153. doi: 10.1109/SC.2004.26
  3. 3.
    Kindratenko VV, Enos J, Shi G, Showerman MT, Arnold GW, Stone JE, Phillips JC, Hwu W (2009) GPU clusters for high-performance computing. In: Proceedings of the IEEE international conference on cluster computing and workshops, CLUSTER, pp 1-8. ISBN:978-1-4244-5011-4. doi: 10.1109/CLUSTR.2009.5289128
  4. 4.
    Top 500 supercomputing sites. Accessed Dec 2012
  5. 5.
    Laosooksathit S, Naksinehaboon N, Leangsuksan C, Dhungana A, Chandler C, Chanchio K, Farbin A (2010) Lightweight checkpoint mechanism and modeling in gpgpu environment. Computing (HPC Syst) , vol 12, pp 13-20Google Scholar
  6. 6.
    Laosooksathit S, Naksinehaboon N, Leangsuksan C (2011) Two level checkpoint/restart modeling for GPGPU. In: Proceedings of 9th IEEE/ACS international conference on computer systems and applications (AICCSA), pp 276–283 .ISBN:9781457704758.
  7. 7.
    NVIDIA (2011) CUDA C Programming Guide Version 4.0. Reliability-aware performance model for optimal GPU-enabled cluster environment 11Google Scholar
  8. 8.
    Laosooksathit S, Baggag A, Chandler C (2009) Stream experiments: toward latency hiding in GPGPU. In: Proceedings of the 9th IASTED international conference, vol 676, p 240Google Scholar
  9. 9.
    Liu Y, Nassar R, Leangsuksun C, Naksinehaboon N, Paun M, Scott S (2008) An optimal checkpoint/restart model for a large scale high performance computing system. In: Proceedings of the 2nd IEEE international parallel and distributed processing symposium (IPDPS 2008), Miami, Florida, pp 1–9. ISBN: 978-1-4244-1693-6. doi: 10.1109/IPDPS.2008.4536279
  10. 10.
    Paun M, Naksinehaboon N, Nassar R, Leangsuksun C, Scott SL, Taerat N (2010) Incremental checkpoint schemes for Weibull failure distribution. Int J Found Comput Sci 21(03):329CrossRefzbMATHMathSciNetGoogle Scholar
  11. 11.
    Gottumukkala NR, Leangsuksun CB, Liu Y, Nassar R, Scott SL (2006) Reliability analysis in HPC clusters. In: Proceedings of high avalability and performance workshop (HAPCS). Conjunction with Los Alamos Computer Science Institute (LACSI) Symposium 2006, Santa FeGoogle Scholar
  12. 12.
    Gottumukkala NR, Nassar R, Paun M, Leangsuksun CB, Scott SL (2010) Reliability of a system of \(k\) nodes for high performance computing applications. IEEE Trans Reliab 59(1):162–169CrossRefGoogle Scholar
  13. 13.
    Thanakornworakij T, Nassar R, Leangsuksun C, Paun M (2012) Reliability model of a system of k nodes with simultaneous failures for high performance computing applications. Int J High Perform Comput ApplGoogle Scholar
  14. 14.
    Barney B (2013) Introduction to parallel computing. Accessed Jan 2013
  15. 15.
    Hill MD, Marty MR (2008) Amdahls law in the multicore era. In: IEEE Computer Society, pp 33 - 38.
  16. 16.
    Gustafson JL, Montry GR, Benner RE, Gear CW, Gustafson JL, Montry GR, Benner E (1988) Development of parallel methods for a 1024-processor hypercube. SIAM J Sci Stat Comput 9:609638Google Scholar
  17. 17.
    Gustafson JL (1988) Reevaluating Amdahl’s law. Commun ACM 31:532533CrossRefGoogle Scholar
  18. 18.
    CUDA Toolkit and SDK. Accessed Dec 2012
  19. 19.
    Laosooksathit S (2013) Performance Modeling and Optimization for GPGPU, Dissertation, Louisiana Tech UniversityGoogle Scholar

Copyright information

© Springer Science+Business Media New York 2014

Authors and Affiliations

  • Supada Laosooksathit
    • 1
  • Raja Nassar
    • 2
  • Chokchai Leangsuksun
    • 1
  • Mihaela Paun
    • 2
    • 3
    Email author
  1. 1.Department of Computer ScienceLouisiana Tech UniversityRustonUSA
  2. 2.Department of Mathematics and StatisticsLouisiana Tech UniversityRustonUSA
  3. 3.National Institute for Research and Development for Biological SciencesBucharestRomania

Personalised recommendations