Advertisement

Supervised Workpools for Reliable Massively Parallel Computing

  • Robert Stewart
  • Phil Trinder
  • Patrick Maier
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7829)

Abstract

The manycore revolution is steadily increasing the performance and size of massively parallel systems, to the point where system reliability becomes a pressing concern. Therefore, massively parallel compute jobs must be able to tolerate failures. For example, in the HPC-GAP project we aim to coordinate symbolic computations in architectures with 106 cores. At that scale, failures are a real issue. Functional languages are well known for advantages both for parallelism and for reliability, e.g. stateless computations can be scheduled and replicated freely.

This paper presents a software level reliability mechanism, namely supervised fault tolerant workpools implemented in a Haskell DSL for parallel programming on distributed memory architectures. The workpool hides task scheduling, failure detection and task replication from the programmer. To the best of our knowledge, this is a novel construct. We demonstrate how to abstract over supervised workpools by providing fault tolerant instances of existing algorithmic skeletons. We evaluate the runtime performance of these skeletons both in the presence and absence of faults, and report low supervision overheads.

Keywords

Fault tolerance workpools parallel computing Haskell 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Armstrong, J., Virding, R., Williams, M.: Concurrent Programming in ERLANG. Prentice Hall (1993)Google Scholar
  2. 2.
    Bialecki, A., Taton, C., Kellerman, J.: Apache Hadoop: a Framework for Running Applications on Large Clusters Built of Commodity Hardware (2010), http://hadoop.apache.org/
  3. 3.
    Borwein, P.B., Ferguson, R., Mossinghoff, M.J.: Sign changes in Sums of the Liouville Function. Mathematics of Computation 77(263), 1681–1694 (2008)MathSciNetMATHCrossRefGoogle Scholar
  4. 4.
    Bouteiller, A., Cappello, F., Hérault, T., Krawezik, G., Lemarinier, P., Magniette, F.: MPICH-V2: a Fault Tolerant MPI for Volatile Nodes based on Pessimistic Sender Based Message Logging. Super Computing 25 (2003)Google Scholar
  5. 5.
    Cappello, F., Geist, A., Gropp, B., Kalé, L.V., Kramer, B., Snir, M.: Toward Exascale Resilience. High Performance Computing Applications 23(4), 374–388 (2009)CrossRefGoogle Scholar
  6. 6.
    Coutts, D., de Vries, E.: The New Cloud Haskell. In: Haskell Implementers Workshop. Well-Typed (September 2012)Google Scholar
  7. 7.
    M. development. Feature: Adding -disable-auto-cleanup to mpich2 (2010), http://goo.gl/PNEaO
  8. 8.
    Epstein, J., Black, A.P., Jones, S.L.P.: Towards Haskell in the Cloud. In: Haskell Symposium, pp. 118–129 (2011)Google Scholar
  9. 9.
    Fabre, J.-C., Nicomette, V., Pérennou, T., Stroud, R.J., Wu, Z.: Implementing Fault Tolerant Applications using Reflective Object-Oriented Programming. In: Symposium on Fault-Tolerant Computing, pp. 489–498 (1995)Google Scholar
  10. 10.
    Fagg, G.E., Dongarra, J.: FT-MPI: Fault Tolerant MPI, Supporting Dynamic Applications in a Dynamic World. In: Euro. PVM/MPI, pp. 346–353 (2000)Google Scholar
  11. 11.
    The GAP Group. GAP – Groups, Algorithms, and Programming, http://www.gap-system.org.
  12. 12.
    Gropp, W., Lusk, E.: Fault Tolerance in MPI Programs. Special Issue of the Journal High Performance Computing Applications 18, 363–372 (2002)CrossRefGoogle Scholar
  13. 13.
    Gropp, W., Lusk, E., Skjellum, A.: Using MPI: Portable Parallel Programming with the Message-Passing Interface. Scientific And Engineering Computation. MIT Press (1994)Google Scholar
  14. 14.
    Larson, J.: Erlang for Concurrent Programming. In: ACM Queue, vol. 6, pp. 18–23. ACM (September 2008)Google Scholar
  15. 15.
    Liskov, B., Shrira, L.: Promises: Linguistic Support for Efficient Asynchronous Procedure Calls in Distributed Systems. In: PLDI, pp. 260–267. ACM (1988)Google Scholar
  16. 16.
    Maier, P., Stewart, R., Trinder, P.: Reliable Scalable Symbolic Computation: The Design of SymGridPar2. In: SAC 2013. ACM (to appear, 2013)Google Scholar
  17. 17.
    Maier, P., Trinder, P.: Implementing a high-level distributed-memory parallel Haskell in Haskell. In: Gill, A., Hage, J. (eds.) IFL 2011. LNCS, vol. 7257, pp. 35–50. Springer, Heidelberg (2012)CrossRefGoogle Scholar
  18. 18.
    Marlow, S., Jones, S.L.P., Singh, S.: Runtime Support for Multicore Haskell. In: ICFP, pp. 65–78 (2009)Google Scholar
  19. 19.
    Marlow, S., Newton, R., Jones, S.L.P.: A Monad for Deterministic Parallelism. In: Haskell Symposium, pp. 71–82 (2011)Google Scholar
  20. 20.
    Niehren, J., Schwinghammer, J., Smolka, G.: A Concurrent Lambda Calculus with Futures. Theoretical Computer Science 364(3), 338–356 (2006)MathSciNetMATHCrossRefGoogle Scholar
  21. 21.
    Schroeder, B., Gibson, G.A.: Disk Failures in the Real World: What Does an MTTF of 1,000,000 Hours Mean to You? In: FAST. USENIX Association (2007)Google Scholar
  22. 22.
    Shavit, N., Touitou, D.: Software Transactional Memory. In: PODC 1995, pp. 204–213. ACM (1995)Google Scholar
  23. 23.
    Stewart, R., Maier, P., Trinder, P.: Implementation of the HdpH Supervised Workpool (July 2012), http://www.macs.hw.ac.uk/~rs46/papers/tfp2012/SupervisedWorkpool.hs
  24. 24.
    Stewart, R., Maier, P., Trinder, P.: Supervised Workpools for Reliable Massively Parallel Computing. Technical report, Heriot-Watt University (2012), http://www.macs.hw.ac.uk/~rs46/papers/tfp2012/TFP2012_Robert_Stewart.pdf
  25. 25.
    Trinder, P.W., Hammond, K., Loidl, H.-W., Peyton Jones, S.L.: Algorithm + Strategy = Parallelism. Journal of Functional Programming 8(1), 23–60 (1998)MathSciNetMATHCrossRefGoogle Scholar
  26. 26.
    Trottier-Hebert, F.: Learn You Some Erlang For Great Good - Building an Application With OTP (2012), http://learnyousomeerlang.com/building-applications-with-otp
  27. 27.
    Zain, A.A., Hammond, K., Berthold, J., Trinder, P.W., Michaelson, G., Aswad, M.: Low-pain, High-Gain Multicore Programming in Haskell: Coordinating Irregular Symbolic Computations on Multicore Architectures. In: DAMP, pp. 25–36. ACM (2009)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2013

Authors and Affiliations

  • Robert Stewart
    • 1
  • Phil Trinder
    • 1
  • Patrick Maier
    • 1
  1. 1.School of Mathematical and Computer SciencesHeriot Watt UniversityEdinburghUK

Personalised recommendations