Implementing Reliable Data Structures for MPI Services in High Component Count Systems

  • Justin M. Wozniak
  • Bryan Jacobs
  • Robert Latham
  • Sam Lang
  • Seung Woo Son
  • Robert Ross
Conference paper

DOI: 10.1007/978-3-642-03770-2_39

Part of the Lecture Notes in Computer Science book series (LNCS, volume 5759)
Cite this paper as:
Wozniak J.M., Jacobs B., Latham R., Lang S., Son S.W., Ross R. (2009) Implementing Reliable Data Structures for MPI Services in High Component Count Systems. In: Ropo M., Westerholm J., Dongarra J. (eds) Recent Advances in Parallel Virtual Machine and Message Passing Interface. EuroPVM/MPI 2009. Lecture Notes in Computer Science, vol 5759. Springer, Berlin, Heidelberg

Abstract

High performance computing systems continue to grow: currently deployed systems exceed 160,000 cores and systems exceeding 1,000,000 cores are planned. Without significant improvements in component reliability, partial system failure modes could become an unacceptably regular occurrence, limiting the usability of advanced computing infrastructures. In this work, we intend to ease the development of survivable systems and applications through the implementation of a reliable key/value data store based on a distributed hash table (DHT). Borrowing from techniques developed for unreliable wide-area systems, we implemented a distributed data service built with MPI [1] that enables user data structures to survive partial system failure. The service is based on a new implementation of the Kademlia [2] distributed hash table.

Copyright information

© Springer-Verlag Berlin Heidelberg 2009

Authors and Affiliations

  • Justin M. Wozniak
    • 1
  • Bryan Jacobs
    • 1
  • Robert Latham
    • 1
  • Sam Lang
    • 1
  • Seung Woo Son
    • 1
  • Robert Ross
    • 1
  1. 1.Mathematics and Computer Science DivisionArgonne National LaboratoryArgonneUSA

Personalised recommendations