FFMK: A Fast and Fault-Tolerant Microkernel-Based System for Exascale Computing

  • Carsten Weinhold
  • Adam Lackorzynski
  • Jan Bierbaum
  • Martin Küttler
  • Maksym Planeta
  • Hermann Härtig
  • Amnon Shiloh
  • Ely Levy
  • Tal Ben-Nun
  • Amnon Barak
  • Thomas Steinke
  • Thorsten Schütt
  • Jan Fajerski
  • Alexander Reinefeld
  • Matthias Lieber
  • Wolfgang E. Nagel
Conference paper
Part of the Lecture Notes in Computational Science and Engineering book series (LNCSE, volume 113)

Abstract

In this paper we describe the hardware and application-inherent challenges that future exascale systems pose to high-performance computing (HPC) and propose a system architecture that addresses them. This architecture is based on proven building blocks and few principles: (1) a fast light-weight kernel that is supported by a virtualized Linux for tasks that are not performance critical, (2) decentralized load and health management using fault-tolerant gossip-based information dissemination, (3) a maximally-parallel checkpoint store for cheap checkpoint/restart in the presence of frequent component failures, and (4) a runtime that enables applications to interact with the underlying system platform through new interfaces. The paper discusses the vision behind FFMK and the current state of a prototype implementation of the system, which is based on a microkernel and an adapted MPI runtime.

Notes

Acknowledgements

This research and the work presented in this paper is supported by the German priority program 1648 “Software for Exascale Computing” via the research project FFMK [16]. We also thank the cluster of excellence “Center for Advancing Electronics Dresden” (cfaed). The authors acknowledge the Jülich Supercomputing Centre, the Gauss Centre for Supercomputing, and the John von Neumann Institute for Computing for providing compute time on the JUQUEEN supercomputer.

References

  1. 1.
    Acun, B., Gupta, A., Jain, N., Langer, A., Menon, H., Mikida, E., Ni, X., Robson, M., Sun, Y., Totoni, E., Wesolowski, L., Kale, L.: Parallel programming with migratable objects: Charm++ in practice. In: Proceedings of the Supercomputing 2014, Leipzig, pp. 647–658. IEEE (2014)Google Scholar
  2. 2.
    Arnold, D.C., Miller, B.P.: Scalable failure recovery for high-performance data aggregation. In: Proceedings of the IPDPS 2010, Atlanta, pp. 1–11. IEEE (2010)Google Scholar
  3. 3.
    Barak, A., Guday, S., Wheeler, R.: The MOSIX Distributed Operating System: Load Balancing for UNIX. Lecture Notes in Computer Science, vol. 672. Springer, Berlin/New York (1993)Google Scholar
  4. 4.
    Barak, A., Margolin, A., Shiloh, A.: Automatic resource-centric process migration for MPI. In: Proceedings of the EuroMPI 2012. Lecture Notes in Computer Science, vol. 7490, pp. 163–172. Springer, Berlin/New York (2012)Google Scholar
  5. 5.
    Barak, A., Drezner, Z., Levy, E., Lieber, M., Shiloh, A.: Resilient gossip algorithms for collecting online management information in exascale clusters. Concurr. Comput. Pract. Exper. 27 (17), 4797–4818 (2015)CrossRefGoogle Scholar
  6. 6.
    Beckman, P., et al.: Argo: an exascale operating system. http://www.argo-osr.org/. Accessed 20 Nov 2015
  7. 7.
    Ben-Nun, T., Levy, E., Barak, A., Rubin, E.: Memory access patterns: the missing piece of the multi-GPU puzzle. In: Proceedings of the Supercomputing 2015, Newport Beach, pp. 19:1–19:12. ACM (2015)Google Scholar
  8. 8.
    Berkeley Lab Checkpoint/Restart. http://ftg.lbl.gov/checkpoint. Accessed 20 Nov 2015
  9. 9.
    Brightwell, R., Oldfield, R., Maccabe, A.B., Bernholdt, D.E.: Hobbes: composition and virtualization as the foundations of an extreme-scale OS/R. In: Proceedings of the ROSS’13, pp. 2:1–2:8. ACM (2013)Google Scholar
  10. 10.
    Bronevetsky, G., Marques, D., Pingali, K., Stodghill, P.: Automated application-level checkpointing of MPI programs. ACM Sigplan Not. 38 (10), 84–94 (2003)CrossRefMATHGoogle Scholar
  11. 11.
    Burstedde, C., Ghattas, O., Gurnis, M., Isaac, T., Stadler, G., Warburton, T., Wilcox, L.: Extreme-scale AMR. In: Proceedings of the Supercomputing 2010, Tsukuba, pp. 1–12. ACM (2010)Google Scholar
  12. 12.
    Cappello, F., Geist, A., Gropp, W., Kale, S., Kramer, B., Snir, M.: Toward exascale resilience: 2014 update. Supercomput. Front. Innov. 1 (1), 5–28 (2014)Google Scholar
  13. 13.
    Corradi, A., Leonardi, L., Zambonelli, F.: Diffusive load-balancing policies for dynamic applications. IEEE Concurr. 7 (1), 22–31 (1999)CrossRefGoogle Scholar
  14. 14.
    Dongarra, J., et al.: The international exascale software project roadmap. Int. J. High Speed Comput. 25 (1), 3–60 (2011)Google Scholar
  15. 15.
    EXAHD – An Exa-Scalable Two-Level Sparse Grid Approach for Higher-Dimensional Problems in Plasma Physics and Beyond. http://ipvs.informatik.uni-stuttgart.de/SGS/EXAHD/index.php. Accessed 29 Nov 2015
  16. 16.
    FFMK Website. http://ffmk.tudos.org. Accessed 20 Nov 2015
  17. 17.
    Harlacher, D.F., Klimach, H., Roller, S., Siebert, C., Wolf, F.: Dynamic load balancing for unstructured meshes on space-filling curves. In: Proceedings of the IPDPSW 2012, pp. 1661–1669. IEEE (2012)Google Scholar
  18. 18.
    Kale, L.V., Zheng, G.: Charm++ and AMPI: adaptive runtime strategies via migratable objects. In: Parashar, M., Li, X. (eds.) Advanced Computational Infrastructures for Parallel and Distributed Adaptive Applications, chap. 13, pp. 265–282. Wiley, Hoboken (2009)CrossRefGoogle Scholar
  19. 19.
    Kogge, P., Shalf, J.: Exascale computing trends: adjusting to the “New Normal” for computer architecture. Comput. Sci. Eng. 15 (6), 16–26 (2013)CrossRefGoogle Scholar
  20. 20.
    Lackorzynski, A., Warg, A., Peter, M.: Generic virtualization with virtual processors. In: Proceedings of the 12th Real-Time Linux Workshop, Nairobi (2010)Google Scholar
  21. 21.
    Lange, J., Pedretti, K., Hudson, T., Dinda, P., Cui, Z., Xia, L., Bridges, P., Gocke, A., Jaconette, S., Levenhagen, M., Brightwell, R.: Palacios and Kitten: new high performance operating systems for scalable virtualized and native supercomputing. In: Proceedings of the IPDPS 2010, Atlanta, pp. 1–12. IEEE (2010)Google Scholar
  22. 22.
    Levy, E., Barak, A., Shiloh, A., Lieber, M., Weinhold, C., Härtig, H.: Overhead of a decentralized gossip algorithm on the performance of HPC applications. In: Proceedings of the ROSS’14, Munich, pp. 10:1–10:7. ACM (2014)Google Scholar
  23. 23.
    Lieber, M., Grützun, V., Wolke, R., Müller, M.S., Nagel, W.E.: Highly scalable dynamic load balancing in the atmospheric modeling system COSMO-SPECS+FD4. In: Proceedings of the PARA 2010. Lecture Notes in Computer Science, vol. 7133, pp. 131–141. Springer, Berlin/New York (2012)Google Scholar
  24. 24.
    Liedtke, J.: On micro-kernel construction. In: Proceedings of the 15th ACM Symposium on Operating Systems Principles (SOSP’95), Copper Mountain Resort, pp. 237–250. ACM (1995)Google Scholar
  25. 25.
    Lucas, R., et al.: Top ten exascale research challenges. DOE ASCAC subcommittee report. http://science.energy.gov/~/media/ascr/ascac/pdf/meetings/20140210/Top10reportFEB14.pdf (2014). Accessed 20 Nov 2015
  26. 26.
    Milthorpe, J., Ganesh, V., Rendell, A.P., Grove, D.: X10 as a parallel language for scientific computation: practice and experience. In: Proceedings of the IPDPS 2011, Anchorage, pp. 1080–1088. IEEE (2011)Google Scholar
  27. 27.
    Moody, A., Bronevetsky, G., Mohror, K., de Supinski, B.: Detailed modeling, design, and evaluation of a scalable multi-level checkpointing system. Technical report LLNL-TR-440491, Lawrence Livermore National Laboratory (LLNL) (2010)Google Scholar
  28. 28.
    MPI: A message-passing interface standard, version 3.1. http://www.mpi-forum.org/docs (2015). Accessed 20 Nov 2015
  29. 29.
    Mvapich: Mpi over infiniband. http://mvapich.cse.ohio-state.edu/. Accessed 20 Nov 2015
  30. 30.
    Open Source Molecular Dynamics. http://www.cp2k.org/. Accessed 20 Nov 2015
  31. 31.
    Ouyang, X., Marcarelli, S., Rajachandrasekar, R., Panda, D.K.: RDMA-based job migration framework for MPI over Infiniband. In: Proceedings of the IEEE CLUSTER 2010, Heraklion, pp. 116–125. IEEE (2010)Google Scholar
  32. 32.
    Rajachandrasekar, R., Moody, A., Mohror, K., Panda, D.K.: A 1 PB/s file system to checkpoint three million MPI tasks. In: Proceedings of the HPDC’13, New York, pp. 143–154. ACM (2013)Google Scholar
  33. 33.
    Roitzsch, M., Wachtler, S., Härtig, H.: Atlas: look-ahead scheduling using workload metrics. In: Proceedings of the RTAS 2013, Philadelphia, pp. 1–10. IEEE (2013)Google Scholar
  34. 34.
    Sato, K., Maruyama, N., Mohror, K., Moody, A., Gamblin, T., de Supinski, B.R., Matsuoka, S.: Design and modeling of a non-blocking checkpointing system. In: Proceedings of the Supercomputing 2012, Venice, pp. 19:1–19:10. IEEE (2012)Google Scholar
  35. 35.
    Sato, M., Fukazawa, G., Yoshinaga, K., Tsujita, Y., Hori, A., Namiki, M.: A hybrid operating system for a computing node with multi-core and many-core processors. Int. J. Adv. Comput. Sci. 3, 368–377 (2013)Google Scholar
  36. 36.
    Wang, C., Mueller, F., Engelmann, C., Scott, S.L.: Proactive process-level live migration and back migration in HPC environments. J. Par. Distrib. Comput. 72 (2), 254–267 (2012)CrossRefGoogle Scholar
  37. 37.
    Wende, F., Steinke, T., Reinefeld, A.: The impact of process placement and oversubscription on application performance: a case study for exascale computing. Technical report 15–05, ZIB (2015)Google Scholar
  38. 38.
    Winkel, M., Speck, R., Hübner, H., Arnold, L., Krause, R., Gibbon, P.: A massively parallel, multi-disciplinary Barnes-Hut tree code for extreme-scale N-body simulations. Comput. Phys. Commun. 183 (4), 880–889 (2012)MathSciNetCrossRefGoogle Scholar
  39. 39.
    Wisniewski, R.W., Inglett, T., Keppel, P., Murty, R., Riesen, R.: mOS: an architecture for extreme-scale operating systems. In: Proceedings of the ROSS’14, Munich, pp. 2:1–2:8. ACM (2014)Google Scholar
  40. 40.
    XtreemFS – a cloud file system. http://www.xtreemfs.org. Accessed 20 Nov 2015
  41. 41.
    Xue, M., Droegemeier, K.K., Weber, D.: Numerical prediction of high-impact local weather: a driver for petascale computing. In: Bader, D.A. (ed.) Petascale Computing: Algorithms and Applications, pp. 103–124. Chapman & Hall/CRC, Boca Raton (2008)Google Scholar
  42. 42.
    Zheng, F., Yu, H., Hantas, C., Wolf, M., Eisenhauer, G., Schwan, K., Abbasi, H., Klasky, S.: Goldrush: resource efficient in situ scientific data analytics using fine-grained interference aware execution. In: Proceedings of the Supercomputing 2013, Eugene, pp. 78:1–78:12. ACM (2013)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2016

Authors and Affiliations

  • Carsten Weinhold
    • 1
  • Adam Lackorzynski
    • 1
  • Jan Bierbaum
    • 1
  • Martin Küttler
    • 1
  • Maksym Planeta
    • 1
  • Hermann Härtig
    • 1
  • Amnon Shiloh
    • 2
  • Ely Levy
    • 2
  • Tal Ben-Nun
    • 2
  • Amnon Barak
    • 2
  • Thomas Steinke
    • 3
  • Thorsten Schütt
    • 3
  • Jan Fajerski
    • 3
  • Alexander Reinefeld
    • 3
  • Matthias Lieber
    • 4
  • Wolfgang E. Nagel
    • 4
  1. 1.Department of Computer ScienceDresdenGermany
  2. 2.Department of Computer ScienceThe Hebrew University of JerusalemJerusalemIsrael
  3. 3.Zuse Institute BerlinBerlinGermany
  4. 4.Center for Information Services and HPCTU Dresden, DresdenGermany

Personalised recommendations