Advertisement

Engineering Cross-Layer Fault Tolerance in Many-Core Systems

  • Rem Gensh
  • Alexander Romanovsky
  • Alex Yakovlev
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9274)

Abstract

Engineering modern many-core systems is a challenging task because of their scale and complexity. We cannot focus on ensuring their dependability without understanding its interplay with performance and energy consumption. This calls for developing new structuring mechanisms that step away from the traditional ways systems are developed (such as strict layering, strong encapsulation, abstractions, hiding). The paper reports on the initial steps of a PhD work focusing on development methods and tools for architecting cross-layer fault tolerance in many-core systems in which error detection and error recovery are applied at several system layers in a concerted coordinated fashion to ensure the overall system efficiency.

Keywords

Error detection Error recovery Performance Power consumption Abstractions Encapsulation 

Notes

Acknowledgments

This work is supported by the EPSRC/UK PRiME project and by the School of Computing Science, Newcastle University (UK).

References

  1. 1.
    Avizienis, A., Laprie, J.C., Randell, B., Landwehr, C.E.: Basic concepts and taxonomy of dependable and secure computing. IEEE Trans. Dependable Sec. Comput. 1(1), 11–33 (2004)CrossRefGoogle Scholar
  2. 2.
    DeHon, A., Carter, N., Quinn, H.: Final Report for CCC Cross-Layer Reliability Visioning Study. http://relxlayer.org/ (2011)
  3. 3.
    Borkar, S.: Thousand core chips—a technology perspective. In: Proceedings of the 44th Annual Design Automation Conference (DAC) (2007)Google Scholar
  4. 4.
    Vajda, A.: Programming Many-Core Chips. Springer, New York (2011)CrossRefGoogle Scholar
  5. 5.
    Randell, B., Xu, J.: The evolution of the recovery block concept. In: Software Fault Tolerance. John Wiley & Sons Ltd, Hoboken, pp. 1–22 (1994)Google Scholar
  6. 6.
    Chen, L., Avizienis, A.: N-version programming: A fault tolerance approach to reliability of software operation. In: Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing, pp. 113–119 (1995)Google Scholar
  7. 7.
    Cristian, F.: A recovery mechanism for modular software. In: Proceeding of the 4th International Conference on Software Engineering, ICSE’1979 (1979)Google Scholar
  8. 8.
    Anderson, T., Lee, P.A.: Fault Tolerance, Principles and Practice. Prentice/Hall International, New Jersey (1981)Google Scholar
  9. 9.
    Mills, M.P.: The Cloud Begins With Coal. CEO Digital Power Group, Washington D.C (2013)Google Scholar
  10. 10.
    Carnevali, L., Ridi, L., Vicario, E.: Stochastic fault trees for cross-layer power management of WSN monitoring systems. In: Proceedings of IEEE Conference on Emerging Technologies & Factory Automation, pp. 1–8 (2009)Google Scholar
  11. 11.
    Rachelin Sujae, P., Vigneshpandi, M.: A cross layer fault tolerant communication architecture for wireless sensor networks. Middle-East J. Sci. Res. pp. 1292–1296 (2014)Google Scholar
  12. 12.
    Wang, Y., Wu, H., Lin, F., Tzeng, N.F.: Cross-layer protocol design and optimization for delay/fault-tolerant mobile sensor networks (DFT-MSN’s). IEEE J. Sel. Areas Commun. 26(5), 809–819 (2008)zbMATHCrossRefGoogle Scholar
  13. 13.
    Ho, C.H., de Kruijf, M., Sankaralingam, K., Rountree, B., Schulz, M., de Supinski, B.R.: Mechanisms and evaluation of cross-layer fault-tolerance for supercomputing. In: Proceedings of the 41st International Conference on Parallel Processing (ICPP), pp. 510–519 (2012)Google Scholar
  14. 14.
    Rafiev, A., Xia, F., Iliasov, A., Gensh, R., Aalsaud, A., Romanovsky, A., Yakovlev, A.: Order graphs and cross-layer parametric significance-driven modelling. In: Proceedings of ACSD 2015. IEEE CS, Brussels (2015)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2015

Authors and Affiliations

  • Rem Gensh
    • 1
  • Alexander Romanovsky
    • 1
  • Alex Yakovlev
    • 2
  1. 1.Centre for Software ReliabilityNewcastle UniversityNewcastle upon TyneUK
  2. 2.School of Electrical and Electronic EngineeringNewcastle UniversityNewcastle upon TyneUK

Personalised recommendations