Experiences with Fault-Injection in a Byzantine Fault-Tolerant Protocol

  • Rolando Martins
  • Rajeev Gandhi
  • Priya Narasimhan
  • Soila Pertet
  • António Casimiro
  • Diego Kreutz
  • Paulo Veríssimo
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8275)

Abstract

The overall performance improvement in Byzantine fault-tolerant state machine replication algorithms has made them a viable option for critical high-performance systems. However, the construction of the proofs necessary to support these algorithms are complex and often make assumptions that may or may not be true in a particular implementation. Furthermore, the transition from theory to practice is difficult and can lead to the introduction of subtle bugs that may break the assumptions that support these algorithms. To address these issues we have developed Hermes, a fault-injector framework that provides an infrastructure for injecting faults in a Byzantine fault-tolerant state machine. Our main goal with Hermes is to help practitioners in the complex process of debugging their implementations of these algorithms, and at the same time increase the confidence of possible adopters, e.g., systems researchers, industry, by allowing them to test the implementations. In this paper, we discuss our experiences with Hermes to inject faults in BFT-SMaRt, a high-performance Byzantine fault-tolerant state machine replication library.

Keywords

Byzantine fault-injector failure diagnosis cloud-computing Byzantine fault-tolerance intrusion-tolerance 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Clement, A., Wong, E., Alvisi, L., Dahlin, M., Marchetti, M.: Making Byzantine Fault Tolerant Systems Tolerate Byzantine faults. In: Proceedings of the 6th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2009, Berkeley, CA, USA, pp. 153–168. USENIX Association (2009)Google Scholar
  2. 2.
    BFT-SMaRt: High-Performance Byzantine Fault-tolerant State Machine Replication, http://code.google.com/p/bft-smart/ (accessed November 4, 2013)
  3. 3.
    Kiczales, G., Hilsdale, E.: Aspect-Oriented Programming. In: ACM SIGSOFT Software Engineering Notes, vol. 26, p. 313. ACM (2001)Google Scholar
  4. 4.
    Spinczyk, O., Gal, A., Schröder-Preikschat, W.: AspectC++: an Aspect-Oriented Extension to the C++ Programming Language. In: Proceedings of the 40th International Conference on Tools Pacific: Objects for Internet, Mobile and Embedded Applications, pp. 53–60. Australian Computer Society, Inc. (2002)Google Scholar
  5. 5.
    Chandra, R., Levefer, R.M., Cukier, M., Sanders, W.H.: Loki: A State-Driven Fault Injector for Distributed Systems. In: International Conference on Dependable Systems and Networks, pp. 237–242 (June 2000)Google Scholar
  6. 6.
    DBench Project Final Report (May 2004)Google Scholar
  7. 7.
    Han, S., Rosenberg, H.A., Shin, K.G.: Doctor: An integrated software fault injection environment. In: International Computer Performance and Dependability Symposium, pp. 204–213 (April 1995)Google Scholar
  8. 8.
    Alvarez, G.A., Cristian, F.: Centralized Failure Injection for Distributed, Fault-Tolerant Protocol Testing. In: International Conference on Distributed Computing Systems, pp. 78–85 (May 1997)Google Scholar
  9. 9.
    Dawson, S., Jahanian, F., Mitton, T., Tung, T.-L.: Testing of Fault-Tolerant and Real-Time Distributed Systems via Protocol Fault Injection. In: Symposium on Fault Tolerant Computing, pp. 404–414 (June 1996)Google Scholar
  10. 10.
    Looker, N., Xu, J.: Assessing the Dependability of OGSA Middleware by Fault Injection. In: Proceedings of the 22nd IEEE International Symposium on Reliable Distributed Systems, SRDS 2003, pp. 293–302 (October 2003)Google Scholar
  11. 11.
    Marsden, E., Fabre, J.-C.: Failure Analysis of an ORB in Presence of Faults. Technical report (October 2001)Google Scholar
  12. 12.
    Kanawati, G.A., Kanawati, N.A., Abraham, J.A.: FERRARI: A Flexible Software-Based Fault and Error Injection System. IEEE Transactions on Computers 44(2), 248–260 (1995)CrossRefMATHGoogle Scholar
  13. 13.
    Tsai, T.K., Iyer, R.K.: Measuring Fault Tolerance with the FTAPE Fault Injection Tool. In: Beilner, H., Bause, F. (eds.) MMB 1995 and TOOLS 1995. LNCS, vol. 977, pp. 26–40. Springer, Heidelberg (1995)CrossRefGoogle Scholar
  14. 14.
    Carreira, J., Madeira, H., Silva, J.G.: Xception: Software Fault Injection and Monitoring in Processor Functional Units. In: Proceedings of the 5th Annual IEEE International Working Conference on Dependable Computing for Critical Applications, DCCA 1995, pp. 135–149 (1995)Google Scholar
  15. 15.
    DeVale, J., Koopman, P., Guttendorf, D.: The Ballista Software Robustness Testing Service. In: Proceedings of Testing Computer Software (1999)Google Scholar
  16. 16.
    Hsueh, M.-C., Tsai, T.K., Iyer, R.K.: Fault Injection Techniques and Tools. Computer 30(4), 75–82 (1997)CrossRefGoogle Scholar
  17. 17.
    Castro, M., Liskov, B.: Practical Byzantine Fault Tolerance and Proactive Recovery. ACM Transactions on Computer Systems 20(4), 398–461 (2002)CrossRefGoogle Scholar
  18. 18.
    Abd-El-Malek, M., Ganger, G.R., Goodson, G.R., Reiter, M.K., Wylie, J.J.: Fault-scalable Byzantine Fault-Tolerant Services. SIGOPS Operating Systems Review 39(5), 59–74 (2005)CrossRefGoogle Scholar
  19. 19.
    Kotla, R., Alvisi, L., Dahlin, M., Clement, A., Wong, E.: Zyzzyva: Speculative byzantine fault folerance. In: Proceedings of 21st ACM SIGOPS Symposium on Operating Systems Principles, SOSP 2007, pp. 45–58. ACM, New York (2007)Google Scholar
  20. 20.
    Cowling, J., Myers, D., Liskov, B., Rodrigues, R., Shrira, L.: HQ Replication: A Hybrid Quorum Protocol for Byzantine Fault Tolerance. In: Proceedings of the 7th Symposium on Operating Systems Design and Implementation, SOSDI 2006, pp. 177–190. USENIX Association (2006)Google Scholar
  21. 21.
    Amir, U., Coan, B., Kirsch, J., Lane, J.: Prime: Byzantine Replication under Attack. IEEE Transactions on Dependable and Secure Computing 8(4), 564–577 (2011)CrossRefGoogle Scholar
  22. 22.
    Amir, Y., Danilov, C., Dolev, D., Kirsch, J., Lane, J., Nita-Rotaru, C., Olsen, J., Zage, D.: Steward: Scaling Byzantine Fault-Tolerant Replication to Wide Area Networks. IEEE Transactions on Dependable and Secure Computing 7(1), 80–93 (2010)CrossRefGoogle Scholar
  23. 23.
    Yin, J., Martin, J.-P., Venkataramani, A., Alvisi, L., Dahlin, M.: Separating Agreement From Execution for Byzantine Fault Tolerant Services. ACM SIGOPS Operating Systems Review 37(5), 253–267 (2003)CrossRefGoogle Scholar
  24. 24.
    Martin, J.-P., Alvisi, L.: Fast byzantine consensus. IEEE Transactions on Dependable and Secure Computing 3(3), 202–215 (2006)CrossRefGoogle Scholar
  25. 25.
    Amir, Y., Coan, B., Kirsch, J., Lane, J.: Customizable Fault Tolerance forWide-Area Replication. In: Proceedings of the 26th IEEE International Symposium on Reliable Distributed Systems, SRDS 2007, pp. 65–82. IEEE (2007)Google Scholar
  26. 26.
    Li, J., Mazieres, D.: Beyond One-Third Faulty Replicas in Byzantine Fault Tolerant Systems. In: Proceedings of the 4th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2007 (2007)Google Scholar
  27. 27.
    Kiczales, G., Lamping, J., Mendhekar, A., Maeda, C., Lopes, C.V., Loingtier, J.-M., Irwin, J.: Aspect-oriented programming. In: Akşit, M., Matsuoka, S. (eds.) ECOOP 1997. LNCS, vol. 1241, pp. 220–242. Springer, Heidelberg (1997)CrossRefGoogle Scholar
  28. 28.
    Sousa, J., Bessani, A.: From Byzantine Consensus to BFT State Machine Replication: A Latency-Optimal Transformation. In: Proceedings of the 9th European Dependable Computing Conference, EDCC 2012, pp. 37–48. IEEE Computer Society, Washington, DC (2012)CrossRefGoogle Scholar
  29. 29.
    IETF. An Architecture for Differentiated Services, http://www.ietf.org/rfc/rfc2475.txt (accessed October 17, 2011)
  30. 30.
    Dixit, M., Casimiro, A., Lollini, P., Bondavalli, A., Verissimo, P.: Adaptare: Supporting Automatic and Dependable Adaptation in Dynamic Environments. ACM Transactions on Autonomous and Adaptive Systems (TAAS) 7(2), 18 (2012)Google Scholar
  31. 31.
    McKeown, N., Anderson, T., Balakrishnan, H., Parulkar, G., Peterson, L., Rexford, J., Shenker, S., Turner, J.: OpenFlow: Enabling Innovation in Campus Networks. ACM SIGCOMM Computer Communication Review 38(2), 69–74 (2008)CrossRefGoogle Scholar

Copyright information

© IFIP International Federation for Information Processing 2013

Authors and Affiliations

  • Rolando Martins
    • 1
  • Rajeev Gandhi
    • 1
  • Priya Narasimhan
    • 1
  • Soila Pertet
    • 1
  • António Casimiro
    • 2
  • Diego Kreutz
    • 2
  • Paulo Veríssimo
    • 2
  1. 1.Department of Electrical & Computer EngineeringCarnegie Mellon UniversityUSA
  2. 2.Departamento de Informática, Faculdade de CiênciasUniversidade de LisboaPortugal

Personalised recommendations