Skip to main content

Event-Driven Chaos Testing for Containerized Applications

  • Conference paper
  • First Online:
High Performance Computing (ISC High Performance 2023)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13999))

Included in the following conference series:

  • 1573 Accesses

Abstract

With the dynamicity of emerging systems rapidly multiplying, it is important to evolve our testing infrastructures to better understand how distributed systems deal with failures. Existing Chaos tools often lack a comprehensive understanding of the system’s runtime and typically inject faults in a random manner. While random testing approaches are helpful in uncovering “shallow” bugs, testing deep failure paths requires precise and controlled fault injection at specific runtime conditions in distributed systems. This paper introduces Frisbee, an automated chaos testing platform for distributed applications on Kubernetes. Frisbee utilizes both static and dynamic runtime instrumentation to manage the dependency stack and perform testing actions. It achieves this by integrating the collection of runtime events from multiple sources with a scenario modeling language. This approach allows Frisbee to inject realistic software faults in a controlled manner while the target system runs. Moreover, since our method is based on runtime events, it ensures deterministic fault injection regardless of the specific system or workload involved. We demonstrate the practicality and relevance of Frisbee across various applications, including Cloud-native databases and Federated learning deployments.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 79.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 99.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Basiri, A., et al.: Chaos engineering. IEEE Softw. 33(3), 35–41 (2016)

    Article  Google Scholar 

  2. Tickets, M.: Recovery problems after network partition (2016). https://jira.mongodb.org/browse/SERVER-23003

  3. Schroeder, B., Gibson, G.A.: Understanding disk failure rates: what does an MTTF of 1,000,000 hours mean to you? ACM Trans. Storage (TOS) 3(3), 8-es (2007)

    Google Scholar 

  4. Ford, D., et al.: Availability in globally distributed storage systems (2010)

    Google Scholar 

  5. Amaral, M., Pardal, M.L., Mercier, H., Matos, M.: FaultSee: reproducible fault injection in distributed systems. In: 16th European Dependable Computing Conference (EDCC), pp. 25–32. IEEE (2020)

    Google Scholar 

  6. Bai, X., Tsai, W.-T., Paul, R., Shen, T., Li, B.: Distributed end-to-end testing management. In: Proceedings of the Fifth IEEE International Enterprise Distributed Object Computing Conference, pp. 140–151. IEEE (2001)

    Google Scholar 

  7. Tsai, W.-T., Bai, X., Paul, R., Shao, W., Agarwal, V.: End-to-end integration testing design. In: 25th Annual International Computer Software and Applications Conference. COMPSAC 2001, pp. 166–171. IEEE (2001)

    Google Scholar 

  8. Cooper, B.F., Silberstein, A., Tam, E., Ramakrishnan, R., Sears, R.: Benchmarking cloud serving systems with YCSB. In: Proceedings of the 1st ACM Symposium on Cloud Computing, pp. 143–154 (2010)

    Google Scholar 

  9. Grafana Labs: Load testing for engineering teams (2023). https://k6.io/

  10. PingCAP: A chaos engineering platform for kubernetes (2023). https://github.com/chaos-mesh/chaos-mesh

  11. Heorhiadi, V., Rajagopalan, S., Jamjoom, H., Reiter, M.K., Sekar, V.: Gremlin: systematic resilience testing of microservices. In: 2016 IEEE 36th International Conference on Distributed Computing Systems (ICDCS), pp. 57–66. IEEE (2016)

    Google Scholar 

  12. Georgiou, J., Symeonides, M., Kasioulis, M., Trihinas, D., Pallis, G., Dikaiakos, M.D.: BenchPilot: repeatable & reproducible benchmarking for edge micro-DCs (2022)

    Google Scholar 

  13. Symeonides, M., Georgiou, Z., Trihinas, D., Pallis, G., Dikaiakos, M.D.: Fogify: a fog computing emulation framework. In: IEEE/ACM Symposium on Edge Computing (SEC), pp. 42–54. IEEE (2020)

    Google Scholar 

  14. Nikolaidis, F., Marazakis, M., Bilas, A.: IOTier: a virtual testbed to evaluate systems for IoT environments. In: IEEE/ACM 21st International Symposium on Cluster, Cloud and Internet Computing (CCGrid), pp. 676–683. IEEE (2021)

    Google Scholar 

  15. Cilium: What is cilium? (2021). https://docs.cilium.io/en/v1.9/contributing/testing/e2e/

  16. Kubernetes: What is testground? (2021). https://docs.testground.ai/

  17. KUTTL: What is KUTTL? (2022). Available: The KUbernetes Test TooL

    Google Scholar 

  18. Toslali, M., Parthasarathy, S., Oliveira, F., Huang, H., Coskun, A.K.: Iter8: online experimentation in the cloud, pp. 289–304. Association for Computing Machinery, New York (2021). https://doi.org/10.1145/3472883.3486984

  19. Kubernetes: Operator best practices (2021). https://sdk.operatorframework.io/docs/best-practices/best-practices/

  20. Kasuya, A., Tesfaye, T.: Verification methodologies in a TLM-to-RTL design flow. In: 44th ACM/IEEE Design Automation Conference, pp. 199–204. IEEE (2007)

    Google Scholar 

  21. Tolosana-Calasanz, R., Banares, J.A., Rana, O.F., Álvarez, P., Ezpeleta, J., Hoheisel, A.: Adaptive exception handling for scientific workflows. Concurr. Comput. Pract. Exp. 22(5), 617–642 (2010)

    MATH  Google Scholar 

  22. Nikolaidis, F., Chazapis, A., Marazakis, M., Bilas, A.: Frisbee: a suite for benchmarking systems recovery. In: Proceedings of the 1st Workshop on High Availability and Observability of Cloud Systems, HAOC 2021, pp. 18–24. Association for Computing Machinery, New York (2021). https://doi.org/10.1145/3447851.3458738

  23. Gupta, R., et al.: CIFTS: a coordinated infrastructure for fault-tolerant systems. In: 2009 International Conference on Parallel Processing, pp. 237–245. IEEE (2009)

    Google Scholar 

  24. Di, S., Gupta, R., Snir, M., Pershey, E., Cappello, F.: LogAider: a tool for mining potential correlations of HPC log events. In: 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID), pp. 442–451. IEEE (2017)

    Google Scholar 

  25. CockroachDB: CockroachDB - the open source, cloud-native distributed SQL database (2022). https://www.cockroachlabs.com

  26. Adap/Flower: A friendly federated learning framework (2023). https://github.com/argoproj/argo

  27. Nikolaidis, F., Chazapis, A., Marazakis, M., Bilas, A.: Frisbee: automated testing of Cloud-native applications in Kubernetes. arXiv preprint arXiv:2109.10727 (2021)

Download references

Acknowledgement

We thankfully acknowledge the support of the European Commission and the Greek General Secretariat for Research and Innovation under the EuroHPC Programme through project EUPEX (GA-101033975). National contributions from the involved state members (including the Greek General Secretariat for Research and Innovation) match the EuroHPC funding.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Fotis Nikolaidis .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Nikolaidis, F., Chazapis, A., Marazakis, M., Bilas, A. (2023). Event-Driven Chaos Testing for Containerized Applications. In: Bienz, A., Weiland, M., Baboulin, M., Kruse, C. (eds) High Performance Computing. ISC High Performance 2023. Lecture Notes in Computer Science, vol 13999. Springer, Cham. https://doi.org/10.1007/978-3-031-40843-4_12

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-40843-4_12

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-40842-7

  • Online ISBN: 978-3-031-40843-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics