Abstract
With the dynamicity of emerging systems rapidly multiplying, it is important to evolve our testing infrastructures to better understand how distributed systems deal with failures. Existing Chaos tools often lack a comprehensive understanding of the system’s runtime and typically inject faults in a random manner. While random testing approaches are helpful in uncovering “shallow” bugs, testing deep failure paths requires precise and controlled fault injection at specific runtime conditions in distributed systems. This paper introduces Frisbee, an automated chaos testing platform for distributed applications on Kubernetes. Frisbee utilizes both static and dynamic runtime instrumentation to manage the dependency stack and perform testing actions. It achieves this by integrating the collection of runtime events from multiple sources with a scenario modeling language. This approach allows Frisbee to inject realistic software faults in a controlled manner while the target system runs. Moreover, since our method is based on runtime events, it ensures deterministic fault injection regardless of the specific system or workload involved. We demonstrate the practicality and relevance of Frisbee across various applications, including Cloud-native databases and Federated learning deployments.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Basiri, A., et al.: Chaos engineering. IEEE Softw. 33(3), 35–41 (2016)
Tickets, M.: Recovery problems after network partition (2016). https://jira.mongodb.org/browse/SERVER-23003
Schroeder, B., Gibson, G.A.: Understanding disk failure rates: what does an MTTF of 1,000,000 hours mean to you? ACM Trans. Storage (TOS) 3(3), 8-es (2007)
Ford, D., et al.: Availability in globally distributed storage systems (2010)
Amaral, M., Pardal, M.L., Mercier, H., Matos, M.: FaultSee: reproducible fault injection in distributed systems. In: 16th European Dependable Computing Conference (EDCC), pp. 25–32. IEEE (2020)
Bai, X., Tsai, W.-T., Paul, R., Shen, T., Li, B.: Distributed end-to-end testing management. In: Proceedings of the Fifth IEEE International Enterprise Distributed Object Computing Conference, pp. 140–151. IEEE (2001)
Tsai, W.-T., Bai, X., Paul, R., Shao, W., Agarwal, V.: End-to-end integration testing design. In: 25th Annual International Computer Software and Applications Conference. COMPSAC 2001, pp. 166–171. IEEE (2001)
Cooper, B.F., Silberstein, A., Tam, E., Ramakrishnan, R., Sears, R.: Benchmarking cloud serving systems with YCSB. In: Proceedings of the 1st ACM Symposium on Cloud Computing, pp. 143–154 (2010)
Grafana Labs: Load testing for engineering teams (2023). https://k6.io/
PingCAP: A chaos engineering platform for kubernetes (2023). https://github.com/chaos-mesh/chaos-mesh
Heorhiadi, V., Rajagopalan, S., Jamjoom, H., Reiter, M.K., Sekar, V.: Gremlin: systematic resilience testing of microservices. In: 2016 IEEE 36th International Conference on Distributed Computing Systems (ICDCS), pp. 57–66. IEEE (2016)
Georgiou, J., Symeonides, M., Kasioulis, M., Trihinas, D., Pallis, G., Dikaiakos, M.D.: BenchPilot: repeatable & reproducible benchmarking for edge micro-DCs (2022)
Symeonides, M., Georgiou, Z., Trihinas, D., Pallis, G., Dikaiakos, M.D.: Fogify: a fog computing emulation framework. In: IEEE/ACM Symposium on Edge Computing (SEC), pp. 42–54. IEEE (2020)
Nikolaidis, F., Marazakis, M., Bilas, A.: IOTier: a virtual testbed to evaluate systems for IoT environments. In: IEEE/ACM 21st International Symposium on Cluster, Cloud and Internet Computing (CCGrid), pp. 676–683. IEEE (2021)
Cilium: What is cilium? (2021). https://docs.cilium.io/en/v1.9/contributing/testing/e2e/
Kubernetes: What is testground? (2021). https://docs.testground.ai/
KUTTL: What is KUTTL? (2022). Available: The KUbernetes Test TooL
Toslali, M., Parthasarathy, S., Oliveira, F., Huang, H., Coskun, A.K.: Iter8: online experimentation in the cloud, pp. 289–304. Association for Computing Machinery, New York (2021). https://doi.org/10.1145/3472883.3486984
Kubernetes: Operator best practices (2021). https://sdk.operatorframework.io/docs/best-practices/best-practices/
Kasuya, A., Tesfaye, T.: Verification methodologies in a TLM-to-RTL design flow. In: 44th ACM/IEEE Design Automation Conference, pp. 199–204. IEEE (2007)
Tolosana-Calasanz, R., Banares, J.A., Rana, O.F., Álvarez, P., Ezpeleta, J., Hoheisel, A.: Adaptive exception handling for scientific workflows. Concurr. Comput. Pract. Exp. 22(5), 617–642 (2010)
Nikolaidis, F., Chazapis, A., Marazakis, M., Bilas, A.: Frisbee: a suite for benchmarking systems recovery. In: Proceedings of the 1st Workshop on High Availability and Observability of Cloud Systems, HAOC 2021, pp. 18–24. Association for Computing Machinery, New York (2021). https://doi.org/10.1145/3447851.3458738
Gupta, R., et al.: CIFTS: a coordinated infrastructure for fault-tolerant systems. In: 2009 International Conference on Parallel Processing, pp. 237–245. IEEE (2009)
Di, S., Gupta, R., Snir, M., Pershey, E., Cappello, F.: LogAider: a tool for mining potential correlations of HPC log events. In: 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID), pp. 442–451. IEEE (2017)
CockroachDB: CockroachDB - the open source, cloud-native distributed SQL database (2022). https://www.cockroachlabs.com
Adap/Flower: A friendly federated learning framework (2023). https://github.com/argoproj/argo
Nikolaidis, F., Chazapis, A., Marazakis, M., Bilas, A.: Frisbee: automated testing of Cloud-native applications in Kubernetes. arXiv preprint arXiv:2109.10727 (2021)
Acknowledgement
We thankfully acknowledge the support of the European Commission and the Greek General Secretariat for Research and Innovation under the EuroHPC Programme through project EUPEX (GA-101033975). National contributions from the involved state members (including the Greek General Secretariat for Research and Innovation) match the EuroHPC funding.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Nikolaidis, F., Chazapis, A., Marazakis, M., Bilas, A. (2023). Event-Driven Chaos Testing for Containerized Applications. In: Bienz, A., Weiland, M., Baboulin, M., Kruse, C. (eds) High Performance Computing. ISC High Performance 2023. Lecture Notes in Computer Science, vol 13999. Springer, Cham. https://doi.org/10.1007/978-3-031-40843-4_12
Download citation
DOI: https://doi.org/10.1007/978-3-031-40843-4_12
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-40842-7
Online ISBN: 978-3-031-40843-4
eBook Packages: Computer ScienceComputer Science (R0)