Making Runtime Data Useful for Incident Diagnosis: An Experience Report
Important and critical aspects of technical debt often surface at runtime only and are difficult to measure statically. This is a particular challenge for cloud applications because of their highly distributed nature. Fortunately, mature frameworks for collecting runtime data exist but need to be integrated.
In this paper, we report an experience from a project that implements a cloud application within Kubernetes on Azure. To analyze the runtime data of this software system, we instrumented our services with Zipkin for distributed tracing; with Prometheus and Grafana for analyzing metrics; and with fluentd, Elasticsearch and Kibana for collecting, storing and exploring log files. However, project team members did not utilize these runtime data until we created a unified and simple access using a chat bot.
We argue that even though your project collects runtime data, this is not sufficient to guarantee its usage: In order to be useful, a simple, unified access to different data sources is required that should be integrated into tools that are commonly used by team members.
KeywordsDevOps Cloud Monitoring Runtime quality
We thank Robert Hoffmann from Deutsche Telekom for his support.
- 1.Android. https://www.android.com/. Accessed 26 July 2018
- 2.Azure. https://azure.microsoft.com/de-de/. Accessed 26 July 2018
- 3.Bass, L., Weber, I., Zhu, L.: DevOps: A Software Architect’s Perspective. Addison-Wesley Professional, Boston (2015)Google Scholar
- 4.C++. https://isocpp.org/. Accessed 26 July 2018
- 5.Ciolkowski, M., Guzmán, L., Trendowicz, A., Vollmer, A.M.: Challenges in assessing technical debt based on dynamic runtime data, pp. 442–445, Prague, August 2018. https://doi.org/10.1109/SEAA.2018.00078
- 6.Docker. https://www.docker.com/. Accessed 26 July 2018
- 7.Elastic, Inc.: Elasticsearch. https://www.elastic.co/de/products/elasticsearch/. Accessed 26 July 2018
- 8.Elastic, Inc.: Kibana. https://www.elastic.co/de/products/kibana/. Accessed 26 July 2018
- 9.Fluentd. https://www.fluentd.org/. Accessed 26 July 2018
- 10.Go. https://golang.org/. Accessed 26 July 2018
- 11.Grafana. https://grafana.com/. Accessed 26 July 2018
- 12.iOS. https://www.apple.com/de/ios/. Accessed 26 July 2018
- 13.Kubernetes. https://kubernetes.io/. Accessed 26 July 2018
- 14.Lautenschlager, F., Philippsen, M., Kumlehn, A., Adersberger, J.: Chronix: long term storage and retrieval technology for anomaly detection in operational data. In: Proceedings of the 15th USENIX Conference on File and Storage Technologies (FAST 2017), pp. 229–242 (2017)Google Scholar
- 15.Lua. https://www.lua.org/. Accessed 26 July 2018
- 16.Mattermost. https://mattermost.com/. Accessed 26 July 2018
- 17.Openshift. https://www.openshift.com/. Accessed 26 July 2018
- 18.Prometheus. http://prometheus.io/. Accessed 26 July 2018
- 19.Python. https://www.python.org/. Accessed 26 July 2018
- 20.Spring Boot. https://spring.io/projects/spring-boot/. Accessed 26 July 2018
- 21.Zipkin. https://zipkin.io/. Accessed 26 July 2018