Making Runtime Data Useful for Incident Diagnosis: An Experience Report

  • Florian LautenschlagerEmail author
  • Marcus CiolkowskiEmail author
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11271)


Important and critical aspects of technical debt often surface at runtime only and are difficult to measure statically. This is a particular challenge for cloud applications because of their highly distributed nature. Fortunately, mature frameworks for collecting runtime data exist but need to be integrated.

In this paper, we report an experience from a project that implements a cloud application within Kubernetes on Azure. To analyze the runtime data of this software system, we instrumented our services with Zipkin for distributed tracing; with Prometheus and Grafana for analyzing metrics; and with fluentd, Elasticsearch and Kibana for collecting, storing and exploring log files. However, project team members did not utilize these runtime data until we created a unified and simple access using a chat bot.

We argue that even though your project collects runtime data, this is not sufficient to guarantee its usage: In order to be useful, a simple, unified access to different data sources is required that should be integrated into tools that are commonly used by team members.


DevOps Cloud Monitoring Runtime quality 



We thank Robert Hoffmann from Deutsche Telekom for his support.


  1. 1.
    Android. Accessed 26 July 2018
  2. 2.
    Azure. Accessed 26 July 2018
  3. 3.
    Bass, L., Weber, I., Zhu, L.: DevOps: A Software Architect’s Perspective. Addison-Wesley Professional, Boston (2015)Google Scholar
  4. 4.
    C++. Accessed 26 July 2018
  5. 5.
    Ciolkowski, M., Guzmán, L., Trendowicz, A., Vollmer, A.M.: Challenges in assessing technical debt based on dynamic runtime data, pp. 442–445, Prague, August 2018.
  6. 6.
    Docker. Accessed 26 July 2018
  7. 7.
    Elastic, Inc.: Elasticsearch. Accessed 26 July 2018
  8. 8.
    Elastic, Inc.: Kibana. Accessed 26 July 2018
  9. 9.
    Fluentd. Accessed 26 July 2018
  10. 10.
    Go. Accessed 26 July 2018
  11. 11.
    Grafana. Accessed 26 July 2018
  12. 12.
    iOS. Accessed 26 July 2018
  13. 13.
    Kubernetes. Accessed 26 July 2018
  14. 14.
    Lautenschlager, F., Philippsen, M., Kumlehn, A., Adersberger, J.: Chronix: long term storage and retrieval technology for anomaly detection in operational data. In: Proceedings of the 15th USENIX Conference on File and Storage Technologies (FAST 2017), pp. 229–242 (2017)Google Scholar
  15. 15.
    Lua. Accessed 26 July 2018
  16. 16.
    Mattermost. Accessed 26 July 2018
  17. 17.
    Openshift. Accessed 26 July 2018
  18. 18.
    Prometheus. Accessed 26 July 2018
  19. 19.
    Python. Accessed 26 July 2018
  20. 20.
    Spring Boot. Accessed 26 July 2018
  21. 21.
    Zipkin. Accessed 26 July 2018

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  1. 1.QAware GmbHMunichGermany

Personalised recommendations