On Observability and Monitoring of Distributed Systems – An Industry Interview Study

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11895)


Business success of companies heavily depends on the availability and performance of their client applications. Due to modern development paradigms such as DevOps and microservice architectural styles, applications are decoupled into services with complex interactions and dependencies. Although these paradigms enable individual development cycles with reduced delivery times, they cause several challenges to manage the services in distributed systems. One major challenge is to observe and monitor such distributed systems. This paper provides a qualitative study to understand the challenges and good practices in the field of observability and monitoring of distributed systems. In 28 semi-structured interviews with software professionals we discovered increasing complexity and dynamics in that field. Especially observability becomes an essential prerequisite to ensure stable services and further development of client applications. However, the participants mentioned a discrepancy in the awareness regarding the importance of the topic, both from the management as well as from the developer perspective. Besides technical challenges, we identified a strong need for an organizational concept including strategy, roles and responsibilities. Our results support practitioners in developing and implementing systematic observability and monitoring for distributed systems.


Monitoring Observability Distributed systems Cloud Industry 


  1. 1.
    Aceto, G., Botta, A., de Donato, W., Pescapè, A.: Cloud monitoring: a survey. Comput. Netw. 57(9), 2093–2115 (2013)CrossRefGoogle Scholar
  2. 2.
    Alhamazani, K., et al.: An overview of the commercial cloud monitoring tools: research dimensions, design issues, and state-of-the-art. Computing 97(4), 357–377 (2015)MathSciNetCrossRefGoogle Scholar
  3. 3.
    Beyer, B., Jones, C., Petoff, J., Murphy, N.R.: Site Reliability Engineering: How Google Runs Production Systems. O’Reilly Media Inc., Sebastopol (2016)Google Scholar
  4. 4.
    Colville, R.J.: CMDB or configuration database: know the difference (2006)Google Scholar
  5. 5.
    Fatema, K., Emeakaroha, V.C., Healy, P.D., Morrison, J.P., Lynn, T.: A survey of cloud monitoring tools: taxonomy, capabilities and objectives. J. Parallel Distrib. Comput. 74(10), 2918–2933 (2014)CrossRefGoogle Scholar
  6. 6.
    Gamez-Diaz, A., Fernandez, P., Ruiz-Cortes, A.: An analysis of RESTful APIs offerings in the industry. In: Maximilien, M., Vallecillo, A., Wang, J., Oriol, M. (eds.) ICSOC 2017. LNCS, vol. 10601, pp. 589–604. Springer, Cham (2017). Scholar
  7. 7.
    Gopal, M.: Modern Control System Theory, 2nd edn. Halsted Press, New York (1993)Google Scholar
  8. 8.
    Gupta, M., Mandal, A., Dasgupta, G., Serebrenik, A.: Runtime monitoring in continuous deployment by differencing execution behavior model. In: Pahl, C., Vukovic, M., Yin, J., Yu, Q. (eds.) ICSOC 2018. LNCS, vol. 11236, pp. 812–827. Springer, Cham (2018). Scholar
  9. 9.
    Heger, C., van Hoorn, A., Mann, M., Okanovic, D.: Application performance management: state of the art and challenges for the future. In: Proceedings of the 8th ACM/SPEC International Conference on Performance Engineering (ICPE 2017). ACM (2017)Google Scholar
  10. 10.
    IEEE: IEEE Standard Glossary of Software Engineering Terminology (1990).
  11. 11.
    Johng, H., Kim, D., Hill, T., Chung, L.: Estimating the performance of cloud-based systems using benchmarking and simulation in a complementary manner. In: Pahl, C., Vukovic, M., Yin, J., Yu, Q. (eds.) ICSOC 2018. LNCS, vol. 11236, pp. 576–591. Springer, Cham (2018). Scholar
  12. 12.
    Kinsella, J.: The cloud complexity gap: making software more intelligent to address complex infrastructure.
  13. 13.
    Knoche, H., Hasselbring, W.: Drivers and barriers for microservice adoption–a survey among professionals in Germany. Enterp. Model. Inf. Syst. Architect. (EMISAJ)–Int. J. Conceptual Model. 14(1), 1–35 (2019)Google Scholar
  14. 14.
    Lin, J., Chen, P., Zheng, Z.: Microscope: pinpoint performance issues with causal graphs in micro-service environments. In: Pahl, C., Vukovic, M., Yin, J., Yu, Q. (eds.) ICSOC 2018. LNCS, vol. 11236, pp. 3–20. Springer, Cham (2018). Scholar
  15. 15.
    Mayring, P.: Qualitative Content Analysis: Theoretical Foundation, Basic Procedures and Software Solution (2014)Google Scholar
  16. 16.
    Natu, M., Ghosh, R.K., Shyamsundar, R.K., Ranjan, R.: Holistic performance monitoring of hybrid clouds: complexities and future directions. IEEE Cloud Comput. 3(1), 72–81 (2016)CrossRefGoogle Scholar
  17. 17.
    Niedermaier, S., Koetter, F., Freymann, A., Wagner, S.: Interview guideline on observability and monitoring of distributed systems (2019).
  18. 18.
    Picoreti, R., Pereira do Carmo, A., Mendonça de Queiroz, F., Salles Garcia, A., Frizera Vassallo, R., Simeonidou, D.: Multilevel observability in cloud orchestration. In: 2018 IEEE 16th International Conference on DASC/PiCom/DataCom/CyberSciTech, pp. 776–784, August 2018Google Scholar
  19. 19.
    Runeson, P., Höst, M.: Guidelines for conducting and reporting case study research in software engineering. Empirical Softw. Eng. 14(2), 131 (2008)CrossRefGoogle Scholar
  20. 20.
    Sambasivan, R.R., Shafer, I., Mace, J., Sigelman, B.H., Fonseca, R., Ganger, G.R.: Principled workflow-centric tracing of distributed systems. In: Proceedings of the Seventh ACM Symposium on Cloud Computing, pp. 401–414. ACM (2016)Google Scholar
  21. 21.
    Sfondrini, N., Motta, G., Longo, A.: Public cloud adoption in multinational companies: a survey. In: 2018 IEEE International Conference on Services Computing (SCC), pp. 177–184, July 2018Google Scholar
  22. 22.
    Singer, J., Sim, S.E., Lethbridge, T.C.: Software engineering data collection for field studies. In: Shull, F., Singer, J., Sjøberg, D.I.K. (eds.) Guide to Advanced Empirical Software Engineering, pp. 9–34. Springer, London (2008). Scholar
  23. 23.
    Sun, C., Li, M., Jia, J., Han, J.: Constraint-based model-driven testing of web services for behavior conformance. In: Pahl, C., Vukovic, M., Yin, J., Yu, Q. (eds.) ICSOC 2018. LNCS, vol. 11236, pp. 543–559. Springer, Cham (2018). Scholar
  24. 24.
    Yang, Y., Wang, L., Gu, J., Li, Y.: Transparently capturing execution path of service/job request processing. In: Pahl, C., Vukovic, M., Yin, J., Yu, Q. (eds.) ICSOC 2018. LNCS, vol. 11236, pp. 879–887. Springer, Cham (2018). Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  1. 1.Institute of Software TechnologyUniversity of StuttgartStuttgartGermany
  2. 2.Fraunhofer Institute for Industrial Engineering IAO, Fraunhofer IAOStuttgartGermany

Personalised recommendations