Skip to main content
Log in

Decreasing the Management Burden in Multi-tier Systems Through Partial Correlation-Based Monitoring

  • Published:
Journal of Network and Systems Management Aims and scope Submit manuscript

Abstract

Modern web applications often consist of hundreds of services distributed in different servers or tiers. On one hand, this architecture may provide easy abstraction and modularity for software development and reuse. On the other hand, such architecture makes difficult to predict the behavior of the systems, as each tier has its own functionality, configuration, and demands for computing resources. Thus, anomaly detection becomes an important aspect for the management and operation of multi-tier web systems. In order to track their operation and aid on their behavior analysis, web systems expose numerous metrics in all the tiers. However, collecting and analyzing all available metrics reduces the system performance due to a non-negligible overhead on communication, storage, and processing. Another concern is the nature of the workload of these systems, which may fluctuate widely over time. One of the approaches to support anomaly detection in web systems is to use stable correlations among monitoring metrics. This approach, called correlation-based monitoring, does not require any deep understanding about the system internals or metric semantic, and also does not demand the existence of data about the faults. In addition, as only the metrics involved in stable correlations are periodically collected, the monitoring overhead is reduced. Stable correlations also have the desired property of holding for long period of time before becoming invalid due to workload fluctuations. The challenge, however, is to identify the stable correlations. In this work, we address this challenge by proposing three novel strategies based on partial correlation, a statistical tool commonly employed to summarize the relevant information of complex systems. We evaluate our strategies using traces obtained from an e-commerce, web transaction benchmark deployed in our testbed. Results show that our best strategy allows the construction of a monitoring network with less metrics than a state-of-the-art solution while achieving larger fault coverage. They also show that the correlations are reasonably stable, and the models can be applied for sufficiently long periods of time (at least 50 times the training time) before they become invalid.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

References

  1. World Economic Forum. Delivering Digital Infrastructure Advancing the Internet Economy. http://www3.weforum.org/docs/WEF_TC_DeliveringDigitalInfrastructure_InternetEconomy_Report_2014, 2014. Last Accessed 23 Feb 2015 (2014)

  2. Huang, D., He, B., Miao, C.: A survey of resource management in multi-tier Web applications. Commun. Surv. Tutor. IEEE 16(3), 1574–1590 (2014)

    Article  Google Scholar 

  3. Ghanbari, S., Soundararajan, G., Amza, C.: A query language and runtime tool for evaluating behavior of multi-tier servers. In: Proceedings of the ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems, pp. 131–142 (2010)

  4. Wang, C., Kavulya, S.P., Tan, J., Liting, H., Kutare, M., Kasick, M., Schwan, K., Narasimhan, P., Gandhi, R.: Performance troubleshooting in data centers: an annotated bibliography? SIGOPS Oper. Syst. Rev. 47(3), 50–62 (2013)

    Article  Google Scholar 

  5. Wang, T., Wei, J., Zhang, W., Zhong, H., Huang, T.: Workload-aware anomaly detection for Web applications. J. Syst. Softw. 89, 19–32 (2014)

    Article  Google Scholar 

  6. Ponemon Institute. Understanding the Cost of Data Center Downtime: An Analysis of the Financial Impact of Infrastructure Vulnerability. http://emersonnetworkpower.com/en-US/Brands/Liebert/Documents/White%20Papers/data-center-uptime_24661-R05-11, 2011. Last Accessed 27 Feb 2015 (2011)

  7. Avizienis, A., Laprie, J.-C., Randell, B., Landwehr, C.: Basic concepts and taxonomy of dependable and secure computing. IEEE Trans. Dependable secur. Comput. 1(1), 11–33 (2004)

    Article  Google Scholar 

  8. Chandola, V., Banerjee, A., Kumar, V.: Anomaly detection: a survey. ACM Comput. Surv. 41(3), 15:1–15:58 (2009)

    Article  Google Scholar 

  9. Oppenheimer, D., Ganapathi, A., Patterson, D.A.: Why do internet services fail, and what can be done about it? In: Proceedings of the 4th Conference on USENIX Symposium on Internet Technologies and Systems, vol 4, p. 1 (2003)

  10. Chen, M.Y., Accardi, A., Kiciman, E., Lloyd, J., Patterson, D., Fox, A., Brewer, E.: Path-based faliure and evolution management. In: Proceedings of the 1st Conference on Symposium on Networked Systems Design and Implementation, vol 1, NSDI’04, pp. 23–23 (2004)

  11. Rabl, T., Gómez-Villamor, S., Sadoghi, M., Muntés-Mulero, V., Jacobsen, H.-A., Mankovskii, S.: Solving big data challenges for enterprise application performance management. Proc. VLDB Endow. 5(12), 1724–1735 (2012)

    Article  Google Scholar 

  12. Jiang, M., Munawar, M.A., Reidemeister, T., Ward, P.A.S.: System monitoring with metric-correlation models: problems and solutions. In: Proceedings of the 6th International Conference on Autonomic Computing, pp. 13–22 (2009)

  13. Jiang, G., Chen, H., Yoshihira, K.: Modeling and tracking of transaction flow dynamics for fault detection in complex systems. IEEE Trans. Dependable Secur. Comput. 3(4), 312–326 (2006)

    Article  Google Scholar 

  14. Magalhães, João P., Silva, L.M.: Root-cause analysis of performance anomalies in Web-based applications. In: Proceedings of the 2011 ACM Symposium on Applied Computing, pp. 209–216 (2011)

  15. Munawar, M.A., Jiang, M., Reidemeister, T., Ward, P.A.S.: Filtering system metrics for minimal correlation-based self-monitoring. In: Third IEEE International Conference on Self-Adaptive and Self-Organizing Systems (SASO), pp. 233–242 (2009)

  16. Peiris, M., Hill, J.H., Thelin, J., Bykov, S., Kliot, G., Konig, C.: PAD: performance anomaly detection in multi-server distributed systems. In: Proceedings of the 2014 IEEE International Conference on Cloud Computing, pp. 769–776 (2014)

  17. Munawar, M.A., Ward, P.A.S.: A comparative study of pairwise regression techniques for problem determination. In: Proceedings of the 2007 Conference of the Center for Advanced Studies on Collaborative Research, pp. 152–166 (2007)

  18. Guo, Z., Jiang, G., Chen, H., Yoshihira, K.: Tracking probabilistic correlation of monitoring data for fault detection in complex systems. In: Dependable Systems and Networks, 2006. DSN 2006. International Conference on, pp. 259–268 (2006)

  19. Baba, K., Shibata, R., Sibuya, M.: Partial correlation and conditional correlation as measures of conditional independence. Aust. N. Z. J. Stat. 46(4), 657–664 (2004)

    Article  MathSciNet  MATH  Google Scholar 

  20. De La Fuente, A., Bing, N., Hoeschele, I., Mendes, P.: Discovery of meaningful associations in genomic data using partial correlation coefficients. Bioinformatics 20(18), 3565–3574 (2004)

    Article  Google Scholar 

  21. Kenett, D.Y., Tumminello, M., Madi, A., Gur-Gershgoren, G., Mantegna, R.N., Ben-Jacob, E.: Dominating clasp of the financial sector revealed by partial correlation analysis of the stock market. PLoS ONE 5(12), 1–14 (2010)

    Article  Google Scholar 

  22. Menasce, D.: TPC-W: a benchmark for e-commerce. Internet Comput. IEEE 6(3), 83–87 (2002)

    Article  Google Scholar 

  23. Mi, N., Casale, G., Cherkasova, L., Smirni, E.: Sizing multi-tier systems with temporal dependence: benchmarks and analytic models. J. Internet Serv. Appl. 1(2), 117–134 (2010)

    Article  Google Scholar 

  24. Munawar, M.A., Ward, P.A.S.: Leveraging many simple statistical models to adaptively monitor software systems. Int. J. High Perform Comput. Netw. 7(1), 29–39 (2011)

    Article  Google Scholar 

  25. Spirtes, P., Glymour, C., Scheines, R.: Causation, Prediction, and Search, 2nd edn. MIT Press, Cambridge (2000)

    MATH  Google Scholar 

  26. Pearl, J.: Causality: Models, Reasoning and Inference, 2nd edn. Cambridge University Press, Cambridge (2009)

    Book  MATH  Google Scholar 

  27. Jiang, M., Munawar, M.A., Reidemeister, T., Ward, P.A.S.: Automatic fault detection and diagnosis in complex software systems by information-theoretic monitoring. In: Dependable Systems Networks, 2009. DSN ’09. IEEE/IFIP International Conference on, pp. 285–294 (2009)

  28. Sprinthall, R.C.: Basic Statistical Analysis, 9th edn. Pearson, London (2011)

    Google Scholar 

  29. Cohen, J.: Statistical Power Analysis for the Behavioral Sciences, 2nd edn. Lawrence Erlbaum Associates, New Jersey (1988)

    MATH  Google Scholar 

  30. IBM. IBM SmartCloud Analytics. http://www-03.ibm.com/software/products/en/ibm-smartcloud-analytics---predictive-insights, 2015. Last Accessed 15 July 2015 (2015)

  31. Microsoft. System Center Operations Manager. http://technet.microsoft.com/en-us/systemcenter/bb497976, 2015. Last Accessed 15 July 2015 (2015)

  32. Cohen, I., Zhang, S., Goldszmidt, M., Symons, J., Kelly, T., Fox, A.: Capturing, indexing, clustering, and retrieving system history. In: Proceedings of the Twentieth ACM Symposium on Operating Systems Principles, pp. 105–118 (2005)

  33. Chen, H., Jiang, G., Yoshihira, K., Saxena, A.: Invariants based failure diagnosis in distributed computing systems. In: Reliable Distributed Systems, 2010 29th IEEE Symposium on, pp. 160–166 (2010)

  34. Ghanbari, S., Amza, C.: Semantic-driven model composition for accurate anomaly diagnosis. In: Autonomic Computing, 2008. ICAC ’08. International Conference on, pp. 35–44 (2008)

  35. Blum, A.L., Langley, P.: Selection of relevant features and examples in machine learning. Artif. Intell. 97(1–2), 245–271 (1997)

    Article  MathSciNet  MATH  Google Scholar 

  36. Guyon, I., Elisseeff, A.: An introduction to variable and feature selection. J. Mach. Learn. Res. 3, 1157–1182 (2003)

    MATH  Google Scholar 

  37. Malik, H., Hemmati, H., Hassan, A.E.: Automatic detection of performance deviations in the load testing of large scale systems. In: Proceedings of the 2013 International Conference on Software Engineering, pp. 1012–1021 (2013)

  38. Magalhaes, J.P., Silva, L.M.: Detection of performance anomalies in Web-based applications. In: Network Computing and Applications (NCA), 2010 9th IEEE International Symposium on, pp. 60–67 (2010)

  39. Mantegna, R.N.: Hierarchical structure in financial markets. Eur. Phys. J. B Condens. Matter Complex Syst. 11(1), 193–197 (1999)

    Article  Google Scholar 

  40. Bonanno, G., Caldarelli, G., Lillo, F., Mantegna, R.N.: Topology of correlation-based minimal spanning trees in real and model markets. Phys. Rev. E 68(4), 046130 (2003)

    Article  Google Scholar 

  41. Tumminello, M., Coronnello, C., Lillo, F., Miccichè, S., Mantegna, R.N.: Spanning trees and bootstrap reliability estimation in correlation-based networks. Int. J. Bifurc. chaos 17, 2319–2329 (2007)

    Article  MATH  Google Scholar 

  42. Wang, C., Talwar, V., Schwan, K., Ranganathan, P.: Online detection of utility cloud anomalies using metric distributions. In: Network Operations and Management Symposium (NOMS), 2010 IEEE, pp. 96–103 (2010)

  43. Kang, H., Zhu, X., Wong, J.L.: DAPA: diagnosing application performance anomalies for virtualized infrastructures. In: 2nd USENIX Workshop on Hot Topics in Management of Internet, Cloud, and Enterprise Networks and Services (2012)

  44. Bezenek, T., Cain, T., Dickson, R., Heil, T., Martin, M., McCurdy, C., Rajwar, R., Weglarz, E., Zilles, C., Lipasti, M.: Java TPC-W implementation distribution. http://pharm.ece.wisc.edu/tpcw.shtml, June 2015. Last Accessed 15 July 2015 (2015)

  45. Forster, F.: Collected. http://collectd.org, December 2015. Last Accessed 15 July 2015

  46. Transaction Processing Performance Council (TPC). TPC Benchmark\(^{TM}\) W (Web Commerce). www.tpc.org/miscellaneous/tpc_w.folder/tpcw-d55.doc, August 2016. Last Accessed 12 Aug 2016

  47. Zhang, Q., Cherkasova, L., Smirni, E.: A regression-based analytic model for dynamic resource provisioning of multi-tier applications. In: Proceedings of the Fourth International Conference on Autonomic Computing, p. 27 (2007)

  48. Kim, M., Sumbaly, R., Shah, S.: Root cause detection in a service-oriented architecture. SIGMETRICS Perform. Eval. Rev. 41(1), 93–104 (2013)

    Article  Google Scholar 

  49. Weinreich, H., Obendorf, H., Herder, E., Mayer, M.: Not quite the average: an empirical study of Web use. ACM Trans. Web 2(1), 5:1–5:31 (2008)

    Article  Google Scholar 

  50. Linux Foundation. netem. http://www.linuxfoundation.org/collaborate/workgroups/networking/netem, November 2015. Last Accessed 15 July 2015

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sand L. Correa.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Pinno, O.J.A., Correa, S.L., dos Santos, A.L. et al. Decreasing the Management Burden in Multi-tier Systems Through Partial Correlation-Based Monitoring. J Netw Syst Manage 25, 612–642 (2017). https://doi.org/10.1007/s10922-017-9402-7

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10922-017-9402-7

Keywords

Navigation