Abstract
Modern web applications often consist of hundreds of services distributed in different servers or tiers. On one hand, this architecture may provide easy abstraction and modularity for software development and reuse. On the other hand, such architecture makes difficult to predict the behavior of the systems, as each tier has its own functionality, configuration, and demands for computing resources. Thus, anomaly detection becomes an important aspect for the management and operation of multi-tier web systems. In order to track their operation and aid on their behavior analysis, web systems expose numerous metrics in all the tiers. However, collecting and analyzing all available metrics reduces the system performance due to a non-negligible overhead on communication, storage, and processing. Another concern is the nature of the workload of these systems, which may fluctuate widely over time. One of the approaches to support anomaly detection in web systems is to use stable correlations among monitoring metrics. This approach, called correlation-based monitoring, does not require any deep understanding about the system internals or metric semantic, and also does not demand the existence of data about the faults. In addition, as only the metrics involved in stable correlations are periodically collected, the monitoring overhead is reduced. Stable correlations also have the desired property of holding for long period of time before becoming invalid due to workload fluctuations. The challenge, however, is to identify the stable correlations. In this work, we address this challenge by proposing three novel strategies based on partial correlation, a statistical tool commonly employed to summarize the relevant information of complex systems. We evaluate our strategies using traces obtained from an e-commerce, web transaction benchmark deployed in our testbed. Results show that our best strategy allows the construction of a monitoring network with less metrics than a state-of-the-art solution while achieving larger fault coverage. They also show that the correlations are reasonably stable, and the models can be applied for sufficiently long periods of time (at least 50 times the training time) before they become invalid.
Similar content being viewed by others
References
World Economic Forum. Delivering Digital Infrastructure Advancing the Internet Economy. http://www3.weforum.org/docs/WEF_TC_DeliveringDigitalInfrastructure_InternetEconomy_Report_2014, 2014. Last Accessed 23 Feb 2015 (2014)
Huang, D., He, B., Miao, C.: A survey of resource management in multi-tier Web applications. Commun. Surv. Tutor. IEEE 16(3), 1574–1590 (2014)
Ghanbari, S., Soundararajan, G., Amza, C.: A query language and runtime tool for evaluating behavior of multi-tier servers. In: Proceedings of the ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems, pp. 131–142 (2010)
Wang, C., Kavulya, S.P., Tan, J., Liting, H., Kutare, M., Kasick, M., Schwan, K., Narasimhan, P., Gandhi, R.: Performance troubleshooting in data centers: an annotated bibliography? SIGOPS Oper. Syst. Rev. 47(3), 50–62 (2013)
Wang, T., Wei, J., Zhang, W., Zhong, H., Huang, T.: Workload-aware anomaly detection for Web applications. J. Syst. Softw. 89, 19–32 (2014)
Ponemon Institute. Understanding the Cost of Data Center Downtime: An Analysis of the Financial Impact of Infrastructure Vulnerability. http://emersonnetworkpower.com/en-US/Brands/Liebert/Documents/White%20Papers/data-center-uptime_24661-R05-11, 2011. Last Accessed 27 Feb 2015 (2011)
Avizienis, A., Laprie, J.-C., Randell, B., Landwehr, C.: Basic concepts and taxonomy of dependable and secure computing. IEEE Trans. Dependable secur. Comput. 1(1), 11–33 (2004)
Chandola, V., Banerjee, A., Kumar, V.: Anomaly detection: a survey. ACM Comput. Surv. 41(3), 15:1–15:58 (2009)
Oppenheimer, D., Ganapathi, A., Patterson, D.A.: Why do internet services fail, and what can be done about it? In: Proceedings of the 4th Conference on USENIX Symposium on Internet Technologies and Systems, vol 4, p. 1 (2003)
Chen, M.Y., Accardi, A., Kiciman, E., Lloyd, J., Patterson, D., Fox, A., Brewer, E.: Path-based faliure and evolution management. In: Proceedings of the 1st Conference on Symposium on Networked Systems Design and Implementation, vol 1, NSDI’04, pp. 23–23 (2004)
Rabl, T., Gómez-Villamor, S., Sadoghi, M., Muntés-Mulero, V., Jacobsen, H.-A., Mankovskii, S.: Solving big data challenges for enterprise application performance management. Proc. VLDB Endow. 5(12), 1724–1735 (2012)
Jiang, M., Munawar, M.A., Reidemeister, T., Ward, P.A.S.: System monitoring with metric-correlation models: problems and solutions. In: Proceedings of the 6th International Conference on Autonomic Computing, pp. 13–22 (2009)
Jiang, G., Chen, H., Yoshihira, K.: Modeling and tracking of transaction flow dynamics for fault detection in complex systems. IEEE Trans. Dependable Secur. Comput. 3(4), 312–326 (2006)
Magalhães, João P., Silva, L.M.: Root-cause analysis of performance anomalies in Web-based applications. In: Proceedings of the 2011 ACM Symposium on Applied Computing, pp. 209–216 (2011)
Munawar, M.A., Jiang, M., Reidemeister, T., Ward, P.A.S.: Filtering system metrics for minimal correlation-based self-monitoring. In: Third IEEE International Conference on Self-Adaptive and Self-Organizing Systems (SASO), pp. 233–242 (2009)
Peiris, M., Hill, J.H., Thelin, J., Bykov, S., Kliot, G., Konig, C.: PAD: performance anomaly detection in multi-server distributed systems. In: Proceedings of the 2014 IEEE International Conference on Cloud Computing, pp. 769–776 (2014)
Munawar, M.A., Ward, P.A.S.: A comparative study of pairwise regression techniques for problem determination. In: Proceedings of the 2007 Conference of the Center for Advanced Studies on Collaborative Research, pp. 152–166 (2007)
Guo, Z., Jiang, G., Chen, H., Yoshihira, K.: Tracking probabilistic correlation of monitoring data for fault detection in complex systems. In: Dependable Systems and Networks, 2006. DSN 2006. International Conference on, pp. 259–268 (2006)
Baba, K., Shibata, R., Sibuya, M.: Partial correlation and conditional correlation as measures of conditional independence. Aust. N. Z. J. Stat. 46(4), 657–664 (2004)
De La Fuente, A., Bing, N., Hoeschele, I., Mendes, P.: Discovery of meaningful associations in genomic data using partial correlation coefficients. Bioinformatics 20(18), 3565–3574 (2004)
Kenett, D.Y., Tumminello, M., Madi, A., Gur-Gershgoren, G., Mantegna, R.N., Ben-Jacob, E.: Dominating clasp of the financial sector revealed by partial correlation analysis of the stock market. PLoS ONE 5(12), 1–14 (2010)
Menasce, D.: TPC-W: a benchmark for e-commerce. Internet Comput. IEEE 6(3), 83–87 (2002)
Mi, N., Casale, G., Cherkasova, L., Smirni, E.: Sizing multi-tier systems with temporal dependence: benchmarks and analytic models. J. Internet Serv. Appl. 1(2), 117–134 (2010)
Munawar, M.A., Ward, P.A.S.: Leveraging many simple statistical models to adaptively monitor software systems. Int. J. High Perform Comput. Netw. 7(1), 29–39 (2011)
Spirtes, P., Glymour, C., Scheines, R.: Causation, Prediction, and Search, 2nd edn. MIT Press, Cambridge (2000)
Pearl, J.: Causality: Models, Reasoning and Inference, 2nd edn. Cambridge University Press, Cambridge (2009)
Jiang, M., Munawar, M.A., Reidemeister, T., Ward, P.A.S.: Automatic fault detection and diagnosis in complex software systems by information-theoretic monitoring. In: Dependable Systems Networks, 2009. DSN ’09. IEEE/IFIP International Conference on, pp. 285–294 (2009)
Sprinthall, R.C.: Basic Statistical Analysis, 9th edn. Pearson, London (2011)
Cohen, J.: Statistical Power Analysis for the Behavioral Sciences, 2nd edn. Lawrence Erlbaum Associates, New Jersey (1988)
IBM. IBM SmartCloud Analytics. http://www-03.ibm.com/software/products/en/ibm-smartcloud-analytics---predictive-insights, 2015. Last Accessed 15 July 2015 (2015)
Microsoft. System Center Operations Manager. http://technet.microsoft.com/en-us/systemcenter/bb497976, 2015. Last Accessed 15 July 2015 (2015)
Cohen, I., Zhang, S., Goldszmidt, M., Symons, J., Kelly, T., Fox, A.: Capturing, indexing, clustering, and retrieving system history. In: Proceedings of the Twentieth ACM Symposium on Operating Systems Principles, pp. 105–118 (2005)
Chen, H., Jiang, G., Yoshihira, K., Saxena, A.: Invariants based failure diagnosis in distributed computing systems. In: Reliable Distributed Systems, 2010 29th IEEE Symposium on, pp. 160–166 (2010)
Ghanbari, S., Amza, C.: Semantic-driven model composition for accurate anomaly diagnosis. In: Autonomic Computing, 2008. ICAC ’08. International Conference on, pp. 35–44 (2008)
Blum, A.L., Langley, P.: Selection of relevant features and examples in machine learning. Artif. Intell. 97(1–2), 245–271 (1997)
Guyon, I., Elisseeff, A.: An introduction to variable and feature selection. J. Mach. Learn. Res. 3, 1157–1182 (2003)
Malik, H., Hemmati, H., Hassan, A.E.: Automatic detection of performance deviations in the load testing of large scale systems. In: Proceedings of the 2013 International Conference on Software Engineering, pp. 1012–1021 (2013)
Magalhaes, J.P., Silva, L.M.: Detection of performance anomalies in Web-based applications. In: Network Computing and Applications (NCA), 2010 9th IEEE International Symposium on, pp. 60–67 (2010)
Mantegna, R.N.: Hierarchical structure in financial markets. Eur. Phys. J. B Condens. Matter Complex Syst. 11(1), 193–197 (1999)
Bonanno, G., Caldarelli, G., Lillo, F., Mantegna, R.N.: Topology of correlation-based minimal spanning trees in real and model markets. Phys. Rev. E 68(4), 046130 (2003)
Tumminello, M., Coronnello, C., Lillo, F., Miccichè, S., Mantegna, R.N.: Spanning trees and bootstrap reliability estimation in correlation-based networks. Int. J. Bifurc. chaos 17, 2319–2329 (2007)
Wang, C., Talwar, V., Schwan, K., Ranganathan, P.: Online detection of utility cloud anomalies using metric distributions. In: Network Operations and Management Symposium (NOMS), 2010 IEEE, pp. 96–103 (2010)
Kang, H., Zhu, X., Wong, J.L.: DAPA: diagnosing application performance anomalies for virtualized infrastructures. In: 2nd USENIX Workshop on Hot Topics in Management of Internet, Cloud, and Enterprise Networks and Services (2012)
Bezenek, T., Cain, T., Dickson, R., Heil, T., Martin, M., McCurdy, C., Rajwar, R., Weglarz, E., Zilles, C., Lipasti, M.: Java TPC-W implementation distribution. http://pharm.ece.wisc.edu/tpcw.shtml, June 2015. Last Accessed 15 July 2015 (2015)
Forster, F.: Collected. http://collectd.org, December 2015. Last Accessed 15 July 2015
Transaction Processing Performance Council (TPC). TPC Benchmark\(^{TM}\) W (Web Commerce). www.tpc.org/miscellaneous/tpc_w.folder/tpcw-d55.doc, August 2016. Last Accessed 12 Aug 2016
Zhang, Q., Cherkasova, L., Smirni, E.: A regression-based analytic model for dynamic resource provisioning of multi-tier applications. In: Proceedings of the Fourth International Conference on Autonomic Computing, p. 27 (2007)
Kim, M., Sumbaly, R., Shah, S.: Root cause detection in a service-oriented architecture. SIGMETRICS Perform. Eval. Rev. 41(1), 93–104 (2013)
Weinreich, H., Obendorf, H., Herder, E., Mayer, M.: Not quite the average: an empirical study of Web use. ACM Trans. Web 2(1), 5:1–5:31 (2008)
Linux Foundation. netem. http://www.linuxfoundation.org/collaborate/workgroups/networking/netem, November 2015. Last Accessed 15 July 2015
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Pinno, O.J.A., Correa, S.L., dos Santos, A.L. et al. Decreasing the Management Burden in Multi-tier Systems Through Partial Correlation-Based Monitoring. J Netw Syst Manage 25, 612–642 (2017). https://doi.org/10.1007/s10922-017-9402-7
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10922-017-9402-7