VScope: Middleware for Troubleshooting Time-Sensitive Data Center Applications

Wang, Chengwei; Rayan, Infantdani Abel; Eisenhauer, Greg; Schwan, Karsten; Talwar, Vanish; Wolf, Matthew; Huneycutt, Chad

doi:10.1007/978-3-642-35170-9_7

Chengwei Wang¹⁸,
Infantdani Abel Rayan²⁰,
Greg Eisenhauer¹⁸,
Karsten Schwan¹⁸,
Vanish Talwar¹⁹,
Matthew Wolf¹⁸ &
…
Chad Huneycutt¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNPSE,volume 7662))

Included in the following conference series:

ACM/IFIP/USENIX International Conference on Distributed Systems Platforms and Open Distributed Processing

1405 Accesses
12 Citations
1 Altmetric

Abstract

Data-Intensive infrastructures are increasingly used for on-line processing of live data to guide operations and decision making. VScope is a flexible monitoring and analysis middleware for troubleshooting such large-scale, time-sensitive, multi-tier applications. With VScope, lightweight anomaly detection and interaction tracking methods can be run continuously throughout an application’s execution. The runtime events generated by these methods can then initiate more detailed and heavier weight analyses which are dynamically deployed in the places where they may be most likely fruitful for root cause diagnosis and mitigation. We comprehensively evaluate VScope prototype in a virtualized data center environment with over 1000 virtual machines (VMs), and apply VScope to a representative on-line log processing application. Experimental results show that VScope can deploy and operate a variety of on-line analytics functions and metrics with a few seconds at large scale.Compared to traditional logging approaches, VScope based troubleshooting has substantially lower perturbation and generates much smaller log data volumes. It can also resolve complex cross-tier or cross-software-level issues unsolvable solely by application-level or per-tier mechanisms.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

The evpath library, http://www.cc.gatech.edu/systems/projects/EVPath
Abadi, D., Carney, D., Çetintemel, U., Cherniack, M., Convey, C., Lee, S., Stonebraker, M., Tatbul, N., Zdonik, S.: Aurora: a new model and architecture for data stream management. The VLDB Journal 12(2), 120–139 (2003)
Article Google Scholar
Agarwala, S., Alegre, F., Schwan, K., Mehalingham, J.: E2eprof: Automated end-to-end performance management for enterprise systems. In: Proceedings of the 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN 2007, pp. 749–758. IEEE, Washington, DC (2007)
Google Scholar
Aguilera, M.K., Mogul, J.C., Wiener, J.L., Reynolds, P., Muthitacharoen, A.: Performance debugging for distributed systems of black boxes. In: Proceedings of the 19th ACM symposium on Operating systems principles, SOSP 2003 (2003)
Google Scholar
Apache. Cloudera flume, http://archive.cloudera.com/cdh/3/flume/
Apache. Hbase log, http://hbase.apache.org/book/trouble.log.html
Bhatia, S., Kumar, A., Fiuczynski, M.E., Peterson, L.: Lightweight, high-resolution monitoring for troubleshooting production systems. In: Proceedings of the 8th USENIX Conference on Operating Systems Design and Implementation, OSDI 2008, pp. 103–116. USENIX Association, Berkeley (2008)
Google Scholar
Bodik, P., Goldszmidt, M., Fox, A., Woodard, D.B., Andersen, H.: Fingerprinting the datacenter: automated classification of performance crises. In: Proceedings of the 5th European Conference on Computer Systems, EuroSys 2010, pp. 111–124. ACM, New York (2010)
Google Scholar
Candea, G., Kawamoto, S., Fujiki, Y., Friedman, G., Fox, A.: Microreboot - a technique for cheap recovery. In: Proceedings of the 6th Conference on Symposium on Opearting Systems Design & Implementation, OSDI 2004 (2004)
Google Scholar
Chen, M.Y., Kiciman, E., Fratkin, E., Fox, A., Brewer, E.: Pinpoint: Problem determination in large, dynamic internet services. In: Proceedings of the 2002 International Conference on Dependable Systems and Networks, DSN 2002, pp. 595–604. IEEE Computer Society Press, Washington, DC (2002)
Google Scholar
Condie, T., Conway, N., Alvaro, P., Hellerstein, J.M., Elmeleegy, K., Sears, R.: Mapreduce online. In: Proceedings of the 7th USENIX Conference on Networked Systems Design and Implementation, NSDI 2010 (2010)
Google Scholar
Dai, J., Huang, J., Huang, S., Huang, B., Liu, Y.: Hitune: dataflow-based performance analysis for big data cloud. In: Proceedings of the 2011 USENIX Conference on USENIX Annual Technical Conference, USENIXATC 2011 (2011)
Google Scholar
De Hoon, M., Imoto, S., Nolan, J., Miyano, S.: Open source clustering software. Bioinformatics 20(9), 1453–1454 (2004)
Article Google Scholar
Erlingsson, U., Peinado, M., Peter, S., Budiu, M.: Fay: extensible distributed tracing from kernels to clusters. In: Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles, SOSP 2011, pp. 311–326 (2011)
Google Scholar
Facebook. Scribe, https://github.com/facebook/scribe/wiki
Gedik, B., Andrade, H., Wu, K.-L., Yu, P.S., Doo, M.: Spade: the system s declarative stream processing engine. In: Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, SIGMOD 2008, pp. 1123–1134 (2008)
Google Scholar
Guo, Z., Zhou, D., Lin, H., Yang, M., Long, F., Deng, C., Liu, C., Zhou, L.: g ²: a graph processing system for diagnosing distributed systems. In: Proceedings of the 2011 USENIX Annual Technical Conference, USENIXATC 2011 (2011)
Google Scholar
Hewlett-Packard. Worldcup98 logs, http://ita.ee.lbl.gov/
Hu, L., Schwan, K., Gulati, A., Zhang, J., Wang, C.: Net-cohort: Detecting and managing vm ensembles in virtualized data centers. In: Proceedings of the 9th ACM International Conference on Autonomic Computing, ICAC 2012 (2012)
Google Scholar
Ko, S.Y., Yalagandula, P., Gupta, I., Talwar, V., Milojicic, D., Iyer, S.: Moara: Flexible and scalable group-based querying system. In: Issarny, V., Schantz, R. (eds.) Middleware 2008. LNCS, vol. 5346, pp. 408–428. Springer, Heidelberg (2008)
Chapter Google Scholar
Kumar, S., Talwar, V., Kumar, V., Ranganathan, P., Schwan, K.: vmanage: loosely coupled platform and virtualization management in data centers. In: Proceedings of the 6th International Conference on Autonomic Computing, ICAC 2009, pp. 127–136. ACM, New York (2009)
Google Scholar
Lee, M., Krishnakumar, A.S., Krishnan, P., Singh, N., Yajnik, S.: Supporting soft real-time tasks in the xen hypervisor. In: Proceedings of the 6th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments, VEE 2010, pp. 97–108. ACM, New York (2010)
Google Scholar
LinkedIn. Kafka, http://sna-projects.com/kafka/design.php
Mansour, M.S., Schwan, K.: I-RMI: Performance Isolation in Information Flow Applications. In: Alonso, G. (ed.) Middleware 2005. LNCS, vol. 3790, pp. 375–389. Springer, Heidelberg (2005)
Chapter Google Scholar
Marz, N.: Twitter’s storm, https://github.com/nathanmarz/storm
Massie, M.L., Chun, B.N., Culler, D.E.: The ganglia distributed monitoring system: Design, implementation and experience. Parallel Computing (2003)
Google Scholar
L. Nagios Enterprises. Nagios, http://www.nagios.org/documentation .
Neumeyer, L., Robbins, B., Nair, A., Kesari, A.: S4: Distributed stream computing platform. In: IEEE International Conference on Data Mining Workshops, ICDMW 2010, pp. 170–177 (December 2010)
Google Scholar
Rabkin, A., Katz, R.: Chukwa: a system for reliable large-scale log collection. In: Proceedings of the 24th International Conference on Large Installation System Administration, LISA 2010, Berkeley, CA, USA, pp. 1–15 (2010)
Google Scholar
Ren, G., Tune, E., Moseley, T., Shi, Y., Rus, S., Hundt, R.: Google-wide profiling: A continuous profiling infrastructure for data centers. In: Micro. IEEE (2010)
Google Scholar
Sigelman, B.H., Barroso, L.A., Burrows, M., Stephenson, P., Plakal, M., Beaver, D., Jaspan, S., Shanbhag, C.: Dapper, a large-scale distributed systems tracing infrastructure. Technical Report dapper-2010-1, Google (April 2010)
Google Scholar
Soundararajan, V., Anderson, J.M.: The impact of management operations on the virtualized datacenter. In: Proceedings of the 37th Annual International Symposium on Computer Architecture, ISCA 2010, pp. 326–337 (2010)
Google Scholar
Van Renesse, R., Birman, K.P., Vogels, W.: Astrolabe: A robust and scalable technology for distributed system monitoring, management, and data mining. ACM Trans. Comput. Syst. 21, 164–206 (2003)
Article Google Scholar
Viswanathan, K., Choudur, L., Talwar, V., Wang, C., MacDonald, G., Satterfield, W.: Ranking anomalies in data centers. In: The 13th IEEE/IFIP Network Operations and Management Symposium, NOMS 2012, pp. 79–87 (2012)
Google Scholar
Wang, C.: Ebat: online methods for detecting utility cloud anomalies. In: Proceedings of the 6th Middleware Doctoral Symposium, MDS 2009 (2009)
Google Scholar
Wang, C., Schwan, K., Talwar, V., Eisenhauer, G., Hu, L., Wolf, M.: A flexible architecture integrating monitoring and analytics for managing large-scale data centers. In: Proceedings of the 8th ACM International Conference on Autonomic Computing, ICAC 2011, pp. 141–150. ACM, New York (2011)
Google Scholar
Wang, C., Talwar, V., Schwan, K., Ranganathan, P.: Online detection of utility cloud anomalies using metric distributions. In: The 12th IEEE/IFIP Network Operations and Management Symposium, NOMS 2010, pp. 96–103 (2010)
Google Scholar
Wang, C., Viswanathan, K., Choudur, L., Talwar, V., Satterfield, W., Schwan, K.: Statistical techniques for online anomaly detection in data centers. In: The 12th IFIP/IEEE International Symposium on Integrated Network Management, IM 2011, pp. 385–392 (2011)
Google Scholar
Yalagandula, P., Dahlin, M.: A scalable distributed information management system. In: Proceedings of the 2004 Conference on Applications, Technologies, Architectures, and Protocols for Computer Communications, SIGCOMM 2004, pp. 379–390. ACM, New York (2004)
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

Georgia Institute of Technology, US
Chengwei Wang, Greg Eisenhauer, Karsten Schwan, Matthew Wolf & Chad Huneycutt
HP Labs., US
Vanish Talwar
Riot Games, US
Infantdani Abel Rayan

Authors

Chengwei Wang
View author publications
You can also search for this author in PubMed Google Scholar
Infantdani Abel Rayan
View author publications
You can also search for this author in PubMed Google Scholar
Greg Eisenhauer
View author publications
You can also search for this author in PubMed Google Scholar
Karsten Schwan
View author publications
You can also search for this author in PubMed Google Scholar
Vanish Talwar
View author publications
You can also search for this author in PubMed Google Scholar
Matthew Wolf
View author publications
You can also search for this author in PubMed Google Scholar
Chad Huneycutt
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Electrical and Computer Engineering Department, Carnegie Mellon University, 4720 Forbes Avenue, 15213, Pittsburgh, PA, USA
Priya Narasimhan
Department of Computer Engineering and Informatics, University of Patras, University Campus, 26504, Rio, Greece
Peter Triantafillou

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wang, C. et al. (2012). VScope: Middleware for Troubleshooting Time-Sensitive Data Center Applications. In: Narasimhan, P., Triantafillou, P. (eds) Middleware 2012. Middleware 2012. Lecture Notes in Computer Science, vol 7662. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-35170-9_7

Download citation

DOI: https://doi.org/10.1007/978-3-642-35170-9_7
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-35169-3
Online ISBN: 978-3-642-35170-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics