A Big Data analyzer for large trace logs

Balliu, Alkida; Olivetti, Dennis; Babaoglu, Ozalp; Marzolla, Moreno; Sîrbu, Alina

doi:10.1007/s00607-015-0480-7

A Big Data analyzer for large trace logs

Published: 28 December 2015

Volume 98, pages 1225–1249, (2016)
Cite this article

Computing Aims and scope Submit manuscript

Alkida Balliu¹,
Dennis Olivetti¹,
Ozalp Babaoglu²,
Moreno Marzolla² &
…
Alina Sîrbu²

825 Accesses
9 Citations
2 Altmetric
Explore all metrics

Abstract

Current generation of Internet-based services are typically hosted on large data centers that take the form of warehouse-size structures housing tens of thousands of servers. Continued availability of a modern data center is the result of a complex orchestration among many internal and external actors including computing hardware, multiple layers of intricate software, networking and storage devices, electrical power and cooling plants. During the course of their operation, many of these components produce large amounts of data in the form of event and error logs that are essential not only for identifying and resolving problems but also for improving data center efficiency and management. Most of these activities would benefit significantly from data analytics techniques to exploit hidden statistical patterns and correlations that may be present in the data. The sheer volume of data to be analyzed makes uncovering these correlations and patterns a challenging task. This paper presents Big Data analyzer (BiDAl), a prototype Java tool for log-data analysis that incorporates several Big Data technologies in order to simplify the task of extracting information from data traces produced by large clusters and server farms. BiDAl provides the user with several analysis languages (SQL, R and Hadoop MapReduce) and storage backends (HDFS and SQLite) that can be freely mixed and matched so that a custom tool for a specific task can be easily constructed. BiDAl has a modular architecture so that it can be extended with other backends and analysis languages in the future. In this paper we present the design of BiDAl and describe our experience using it to analyze publicly-available traces from Google data clusters, with the goal of building a realistic model of a complex data center.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Abdul-Rahman OA, Aida K (2014) Towards understanding the usage behavior of Google cloud users: the mice and elephants phenomenon. In: 2014 IEEE 6th international conference on cloud computing technology and science (CloudCom). IEEE, pp 272–277. doi:10.1109/CloudCom.2014.75
Abraham L, Allen J, Barykin O, Borkar V, Chopra B, Gerea C, Merl D, Metzler J, Reiss D, Subramanian S, Wiener JL, Zed O (2013) Scuba: diving into data at facebook. Proc VLDB Endow 6(11):1057–1067. doi:10.14778/2536222.2536231
Article Google Scholar
Armbrust M, Xin RS, Lian C, Huai Y, Liu D, Bradley JK, Meng X, Kaftan T, Franklin MJ, Ghodsi A, Zaharia M (2015) Spark SQL: relational data processing in spark. In: Proceedings of the 2015 ACM SIGMOD international conference on management of data (SIGMOD’15). ACM, New York, pp 1383–1394. doi:10.1145/2723372.2742797
Breitgand D, Dubitzky Z, Epstein A, Feder O, Glikson A, Shapira I, Toffetti G (2014) An adaptive utilization accelerator for virtualized environments. In: 2014 IEEE international conference on cloud engineering (IC2E). IEEE, pp 165–174. doi:10.1109/IC2E.2014.63
Caglar F, Gokhale A (2014) iOverbook: intelligent resource-overbooking to support soft real-time applications in the cloud. In: Proceedings of the 2014 IEEE international conference on cloud computing (CLOUD’14). IEEE Computer Society, Washington, DC, pp 538–545. doi:10.1109/CLOUD.2014.78
Calheiros RN, Ranjan R, Beloglazov A, De Rose CAF, Buyya R (2011) Cloudsim: a toolkit for modeling and simulation of cloud computing environments and evaluation of resource provisioning algorithms. Softw Pract Exp 41(1):23–50. doi:10.1002/spe.995
Article Google Scholar
Chen Y, Alspaugh S, Katz RH (2012) Design insights for MapReduce from diverse production workloads. Tech. Rep. UCB/EECS-2012-17, EECS Department, University of California, Berkeley. http://www.eecs.berkeley.edu/Pubs/TechRpts/2012/EECS-2012-17.html. Accessed Dec 2015
Chen Y, Ganapathi A, Griffith R, Katz RH (2011) The case for evaluating MapReduce performance using workload suites. In: 2011 IEEE 19th annual international symposium on modelling, analysis, and simulation of computer and telecommunication systems, pp 390–399. doi:10.1109/MASCOTS.2011.12
Dean J, Ghemawat S (2010) Mapreduce: a flexible data processing tool. Commun ACM 53(1):72–77. doi:10.1145/1629175.1629198
Article Google Scholar
Di S, Kondo D, Cirne W (2012) Characterization and comparison of Google cloud load versus grids. In: 2012 IEEE international conference on cluster computing (CLUSTER), Beijing, pp 230–238. doi:10.1109/CLUSTER.2012.35
Di S, Kondo D, Cirne W (2012) Host load prediction in a Google compute cloud with a bayesian model. In: Proceedings of the international conference on high performance computing, networking, storage and analysis. IEEE Computer Society Press, USA, pp 1–11. doi:10.1109/SC.2012.68
Di S, Robert Y, Vivien F, Kondo D, Wang CL, Cappello F (2013) Optimization of cloud task processing with checkpoint-restart mechanism. In: 2013 international conference for high performance computing, networking, storage and analysis (SC). IEEE, pp 1–12
Gamma E, Helm R, Johnson R, Vlissides J (1994) Design patterns: elements of reusable object-oriented software. Addison-Wesley Professional, Boston
MATH Google Scholar
Gibbons JD, Chakraborti S (2010) Nonparametric statistical inference. Chapman and Hall/CRC, London
Guan Q, Fu S (2013) Adaptive anomaly identification by exploring metric subspace in cloud computing infrastructures. In: Proceedings of the 2013 IEEE 32nd international symposium on reliable distributed systems (SRDS’13). IEEE Computer Society, Washington, DC, pp 205–214. doi:10.1109/SRDS.2013.29
Gupta SKS, Banerjee A, Abbasi Z, Varsamopoulos G, Jonas M, Ferguson J, Gilbert RR, Mukherjee T (2014) Gdcsim: a simulator for green data center design and analysis. ACM Trans Model Comput Simul 24(1):3:1–3:27. doi:10.1145/2553083
Iglesias JO, Murphy L, De Cauwer M, Mehta D, O’Sullivan B (2014) A methodology for online consolidation of tasks through more accurate resource estimations. In: Proceedings of the 2014 IEEE/ACM 7th international conference on utility and cloud computing (UCC’14). IEEE Computer Society, Washington, DC, pp 89–98. doi:10.1109/UCC.2014.17
Isard M (2007) Autopilot: automatic data center management. SIGOPS Oper Syst Rev 41(2):60–67. doi:10.1145/1243418.1243426
Article Google Scholar
Javadi B, Kondo D, Iosup A, Epema D (2013) The failure trace archive: enabling the comparison of failure measurements and models of distributed systems. J Parallel Distrib Comput 73(8):1208–1223. doi:10.1016/j.jpdc.2013.04.002
Article Google Scholar
Kavulya S, Tan J, Gandhi R, Narasimhan P (2010) An analysis of traces from a production MapReduce cluster. In: Proceedings of the 2010 10th IEEE/ACM international conference on cluster, cloud and grid computing (CCGRID’10). IEEE Computer Society, Washington, DC, pp 94–103. doi:10.1109/CCGRID.2010.112
Kornacker M, Behm A, Bittorf V, Bobrovytsky T, Ching C, Choi A, Erickson J, Grund M, Hecht D, Jacobs M, Joshi I, Kuff L, Kumar D, Leblang A, Li N, Pandis I, Robinson H, Rorke D, Rus S, Russell J, Tsirogiannis D, Wanderman-Milne S, Yoder M (2015) Impala: a modern, open-source SQL engine for Hadoop. In: CIDR 2015, seventh biennial conference on innovative data systems research, Asilomar
Liu Z, Cho S (2012) Characterizing machines and workloads on a Google cluster. In: 2012 41st international conference on parallel processing workshops (ICPPW), pp 397–403. doi:10.1109/ICPPW.2012.57
Mishra AK, Hellerstein JL, Cirne W, Das CR (2010) Towards characterizing cloud backend workloads: insights from Google compute clusters. SIGMETRICS Perform Eval Rev 37(4):34–41. doi:10.1145/1773394.1773400
Article Google Scholar
Olston C, Reed B, Srivastava U, Kumar R, Tomkins A (2008) Pig latin: a not-so-foreign language for data processing. In: Proceedings of the 2008 ACM SIGMOD international conference on management of data (SIGMOD’08). ACM, New York, pp 1099–1110. doi:10.1145/1376616.1376726
R Development Core Team (2008) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna. http://www.R-project.org. Accessed Dec 2015
Reiss C, Tumanov A, Ganger GR, Katz RH, Kozuch MA (2012) Heterogeneity and dynamicity of clouds at scale: Google trace analysis. In: Proceedings of the third ACM symposium on cloud computing (SoCC’12). ACM, New York, pp 7:1–7:13. doi:10.1145/2391229.2391236
Reiss C, Wilkes J, Hellerstein JL (2011) Google cluster-usage traces: format \(+\) schema. Technical report, Google Inc., Mountain View. http://code.google.com/p/googleclusterdata/wiki/TraceVersion2. Accessed 20 March 2012
Salfner F, Lenk M, Malek M (2010) A survey of online failure prediction methods. ACM Comput Surv 42(3):10:1–10:42. doi:10.1145/1670679.1670680
Schwarzkopf M, Konwinski A, Abd-El-Malek M, Wilkes J (2013) Omega: flexible, scalable schedulers for large compute clusters. In: Proceedings of the 8th ACM European conference on computer systems (EuroSys’13). ACM, New York, pp 351–364. doi:10.1145/2465351.2465386
Sharma B, Chudnovsky V, Hellerstein JL, Rifaat R, Das CR (2011) Modeling and synthesizing task placement constraints in Google compute clusters. In: Proceedings of the 2nd ACM symposium on cloud computing (SOCC’11). ACM, New York, pp 3:1–3:14. doi:10.1145/2038916.2038919
Shvachko K, Kuang H, Radia S, Chansler R (2010) The Hadoop distributed file system. In: Proceedings of the 2010 IEEE 26th symposium on mass storage systems and technologies (MSST’10). IEEE Computer Society, USA, pp 1–10. doi:10.1109/MSST.2010.5496972
Sîrbu A, Babaoglu O (2015) Towards data-driven autonomics in data centers. In: IEEE international conference on cloud and autonomic computing (ICCAC). IEEE
Thusoo A, Sarma JS, Jain N, Shao Z, Chakka P, Zhang N, Anthony S, Liu H, Murthy R (2010) Hive—a petabyte scale data warehouse using Hadoop. In: Proceedings of the 26th international conference on data engineering (ICDE), Long Beach, pp 996–1005. doi:10.1109/ICDE.2010.5447738
Thusoo A, Shao Z, Anthony S, Borthakur D, Jain N, Sen Sarma J, Murthy R, Liu H (2010) Data warehousing and analytics infrastructure at facebook. In: Proceedings of the 2010 ACM SIGMOD international conference on management of data (SIGMOD’10). ACM, New York, pp 1013–1020. doi:10.1145/1807167.1807278
Varga A et al (2001) The OMNeT++ discrete event simulation system. In: Proceedings of the European simulation multiconference (ESM’01), Prague
Verma A, Pedrosa L, Korupolu M, Oppenheimer D, Tune E, Wilkes J (2015) Large-scale cluster management at Google with borg. In: Proceedings of the tenth European conference on computer systems (EuroSys’15). ACM, New York, pp 18:1–18:17. doi:10.1145/2741948.2741964
Wang G, Butt AR, Monti H, Gupta K (2011) Towards synthesizing realistic workload traces for studying the Hadoop ecosystem. In: Proceedings of the 2011 IEEE 19th annual international symposium on modelling, analysis, and simulation of computer and telecommunication systems (MASCOTS’11). IEEE Computer Society, Washington, DC, pp 400–408. doi:10.1109/MASCOTS.2011.59
Wang G, Butt AR, Pandey P, Gupta K (2009) A simulation approach to evaluating design decisions in MapReduce setups. In: IEEE international symposium on modeling, analysis simulation of computer and telecommunication systems (MASCOTS’09), pp 1–11 (2009). doi:10.1109/MASCOT.2009.5366973
Wilkes J (2011) More Google cluster data. Google research blog. http://googleresearch.blogspot.com/2011/11/more-google-cluster-data.html. Accessed Dec 2015
Wolski R, Brevik J (2014) Using parametric models to represent private cloud workloads. IEEE Trans Serv Comput 7(4):714–725. doi:10.1109/TSC.2013.48
Article Google Scholar
Zhang Q, Hellerstein JL, Boutaba R (2011) Characterizing task usage shapes in Google’s compute clusters. In: Proceedings of the 5th international workshop on large scale distributed systems and middleware
Zhang Q, Zhani MF, Boutaba R, Hellerstein JL (2014) Dynamic heterogeneity-aware resource provisioning in the cloud. IEEE Trans Cloud Comput 2(1):14–28. doi:10.1109/TCC.2014.2306427
Article Google Scholar
Zhang X, Tune E, Hagmann R, Jnagal R, Gokhale V, Wilkes J (2013) CPI2: CPU performance isolation for shared compute clusters. In: Proceedings of the 8th ACM European conference on computer systems (EuroSys’13). ACM, New York, pp 379–391. doi:10.1145/2465351.2465388
Zhao W, Peng Y, Xie F, Dai Z (2012) Modeling and simulation of cloud computing: a review. In: 2012 IEEE Asia Pacific cloud computing congress (APCloudCC), pp 20–24. doi:10.1109/APCloudCC.2012.6486505

Download references

Author information

Authors and Affiliations

Gran Sasso Science Institute (GSSI), L’Aquila, Italy
Alkida Balliu & Dennis Olivetti
Department of Computer Science and Engineering, University of Bologna, Bologna, Italy
Ozalp Babaoglu, Moreno Marzolla & Alina Sîrbu

Authors

Alkida Balliu
View author publications
You can also search for this author in PubMed Google Scholar
Dennis Olivetti
View author publications
You can also search for this author in PubMed Google Scholar
Ozalp Babaoglu
View author publications
You can also search for this author in PubMed Google Scholar
Moreno Marzolla
View author publications
You can also search for this author in PubMed Google Scholar
Alina Sîrbu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Alina Sîrbu.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Balliu, A., Olivetti, D., Babaoglu, O. et al. A Big Data analyzer for large trace logs. Computing 98, 1225–1249 (2016). https://doi.org/10.1007/s00607-015-0480-7

Download citation

Received: 04 August 2015
Accepted: 10 December 2015
Published: 28 December 2015
Issue Date: December 2016
DOI: https://doi.org/10.1007/s00607-015-0480-7

Keywords

Mathematics Subject Classification

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Big Data analyzer for large trace logs

Abstract

Access this article

Similar content being viewed by others

Modelling Auto-scalable Big Data Enabled Log Analytic Framework

An Open-Source Framework Unifying Stream and Batch Processing

A Big Data Architecture for Log Data Storage and Analysis

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Mathematics Subject Classification

Navigation

A Big Data analyzer for large trace logs

Abstract

Access this article

Similar content being viewed by others

Modelling Auto-scalable Big Data Enabled Log Analytic Framework

An Open-Source Framework Unifying Stream and Batch Processing

A Big Data Architecture for Log Data Storage and Analysis

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification

Search

Navigation