Abstract
Current generation of Internet-based services are typically hosted on large data centers that take the form of warehouse-size structures housing tens of thousands of servers. Continued availability of a modern data center is the result of a complex orchestration among many internal and external actors including computing hardware, multiple layers of intricate software, networking and storage devices, electrical power and cooling plants. During the course of their operation, many of these components produce large amounts of data in the form of event and error logs that are essential not only for identifying and resolving problems but also for improving data center efficiency and management. Most of these activities would benefit significantly from data analytics techniques to exploit hidden statistical patterns and correlations that may be present in the data. The sheer volume of data to be analyzed makes uncovering these correlations and patterns a challenging task. This paper presents Big Data analyzer (BiDAl), a prototype Java tool for log-data analysis that incorporates several Big Data technologies in order to simplify the task of extracting information from data traces produced by large clusters and server farms. BiDAl provides the user with several analysis languages (SQL, R and Hadoop MapReduce) and storage backends (HDFS and SQLite) that can be freely mixed and matched so that a custom tool for a specific task can be easily constructed. BiDAl has a modular architecture so that it can be extended with other backends and analysis languages in the future. In this paper we present the design of BiDAl and describe our experience using it to analyze publicly-available traces from Google data clusters, with the goal of building a realistic model of a complex data center.
Similar content being viewed by others
References
Abdul-Rahman OA, Aida K (2014) Towards understanding the usage behavior of Google cloud users: the mice and elephants phenomenon. In: 2014 IEEE 6th international conference on cloud computing technology and science (CloudCom). IEEE, pp 272–277. doi:10.1109/CloudCom.2014.75
Abraham L, Allen J, Barykin O, Borkar V, Chopra B, Gerea C, Merl D, Metzler J, Reiss D, Subramanian S, Wiener JL, Zed O (2013) Scuba: diving into data at facebook. Proc VLDB Endow 6(11):1057–1067. doi:10.14778/2536222.2536231
Armbrust M, Xin RS, Lian C, Huai Y, Liu D, Bradley JK, Meng X, Kaftan T, Franklin MJ, Ghodsi A, Zaharia M (2015) Spark SQL: relational data processing in spark. In: Proceedings of the 2015 ACM SIGMOD international conference on management of data (SIGMOD’15). ACM, New York, pp 1383–1394. doi:10.1145/2723372.2742797
Breitgand D, Dubitzky Z, Epstein A, Feder O, Glikson A, Shapira I, Toffetti G (2014) An adaptive utilization accelerator for virtualized environments. In: 2014 IEEE international conference on cloud engineering (IC2E). IEEE, pp 165–174. doi:10.1109/IC2E.2014.63
Caglar F, Gokhale A (2014) iOverbook: intelligent resource-overbooking to support soft real-time applications in the cloud. In: Proceedings of the 2014 IEEE international conference on cloud computing (CLOUD’14). IEEE Computer Society, Washington, DC, pp 538–545. doi:10.1109/CLOUD.2014.78
Calheiros RN, Ranjan R, Beloglazov A, De Rose CAF, Buyya R (2011) Cloudsim: a toolkit for modeling and simulation of cloud computing environments and evaluation of resource provisioning algorithms. Softw Pract Exp 41(1):23–50. doi:10.1002/spe.995
Chen Y, Alspaugh S, Katz RH (2012) Design insights for MapReduce from diverse production workloads. Tech. Rep. UCB/EECS-2012-17, EECS Department, University of California, Berkeley. http://www.eecs.berkeley.edu/Pubs/TechRpts/2012/EECS-2012-17.html. Accessed Dec 2015
Chen Y, Ganapathi A, Griffith R, Katz RH (2011) The case for evaluating MapReduce performance using workload suites. In: 2011 IEEE 19th annual international symposium on modelling, analysis, and simulation of computer and telecommunication systems, pp 390–399. doi:10.1109/MASCOTS.2011.12
Dean J, Ghemawat S (2010) Mapreduce: a flexible data processing tool. Commun ACM 53(1):72–77. doi:10.1145/1629175.1629198
Di S, Kondo D, Cirne W (2012) Characterization and comparison of Google cloud load versus grids. In: 2012 IEEE international conference on cluster computing (CLUSTER), Beijing, pp 230–238. doi:10.1109/CLUSTER.2012.35
Di S, Kondo D, Cirne W (2012) Host load prediction in a Google compute cloud with a bayesian model. In: Proceedings of the international conference on high performance computing, networking, storage and analysis. IEEE Computer Society Press, USA, pp 1–11. doi:10.1109/SC.2012.68
Di S, Robert Y, Vivien F, Kondo D, Wang CL, Cappello F (2013) Optimization of cloud task processing with checkpoint-restart mechanism. In: 2013 international conference for high performance computing, networking, storage and analysis (SC). IEEE, pp 1–12
Gamma E, Helm R, Johnson R, Vlissides J (1994) Design patterns: elements of reusable object-oriented software. Addison-Wesley Professional, Boston
Gibbons JD, Chakraborti S (2010) Nonparametric statistical inference. Chapman and Hall/CRC, London
Guan Q, Fu S (2013) Adaptive anomaly identification by exploring metric subspace in cloud computing infrastructures. In: Proceedings of the 2013 IEEE 32nd international symposium on reliable distributed systems (SRDS’13). IEEE Computer Society, Washington, DC, pp 205–214. doi:10.1109/SRDS.2013.29
Gupta SKS, Banerjee A, Abbasi Z, Varsamopoulos G, Jonas M, Ferguson J, Gilbert RR, Mukherjee T (2014) Gdcsim: a simulator for green data center design and analysis. ACM Trans Model Comput Simul 24(1):3:1–3:27. doi:10.1145/2553083
Iglesias JO, Murphy L, De Cauwer M, Mehta D, O’Sullivan B (2014) A methodology for online consolidation of tasks through more accurate resource estimations. In: Proceedings of the 2014 IEEE/ACM 7th international conference on utility and cloud computing (UCC’14). IEEE Computer Society, Washington, DC, pp 89–98. doi:10.1109/UCC.2014.17
Isard M (2007) Autopilot: automatic data center management. SIGOPS Oper Syst Rev 41(2):60–67. doi:10.1145/1243418.1243426
Javadi B, Kondo D, Iosup A, Epema D (2013) The failure trace archive: enabling the comparison of failure measurements and models of distributed systems. J Parallel Distrib Comput 73(8):1208–1223. doi:10.1016/j.jpdc.2013.04.002
Kavulya S, Tan J, Gandhi R, Narasimhan P (2010) An analysis of traces from a production MapReduce cluster. In: Proceedings of the 2010 10th IEEE/ACM international conference on cluster, cloud and grid computing (CCGRID’10). IEEE Computer Society, Washington, DC, pp 94–103. doi:10.1109/CCGRID.2010.112
Kornacker M, Behm A, Bittorf V, Bobrovytsky T, Ching C, Choi A, Erickson J, Grund M, Hecht D, Jacobs M, Joshi I, Kuff L, Kumar D, Leblang A, Li N, Pandis I, Robinson H, Rorke D, Rus S, Russell J, Tsirogiannis D, Wanderman-Milne S, Yoder M (2015) Impala: a modern, open-source SQL engine for Hadoop. In: CIDR 2015, seventh biennial conference on innovative data systems research, Asilomar
Liu Z, Cho S (2012) Characterizing machines and workloads on a Google cluster. In: 2012 41st international conference on parallel processing workshops (ICPPW), pp 397–403. doi:10.1109/ICPPW.2012.57
Mishra AK, Hellerstein JL, Cirne W, Das CR (2010) Towards characterizing cloud backend workloads: insights from Google compute clusters. SIGMETRICS Perform Eval Rev 37(4):34–41. doi:10.1145/1773394.1773400
Olston C, Reed B, Srivastava U, Kumar R, Tomkins A (2008) Pig latin: a not-so-foreign language for data processing. In: Proceedings of the 2008 ACM SIGMOD international conference on management of data (SIGMOD’08). ACM, New York, pp 1099–1110. doi:10.1145/1376616.1376726
R Development Core Team (2008) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna. http://www.R-project.org. Accessed Dec 2015
Reiss C, Tumanov A, Ganger GR, Katz RH, Kozuch MA (2012) Heterogeneity and dynamicity of clouds at scale: Google trace analysis. In: Proceedings of the third ACM symposium on cloud computing (SoCC’12). ACM, New York, pp 7:1–7:13. doi:10.1145/2391229.2391236
Reiss C, Wilkes J, Hellerstein JL (2011) Google cluster-usage traces: format \(+\) schema. Technical report, Google Inc., Mountain View. http://code.google.com/p/googleclusterdata/wiki/TraceVersion2. Accessed 20 March 2012
Salfner F, Lenk M, Malek M (2010) A survey of online failure prediction methods. ACM Comput Surv 42(3):10:1–10:42. doi:10.1145/1670679.1670680
Schwarzkopf M, Konwinski A, Abd-El-Malek M, Wilkes J (2013) Omega: flexible, scalable schedulers for large compute clusters. In: Proceedings of the 8th ACM European conference on computer systems (EuroSys’13). ACM, New York, pp 351–364. doi:10.1145/2465351.2465386
Sharma B, Chudnovsky V, Hellerstein JL, Rifaat R, Das CR (2011) Modeling and synthesizing task placement constraints in Google compute clusters. In: Proceedings of the 2nd ACM symposium on cloud computing (SOCC’11). ACM, New York, pp 3:1–3:14. doi:10.1145/2038916.2038919
Shvachko K, Kuang H, Radia S, Chansler R (2010) The Hadoop distributed file system. In: Proceedings of the 2010 IEEE 26th symposium on mass storage systems and technologies (MSST’10). IEEE Computer Society, USA, pp 1–10. doi:10.1109/MSST.2010.5496972
Sîrbu A, Babaoglu O (2015) Towards data-driven autonomics in data centers. In: IEEE international conference on cloud and autonomic computing (ICCAC). IEEE
Thusoo A, Sarma JS, Jain N, Shao Z, Chakka P, Zhang N, Anthony S, Liu H, Murthy R (2010) Hive—a petabyte scale data warehouse using Hadoop. In: Proceedings of the 26th international conference on data engineering (ICDE), Long Beach, pp 996–1005. doi:10.1109/ICDE.2010.5447738
Thusoo A, Shao Z, Anthony S, Borthakur D, Jain N, Sen Sarma J, Murthy R, Liu H (2010) Data warehousing and analytics infrastructure at facebook. In: Proceedings of the 2010 ACM SIGMOD international conference on management of data (SIGMOD’10). ACM, New York, pp 1013–1020. doi:10.1145/1807167.1807278
Varga A et al (2001) The OMNeT++ discrete event simulation system. In: Proceedings of the European simulation multiconference (ESM’01), Prague
Verma A, Pedrosa L, Korupolu M, Oppenheimer D, Tune E, Wilkes J (2015) Large-scale cluster management at Google with borg. In: Proceedings of the tenth European conference on computer systems (EuroSys’15). ACM, New York, pp 18:1–18:17. doi:10.1145/2741948.2741964
Wang G, Butt AR, Monti H, Gupta K (2011) Towards synthesizing realistic workload traces for studying the Hadoop ecosystem. In: Proceedings of the 2011 IEEE 19th annual international symposium on modelling, analysis, and simulation of computer and telecommunication systems (MASCOTS’11). IEEE Computer Society, Washington, DC, pp 400–408. doi:10.1109/MASCOTS.2011.59
Wang G, Butt AR, Pandey P, Gupta K (2009) A simulation approach to evaluating design decisions in MapReduce setups. In: IEEE international symposium on modeling, analysis simulation of computer and telecommunication systems (MASCOTS’09), pp 1–11 (2009). doi:10.1109/MASCOT.2009.5366973
Wilkes J (2011) More Google cluster data. Google research blog. http://googleresearch.blogspot.com/2011/11/more-google-cluster-data.html. Accessed Dec 2015
Wolski R, Brevik J (2014) Using parametric models to represent private cloud workloads. IEEE Trans Serv Comput 7(4):714–725. doi:10.1109/TSC.2013.48
Zhang Q, Hellerstein JL, Boutaba R (2011) Characterizing task usage shapes in Google’s compute clusters. In: Proceedings of the 5th international workshop on large scale distributed systems and middleware
Zhang Q, Zhani MF, Boutaba R, Hellerstein JL (2014) Dynamic heterogeneity-aware resource provisioning in the cloud. IEEE Trans Cloud Comput 2(1):14–28. doi:10.1109/TCC.2014.2306427
Zhang X, Tune E, Hagmann R, Jnagal R, Gokhale V, Wilkes J (2013) CPI2: CPU performance isolation for shared compute clusters. In: Proceedings of the 8th ACM European conference on computer systems (EuroSys’13). ACM, New York, pp 379–391. doi:10.1145/2465351.2465388
Zhao W, Peng Y, Xie F, Dai Z (2012) Modeling and simulation of cloud computing: a review. In: 2012 IEEE Asia Pacific cloud computing congress (APCloudCC), pp 20–24. doi:10.1109/APCloudCC.2012.6486505
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Balliu, A., Olivetti, D., Babaoglu, O. et al. A Big Data analyzer for large trace logs. Computing 98, 1225–1249 (2016). https://doi.org/10.1007/s00607-015-0480-7
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00607-015-0480-7