Skip to main content

A Big Data analyzer for large trace logs

Abstract

Current generation of Internet-based services are typically hosted on large data centers that take the form of warehouse-size structures housing tens of thousands of servers. Continued availability of a modern data center is the result of a complex orchestration among many internal and external actors including computing hardware, multiple layers of intricate software, networking and storage devices, electrical power and cooling plants. During the course of their operation, many of these components produce large amounts of data in the form of event and error logs that are essential not only for identifying and resolving problems but also for improving data center efficiency and management. Most of these activities would benefit significantly from data analytics techniques to exploit hidden statistical patterns and correlations that may be present in the data. The sheer volume of data to be analyzed makes uncovering these correlations and patterns a challenging task. This paper presents Big Data analyzer (BiDAl), a prototype Java tool for log-data analysis that incorporates several Big Data technologies in order to simplify the task of extracting information from data traces produced by large clusters and server farms. BiDAl provides the user with several analysis languages (SQL, R and Hadoop MapReduce) and storage backends (HDFS and SQLite) that can be freely mixed and matched so that a custom tool for a specific task can be easily constructed. BiDAl has a modular architecture so that it can be extended with other backends and analysis languages in the future. In this paper we present the design of BiDAl and describe our experience using it to analyze publicly-available traces from Google data clusters, with the goal of building a realistic model of a complex data center.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

References

  1. Abdul-Rahman OA, Aida K (2014) Towards understanding the usage behavior of Google cloud users: the mice and elephants phenomenon. In: 2014 IEEE 6th international conference on cloud computing technology and science (CloudCom). IEEE, pp 272–277. doi:10.1109/CloudCom.2014.75

  2. Abraham L, Allen J, Barykin O, Borkar V, Chopra B, Gerea C, Merl D, Metzler J, Reiss D, Subramanian S, Wiener JL, Zed O (2013) Scuba: diving into data at facebook. Proc VLDB Endow 6(11):1057–1067. doi:10.14778/2536222.2536231

    Article  Google Scholar 

  3. Armbrust M, Xin RS, Lian C, Huai Y, Liu D, Bradley JK, Meng X, Kaftan T, Franklin MJ, Ghodsi A, Zaharia M (2015) Spark SQL: relational data processing in spark. In: Proceedings of the 2015 ACM SIGMOD international conference on management of data (SIGMOD’15). ACM, New York, pp 1383–1394. doi:10.1145/2723372.2742797

  4. Breitgand D, Dubitzky Z, Epstein A, Feder O, Glikson A, Shapira I, Toffetti G (2014) An adaptive utilization accelerator for virtualized environments. In: 2014 IEEE international conference on cloud engineering (IC2E). IEEE, pp 165–174. doi:10.1109/IC2E.2014.63

  5. Caglar F, Gokhale A (2014) iOverbook: intelligent resource-overbooking to support soft real-time applications in the cloud. In: Proceedings of the 2014 IEEE international conference on cloud computing (CLOUD’14). IEEE Computer Society, Washington, DC, pp 538–545. doi:10.1109/CLOUD.2014.78

  6. Calheiros RN, Ranjan R, Beloglazov A, De Rose CAF, Buyya R (2011) Cloudsim: a toolkit for modeling and simulation of cloud computing environments and evaluation of resource provisioning algorithms. Softw Pract Exp 41(1):23–50. doi:10.1002/spe.995

    Article  Google Scholar 

  7. Chen Y, Alspaugh S, Katz RH (2012) Design insights for MapReduce from diverse production workloads. Tech. Rep. UCB/EECS-2012-17, EECS Department, University of California, Berkeley. http://www.eecs.berkeley.edu/Pubs/TechRpts/2012/EECS-2012-17.html. Accessed Dec 2015

  8. Chen Y, Ganapathi A, Griffith R, Katz RH (2011) The case for evaluating MapReduce performance using workload suites. In: 2011 IEEE 19th annual international symposium on modelling, analysis, and simulation of computer and telecommunication systems, pp 390–399. doi:10.1109/MASCOTS.2011.12

  9. Dean J, Ghemawat S (2010) Mapreduce: a flexible data processing tool. Commun ACM 53(1):72–77. doi:10.1145/1629175.1629198

    Article  Google Scholar 

  10. Di S, Kondo D, Cirne W (2012) Characterization and comparison of Google cloud load versus grids. In: 2012 IEEE international conference on cluster computing (CLUSTER), Beijing, pp 230–238. doi:10.1109/CLUSTER.2012.35

  11. Di S, Kondo D, Cirne W (2012) Host load prediction in a Google compute cloud with a bayesian model. In: Proceedings of the international conference on high performance computing, networking, storage and analysis. IEEE Computer Society Press, USA, pp 1–11. doi:10.1109/SC.2012.68

  12. Di S, Robert Y, Vivien F, Kondo D, Wang CL, Cappello F (2013) Optimization of cloud task processing with checkpoint-restart mechanism. In: 2013 international conference for high performance computing, networking, storage and analysis (SC). IEEE, pp 1–12

  13. Gamma E, Helm R, Johnson R, Vlissides J (1994) Design patterns: elements of reusable object-oriented software. Addison-Wesley Professional, Boston

    MATH  Google Scholar 

  14. Gibbons JD, Chakraborti S (2010) Nonparametric statistical inference. Chapman and Hall/CRC, London

  15. Guan Q, Fu S (2013) Adaptive anomaly identification by exploring metric subspace in cloud computing infrastructures. In: Proceedings of the 2013 IEEE 32nd international symposium on reliable distributed systems (SRDS’13). IEEE Computer Society, Washington, DC, pp 205–214. doi:10.1109/SRDS.2013.29

  16. Gupta SKS, Banerjee A, Abbasi Z, Varsamopoulos G, Jonas M, Ferguson J, Gilbert RR, Mukherjee T (2014) Gdcsim: a simulator for green data center design and analysis. ACM Trans Model Comput Simul 24(1):3:1–3:27. doi:10.1145/2553083

  17. Iglesias JO, Murphy L, De Cauwer M, Mehta D, O’Sullivan B (2014) A methodology for online consolidation of tasks through more accurate resource estimations. In: Proceedings of the 2014 IEEE/ACM 7th international conference on utility and cloud computing (UCC’14). IEEE Computer Society, Washington, DC, pp 89–98. doi:10.1109/UCC.2014.17

  18. Isard M (2007) Autopilot: automatic data center management. SIGOPS Oper Syst Rev 41(2):60–67. doi:10.1145/1243418.1243426

    Article  Google Scholar 

  19. Javadi B, Kondo D, Iosup A, Epema D (2013) The failure trace archive: enabling the comparison of failure measurements and models of distributed systems. J Parallel Distrib Comput 73(8):1208–1223. doi:10.1016/j.jpdc.2013.04.002

    Article  Google Scholar 

  20. Kavulya S, Tan J, Gandhi R, Narasimhan P (2010) An analysis of traces from a production MapReduce cluster. In: Proceedings of the 2010 10th IEEE/ACM international conference on cluster, cloud and grid computing (CCGRID’10). IEEE Computer Society, Washington, DC, pp 94–103. doi:10.1109/CCGRID.2010.112

  21. Kornacker M, Behm A, Bittorf V, Bobrovytsky T, Ching C, Choi A, Erickson J, Grund M, Hecht D, Jacobs M, Joshi I, Kuff L, Kumar D, Leblang A, Li N, Pandis I, Robinson H, Rorke D, Rus S, Russell J, Tsirogiannis D, Wanderman-Milne S, Yoder M (2015) Impala: a modern, open-source SQL engine for Hadoop. In: CIDR 2015, seventh biennial conference on innovative data systems research, Asilomar

  22. Liu Z, Cho S (2012) Characterizing machines and workloads on a Google cluster. In: 2012 41st international conference on parallel processing workshops (ICPPW), pp 397–403. doi:10.1109/ICPPW.2012.57

  23. Mishra AK, Hellerstein JL, Cirne W, Das CR (2010) Towards characterizing cloud backend workloads: insights from Google compute clusters. SIGMETRICS Perform Eval Rev 37(4):34–41. doi:10.1145/1773394.1773400

    Article  Google Scholar 

  24. Olston C, Reed B, Srivastava U, Kumar R, Tomkins A (2008) Pig latin: a not-so-foreign language for data processing. In: Proceedings of the 2008 ACM SIGMOD international conference on management of data (SIGMOD’08). ACM, New York, pp 1099–1110. doi:10.1145/1376616.1376726

  25. R Development Core Team (2008) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna. http://www.R-project.org. Accessed Dec 2015

  26. Reiss C, Tumanov A, Ganger GR, Katz RH, Kozuch MA (2012) Heterogeneity and dynamicity of clouds at scale: Google trace analysis. In: Proceedings of the third ACM symposium on cloud computing (SoCC’12). ACM, New York, pp 7:1–7:13. doi:10.1145/2391229.2391236

  27. Reiss C, Wilkes J, Hellerstein JL (2011) Google cluster-usage traces: format \(+\) schema. Technical report, Google Inc., Mountain View. http://code.google.com/p/googleclusterdata/wiki/TraceVersion2. Accessed 20 March 2012

  28. Salfner F, Lenk M, Malek M (2010) A survey of online failure prediction methods. ACM Comput Surv 42(3):10:1–10:42. doi:10.1145/1670679.1670680

  29. Schwarzkopf M, Konwinski A, Abd-El-Malek M, Wilkes J (2013) Omega: flexible, scalable schedulers for large compute clusters. In: Proceedings of the 8th ACM European conference on computer systems (EuroSys’13). ACM, New York, pp 351–364. doi:10.1145/2465351.2465386

  30. Sharma B, Chudnovsky V, Hellerstein JL, Rifaat R, Das CR (2011) Modeling and synthesizing task placement constraints in Google compute clusters. In: Proceedings of the 2nd ACM symposium on cloud computing (SOCC’11). ACM, New York, pp 3:1–3:14. doi:10.1145/2038916.2038919

  31. Shvachko K, Kuang H, Radia S, Chansler R (2010) The Hadoop distributed file system. In: Proceedings of the 2010 IEEE 26th symposium on mass storage systems and technologies (MSST’10). IEEE Computer Society, USA, pp 1–10. doi:10.1109/MSST.2010.5496972

  32. Sîrbu A, Babaoglu O (2015) Towards data-driven autonomics in data centers. In: IEEE international conference on cloud and autonomic computing (ICCAC). IEEE

  33. Thusoo A, Sarma JS, Jain N, Shao Z, Chakka P, Zhang N, Anthony S, Liu H, Murthy R (2010) Hive—a petabyte scale data warehouse using Hadoop. In: Proceedings of the 26th international conference on data engineering (ICDE), Long Beach, pp 996–1005. doi:10.1109/ICDE.2010.5447738

  34. Thusoo A, Shao Z, Anthony S, Borthakur D, Jain N, Sen Sarma J, Murthy R, Liu H (2010) Data warehousing and analytics infrastructure at facebook. In: Proceedings of the 2010 ACM SIGMOD international conference on management of data (SIGMOD’10). ACM, New York, pp 1013–1020. doi:10.1145/1807167.1807278

  35. Varga A et al (2001) The OMNeT++ discrete event simulation system. In: Proceedings of the European simulation multiconference (ESM’01), Prague

  36. Verma A, Pedrosa L, Korupolu M, Oppenheimer D, Tune E, Wilkes J (2015) Large-scale cluster management at Google with borg. In: Proceedings of the tenth European conference on computer systems (EuroSys’15). ACM, New York, pp 18:1–18:17. doi:10.1145/2741948.2741964

  37. Wang G, Butt AR, Monti H, Gupta K (2011) Towards synthesizing realistic workload traces for studying the Hadoop ecosystem. In: Proceedings of the 2011 IEEE 19th annual international symposium on modelling, analysis, and simulation of computer and telecommunication systems (MASCOTS’11). IEEE Computer Society, Washington, DC, pp 400–408. doi:10.1109/MASCOTS.2011.59

  38. Wang G, Butt AR, Pandey P, Gupta K (2009) A simulation approach to evaluating design decisions in MapReduce setups. In: IEEE international symposium on modeling, analysis simulation of computer and telecommunication systems (MASCOTS’09), pp 1–11 (2009). doi:10.1109/MASCOT.2009.5366973

  39. Wilkes J (2011) More Google cluster data. Google research blog. http://googleresearch.blogspot.com/2011/11/more-google-cluster-data.html. Accessed Dec 2015

  40. Wolski R, Brevik J (2014) Using parametric models to represent private cloud workloads. IEEE Trans Serv Comput 7(4):714–725. doi:10.1109/TSC.2013.48

    Article  Google Scholar 

  41. Zhang Q, Hellerstein JL, Boutaba R (2011) Characterizing task usage shapes in Google’s compute clusters. In: Proceedings of the 5th international workshop on large scale distributed systems and middleware

  42. Zhang Q, Zhani MF, Boutaba R, Hellerstein JL (2014) Dynamic heterogeneity-aware resource provisioning in the cloud. IEEE Trans Cloud Comput 2(1):14–28. doi:10.1109/TCC.2014.2306427

    Article  Google Scholar 

  43. Zhang X, Tune E, Hagmann R, Jnagal R, Gokhale V, Wilkes J (2013) CPI2: CPU performance isolation for shared compute clusters. In: Proceedings of the 8th ACM European conference on computer systems (EuroSys’13). ACM, New York, pp 379–391. doi:10.1145/2465351.2465388

  44. Zhao W, Peng Y, Xie F, Dai Z (2012) Modeling and simulation of cloud computing: a review. In: 2012 IEEE Asia Pacific cloud computing congress (APCloudCC), pp 20–24. doi:10.1109/APCloudCC.2012.6486505

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Alina Sîrbu.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Balliu, A., Olivetti, D., Babaoglu, O. et al. A Big Data analyzer for large trace logs. Computing 98, 1225–1249 (2016). https://doi.org/10.1007/s00607-015-0480-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00607-015-0480-7

Keywords

  • Big Data
  • Log analysis
  • Workload characterization
  • Google cluster trace
  • Model
  • Simulation

Mathematics Subject Classification

  • 68N01
  • 68P20
  • 68U20