, Volume 98, Issue 12, pp 1225–1249 | Cite as

A Big Data analyzer for large trace logs

  • Alkida Balliu
  • Dennis Olivetti
  • Ozalp Babaoglu
  • Moreno Marzolla
  • Alina Sîrbu


Current generation of Internet-based services are typically hosted on large data centers that take the form of warehouse-size structures housing tens of thousands of servers. Continued availability of a modern data center is the result of a complex orchestration among many internal and external actors including computing hardware, multiple layers of intricate software, networking and storage devices, electrical power and cooling plants. During the course of their operation, many of these components produce large amounts of data in the form of event and error logs that are essential not only for identifying and resolving problems but also for improving data center efficiency and management. Most of these activities would benefit significantly from data analytics techniques to exploit hidden statistical patterns and correlations that may be present in the data. The sheer volume of data to be analyzed makes uncovering these correlations and patterns a challenging task. This paper presents Big Data analyzer (BiDAl), a prototype Java tool for log-data analysis that incorporates several Big Data technologies in order to simplify the task of extracting information from data traces produced by large clusters and server farms. BiDAl provides the user with several analysis languages (SQL, R and Hadoop MapReduce) and storage backends (HDFS and SQLite) that can be freely mixed and matched so that a custom tool for a specific task can be easily constructed. BiDAl has a modular architecture so that it can be extended with other backends and analysis languages in the future. In this paper we present the design of BiDAl and describe our experience using it to analyze publicly-available traces from Google data clusters, with the goal of building a realistic model of a complex data center.


Big Data Log analysis Workload characterization  Google cluster trace Model Simulation 

Mathematics Subject Classification

68N01 68P20 68U20 


  1. 1.
    Abdul-Rahman OA, Aida K (2014) Towards understanding the usage behavior of Google cloud users: the mice and elephants phenomenon. In: 2014 IEEE 6th international conference on cloud computing technology and science (CloudCom). IEEE, pp 272–277. doi: 10.1109/CloudCom.2014.75
  2. 2.
    Abraham L, Allen J, Barykin O, Borkar V, Chopra B, Gerea C, Merl D, Metzler J, Reiss D, Subramanian S, Wiener JL, Zed O (2013) Scuba: diving into data at facebook. Proc VLDB Endow 6(11):1057–1067. doi: 10.14778/2536222.2536231 CrossRefGoogle Scholar
  3. 3.
    Armbrust M, Xin RS, Lian C, Huai Y, Liu D, Bradley JK, Meng X, Kaftan T, Franklin MJ, Ghodsi A, Zaharia M (2015) Spark SQL: relational data processing in spark. In: Proceedings of the 2015 ACM SIGMOD international conference on management of data (SIGMOD’15). ACM, New York, pp 1383–1394. doi: 10.1145/2723372.2742797
  4. 4.
    Breitgand D, Dubitzky Z, Epstein A, Feder O, Glikson A, Shapira I, Toffetti G (2014) An adaptive utilization accelerator for virtualized environments. In: 2014 IEEE international conference on cloud engineering (IC2E). IEEE, pp 165–174. doi: 10.1109/IC2E.2014.63
  5. 5.
    Caglar F, Gokhale A (2014) iOverbook: intelligent resource-overbooking to support soft real-time applications in the cloud. In: Proceedings of the 2014 IEEE international conference on cloud computing (CLOUD’14). IEEE Computer Society, Washington, DC, pp 538–545. doi: 10.1109/CLOUD.2014.78
  6. 6.
    Calheiros RN, Ranjan R, Beloglazov A, De Rose CAF, Buyya R (2011) Cloudsim: a toolkit for modeling and simulation of cloud computing environments and evaluation of resource provisioning algorithms. Softw Pract Exp 41(1):23–50. doi: 10.1002/spe.995 CrossRefGoogle Scholar
  7. 7.
    Chen Y, Alspaugh S, Katz RH (2012) Design insights for MapReduce from diverse production workloads. Tech. Rep. UCB/EECS-2012-17, EECS Department, University of California, Berkeley. Accessed Dec 2015
  8. 8.
    Chen Y, Ganapathi A, Griffith R, Katz RH (2011) The case for evaluating MapReduce performance using workload suites. In: 2011 IEEE 19th annual international symposium on modelling, analysis, and simulation of computer and telecommunication systems, pp 390–399. doi: 10.1109/MASCOTS.2011.12
  9. 9.
    Dean J, Ghemawat S (2010) Mapreduce: a flexible data processing tool. Commun ACM 53(1):72–77. doi: 10.1145/1629175.1629198 CrossRefGoogle Scholar
  10. 10.
    Di S, Kondo D, Cirne W (2012) Characterization and comparison of Google cloud load versus grids. In: 2012 IEEE international conference on cluster computing (CLUSTER), Beijing, pp 230–238. doi: 10.1109/CLUSTER.2012.35
  11. 11.
    Di S, Kondo D, Cirne W (2012) Host load prediction in a Google compute cloud with a bayesian model. In: Proceedings of the international conference on high performance computing, networking, storage and analysis. IEEE Computer Society Press, USA, pp 1–11. doi: 10.1109/SC.2012.68
  12. 12.
    Di S, Robert Y, Vivien F, Kondo D, Wang CL, Cappello F (2013) Optimization of cloud task processing with checkpoint-restart mechanism. In: 2013 international conference for high performance computing, networking, storage and analysis (SC). IEEE, pp 1–12Google Scholar
  13. 13.
    Gamma E, Helm R, Johnson R, Vlissides J (1994) Design patterns: elements of reusable object-oriented software. Addison-Wesley Professional, BostonzbMATHGoogle Scholar
  14. 14.
    Gibbons JD, Chakraborti S (2010) Nonparametric statistical inference. Chapman and Hall/CRC, LondonGoogle Scholar
  15. 15.
    Guan Q, Fu S (2013) Adaptive anomaly identification by exploring metric subspace in cloud computing infrastructures. In: Proceedings of the 2013 IEEE 32nd international symposium on reliable distributed systems (SRDS’13). IEEE Computer Society, Washington, DC, pp 205–214. doi: 10.1109/SRDS.2013.29
  16. 16.
    Gupta SKS, Banerjee A, Abbasi Z, Varsamopoulos G, Jonas M, Ferguson J, Gilbert RR, Mukherjee T (2014) Gdcsim: a simulator for green data center design and analysis. ACM Trans Model Comput Simul 24(1):3:1–3:27. doi: 10.1145/2553083
  17. 17.
    Iglesias JO, Murphy L, De Cauwer M, Mehta D, O’Sullivan B (2014) A methodology for online consolidation of tasks through more accurate resource estimations. In: Proceedings of the 2014 IEEE/ACM 7th international conference on utility and cloud computing (UCC’14). IEEE Computer Society, Washington, DC, pp 89–98. doi: 10.1109/UCC.2014.17
  18. 18.
    Isard M (2007) Autopilot: automatic data center management. SIGOPS Oper Syst Rev 41(2):60–67. doi: 10.1145/1243418.1243426 CrossRefGoogle Scholar
  19. 19.
    Javadi B, Kondo D, Iosup A, Epema D (2013) The failure trace archive: enabling the comparison of failure measurements and models of distributed systems. J Parallel Distrib Comput 73(8):1208–1223. doi: 10.1016/j.jpdc.2013.04.002 CrossRefGoogle Scholar
  20. 20.
    Kavulya S, Tan J, Gandhi R, Narasimhan P (2010) An analysis of traces from a production MapReduce cluster. In: Proceedings of the 2010 10th IEEE/ACM international conference on cluster, cloud and grid computing (CCGRID’10). IEEE Computer Society, Washington, DC, pp 94–103. doi: 10.1109/CCGRID.2010.112
  21. 21.
    Kornacker M, Behm A, Bittorf V, Bobrovytsky T, Ching C, Choi A, Erickson J, Grund M, Hecht D, Jacobs M, Joshi I, Kuff L, Kumar D, Leblang A, Li N, Pandis I, Robinson H, Rorke D, Rus S, Russell J, Tsirogiannis D, Wanderman-Milne S, Yoder M (2015) Impala: a modern, open-source SQL engine for Hadoop. In: CIDR 2015, seventh biennial conference on innovative data systems research, AsilomarGoogle Scholar
  22. 22.
    Liu Z, Cho S (2012) Characterizing machines and workloads on a Google cluster. In: 2012 41st international conference on parallel processing workshops (ICPPW), pp 397–403. doi: 10.1109/ICPPW.2012.57
  23. 23.
    Mishra AK, Hellerstein JL, Cirne W, Das CR (2010) Towards characterizing cloud backend workloads: insights from Google compute clusters. SIGMETRICS Perform Eval Rev 37(4):34–41. doi: 10.1145/1773394.1773400 CrossRefGoogle Scholar
  24. 24.
    Olston C, Reed B, Srivastava U, Kumar R, Tomkins A (2008) Pig latin: a not-so-foreign language for data processing. In: Proceedings of the 2008 ACM SIGMOD international conference on management of data (SIGMOD’08). ACM, New York, pp 1099–1110. doi: 10.1145/1376616.1376726
  25. 25.
    R Development Core Team (2008) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna. Accessed Dec 2015
  26. 26.
    Reiss C, Tumanov A, Ganger GR, Katz RH, Kozuch MA (2012) Heterogeneity and dynamicity of clouds at scale: Google trace analysis. In: Proceedings of the third ACM symposium on cloud computing (SoCC’12). ACM, New York, pp 7:1–7:13. doi: 10.1145/2391229.2391236
  27. 27.
    Reiss C, Wilkes J, Hellerstein JL (2011) Google cluster-usage traces: format \(+\) schema. Technical report, Google Inc., Mountain View. Accessed 20 March 2012
  28. 28.
    Salfner F, Lenk M, Malek M (2010) A survey of online failure prediction methods. ACM Comput Surv 42(3):10:1–10:42. doi: 10.1145/1670679.1670680
  29. 29.
    Schwarzkopf M, Konwinski A, Abd-El-Malek M, Wilkes J (2013) Omega: flexible, scalable schedulers for large compute clusters. In: Proceedings of the 8th ACM European conference on computer systems (EuroSys’13). ACM, New York, pp 351–364. doi: 10.1145/2465351.2465386
  30. 30.
    Sharma B, Chudnovsky V, Hellerstein JL, Rifaat R, Das CR (2011) Modeling and synthesizing task placement constraints in Google compute clusters. In: Proceedings of the 2nd ACM symposium on cloud computing (SOCC’11). ACM, New York, pp 3:1–3:14. doi: 10.1145/2038916.2038919
  31. 31.
    Shvachko K, Kuang H, Radia S, Chansler R (2010) The Hadoop distributed file system. In: Proceedings of the 2010 IEEE 26th symposium on mass storage systems and technologies (MSST’10). IEEE Computer Society, USA, pp 1–10. doi: 10.1109/MSST.2010.5496972
  32. 32.
    Sîrbu A, Babaoglu O (2015) Towards data-driven autonomics in data centers. In: IEEE international conference on cloud and autonomic computing (ICCAC). IEEEGoogle Scholar
  33. 33.
    Thusoo A, Sarma JS, Jain N, Shao Z, Chakka P, Zhang N, Anthony S, Liu H, Murthy R (2010) Hive—a petabyte scale data warehouse using Hadoop. In: Proceedings of the 26th international conference on data engineering (ICDE), Long Beach, pp 996–1005. doi: 10.1109/ICDE.2010.5447738
  34. 34.
    Thusoo A, Shao Z, Anthony S, Borthakur D, Jain N, Sen Sarma J, Murthy R, Liu H (2010) Data warehousing and analytics infrastructure at facebook. In: Proceedings of the 2010 ACM SIGMOD international conference on management of data (SIGMOD’10). ACM, New York, pp 1013–1020. doi: 10.1145/1807167.1807278
  35. 35.
    Varga A et al (2001) The OMNeT++ discrete event simulation system. In: Proceedings of the European simulation multiconference (ESM’01), PragueGoogle Scholar
  36. 36.
    Verma A, Pedrosa L, Korupolu M, Oppenheimer D, Tune E, Wilkes J (2015) Large-scale cluster management at Google with borg. In: Proceedings of the tenth European conference on computer systems (EuroSys’15). ACM, New York, pp 18:1–18:17. doi: 10.1145/2741948.2741964
  37. 37.
    Wang G, Butt AR, Monti H, Gupta K (2011) Towards synthesizing realistic workload traces for studying the Hadoop ecosystem. In: Proceedings of the 2011 IEEE 19th annual international symposium on modelling, analysis, and simulation of computer and telecommunication systems (MASCOTS’11). IEEE Computer Society, Washington, DC, pp 400–408. doi: 10.1109/MASCOTS.2011.59
  38. 38.
    Wang G, Butt AR, Pandey P, Gupta K (2009) A simulation approach to evaluating design decisions in MapReduce setups. In: IEEE international symposium on modeling, analysis simulation of computer and telecommunication systems (MASCOTS’09), pp 1–11 (2009). doi: 10.1109/MASCOT.2009.5366973
  39. 39.
    Wilkes J (2011) More Google cluster data. Google research blog. Accessed Dec 2015
  40. 40.
    Wolski R, Brevik J (2014) Using parametric models to represent private cloud workloads. IEEE Trans Serv Comput 7(4):714–725. doi: 10.1109/TSC.2013.48 CrossRefGoogle Scholar
  41. 41.
    Zhang Q, Hellerstein JL, Boutaba R (2011) Characterizing task usage shapes in Google’s compute clusters. In: Proceedings of the 5th international workshop on large scale distributed systems and middlewareGoogle Scholar
  42. 42.
    Zhang Q, Zhani MF, Boutaba R, Hellerstein JL (2014) Dynamic heterogeneity-aware resource provisioning in the cloud. IEEE Trans Cloud Comput 2(1):14–28. doi: 10.1109/TCC.2014.2306427 CrossRefGoogle Scholar
  43. 43.
    Zhang X, Tune E, Hagmann R, Jnagal R, Gokhale V, Wilkes J (2013) CPI2: CPU performance isolation for shared compute clusters. In: Proceedings of the 8th ACM European conference on computer systems (EuroSys’13). ACM, New York, pp 379–391. doi: 10.1145/2465351.2465388
  44. 44.
    Zhao W, Peng Y, Xie F, Dai Z (2012) Modeling and simulation of cloud computing: a review. In: 2012 IEEE Asia Pacific cloud computing congress (APCloudCC), pp 20–24. doi: 10.1109/APCloudCC.2012.6486505

Copyright information

© Springer-Verlag Wien 2015

Authors and Affiliations

  • Alkida Balliu
    • 1
  • Dennis Olivetti
    • 1
  • Ozalp Babaoglu
    • 2
  • Moreno Marzolla
    • 2
  • Alina Sîrbu
    • 2
  1. 1.Gran Sasso Science Institute (GSSI)L’AquilaItaly
  2. 2.Department of Computer Science and EngineeringUniversity of BolognaBolognaItaly

Personalised recommendations