ALOJA: A Benchmarking and Predictive Platform for Big Data Performance Analysis

  • Nicolas PoggiEmail author
  • Josep Ll. Berral
  • David Carrera
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10044)


The main goals of the ALOJA research project from BSC-MSR, are to explore and automate the characterization of cost-effectiveness of Big Data deployments. The development of the project over its first year, has resulted in a open source benchmarking platform, an online public repository of results with over 42,000 Hadoop job runs, and web-based analytic tools to gather insights about system’s cost-performance (ALOJA’s Web application, tools, and sources available at This article describes the evolution of the project’s focus and research lines from over a year of continuously benchmarking Hadoop under different configuration and deployments options, presents results, and discusses the motivation both technical and market-based of such changes. During this time, ALOJA’s target has evolved from a previous low-level profiling of Hadoop runtime, passing through extensive benchmarking and evaluation of a large body of results via aggregation, to currently leveraging Predictive Analytics (PA) techniques. Modeling benchmark executions allow us to estimate the results of new or untested configurations or hardware set-ups automatically, by learning techniques from past observations saving in benchmarking time and costs.


Benchmark Execution Automated Knowledge Discovery Extract Performance Knowledge Data Aggregation Features Hadoop Configuration 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.



This work is partially supported the BSC-Microsoft Research Centre, the Spanish Ministry of Education (TIN2012-34557), the MINECO Severo Ochoa Research program (SEV-2011-0067) and the Generalitat de Catalunya (2014-SGR-1051).


  1. 1.
    Borthakur, D.: System, the Hadoop distributed file: architecture and design. The Apache Software Foundation (2007).
  2. 2.
    BSC. Aloja home page (2015).
  3. 3.
    BSC. Performance tools research group page (2015).
  4. 4.
    BSC. Administrator privileges on headnode of hdinsight-cluster, May 2015.
  5. 5.
    Gartner. Predictive analytics, May 2015.
  6. 6.
    Guitart, J., Torres, J., Ayguad, E., Oliver, J., Labarta, J.: Java instrumentation suite: accurate analysis of java threaded applications. In: Proceedings of the Second Annual Workshop on Java for HPC, ICS 2000, pp. 15–25 (2000)Google Scholar
  7. 7.
    Heger, D.: Hadoop performance tuning - a pragmatic & iterative approach. DH Technologies (2013)Google Scholar
  8. 8.
    Huang, S., Huang, J., Dai, J., Xie, T., Huang, B.: The HiBench benchmark suite: characterization of the MapReduce-based data analysis. In: 22nd International Conference on Data Engineering Workshops, pp. 41–51 (2010)Google Scholar
  9. 9.
    Kambatla, K., Pathak, A., Pucha, H.: Towards optimizing hadoop provisioning in the cloud. In: Proceedings of the 2009 Conference on Hot Topics in Cloud Computing, HotCloud 2009, Berkeley, CA, USA. USENIX Association (2009)Google Scholar
  10. 10.
    Person, L.: Global hadoop market. Allied market research, March 2014Google Scholar
  11. 11.
    Poggi, N., Carrera, D., Call, A., Mendoza, S., Becerra, Y., Torres, J., Ayguadé, E., Gagliardi, F., Labarta, J., Reinauer, R., Vujic, N., Green, D., Blakeley, J.: ALOJA: a systematic study of hadoop deployment variables to enable automated characterization of cost-effectiveness. In: 2014 IEEE International Conference on Big Data, Big Data 2014, Washington, DC, USA, 27–30 October 2014, pp. 905–913 (2014)Google Scholar
  12. 12.
    Schwartz, B., Zaitsev, P., Tkachenko, V.: High Performance MySQL. O’Reilly Media, Sebastopol (2012)Google Scholar
  13. 13.
    Wikipedia. Predictive analytics, May 2015.
  14. 14.
    Zhang, Z., Cherkasova, L., Loo, B.T.: Optimizing cost and performance trade-offs for mapreduce job processing in the cloud. In: 2014 IEEE on Network Operations and Management Symposium (NOMS), pp. 1–8. IEEE (2014)Google Scholar
  15. 15.
    Apache Foundation. Apache Hadoop. Accessed Apr. 2015
  16. 16.
    Berral, J.Ll.: Improved management of data-center systems using machine learning. Ph.D. thesis on Computer Science, November 2013Google Scholar
  17. 17.
    Heger, D.: Hadoop performance tuning. Accessed Jan. 2015
  18. 18.
    Intel Corporation. Intel HiBench, Hadoop benchmark suite. Accessed Apr. 2015
  19. 19.
    Quinlan, R.J.: Learning with continuous classes. In: 5th Australian Joint Conference on Artificial Intelligence, Singapore, pp. 343–348 (1992)Google Scholar
  20. 20.
    Wang, Y., Witten, I.H.: Induction of model trees for predicting continuous classes. In: Poster Papers of the 9th European Conference on Machine Learning (1997)Google Scholar

Copyright information

© Springer International Publishing AG 2016

Authors and Affiliations

  • Nicolas Poggi
    • 1
    Email author
  • Josep Ll. Berral
    • 1
  • David Carrera
    • 1
  1. 1.Barcelona Supercomputing Center (BSC)Universitat Politcnica de Catalunya (UPC-BarcelonaTech)BarcelonaSpain

Personalised recommendations