Scheduling MapReduce Jobs in HPC Clusters

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7484)


MapReduce (MR) has become a de facto standard for large-scale data analysis. Moreover, it has also attracted the attention of the HPC community due to its simplicity, efficiency and highly scalable parallel model. However, MR implementations present some issues that may complicate its execution in existing HPC clusters, specially concerning the job submission. While on MR there are no strict parameters required to submit a job, in a typical HPC cluster, users must specify the number of nodes and amount of time required to complete the job execution. This paper presents the MR Job Adaptor, a component to optimize the scheduling of MR jobs along with HPC jobs in an HPC cluster. Experiments performed using real-world HPC and MapReduce workloads have show that MR Job Adaptor can properly transform MR jobs to be scheduled in an HPC Cluster, minimizing the job turnaround time, and exploiting unused resources in the cluster.


Turnaround Time Resource Management System Free Slot Unused Resource Mixed Workload 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    Apache Hadoop on Demand (HOD) (2012), (accessed on February 2012)
  2. 2.
    Parallel Workloads Archive (2012), (accessed on February 2012)
  3. 3.
    Casanova, H.: Simgrid: A toolkit for the simulation of application scheduling. In: Proceedings of the First IEEE/ACM International Symposium on Cluster Computing and the Grid (CCGrid 2001), Brisbane, Australia (May 2001)Google Scholar
  4. 4.
    Chen, Y., Ganapathi, A., Griffith, R., Katz, R.H.: The case for evaluating mapreduce performance using workload suites. In: MASCOTS, pp. 390–399. IEEE (2011)Google Scholar
  5. 5.
    De Rose, C.A.F., Ferreto, T., Calheiros, R.N., Cirne, W., Costa, L.B., Fireman, D.: Allocation strategies for utilization of space shared resources in bag of tasks grids. Future Generation Computer Systems 24(5), 331–341 (2008)CrossRefGoogle Scholar
  6. 6.
    Dean, J., Ghemawat, S.: Mapreduce: Simplified data processing on large clusters. Communications of the ACM 51(1), 107–113 (2008)CrossRefGoogle Scholar
  7. 7.
    Ekanayake, J., et al.: Twister: a runtime for iterative mapreduce. In: Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing, HPDC 2010, pp. 810–818. ACM, New York (2010)CrossRefGoogle Scholar
  8. 8.
    Feitelson, D.G., Mu’alem Weil, A.: Utilization and predictability in scheduling the IBM SP2 with backfilling. In: 12th Intl. Parallel Processing Symp (IPPS), pp. 542–546 (April 1998)Google Scholar
  9. 9.
    Fox, G., et al.: Parallel data mining from multicore to cloudy grids. In: Proceedings of HPC 2008 (2011)Google Scholar
  10. 10.
    Gropp, W., Lusk, E., Skjellum, A.: Using MPI Portable Parallel Programming with the Message Passing Interface. The MIT Press (1994)Google Scholar
  11. 11.
    Hindman, B., Konwinski, A., Zaharia, M., Ghodsi, A., Joseph, A.D., Katz, R., Shenker, S., Stoica, I.: Mesos: Flexible resource sharing for the cloud. USENIX (August 2011)Google Scholar
  12. 12.
    Isard, M., et al.: Dryad: distributed data-parallel programs from sequential building blocks. In: Proceedings of EuroSys 2007 (January 2007)Google Scholar
  13. 13.
    Krishnan, S., Tatineni, M.: Myhadoop-hadoop-on-demand on traditional hpc resources. (2011),
  14. 14.
    Lublin, U., Feitelson, D.G.: The workload on parallel supercomputers: Modeling the characteristics of rigid jobs. J. Parallel & Distributed Comput. 63(11), 1105–1122 (2003)zbMATHCrossRefGoogle Scholar
  15. 15.
    Middleton, A.: Data-intensive technologies for cloud computing. In: Handbook of Cloud Computing (January 2010)Google Scholar
  16. 16.
    Oracle: Oracle Grid Engine, previously known as Sun Grid Engine (SGE) (2012), (accessed on February 2012)
  17. 17.
    Schadt, E., Linderman, M., Sorenson, J.: Computational solutions to large-scale data management and analysis. Nature Reviews (January 2010)Google Scholar
  18. 18.
    Sehrish, S., et al.: Mrap: a novel mapreduce-based framework to support hpc analytics applications with access patterns. In: Proceedings of HPDC 2010, pp. 107–118 (2010),
  19. 19.
    Srirama, S., Jakovits, P.: Adapting scientific computing problems to clouds using mapreduce. Future Generation Computer Systems (January 2011)Google Scholar
  20. 20.
    Team, A.H.: Apache hadoop web site (2011), (accessed on February 2012)
  21. 21.
    Team, A.H.: Hamster: Hadoop and mpi on the same cluster (2011), (accessed on February 2012)
  22. 22.
    Top 500: Top 500 Supercomputers Site (2012), (accessed on February 2012)
  23. 23.
    TORQUE: TORQUE Resource Manager (2012), (accessed on February 2012)
  24. 24.
    Verma, A., Cherkasova, L., Campbell, R.H.: Aria: automatic resource inference and allocation for mapreduce environments. In: Proceedings of ICAC 2011, pp. 235–244 (2011)Google Scholar
  25. 25.
    Wang, G., et al.: Towards synthesizing realistic workload traces for studying the hadoop ecosystem. In: MASCOTS. pp. 400–408. IEEE (2011)Google Scholar
  26. 26.
    Zaharia, M., et al.: Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling. In: Morin, C., Muller, G. (eds.) EuroSys, pp. 265–278. ACM (2010)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  1. 1.Faculty of InformaticsPUCRSBrazil

Personalised recommendations