Architecture for the Execution of Tasks in Apache Spark in Heterogeneous Environments

  • Estefania Serrano
  • Javier Garcia Blas
  • Jesus Carretero
  • Monica Abella
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10104)


The current disadvantages in computing platforms and the easy migration to the Cloud Computing paradigm have as consequence the migration of scientific applications to different task-based distributed computing frameworks. However, many of them have already been optimized for their execution on specific hardware accelerators like GPUs. In this work, we present an architecture design that aims to facilitate the execution of traditional HPC based applications into Big Data environments. We prove that the bigger memory capacity, the automatic task partitioning, and the higher computational power lead to a convergence to a highly distributed new execution model. Moreover, we present an study of the viability of our proposal through the use of GPUs inside the Apache Spark infrastructure. The architecture presented will be evaluated through a real medical imaging application. The evaluation results demonstrate that our approach obtains competitive execution times compared with the original application.


Computed Tomography CT GPU scheduling MapReduce 



This work has been partially supported by the Spanish MINISTERIO DE ECONOMÍA Y COMPETITIVIDAD under the project grant TIN2016-79637-P TOWARDS UNIFICATION OF HPC AND BIG DATA PARADIGMS, and by NECRA RTC-2014-3028-1 project. We also want to thank NVidia for providing the device Tesla K40 which with we have been able to perform the experiments.


  1. 1.
  2. 2.
    RPyC - Transparent, Symmetric Distributed Computing – RPyC.
  3. 3.
    Blas, J.G., Abella, M., Isaila, F., Carretero, J., Desco, M.: Surfing the optimization space of a multiple-GPU parallel implementation of a x-ray tomography reconstruction algorithm. J. Syst. Softw. 95, 166–175 (2014)CrossRefGoogle Scholar
  4. 4.
    Boubela, R.N., Kalcher, K., Huf, W., Našel, C., Moser, E.: Big data approaches for the analysis of large-scale fMRI data using apache spark and GPU processing: a demonstration on resting-state fMRI data from the human connectome project. Front. Neurosci. 9, Article no. 492 (2015)Google Scholar
  5. 5.
    Caino-Lores, S., Fernandez, A.G., Garcia-Carballeira, F., Perez, J.C.: A cloudification methodology for multidimensional analysis: implementation and application to a railway power simulator. Simul. Model. Pract. Theory 55, 46–62 (2015)CrossRefGoogle Scholar
  6. 6.
    Cao, L., Juan, P., Zhang, Y.: Real-time deconvolution with GPU and spark for big imaging data analysis. In: Wang, G., Zomaya, A., Perez, G.M., Li, K. (eds.) ICA3PP 2015. LNCS, vol. 9530, pp. 240–250. Springer, Cham (2015). doi: 10.1007/978-3-319-27137-8_19 CrossRefGoogle Scholar
  7. 7.
    Feldkamp, L., Davis, L., Kress, J.: Practical cone-beam algorithm. JOSA A 1(6), 612–619 (1984)CrossRefGoogle Scholar
  8. 8.
    Grossman, M., Breternitz, M., Sarkar, V.: HadoopCL: MapReduce on distributed heterogeneous platforms through seamless integration of Hadoop and OpenCL. In: 2013 IEEE 27th International, Parallel and Distributed Processing Symposium Workshops & Ph.D. Forum (IPDPSW), pp. 1918–1927. IEEE (2013)Google Scholar
  9. 9.
    He, W., Cui, H., Lu, B., Zhao, J., Li, S., Ruan, G., Xue, J., Feng, X., Yang, W., Yan, Y.: Hadoop+: modeling and evaluating the heterogeneity for MapReduce applications in heterogeneous clusters. In: Proceedings of 29th ACM on International Conference on Supercomputing, pp. 143–153. ACM (2015)Google Scholar
  10. 10.
    Klöckner, A., Pinto, N., Lee, Y., Catanzaro, B., Ivanov, P., Fasih, A.: PyCUDA and PyOpenCL: a scripting-based approach to GPU run-time code generation. Parallel Comput. 38(3), 157–174 (2012)CrossRefGoogle Scholar
  11. 11.
    Li, P., Luo, Y., Zhang, N., Cao, Y.: HeteroSpark: a heterogeneous CPU/GPU spark platform for machine learning algorithms. In: 2015 IEEE International Conference on Networking, Architecture and Storage (NAS), pp. 347–348. IEEE (2015)Google Scholar
  12. 12.
    Mader, K.: Scaling Up Fast: Real-time Image Processing and Analytics Using Spark.
  13. 13.
    Meng, B., Pratx, G., Xing, L.: Ultrafast and scalable cone-beam CT reconstruction using MapReduce in a cloud computing environment. Med. Phys. 38(12), 6603–6609 (2011)CrossRefGoogle Scholar
  14. 14.
    Okur, S., Radoi, C., Lin, Y.: Hadoop+ Aparapi: making heterogeneous MapReduce programming easierGoogle Scholar
  15. 15.
    Sweeney, C., Liu, L., Arietta, S., Lawrence, J.: HIPI: a hadoop image processing interface for image-based MapReduce tasks. Chris, University of Virginia (2011)Google Scholar
  16. 16.
    Wallskog Pappas, A.: Migration of legacy applications to the cloud - a review on methodology and tools for migration to the cloud. Bachelor thesis, University of Umea (2014)Google Scholar
  17. 17.
    Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., Franklin, M.J., Shenker, S., Stoica, I.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of 9th USENIX Conference on Networked Systems Design and Implementation, p. 2. USENIX Association (2012)Google Scholar
  18. 18.
    Zheng, H.X., Wu, J.M.: Accelerate K-means algorithm by using GPU in the hadoop framework. In: Chen, Y., Balke, W.-T., Xu, J., Xu, W., Jin, P., Lin, X., Tang, T., Hwang, E. (eds.) WAIM 2014. LNCS, vol. 8597, pp. 177–186. Springer, Cham (2014). doi: 10.1007/978-3-319-11538-2_17 Google Scholar

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  • Estefania Serrano
    • 1
  • Javier Garcia Blas
    • 1
  • Jesus Carretero
    • 1
  • Monica Abella
    • 2
    • 3
  1. 1.Computer Architecture and Technology AreaUniv. Carlos IIIMadridSpain
  2. 2.Bioengineering and Aerospace Engineering DepartmentUniv. Carlos IIIMadridSpain
  3. 3.Instituto de Investigacion Sanitaria Gregorio Marañon (IiSGM)MadridSpain

Personalised recommendations