Architecture for the Execution of Tasks in Apache Spark in Heterogeneous Environments
The current disadvantages in computing platforms and the easy migration to the Cloud Computing paradigm have as consequence the migration of scientific applications to different task-based distributed computing frameworks. However, many of them have already been optimized for their execution on specific hardware accelerators like GPUs. In this work, we present an architecture design that aims to facilitate the execution of traditional HPC based applications into Big Data environments. We prove that the bigger memory capacity, the automatic task partitioning, and the higher computational power lead to a convergence to a highly distributed new execution model. Moreover, we present an study of the viability of our proposal through the use of GPUs inside the Apache Spark infrastructure. The architecture presented will be evaluated through a real medical imaging application. The evaluation results demonstrate that our approach obtains competitive execution times compared with the original application.
KeywordsComputed Tomography CT GPU scheduling MapReduce
This work has been partially supported by the Spanish MINISTERIO DE ECONOMÍA Y COMPETITIVIDAD under the project grant TIN2016-79637-P TOWARDS UNIFICATION OF HPC AND BIG DATA PARADIGMS, and by NECRA RTC-2014-3028-1 project. We also want to thank NVidia for providing the device Tesla K40 which with we have been able to perform the experiments.
- 1.Hadoop. http://hadoop.apache.org/
- 2.RPyC - Transparent, Symmetric Distributed Computing – RPyC. https://rpyc.readthedocs.io/en/latest/index.html
- 4.Boubela, R.N., Kalcher, K., Huf, W., Našel, C., Moser, E.: Big data approaches for the analysis of large-scale fMRI data using apache spark and GPU processing: a demonstration on resting-state fMRI data from the human connectome project. Front. Neurosci. 9, Article no. 492 (2015)Google Scholar
- 8.Grossman, M., Breternitz, M., Sarkar, V.: HadoopCL: MapReduce on distributed heterogeneous platforms through seamless integration of Hadoop and OpenCL. In: 2013 IEEE 27th International, Parallel and Distributed Processing Symposium Workshops & Ph.D. Forum (IPDPSW), pp. 1918–1927. IEEE (2013)Google Scholar
- 9.He, W., Cui, H., Lu, B., Zhao, J., Li, S., Ruan, G., Xue, J., Feng, X., Yang, W., Yan, Y.: Hadoop+: modeling and evaluating the heterogeneity for MapReduce applications in heterogeneous clusters. In: Proceedings of 29th ACM on International Conference on Supercomputing, pp. 143–153. ACM (2015)Google Scholar
- 11.Li, P., Luo, Y., Zhang, N., Cao, Y.: HeteroSpark: a heterogeneous CPU/GPU spark platform for machine learning algorithms. In: 2015 IEEE International Conference on Networking, Architecture and Storage (NAS), pp. 347–348. IEEE (2015)Google Scholar
- 12.Mader, K.: Scaling Up Fast: Real-time Image Processing and Analytics Using Spark. https://spark-summit.org/2014/talk/scaling-up-fast-real-time-image-processing-and-analytics-using-spark
- 14.Okur, S., Radoi, C., Lin, Y.: Hadoop+ Aparapi: making heterogeneous MapReduce programming easierGoogle Scholar
- 15.Sweeney, C., Liu, L., Arietta, S., Lawrence, J.: HIPI: a hadoop image processing interface for image-based MapReduce tasks. Chris, University of Virginia (2011)Google Scholar
- 16.Wallskog Pappas, A.: Migration of legacy applications to the cloud - a review on methodology and tools for migration to the cloud. Bachelor thesis, University of Umea (2014)Google Scholar
- 17.Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., Franklin, M.J., Shenker, S., Stoica, I.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of 9th USENIX Conference on Networked Systems Design and Implementation, p. 2. USENIX Association (2012)Google Scholar