Optimum Parallelism in Spark Framework on Hadoop YARN for Maximum Cluster Resource Utilization
Spark is widely used as a distributed computing framework for in-memory parallel processing. It implements distributed computing by splitting the jobs into tasks and deploying them on executors on the nodes of a cluster. Executors are JVMs with dedicated allocation of CPU cores and memory. The number of tasks depends on the partitions of input data. Depending on the number of CPU cores allocated to executors, one or more cores get allocated to one task. Tasks run as independent threads on executors hosted on JVMs dedicated exclusively to the executor. One or more executors are deployed on the nodes of the cluster depending on the resource availability. The performance advantage provided by distributed computing on Spark framework depends on the level of parallelism configured at 3 levels, namely node level, executor level, and task level. The parallelism at each of these levels should be configured to fully utilize the available computing resources. This paper recommends optimum parallelism configuration for Apache Spark framework deployed on Hadoop YARN cluster. The recommendations are based on the results of the experiments conducted to evaluate the dependency of parallelism at each of these levels on the performance of Spark applications. For the purpose of the evaluation, a CPU-intensive job and an I/O-intensive job are used. The performance is measured by varying the parallelism at each of the 3 levels. The results presented in this paper help Spark users in selecting optimum parallelism at each of these levels for achieving maximum performance for Spark jobs by maximum resource utilization.
KeywordsDistributed computing Apache Spark Hadoop YARN SparkBench Spark configuration Multi-level parallelism Resource optimization
- 1.Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark Cluster Computing with Working Sets, HotCloud 2010 (2010)Google Scholar
- 2.Mane, D.: How to Plan Capacity for Hadoop Cluster, Hadoop Magazine, April (2014)Google Scholar
- 3.Janardhanan, P.S., Samuel, P.: Analysis and modeling of resource management overhead in Hadoop YARN Clusters. In: IEEE DataCom 2017, The 3rd IEEE International Conference on Big Data Intelligence and Computing Orlando, Florida, USA (2017)Google Scholar
- 4.Janardhanan, P.S., Samuel, P.: Study of execution parallelism by resource partitioning in Hadoop YARN. In: ICACCI’17—6th International Conference on Advances in Computing, Communications and Informatics, Manipal University, Karnataka, India (2017)Google Scholar
- 5.Li, M., Tan, J., Wang, Y., Zhang, L., Salapura, V.: SPARKBENCH a comprehensive benchmarking suite for in memory data analytic platform spark, IBM TJ Watson Research Center. In: CF’15 Proceedings of the 12th ACM International Conference on Computing Frontiers, Article No. 53 Ischia, Italy (2015)Google Scholar