Abstract
Running deep learning models on a computer is often resource intensive and time-consuming. Deep learning models require high-performance GPUs to train on big data. It might take days and months to train models with large datasets, even with the help of high-performance GPUs. This paper provides an affordable solution for executing models within a reasonable time interval. We propose a system which is perfect to distribute large-scale deep learning models in commodity hardware. Our model consists of creating distributed computing clusters using only open-source software which can provide comparable performance to High-Performance Computing clusters even with the absence of GPUs. Hadoop clusters are created by connecting servers with a SSH network to interconnect computers and enable continuous data transfer between them. We then set up Apache Spark on our Hadoop cluster. Then, we run BigDL on top of Spark. It is a high-performance Spark library that helps us scale to massive datasets. BigDL helps us run large deep learning models locally in Jupyter Notebook and simplifies cluster computing and resource management. This environment provides computation performance up to 70% faster than a single machine execution with the option of scaling in case of model training, data throughput, hyperparameter search, and resource utilization.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Fernández AM, Gutiérrez-Avilés D, Troncoso A, Martínez-Álvarez F (2020) Automated deployment of a spark cluster with machine learning algorithm integration. Big Data Res 19:100135
Kim H, Park J, Jang J, Yoon S (2016) Deepspark: spark-based deep learning supporting asynchronous updates and caffe compatibility
Mostafaeipour A, Jahangard Rafsanjani A, Ahmadi M, Arockia Dhanraj J (2021) Investigating the performance of Hadoop and Spark platforms on machine learning algorithms. J Supercomput 77(2):1273–1300
Ghoting A, Krishnamurthy R, Pednault E, Reinwald B, Sindhwani V, Tatikonda S, Tian Y, Vaithyanathan S (2011) SystemML: declarative machine learning on MapReduce. In: 2011 IEEE 27th international conference on data engineering. IEEE, pp 231–242
Dai JJ, Wang Y, Qiu X, Ding D, Zhang Y, Wang Y, Jia X, Zhang CL, Wan Y, Li Z, Wang J, Huang S, Wu Z, Wang Y, Yang Y, She B, Shi D, Lu Q, Huang K, Song G (2019) BigDL: a distributed deep learning framework for big data. In: Proceedings of the ACM symposium on cloud computing (SoCC ’19)
Meng X, Bradley J, Yavuz B, Sparks E, Venkataraman S, Liu D, Freeman J, Tsai DB, Amde M, Owen S, Xin D, Xin R, Franklin MJ, Zadeh R, Zaharia M, Talwalkar A (2016) MLlib: machine learning in apache spark. J Mach Learn Res 17(1):1235–1241
Langer M, Hall A, He Z, Rahayu W (2018) MPCA SGD—a method for distributed training of deep learning models on spark. IEEE Trans Parallel Distrib Syst 29(11):2540–2556
Kim H, Park J, Jang J, Yoon S (2016) DeepSpark: a spark-based distributed deep learning framework for commodity clusters
Li Z, Davis J, Jarvis SA (2018) Optimizing machine learning on apache spark in HPC environments. In: 2018 IEEE/ACM machine learning in HPC environments (MLHPC), pp 95–105
Khumoyun A, Cui Y, Hanku L (2016) Spark based distributed Deep Learning framework for Big Data applications. In: 2016 international conference on information science and communications technologies (ICISCT), pp 1–5
Aspri M, Tsagkatakis G, Tsakalides P (2020) Distributed training and inference of deep learning models for multi-modal land cover classification. Rem Sens
Venkatesan NJ, Nam CS, Shin DR (2018) Deep learning frameworks on apache spark: a review. IETE Tech Rev
Lecun Y, Bottou L, Bengio Y, Haffner P (1998) Gradient-based learning applied to document recognition. Proc IEEE 86(11):2278–2324
Nguyen TQ, Weitekamp D, Anderson D, Castello R, Cerri O, Pierini M et al (2019) Topology classification with deep learning to improve real-time event selection at the LHC. Comput Softw Big Sci 3(1):1–14
Jonnalagadda VS, Srikanth P, Thumati K, Nallamala SH, Dist K (2016) A review study of apache spark in big data processing. Int J Comput Sci Trends Technol (IJCST) 4(3):93–98
Fiterău-Broştean P, Lenaerts T, Poll E, de Ruiter J, Vaandrager F, Verleg P (2017) Model learning and model checking of SSH implementations. In: Proceedings of the 24th ACM SIGSOFT international SPIN symposium on model checking of software (SPIN 2017). Association for Computing Machinery, New York, NY, USA, pp 142–151
Dai JJ, Wang Y, Qiu X, Ding D, Zhang Y, Wang Y et al (2019, November) Bigdl: a distributed deep learning framework for big data. In: Proceedings of the ACM symposium on cloud computing, pp 50–60
Aftab MO, Awan MJ, Khalid S, Javed R, Shabir H (2021, April) Executing spark BigDL for leukemia detection from microscopic images using transfer learning. In: 2021 1st international conference on artificial intelligence and data analytics (CAIDA). IEEE, pp 216–220
Borthakur D (2008) HDFS architecture guide. Hadoop Apache Project 53(1–13):2
Jain M (2018) Advanced techniques in shell scripting. In: Beginning modern unix. Apress, Berkeley, CA, pp 283–312
Liashchynskyi P, Liashchynskyi P (2019) Grid search, random search, genetic algorithm: a big comparison for NAS
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Ahmad, J., Navin, T.E., Al Awsaf, F., Arafat, M.Y., Hossain, M.S., Islam, M.M. (2023). Distributed Training of Large-Scale Deep Learning Models in Commodity Hardware. In: Suma, V., Lorenz, P., Baig, Z. (eds) Inventive Systems and Control. Lecture Notes in Networks and Systems, vol 672. Springer, Singapore. https://doi.org/10.1007/978-981-99-1624-5_52
Download citation
DOI: https://doi.org/10.1007/978-981-99-1624-5_52
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-1623-8
Online ISBN: 978-981-99-1624-5
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)