Skip to main content

Distributed Training of Large-Scale Deep Learning Models in Commodity Hardware

  • Conference paper
  • First Online:
Inventive Systems and Control

Abstract

Running deep learning models on a computer is often resource intensive and time-consuming. Deep learning models require high-performance GPUs to train on big data. It might take days and months to train models with large datasets, even with the help of high-performance GPUs. This paper provides an affordable solution for executing models within a reasonable time interval. We propose a system which is perfect to distribute large-scale deep learning models in commodity hardware. Our model consists of creating distributed computing clusters using only open-source software which can provide comparable performance to High-Performance Computing clusters even with the absence of GPUs. Hadoop clusters are created by connecting servers with a SSH network to interconnect computers and enable continuous data transfer between them. We then set up Apache Spark on our Hadoop cluster. Then, we run BigDL on top of Spark. It is a high-performance Spark library that helps us scale to massive datasets. BigDL helps us run large deep learning models locally in Jupyter Notebook and simplifies cluster computing and resource management. This environment provides computation performance up to 70% faster than a single machine execution with the option of scaling in case of model training, data throughput, hyperparameter search, and resource utilization.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 219.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 279.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Fernández AM, Gutiérrez-Avilés D, Troncoso A, Martínez-Álvarez F (2020) Automated deployment of a spark cluster with machine learning algorithm integration. Big Data Res 19:100135

    Article  Google Scholar 

  2. Kim H, Park J, Jang J, Yoon S (2016) Deepspark: spark-based deep learning supporting asynchronous updates and caffe compatibility

    Google Scholar 

  3. Mostafaeipour A, Jahangard Rafsanjani A, Ahmadi M, Arockia Dhanraj J (2021) Investigating the performance of Hadoop and Spark platforms on machine learning algorithms. J Supercomput 77(2):1273–1300

    Article  Google Scholar 

  4. Ghoting A, Krishnamurthy R, Pednault E, Reinwald B, Sindhwani V, Tatikonda S, Tian Y, Vaithyanathan S (2011) SystemML: declarative machine learning on MapReduce. In: 2011 IEEE 27th international conference on data engineering. IEEE, pp 231–242

    Google Scholar 

  5. Dai JJ, Wang Y, Qiu X, Ding D, Zhang Y, Wang Y, Jia X, Zhang CL, Wan Y, Li Z, Wang J, Huang S, Wu Z, Wang Y, Yang Y, She B, Shi D, Lu Q, Huang K, Song G (2019) BigDL: a distributed deep learning framework for big data. In: Proceedings of the ACM symposium on cloud computing (SoCC ’19)

    Google Scholar 

  6. Meng X, Bradley J, Yavuz B, Sparks E, Venkataraman S, Liu D, Freeman J, Tsai DB, Amde M, Owen S, Xin D, Xin R, Franklin MJ, Zadeh R, Zaharia M, Talwalkar A (2016) MLlib: machine learning in apache spark. J Mach Learn Res 17(1):1235–1241

    Google Scholar 

  7. Langer M, Hall A, He Z, Rahayu W (2018) MPCA SGD—a method for distributed training of deep learning models on spark. IEEE Trans Parallel Distrib Syst 29(11):2540–2556

    Google Scholar 

  8. Kim H, Park J, Jang J, Yoon S (2016) DeepSpark: a spark-based distributed deep learning framework for commodity clusters

    Google Scholar 

  9. Li Z, Davis J, Jarvis SA (2018) Optimizing machine learning on apache spark in HPC environments. In: 2018 IEEE/ACM machine learning in HPC environments (MLHPC), pp 95–105

    Google Scholar 

  10. Khumoyun A, Cui Y, Hanku L (2016) Spark based distributed Deep Learning framework for Big Data applications. In: 2016 international conference on information science and communications technologies (ICISCT), pp 1–5

    Google Scholar 

  11. Aspri M, Tsagkatakis G, Tsakalides P (2020) Distributed training and inference of deep learning models for multi-modal land cover classification. Rem Sens

    Google Scholar 

  12. Venkatesan NJ, Nam CS, Shin DR (2018) Deep learning frameworks on apache spark: a review. IETE Tech Rev

    Google Scholar 

  13. Lecun Y, Bottou L, Bengio Y, Haffner P (1998) Gradient-based learning applied to document recognition. Proc IEEE 86(11):2278–2324

    Article  Google Scholar 

  14. Nguyen TQ, Weitekamp D, Anderson D, Castello R, Cerri O, Pierini M et al (2019) Topology classification with deep learning to improve real-time event selection at the LHC. Comput Softw Big Sci 3(1):1–14

    Google Scholar 

  15. Jonnalagadda VS, Srikanth P, Thumati K, Nallamala SH, Dist K (2016) A review study of apache spark in big data processing. Int J Comput Sci Trends Technol (IJCST) 4(3):93–98

    Google Scholar 

  16. Fiterău-Broştean P, Lenaerts T, Poll E, de Ruiter J, Vaandrager F, Verleg P (2017) Model learning and model checking of SSH implementations. In: Proceedings of the 24th ACM SIGSOFT international SPIN symposium on model checking of software (SPIN 2017). Association for Computing Machinery, New York, NY, USA, pp 142–151

    Google Scholar 

  17. Dai JJ, Wang Y, Qiu X, Ding D, Zhang Y, Wang Y et al (2019, November) Bigdl: a distributed deep learning framework for big data. In: Proceedings of the ACM symposium on cloud computing, pp 50–60

    Google Scholar 

  18. Aftab MO, Awan MJ, Khalid S, Javed R, Shabir H (2021, April) Executing spark BigDL for leukemia detection from microscopic images using transfer learning. In: 2021 1st international conference on artificial intelligence and data analytics (CAIDA). IEEE, pp 216–220

    Google Scholar 

  19. Borthakur D (2008) HDFS architecture guide. Hadoop Apache Project 53(1–13):2

    Google Scholar 

  20. Jain M (2018) Advanced techniques in shell scripting. In: Beginning modern unix. Apress, Berkeley, CA, pp 283–312

    Google Scholar 

  21. Liashchynskyi P, Liashchynskyi P (2019) Grid search, random search, genetic algorithm: a big comparison for NAS

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Md. Motaharul Islam .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Ahmad, J., Navin, T.E., Al Awsaf, F., Arafat, M.Y., Hossain, M.S., Islam, M.M. (2023). Distributed Training of Large-Scale Deep Learning Models in Commodity Hardware. In: Suma, V., Lorenz, P., Baig, Z. (eds) Inventive Systems and Control. Lecture Notes in Networks and Systems, vol 672. Springer, Singapore. https://doi.org/10.1007/978-981-99-1624-5_52

Download citation

Publish with us

Policies and ethics