A distributed parallel training method of deep belief networks

Abstract

Nowadays, it has become well known that efficient training of deep neural networks plays a vital role in various successful applications. To achieve this goal, it is impractical to use only one computer, especially when the scale of models is large and some efficient computing resources are available. In this paper, we present a distributed parallel computing framework for training deep belief networks (DBNs) by employing the great power of high-performance clusters (i.e., a system consists of many computers). Motivated by the greedy layer-wise learning algorithm of DBNs, the whole training process is divided layer by layer and distributed to different machines. At the same time, rough representations are exploited to parallelize the training process. By conducting experiments on several large-scale real datasets, the novel algorithms are shown to significantly accelerate the training speed of DBNs while achieving better or competitive prediction accuracy in comparison with the original algorithm.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Notes

  1. 1.

    Tying all weight matrices together means the weight matrix of each layer in DBN is constrained to be equal. Taking the DBN shown in Fig. 1 as an example, all weight matrices are tied to \(\mathbf {W}^1\) means setting \(\mathbf {W}^2=\mathbf {W}^3=\mathbf {W}^1\).

  2. 2.

    http://yann.lecun.com/exdb/mnist/.

  3. 3.

    http://www.cs.nyu.edu/~ylclab/data/norb-v1.0-small/.

  4. 4.

    http://qwone.com/~jason/20Newsgroups/.

  5. 5.

    The authors are grateful to one anonymous reviewer for providing us with the insight into this.

References

  1. Bengio Y (2009) Learning deep architectures for AI. Found Trends Mach Learn 2(1):1–127

    MATH  Article  Google Scholar 

  2. Bengio Y, Lamblin P, Popovici D, Larochelle H et al (2007) Greedy layer-wise training of deep networks. Adv Neural Inf Process Syst 19:153–160

    Google Scholar 

  3. Bengio Y, Courville A, Vincent P (2013) Representation learning: a review and new perspectives. IEEE Trans Pattern Anal Mach Intell 35(8):1798–1828

    Article  Google Scholar 

  4. Bishop CM (1995) Training with noise is equivalent to Tikhonov regularization. Neural Comput 7(1):108–116

    Article  Google Scholar 

  5. Chellapilla K, Puri S, Simard P (2006) High performance convolutional neural networks for document processing. In: Tenth international workshop on frontiers in handwriting recognition, Suvisoft

  6. Chilimbi T, Suzue Y, Apacible J, Kalyanaraman K (2014) Project adam: building an efficient and scalable deep learning training system. In: Usenix conference on operating systems design and implementation, pp 571–582

  7. Coates A, Ng AY, Lee H (2011) An analysis of single-layer networks in unsupervised feature learning. J Mach Learn Res 15:215–223

    Google Scholar 

  8. Coates A, Huval B, Wang T, Wu DJ, Catanzaro B, Andrew N (2013) Deep learning with COTS HPC systems. In: International conference on machine learning, pp 1337–1345

  9. Cui H, Zhang H, Ganger GR, Gibbons PB, Xing EP (2016) GeePS: scalable deep learning on distributed GPUs with a GPU-specialized parameter server. In: Proceedings of the eleventh European conference on computer systems, ACM, p 4

  10. Dahl GE, Yu D, Deng L, Acero A (2012) Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition. IEEE Trans Audio Speech Lang Process 20(1):30–42

    Article  Google Scholar 

  11. Dan CC, Meier U, Gambardella LM, Schmidhuber J (2010) Deep big simple neural nets excel on handwritten digit recognition. Corr 22(12):3207–3220

    Google Scholar 

  12. Dean J, Corrado GS, Monga R, Chen K, Devin M, Le QV, Mao MZ, Ranzato A, Senior A, Tucker P (2012) Large scale distributed deep networks. In: Advances in neural information processing systems, pp 1232–1240

  13. Fischer A, Igel C (2014) Training restricted Boltzmann machines: an introduction. Pattern Recogn 47(1):25–39

    MATH  Article  Google Scholar 

  14. Glorot X, Bengio Y (2010) Understanding the difficulty of training deep feedforward neural networks. J Mach Learn Res 9:249–256

    Google Scholar 

  15. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: The IEEE conference on computer vision and pattern recognition (CVPR)

  16. Hinton GE (2002) Training products of experts by minimizing contrastive divergence. Neural Comput 14(8):1771–1800

    MATH  Article  Google Scholar 

  17. Hinton GE (2007) Learning multiple layers of representation. Trends Cognit Sci 11(10):428–434

    Article  Google Scholar 

  18. Hinton GE, Salakhutdinov RR (2006) Reducing the dimensionality of data with neural networks. Science 313(5786):504–507

    MathSciNet  MATH  Article  Google Scholar 

  19. Hinton GE, Osindero S, Teh YW (2006) A fast learning algorithm for deep belief nets. Neural Comput 18(7):1527–1554

    MathSciNet  MATH  Article  Google Scholar 

  20. Hinton G, Deng L, Yu D, Dahl GE, Mohamed A, Jaitly N, Senior A, Vanhoucke V, Nguyen P, Saiainath TN (2012) Deep neural networks for acoustic modeling in speech recognition. IEEE Signal Process Mag 29(6):82–97

    Article  Google Scholar 

  21. Hochreiter S (1998) The vanishing gradient problem during learning recurrent neural nets and problem solutions. Int J Uncertain Fuzziness Knowl Based Syst 6(2):107–116

    MathSciNet  MATH  Article  Google Scholar 

  22. Hochreiter S, Bengio Y, Frasconi P, Schmidhuber J (2001) Gradient flow in recurrent nets: the difficulty of learning long-term dependencies

  23. Khumoyun A, Cui Y, Hanku L (2016) Spark based distributed deep learning framework for big data applications. In: International conference on information science and communications technologies (ICISCT), IEEE, pp 1–5

  24. Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems, pp 1097–1105

  25. Larochelle H, Mandel M, Pascanu R, Bengio Y (2012) Learning algorithms for the classification restricted Boltzmann machine. J Mach Learn Res 13(3):643–669

    MathSciNet  MATH  Google Scholar 

  26. Le QV, Ngiam J, Coates A, Lahiri A, Prochnow B, Ng AY (2011) On optimization methods for deep learning. In: International conference on machine learning, pp 67–05

  27. LeCun Y, Bottou L, Bengio Y, Haffner P (1998) Gradient-based learning applied to document recognition. Proc IEEE 86(11):2278–2324

    Article  Google Scholar 

  28. LeCun Y, Huang FJ, Bottou L (2004) Learning methods for generic object recognition with invariance to pose and lighting. IEEE Comput Soc Conf Comput Vis Pattern Recognit 2:97–104

    Google Scholar 

  29. Meng X, Bradley J, Yavuz B, Sparks E, Venkataraman S, Liu D, Freeman J, Tsai D, Amde M, Owen S, Xin D, Xin R, Franklin MJ, Zadeh R, Zaharia M, Talwalkar A (2016) MLlib: machine learning in Apache Spark. J Mach Learn Res 17(1):1235–1241

    MathSciNet  MATH  Google Scholar 

  30. Mohamed A, Dahl G, Hinton G (2009) Deep belief networks for phone recognition. In: Nips workshop on deep learning for speech recognition and related applications, Vancouver, Canada, vol 1, p 39

  31. Moritz P, Nishihara R, Stoica I, Jordan MI (2015) Sparknet: training deep networks in Spark. arXiv preprint arXiv:1511.06051

  32. Oh KS, Jung K (2004) GPU implementation of neural networks. Pattern Recognit 37(6):1311–1314

    MATH  Article  Google Scholar 

  33. Ouyang W, Zeng X, Wang X, Qiu S, Luo P, Tian Y, Li H, Yang S, Wang Z, Li H, Wang K, Yan J, Loy CC, Tang X (2017) DeepID-Net: object detection with deformable part based convolutional neural networks. IEEE Trans Pattern Anal Mach Intell 39(7):1320–1334

    Article  Google Scholar 

  34. Poole B, Sohl-Dickstein J, Ganguli S (2014) Analyzing noise in autoencoders and deep networks. arXiv preprint arXiv:1406.1831

  35. Ren S, He K, Girshick R, Sun J (2017) Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans Pattern Anal Mach Intell 39(6):1137–1149

    Article  Google Scholar 

  36. Rifai S, Glorot X, Bengio Y, Vincent P (2011) Adding noise to the input of a model trained with a regularized objective. arXiv preprint arXiv:1104.3250

  37. Salakhutdinov R (2009) Learning deep generative models. PhD thesis, University of Toronto

  38. Schmidhuber J (2015) Deep learning in neural networks: an overview. Neural Netw 61:85–117

    Article  Google Scholar 

  39. Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556

  40. Smolensky P (1986) Information processing in dynamical systems: Foundations of harmony theory. vol 1. MIT Press, Cambridge, MA, USA, chap 6, pp 194–281

  41. Szegedy C, Liu W, Jia Y, Sermanet P (2015) Going deeper with convolutions. In: IEEE conference on computer vision and pattern recognition, pp 1–9

  42. Teh YW, Welling M, Osindero S, Hinton GE (2003) Energy-based models for sparse overcomplete representations. J Mach Learn Res 4(12):1235–1260

    MathSciNet  MATH  Google Scholar 

  43. Tesauro G (1992) Practical issues in temporal difference learning. Mach Learn 8(3–4):257–277

    MATH  Google Scholar 

  44. Vincent P, Larochelle H, Bengio Y, Manzagol PA (2008) Extracting and composing robust features with denoising autoencoders. In: Proceedings of the 25th international conference on machine learning, ACM, pp 1096–1103

  45. Wei J, He J, Chen K, Zhou Y, Tang Z (2017) Collaborative filtering and deep learning based recommendation system for cold start items. Expert Syst Appl 69:29–39

    Article  Google Scholar 

  46. Williams CK, Agakov FV (2002) An analysis of contrastive divergence learning in Gaussian Boltzmann machines. Institute for Adaptive and Neural Computation

  47. Yuille AL (2005) The convergence of contrastive divergences. In: Advances in neural information processing systems, pp 1593–1600

Download references

Acknowledgements

The authors are very grateful to the editor and reviewers for their valuable comments which greatly helped to improve the paper. This work is supported by the National Basic Research Program of China (973Program No. 2013CB329404), the National Natural Science Foundation of China (Nos. 61572393, 11501049, 11131006, 11671317) and the Special Program for Applied Research on Super Computation of the NSFC-Guangdong Joint Fund (the second phase).

Author information

Affiliations

Authors

Corresponding author

Correspondence to Jiangshe Zhang.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Human and animal rights

This article does not contain any studies with human participants or animals performed by any of the authors.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Communicated by V. Loia.

Appendices

Appendix

A derivative of the log-likelihood

The derivative of the log-likelihood with respect to the model parameters \(\varvec{\theta }\) can be obtained from Eq. 2:

$$\begin{aligned} \begin{aligned} \frac{\partial \log P(\mathbf {v}_0;\varvec{\theta })}{\partial \varvec{\theta }}&=\frac{\partial \log Z_{\mathbf {v}_0}(\varvec{\theta })}{\partial \varvec{\theta }} - \frac{\partial \log Z(\varvec{\theta })}{\partial \varvec{\theta }},\\ Z_{\mathbf {v}_0}(\varvec{\theta })&= \sum _\mathbf {h}\exp (-E(\mathbf {v}_0,\mathbf {h})). \end{aligned} \end{aligned}$$
(8)

The first term in Eq. 8 is

$$\begin{aligned} \frac{\partial \log Z_{\mathbf {v}_0}(\varvec{\theta })}{\partial \varvec{\theta }}= & {} \frac{1}{Z_{\mathbf {v}_0}(\varvec{\theta })} \sum _\mathbf {h}\frac{\partial \exp (-E(\mathbf {v}_0,\mathbf {h}))}{\partial \varvec{\theta }}\nonumber \\= & {} - \frac{1}{Z_{\mathbf {v}_0}(\varvec{\theta })} \sum _\mathbf {h}\left( \exp (-E(\mathbf {v}_0,\mathbf {h})) \frac{\partial E(\mathbf {v}_0,\mathbf {h})}{\partial \varvec{\theta }} \right) \nonumber \\= & {} - \sum _\mathbf {h}\left( \frac{\exp (-E(\mathbf {v}_0,\mathbf {h}))}{Z_{\mathbf {v}_0}(\varvec{\theta })} \frac{\partial E(\mathbf {v}_0,\mathbf {h})}{\partial \varvec{\theta }} \right) \nonumber \\= & {} - \sum _\mathbf {h}\left( P(\mathbf {h}|\mathbf {v}_0) \frac{\partial E(\mathbf {v}_0,\mathbf {h})}{\partial \varvec{\theta }}\right) \nonumber \\= & {} - \mathbb {E}_{P(\mathbf {h}|\mathbf {v}_0)} \left[ \frac{\partial E(\mathbf {v}_0,\mathbf {h})}{\partial \varvec{\theta }} \right] , \end{aligned}$$
(9)

where \(P(\mathbf {h}|\mathbf {v}_0)\) is defined in Eq. 4. The second term in Eq. 8 is

$$\begin{aligned} \begin{aligned} \frac{\partial \log Z(\varvec{\theta })}{\partial \varvec{\theta }}&= \frac{1}{Z(\varvec{\theta })} \sum _{\mathbf {h},\mathbf {v}} \frac{\partial \exp (-E(\mathbf {v},\mathbf {h}))}{\partial \varvec{\theta }}\\&=- \frac{1}{Z(\varvec{\theta })} \sum _{\mathbf {h},\mathbf {v}} \left( \exp (-E(\mathbf {v},\mathbf {h})) \frac{\partial E(\mathbf {v},\mathbf {h})}{\partial \varvec{\theta }} \right) \\&=- \sum _{\mathbf {h},\mathbf {v}} \left( \frac{\exp (-E(\mathbf {v},\mathbf {h}))}{Z(\varvec{\theta })} \frac{\partial E(\mathbf {v},\mathbf {h})}{\partial \varvec{\theta }} \right) \\&=- \sum _{\mathbf {h},\mathbf {v}} \left( P(\mathbf {h},\mathbf {v}) \frac{\partial E(\mathbf {v},\mathbf {h})}{\partial \varvec{\theta }} \right) \\&=- \mathbb {E}_{P(\mathbf {v},\mathbf {h})} \left[ \frac{\partial E(\mathbf {v},\mathbf {h})}{\partial \varvec{\theta }} \right] . \end{aligned} \end{aligned}$$
(10)

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Shi, G., Zhang, J., Zhang, C. et al. A distributed parallel training method of deep belief networks. Soft Comput 24, 13357–13368 (2020). https://doi.org/10.1007/s00500-020-04754-6

Download citation

Keywords

  • Distributed computing
  • Model parallelism
  • Deep belief network
  • Restricted Boltzmann machine
  • Rough representation