## Abstract

Nowadays, it has become well known that efficient training of deep neural networks plays a vital role in various successful applications. To achieve this goal, it is impractical to use only one computer, especially when the scale of models is large and some efficient computing resources are available. In this paper, we present a distributed parallel computing framework for training deep belief networks (DBNs) by employing the great power of high-performance clusters (i.e., a system consists of many computers). Motivated by the greedy layer-wise learning algorithm of DBNs, the whole training process is divided layer by layer and distributed to different machines. At the same time, rough representations are exploited to parallelize the training process. By conducting experiments on several large-scale real datasets, the novel algorithms are shown to significantly accelerate the training speed of DBNs while achieving better or competitive prediction accuracy in comparison with the original algorithm.

This is a preview of subscription content, log in to check access.

## Notes

- 1.
Tying all weight matrices together means the weight matrix of each layer in DBN is constrained to be equal. Taking the DBN shown in Fig. 1 as an example, all weight matrices are tied to \(\mathbf {W}^1\) means setting \(\mathbf {W}^2=\mathbf {W}^3=\mathbf {W}^1\).

- 2.
- 3.
- 4.
- 5.
The authors are grateful to one anonymous reviewer for providing us with the insight into this.

## References

Bengio Y (2009) Learning deep architectures for AI. Found Trends Mach Learn 2(1):1–127

Bengio Y, Lamblin P, Popovici D, Larochelle H et al (2007) Greedy layer-wise training of deep networks. Adv Neural Inf Process Syst 19:153–160

Bengio Y, Courville A, Vincent P (2013) Representation learning: a review and new perspectives. IEEE Trans Pattern Anal Mach Intell 35(8):1798–1828

Bishop CM (1995) Training with noise is equivalent to Tikhonov regularization. Neural Comput 7(1):108–116

Chellapilla K, Puri S, Simard P (2006) High performance convolutional neural networks for document processing. In: Tenth international workshop on frontiers in handwriting recognition, Suvisoft

Chilimbi T, Suzue Y, Apacible J, Kalyanaraman K (2014) Project adam: building an efficient and scalable deep learning training system. In: Usenix conference on operating systems design and implementation, pp 571–582

Coates A, Ng AY, Lee H (2011) An analysis of single-layer networks in unsupervised feature learning. J Mach Learn Res 15:215–223

Coates A, Huval B, Wang T, Wu DJ, Catanzaro B, Andrew N (2013) Deep learning with COTS HPC systems. In: International conference on machine learning, pp 1337–1345

Cui H, Zhang H, Ganger GR, Gibbons PB, Xing EP (2016) GeePS: scalable deep learning on distributed GPUs with a GPU-specialized parameter server. In: Proceedings of the eleventh European conference on computer systems, ACM, p 4

Dahl GE, Yu D, Deng L, Acero A (2012) Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition. IEEE Trans Audio Speech Lang Process 20(1):30–42

Dan CC, Meier U, Gambardella LM, Schmidhuber J (2010) Deep big simple neural nets excel on handwritten digit recognition. Corr 22(12):3207–3220

Dean J, Corrado GS, Monga R, Chen K, Devin M, Le QV, Mao MZ, Ranzato A, Senior A, Tucker P (2012) Large scale distributed deep networks. In: Advances in neural information processing systems, pp 1232–1240

Fischer A, Igel C (2014) Training restricted Boltzmann machines: an introduction. Pattern Recogn 47(1):25–39

Glorot X, Bengio Y (2010) Understanding the difficulty of training deep feedforward neural networks. J Mach Learn Res 9:249–256

He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: The IEEE conference on computer vision and pattern recognition (CVPR)

Hinton GE (2002) Training products of experts by minimizing contrastive divergence. Neural Comput 14(8):1771–1800

Hinton GE (2007) Learning multiple layers of representation. Trends Cognit Sci 11(10):428–434

Hinton GE, Salakhutdinov RR (2006) Reducing the dimensionality of data with neural networks. Science 313(5786):504–507

Hinton GE, Osindero S, Teh YW (2006) A fast learning algorithm for deep belief nets. Neural Comput 18(7):1527–1554

Hinton G, Deng L, Yu D, Dahl GE, Mohamed A, Jaitly N, Senior A, Vanhoucke V, Nguyen P, Saiainath TN (2012) Deep neural networks for acoustic modeling in speech recognition. IEEE Signal Process Mag 29(6):82–97

Hochreiter S (1998) The vanishing gradient problem during learning recurrent neural nets and problem solutions. Int J Uncertain Fuzziness Knowl Based Syst 6(2):107–116

Hochreiter S, Bengio Y, Frasconi P, Schmidhuber J (2001) Gradient flow in recurrent nets: the difficulty of learning long-term dependencies

Khumoyun A, Cui Y, Hanku L (2016) Spark based distributed deep learning framework for big data applications. In: International conference on information science and communications technologies (ICISCT), IEEE, pp 1–5

Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems, pp 1097–1105

Larochelle H, Mandel M, Pascanu R, Bengio Y (2012) Learning algorithms for the classification restricted Boltzmann machine. J Mach Learn Res 13(3):643–669

Le QV, Ngiam J, Coates A, Lahiri A, Prochnow B, Ng AY (2011) On optimization methods for deep learning. In: International conference on machine learning, pp 67–05

LeCun Y, Bottou L, Bengio Y, Haffner P (1998) Gradient-based learning applied to document recognition. Proc IEEE 86(11):2278–2324

LeCun Y, Huang FJ, Bottou L (2004) Learning methods for generic object recognition with invariance to pose and lighting. IEEE Comput Soc Conf Comput Vis Pattern Recognit 2:97–104

Meng X, Bradley J, Yavuz B, Sparks E, Venkataraman S, Liu D, Freeman J, Tsai D, Amde M, Owen S, Xin D, Xin R, Franklin MJ, Zadeh R, Zaharia M, Talwalkar A (2016) MLlib: machine learning in Apache Spark. J Mach Learn Res 17(1):1235–1241

Mohamed A, Dahl G, Hinton G (2009) Deep belief networks for phone recognition. In: Nips workshop on deep learning for speech recognition and related applications, Vancouver, Canada, vol 1, p 39

Moritz P, Nishihara R, Stoica I, Jordan MI (2015) Sparknet: training deep networks in Spark. arXiv preprint arXiv:1511.06051

Oh KS, Jung K (2004) GPU implementation of neural networks. Pattern Recognit 37(6):1311–1314

Ouyang W, Zeng X, Wang X, Qiu S, Luo P, Tian Y, Li H, Yang S, Wang Z, Li H, Wang K, Yan J, Loy CC, Tang X (2017) DeepID-Net: object detection with deformable part based convolutional neural networks. IEEE Trans Pattern Anal Mach Intell 39(7):1320–1334

Poole B, Sohl-Dickstein J, Ganguli S (2014) Analyzing noise in autoencoders and deep networks. arXiv preprint arXiv:1406.1831

Ren S, He K, Girshick R, Sun J (2017) Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans Pattern Anal Mach Intell 39(6):1137–1149

Rifai S, Glorot X, Bengio Y, Vincent P (2011) Adding noise to the input of a model trained with a regularized objective. arXiv preprint arXiv:1104.3250

Salakhutdinov R (2009) Learning deep generative models. PhD thesis, University of Toronto

Schmidhuber J (2015) Deep learning in neural networks: an overview. Neural Netw 61:85–117

Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556

Smolensky P (1986) Information processing in dynamical systems: Foundations of harmony theory. vol 1. MIT Press, Cambridge, MA, USA, chap 6, pp 194–281

Szegedy C, Liu W, Jia Y, Sermanet P (2015) Going deeper with convolutions. In: IEEE conference on computer vision and pattern recognition, pp 1–9

Teh YW, Welling M, Osindero S, Hinton GE (2003) Energy-based models for sparse overcomplete representations. J Mach Learn Res 4(12):1235–1260

Tesauro G (1992) Practical issues in temporal difference learning. Mach Learn 8(3–4):257–277

Vincent P, Larochelle H, Bengio Y, Manzagol PA (2008) Extracting and composing robust features with denoising autoencoders. In: Proceedings of the 25th international conference on machine learning, ACM, pp 1096–1103

Wei J, He J, Chen K, Zhou Y, Tang Z (2017) Collaborative filtering and deep learning based recommendation system for cold start items. Expert Syst Appl 69:29–39

Williams CK, Agakov FV (2002) An analysis of contrastive divergence learning in Gaussian Boltzmann machines. Institute for Adaptive and Neural Computation

Yuille AL (2005) The convergence of contrastive divergences. In: Advances in neural information processing systems, pp 1593–1600

## Acknowledgements

The authors are very grateful to the editor and reviewers for their valuable comments which greatly helped to improve the paper. This work is supported by the National Basic Research Program of China (973Program No. 2013CB329404), the National Natural Science Foundation of China (Nos. 61572393, 11501049, 11131006, 11671317) and the Special Program for Applied Research on Super Computation of the NSFC-Guangdong Joint Fund (the second phase).

## Author information

### Affiliations

### Corresponding author

## Ethics declarations

### Conflict of interest

The authors declare that they have no conflict of interest.

### Human and animal rights

This article does not contain any studies with human participants or animals performed by any of the authors.

## Additional information

### Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Communicated by V. Loia.

## Appendices

### Appendix

### A derivative of the log-likelihood

The derivative of the log-likelihood with respect to the model parameters \(\varvec{\theta }\) can be obtained from Eq. 2:

The first term in Eq. 8 is

where \(P(\mathbf {h}|\mathbf {v}_0)\) is defined in Eq. 4. The second term in Eq. 8 is

## Rights and permissions

## About this article

### Cite this article

Shi, G., Zhang, J., Zhang, C. *et al.* A distributed parallel training method of deep belief networks.
*Soft Comput* **24, **13357–13368 (2020). https://doi.org/10.1007/s00500-020-04754-6

Published:

Issue Date:

### Keywords

- Distributed computing
- Model parallelism
- Deep belief network
- Restricted Boltzmann machine
- Rough representation