Abstract
Inspired by the sparse mechanism of biological neural systems, an approach of strengthening response sparsity for deep learning is presented in this paper. Firstly, an unsupervised sparse pre-training process is implemented and a sparse deep network is begun to take shape. In order to avoid that all the connections of the network will be readjusted backward during the following fine-tuning process, for the loss function of the fine-tuning process, some regularization items which strength the sparse responsiveness are added. More importantly, the unified and concise residual formulae for network updating are deduced, which ensure the backpropagation algorithm to perform successfully. The residual formulae significantly improve the existing sparse fine-tuning methods such as which in sparse autoencoders by Andrew Ng. In this way, the sparse structure obtained in the pre-training can be maintained, and the sparse abstract features of data can be extracted effectively. Numerical experiments show that by this sparsity-strengthened learning method, the sparse deep neural network has the best classification performance among several classical classifiers; meanwhile, the sparse learning abilities and time complexity all are better than traditional deep learning methods.
Similar content being viewed by others
References
Hinton GE, Osindero S, Teh YW (2006) A fast learning algorithm for deep belief nets. Neural Comput 18(7):1527–1554
LeCun Y, Bengio Y, Hinton GE (2015) Deep learning. Nature 521(7553):436–444
Rumelhart DE, Hinton GE, Williams RJ (1986) Learning representations by back-propagating errors. Nature 323(6088):533–536
Olshausen BA, Field DJ (2004) Sparse coding of sensory inputs. Curr Opin Neurobiol 14(4):481–487
Morris G, Nevet A, Bergman H (2003) Anatomical funneling, sparse connectivity and redundancy reduction in the neural networks of the basal ganglia. J Physiol Paris 97(4–6):581–589
Ji N, Zhang J, Zhang C et al (2014) Enhancing performance of restricted Boltzmann machines via log-sum regularization. Knowl Based Syst 63:82–96
Banino A, Barry C et al (2018) Vector-based navigation using grid-like representations in artificial agents. Nature. https://doi.org/10.1038/s41586-018-0102-6
Zhang H, Wang S, Xu X et al (2018) Tree2Vector: learning a vectorial representation for tree-structured data. IEEE Trans Neural Netw Learn Syst 29:1–15
Zhang H, Wang S, Zhao M et al (2018) Locality reconstruction models for book representation. IEEE Trans Knowl Data Eng 30:873–1886
Barlow HB (1972) Single units and sensation: a neuron doctrine for perceptual psychology. Perception 38(4):795–798
Nair V, Hinton G E (2009) 3D object recognition with Deep Belief Nets. In: International conference on neural information processing systems, pp 1339–1347
Lee H, Ekanadham C, Ng AY (2008) Sparse deep belief net model for visual area V2. Adv Neural Inf Process Syst 20:873–880
Lee H, Grosse R, Ranganath R et al (2011) Unsupervised learning of hierarchical representations with convolutional deep belief networks. Commun ACM 54(10):95–103
Ranzato MA, Poultney C, Chopra S, LeCun Yann (2006) Efficient learning of sparse representations with an energy-based model. Adv Neural Inf Process Syst 19:1137–1144
Thom M, Palm G (2013) Sparse activity and sparse connectivity in supervised learning. J Mach Learn Res 14(1):1091–1143
Wan W, Mabu S, Shimada K et al (2009) Enhancing the generalization ability of neural networks through controlling the hidden layers. Appl Soft Comput 9(1):404–414
Jones M, Poggio T (1995) Regularization theory and neural networks architectures. Neural Comput 7(2):219–269
Williams PM (1995) Bayesian regularization and pruning using a laplace prior. Neural Comput 7(1):117–143
Weigend A S, Rumelhart D E, Huberman B A (1990) Generalization by weight elimination with application to forecasting. In: Advances in neural information processing systems, DBLP, pp 875–882
Nowlan SJ, Hinton GE (1992) Simplifying neural networks by soft weight-sharing. Neural Comput 4(4):473–493
Zhang J, Ji N, Liu J et al (2015) Enhancing performance of the backpropagation algorithm via sparse response regularization. Neurocomputing 153:20–40
Ng A (2011) Sparse autoencoder. CS294A Lecture Notes for Stanford University
Hinton GE, Salakhutdinov RR (2006) Reducing the dimensionality of data with neural networks. Science 313(5786):504–507
Bengio Y, Lamblin P, Popovici D, Larochelle H (2006) Greedy layer-wise training of deep networks. In: Proceedings of the advances in neural information processing systems, pp 19:153–160
Hinton GE (2002) Training products of experts by minimizing contrastive divergence. Neural Comput 14:1771–1800
Hinton GE (2010) A practical guide to training restricted Boltzmann machines. Momentum 9(1):599–619
Fischer A, Igel C (2014) Training restricted Boltzmann machines: an introduction. Pattern Recognit 47(1):25–39
Donoho DL (2006) Compressed sensing. IEEE Trans Inf Theory 52(4):1289–1306
Xiao H, Rasul K, Vollgraf R (2017) Fashion-MNIST: a novel image dataset for benchmarking machine learning algorithms arXiv:1708.07747v1
Maaten LV, Hinton GE (2008) Visualizing data using t-SNE. J Mach Learn Res 9(11):2579–2605
Acknowledgements
This research was funded by NSFC Nos. 11471006 and 11101327, the Fundamental Research Funds for the Central Universities (No. xjj2017126), the Science and Technology Project of Xi’an (No. 201809164CX5JC6) and the HPC Platform of Xi’an Jiaotong University.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that there are no financial or other relationships that might lead to conflict of interest of the present article.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix 1: The deviation of KL divergence for the parameters in Sect. 2.2
here \(\sigma ^{(q)}_{j}=\sigma (\sum ^{N_v}_{i=1}v^{(q)}_{i}W_{ij}+\beta _j)=\frac{1}{1+e^{-\sum ^{N_v}_{i=1}v^{(q)}_{i}W_{ij}-\beta _j}}\).
Appendix 2: The derivation of updating formula for the traditional BP in Sect. 3.1
For the traditional BP, the total error of the network in the backpropagation process, i.e., the loss function is
where N is the training sample size, \(y_{qj}\) is the target output of the j-th neuron in the output layer corresponding to the q-th sample, and \(a^{(L)}_{qj}\) is the actual output of it. For simplicity, we first give the parameter updating formula for one sample. Consider \(J(W)=\frac{1}{2}\sum ^{n_L}_{j=1}(a^{(L)}_{j}-y_j)^2\) as the error of the network for one sample. Let \(\eta _1\) be the learning rate, \(W^{(l)}_{ij}\) be the connection weight of the i-th node in the l-th layer and the j-th node in the \((l+1)\)-th layer (\(1\le i\le n_l+1\), \(1\le j\le n_{l+1}\)), then we have the following update formula for the network parameters
where \(\delta ^{(l+1)}_{j}=\frac{\partial J(W)}{\partial z^{(l+1)}_{j}}\) is the residual of the j-th node in the \((l+1)\)-th layer. For the L-th layer, i.e., the output layer, the residual of the j-th node is
Suppose
thus the residual vector of the L-th layer is
where \(_{\cdot }*\) is the vector product operator (Hadamard product), which is defined as the product of the corresponding elements for one vector or matrix.
The residual of the j-th node for the \((L-1)\)-th layer is and the residual of the j-th node for the \((L-1)\)-th layer is
The residual of the j-th node for the l-th layer \((l=L-1,\ldots ,2,1)\) is \(\delta ^{(l)}_{j}=(\sum ^{n_{l+1}}_{k=1}W^{(l)}_{jk}\delta ^{(l+1)}_{k})f^{'}(z^{(l)}_{j})\), thus the vector form of the residual for the l-th layer is
where \(\cdot\) is the matrix product, \({\bar{W}}^{(l)}\) is the first \(n_l\) rows of \(W^{(l)}\). Let
in which \(\delta ^{(l+1)}\) is defined by (20)–(21) (\(l=L-1,\ldots ,2,1\)).
For the N samples case, by (22), we have \(\Delta {W^{(l)}_q}\_{J}=(a^{(l)})^{\mathrm{T}}_q \cdot \delta ^{(l+1)}_q\) for each sample \(q=1,2,\ldots ,N\). Thus, the update formula for the network parameters in a matrix form is
Appendix 3: AR results of SRS-DNN with different parameter sets
The following Table 6 contains different sparse parameter sets of sparsity penalty items for KL and \(L_1\) in RBM as well as in BP, i.e., \(\lambda _2\), \(\lambda _3\), \(\tau _1\) and \(\tau _2\), and also their corresponding accurate rates of classification. Based on which, in this paper, we select \(\lambda _2\), \(\lambda _3\), \(\tau _1\) and \(\tau _2\) to be 0.005, 0.0001, 0.0001 and 0.0002, respectively.
Rights and permissions
About this article
Cite this article
Qiao, C., Gao, B. & Shi, Y. SRS-DNN: a deep neural network with strengthening response sparsity. Neural Comput & Applic 32, 8127–8142 (2020). https://doi.org/10.1007/s00521-019-04309-3
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00521-019-04309-3