Abstract
The present paper aims to propose a new neural network called “sparse semi-autoencoder” to overcome the vanishing information problem inherent to multi-layered neural networks. The vanishing information problem represents a natural tendency of multi-layered neural networks to lose information in input patterns as well as training errors, including also natural reduction in information due to constraints such as sparse regularization. To overcome this problem, two methods are proposed here, namely, input information enhancement by semi-autoencoders and the separation of error minimization and sparse regularization by soft pruning. First, we try to enhance information in input patterns to prevent the information from decreasing when going through multi-layers. The information enhancement is realized in a form of new architecture called “semi-autoencoders”, in which information in input patterns is forced to be given to all hidden layers to keep the original information in input patterns as much as possible. Second, information reduction by the sparse regularization is separated from a process of information acquisition as error minimization. The sparse regularization is usually applied in training autoencoders, and it has a natural tendency to decrease information by restricting the information capacity. This information reduction in terms of the penalties tends to eliminate even necessary and important information, because of the existence of many parameters to harmonize the penalties with error minimization. Thus, we introduce a new method of soft pruning, where information acquisition of error minimization and information reduction of sparse regularization are separately applied without a drastic change in connection weights, as is the case of the pruning methods. The two methods of information enhancement and soft pruning try jointly to keep the original information as much as possible and particularly to keep necessary and important information by enabling the making of a flexible compromise between information acquisition and reduction. The method was applied to the artificial data set, eye-tracking data set, and rebel forces participation data set. With the artificial data set, we demonstrated that the selectivity of connection weights increased by the soft pruning, giving sparse weights, and the final weights were naturally interpreted. Then, when it was applied to the real data set of eye tracking, it was confirmed that the present method outperformed the conventional methods, including the ensemble methods, in terms of generalization. In addition, for the eye-tracking data set, we could interpret the final results according to the conventional eye-tracking theory of choice process. Finally, the rebel data set showed the possibility of detailed interpretation of relations between inputs and outputs. However, it was also found that the method had the limitation that the selectivity by the soft pruning could not be increased.
Similar content being viewed by others
Notes
The eye-tracking data set was made by the co-author, Haruhiko Takeuchi, which will be obtained from his personal Web page: http://www2.accsnet.ne.jp/~amx08582/.
The rebel forces participation data set was discussed thoroughly in A. Oefusi’s paper on the Web page: http://journals.sagepub.com/doi/abs/10.1177/0022343308091360.
References
Hochreiter S (1998) The vanishing gradient problem during learning recurrent neural nets and problem solutions. Int J Uncertainty Fuzziness Knowledge Based Syst 6(02):107–116
Bengio Y, Simard P, Frasconi P (1994) Learning long-term dependencies with gradient descent is difficult. IEEE Trans Neural Netw 5(2):157–166
Glorot X, Bengio Y (2010) Understanding the difficulty of training deep feedforward neural networks. In: Proceedings of the thirteenth international conference on artificial intelligence and statistics, pp 249–256
Schmidhuber J (2015) Deep learning in neural networks: an overview. Neural Netw 61:85–117
Glorot X, Bordes A, Bengio Y (2011) Deep sparse rectifier neural networks. In: Aistats, vol 15, p 275
Hinton GE, Osindero S, Teh Y-W (2006) A fast learning algorithm for deep belief nets. Neural Comput 18(7):1527–1554
Bengio Y, Lamblin P, Popovici D, Larochelle H, et al (2007) Greedy layer-wise training of deep networks. Adv Neural Inf Proces Syst 19:153–160
Shannon CE (2001) A mathematical theory of communication. ACM SIGMOBILE Mobile Computing and Communications Review 5(1):3–55
Shannon CE (1951) Prediction and entropy of printed english. Bell Syst Tech J 30(1):50–64
Abramson N (1963) Information theory and coding. McGraw-Hill, New York
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778
He K, Zhang X, Ren S, Sun J (2016) Identity mappings in deep residual networks. In: European conference on computer vision. Springer, pp 630–645
Szegedy C, Ioffe S , Vanhoucke V, Alemi AA (2017) Inception-v4, inception-resnet and the impact of residual connections on learning. In: AAAI, pp 4278–4284
Hanson SJ , Pratt LY (1989) Comparing biases for minimal network construction with back-propagation. In: Advances in neural information processing systems, pp 177–185
Lecun Y, Denker JS , Solla SA (1990) Optimal brain damage. In: Advances in neural information processing systems, pp 598–605
Andrews R, Diederich J, Tickle AB (1995) Survey and critique of techniques for extracting rules from trained artificial neural networks. Knowl-Based Syst 8(6):373–389
Benítez JM, Castro JL, Requena I (1997) Are artificial neural networks black boxes? IEEE Trans Neural Netw 8(5):1156–1164
Srinivas S, Babu RV (2015) Data-free parameter pruning for deep neural networks. arXiv:1507.06149
Olshausen BA, Field DJ (1997) Sparse coding with an overcomplete basis set: a strategy employed by v1? Vision Res 37(23):3311–3325
Lee H, Ekanadham C, Ng AY (2008) Sparse deep belief net model for visual area v2. In: Advances in neural information processing systems, pp 873–880
Nair V, Hinton GE (2009) 3d object recognition with deep belief nets. In: Advances in neural information processing systems, pp 1339–1347
Ng A (2011) Sparse autoencoder, vol. 72 of CS294a lecture notes
Zhang X, Dou H, Ju T, Xu J, Zhang S (2016) Fusing heterogeneous features from stacked sparse autoencoder for histopathological image analysis. IEEE Journal of Biomedical and Health Informatics 20 (5):1377–1383
Xu J, Xiang L, Hang R, Wu J (2014) Stacked sparse autoencoder (ssae) based framework for nuclei patch classification on breast cancer histopathology. In: 2014 IEEE 11th international symposium on biomedical imaging (ISBI). IEEE, pp 999–1002
Tao C, Pan H, Li Y, Zou Z (2015) Unsupervised spectral–spatial feature learning with stacked sparse autoencoder for hyperspectral imagery classification. IEEE Geosci Remote Sens Lett 12(12):2438–2442
Deng J, Zhang Z, Marchi E, Schuller B (2013) Sparse autoencoder-based feature transfer learning for speech emotion recognition. In: 2013 Humaine association conference on affective computing and intelligent interaction. IEEE, pp 511–516
Hinton GE, Salakhutdinov RR (2006) Reducing the dimensionality of data with neural networks. Science 313(5786):504–507
Hinton GE (2012) A practical guide to training restricted boltzmann machines. In: Neural networks: tricks of the trade. Springer, pp 599–619
Ciresan DC, Meier U, Gambardella LM, Schmidhuber J (2010) Deep, big, simple neural nets for handwritten digit recognition. Neural Comput 22(12):3207–3220
Makhzani A, Frey B (2013) K-sparse autoencoders. arXiv:1312.5663
Cheng Y, Wang D, Zhou P, Zhang T (2017) A survey of model compression and acceleration for deep neural networks. arXiv:1710.09282
Rudy J, Ding W, Im DJ, Taylor GW (2014) Neural network regularization via robust weight factorization. arXiv:1412.6630
Bach F, Jenatton R, Mairal J, Obozinski G (2012) Optimization with sparsity-inducing penalties. Foundations and Trends® in Machine Learning 4(1):1–106
Vincent P, Larochelle H, Bengio Y, Manzagol P-A (2008) Extracting and composing robust features with denoising autoencoders. In: Proceedings of the 25th international conference on machine learning. ACM, pp 1096–1103
Vincent P, Larochelle H, Lajoie I, Bengio Y, Manzagol P-A (2010) Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. J Mach Learn Res 11(Dec):3371–3408
Hinton GE, Srivastava N, Krizhevsky A, Sutskever I, Salakhutdinov RR (2012) Improving neural networks by preventing co-adaptation of feature detectors. arXiv:1207.0580
Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R (2014) Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res 15(1):1929–1958
Castellano G, Fanelli AM (1999) Variable selection using neural-network models. Neurocomputing 31:1–13
Oliveira GG, Pedrollo OC, Castro NM (2015) Simplifying artificial neural network models of river basin behaviour by an automated procedure for input variable selection. Eng Appl Artif Intell 40:47–61
Olden JD, Joy MK, Death RG (2004) An accurate comparison of methods for quantifying variable importance in artificial neural networks using simulated data. Ecol Model 178(3):389–397
Papadokonstantaks S, Lygeros A, Jacobsson SP (2005) Comparison of recent methods for inference of variable influence in neural networks. Neural Netw 19:500–513
Gómez-Carracedo M, Andrade J, Carrera G, Aires-de Sousa J, Carlosena A, Prada D (2010) Combining kohonen neural networks and variable selection by classification trees to cluster road soil samples. Chemometr Intell Lab Syst 102(1):20–34
May R, Dandy G, Maier H (2011) Review of input variable selection methods for artificial neural networks. In: Suzuki K (ed) Artificial neural networks-methodological advances and biomedical applications, InTech, pp 19–44
Oyefusi A (2008) Oil and the probability of rebel participation among youths in the Niger delta of nigeria. J Peace Res 45(4):539–555
Breiman L (1996) Bagging predictors. Mach Learn 24(2):123–140
Breiman L (2001) Random forests. Mach Learn 45(1):5–32
Friedman JH (2001) Greedy function approximation: a gradient boosting machine. Ann Stat 29(5):1189–1232
Friedman J, Hastie T, Tibshirani R, et al (2000) Additive logistic regression: a statistical view of boosting. Ann Stat 28(2):337–407
Ho TK (1998) The random subspace method for constructing decision forests. IEEE Trans Pattern Anal Mach Intell 20(8):832–844
Schapire RE, Freund Y, Bartlett P, Lee WS, et al (1998) Boosting the margin: a new explanation for the effectiveness of voting methods. Ann Stat 26(5):1651–1686
Schapire RE, Singer Y (1999) Improved boosting algorithms using confidence-rated predictions. Mach Learn 37(3):297–336
Seiffert C, Khoshgoftaar TM, Van Hulse J, Napolitano A (2008) Rusboost: Improving classification performance when training data is skewed. In: 19th International conference on pattern recognition, 2008. ICPR 2008. IEEE, pp 1–4
Warmuth MK , Liao J, Rätsch G (2006) Totally corrective boosting algorithms that maximize the margin. In: Proceedings of the 23rd international conference on machine learning. ACM, pp 1001–1008
Dietterich TG (2000) Ensemble methods in machine learning. In: International workshop on multiple classifier systems. Springer, pp 1–15
Riedl R, Brandstätter E, Roithmayr F (2008) Identifying decision strategies: a process-and outcome-based classification method. Behav Res Methods 40(3):795–807
Glaholt MG, Reingold EM (2011) Eye movement monitoring as a process tracing methodology in decision making research. Journal of Neuroscience, Psychology, and Economics 4(2):125– 146
Gere A, Danner L, de Antoni N, Kovács S, Dürrschmid K, Sipos L (2016) Visual attention accompanying food decision process: an alternative approach to choose the best models. Food Qual Prefer 51:1–7
Russo JE, Leclerc F (1994) An eye-fixation analysis of choice processes for consumer nondurables. J Consum Res 21(2):274–290
Acknowledgments
We are very grateful to an editor and two reviewers for valuable comments on the paper. This research is supported by the Japan Society for the Promotion of Science under the Grants-in-Aid for Scientific Research-grant 16K00339.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix: Soft pruning by layer-wise selectivity increase
Appendix: Soft pruning by layer-wise selectivity increase
We briefly explain here how to modify connection weights to have strong selectivity. The selectivity of connection weights is realized by multiplying the connection weights by their normalized importance with a fixed number of learning steps (N steps), as shown in Fig. 11. In the first learning step in Fig. 11a, the ordinary learning is performed with T learning epochs. Then, from the second step on, the normalized importance of connection weights is computed, and weights are multiplied by their importance in Fig. 11b. The effect of importance values for the weights naturally diminishes when the number of epochs increases. Thus, when the number of steps goes beyond T epochs, the normalized importance is again computed and multiplied, and so on. The actual value of T was fixed to 100 in all the experiments, because we maximally repeated the learning step ten times, amounting to a total of 1,000 epochs, which corresponded to the default learning epochs of the Matlab neural network package. As mentioned above, we tried to use the default parameter values as much as possible for easy reproduction of the present results.
We present here how to control the selectivity step by step, and the superscript s for input patterns is omitted for simplification. In the first epoch of the first step denoted by the superscript (1, 1), we have the output from the j th hidden neuron
Then, the output from the output neuron (input) is computed by
The error is computed by
Then, the number of epochs is increased up to the T th epoch (T, 1) in the first step. For the final T th epoch, the importance of the first step is computed by
The normalized importance is
Then, the importance in the second step in Fig. 11b is used to modify connection weights of the first step in Fig. 11a as
Gradually, connection weights are modified to take into account the effect of importance of the connection weights. As can be seen in Fig. 11a–c, gradually, the number of strong connection weights diminishes. Finally, in the final T th epoch of the N th step in Fig. 11c3, the number of strong connection weights decreases to the final point.
Rights and permissions
About this article
Cite this article
Kamimura, R., Takeuchi, H. Sparse semi-autoencoders to solve the vanishing information problem in multi-layered neural networks. Appl Intell 49, 2522–2545 (2019). https://doi.org/10.1007/s10489-018-1393-x
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-018-1393-x