# Sparse semi-autoencoders to solve the vanishing information problem in multi-layered neural networks

- 15 Downloads

## Abstract

The present paper aims to propose a new neural network called “sparse semi-autoencoder” to overcome the vanishing information problem inherent to multi-layered neural networks. The vanishing information problem represents a natural tendency of multi-layered neural networks to lose information in input patterns as well as training errors, including also natural reduction in information due to constraints such as sparse regularization. To overcome this problem, two methods are proposed here, namely, input information enhancement by semi-autoencoders and the separation of error minimization and sparse regularization by soft pruning. First, we try to enhance information in input patterns to prevent the information from decreasing when going through multi-layers. The information enhancement is realized in a form of new architecture called “semi-autoencoders”, in which information in input patterns is forced to be given to all hidden layers to keep the original information in input patterns as much as possible. Second, information reduction by the sparse regularization is separated from a process of information acquisition as error minimization. The sparse regularization is usually applied in training autoencoders, and it has a natural tendency to decrease information by restricting the information capacity. This information reduction in terms of the penalties tends to eliminate even necessary and important information, because of the existence of many parameters to harmonize the penalties with error minimization. Thus, we introduce a new method of soft pruning, where information acquisition of error minimization and information reduction of sparse regularization are separately applied without a drastic change in connection weights, as is the case of the pruning methods. The two methods of information enhancement and soft pruning try jointly to keep the original information as much as possible and particularly to keep necessary and important information by enabling the making of a flexible compromise between information acquisition and reduction. The method was applied to the artificial data set, eye-tracking data set, and rebel forces participation data set. With the artificial data set, we demonstrated that the selectivity of connection weights increased by the soft pruning, giving sparse weights, and the final weights were naturally interpreted. Then, when it was applied to the real data set of eye tracking, it was confirmed that the present method outperformed the conventional methods, including the ensemble methods, in terms of generalization. In addition, for the eye-tracking data set, we could interpret the final results according to the conventional eye-tracking theory of choice process. Finally, the rebel data set showed the possibility of detailed interpretation of relations between inputs and outputs. However, it was also found that the method had the limitation that the selectivity by the soft pruning could not be increased.

## Keywords

Multi-layered neural networks Autoencoder Semi-autoencoder Sparsity Softpruning Information augmentation Generalization Interpretation Vanishing information## Notes

### Acknowledgments

We are very grateful to an editor and two reviewers for valuable comments on the paper. This research is supported by the Japan Society for the Promotion of Science under the Grants-in-Aid for Scientific Research-grant 16K00339.

## References

- 1.Hochreiter S (1998) The vanishing gradient problem during learning recurrent neural nets and problem solutions. Int J Uncertainty Fuzziness Knowledge Based Syst 6(02):107–116zbMATHCrossRefGoogle Scholar
- 2.Bengio Y, Simard P, Frasconi P (1994) Learning long-term dependencies with gradient descent is difficult. IEEE Trans Neural Netw 5(2):157–166CrossRefGoogle Scholar
- 3.Glorot X, Bengio Y (2010) Understanding the difficulty of training deep feedforward neural networks. In: Proceedings of the thirteenth international conference on artificial intelligence and statistics, pp 249–256Google Scholar
- 4.Schmidhuber J (2015) Deep learning in neural networks: an overview. Neural Netw 61:85–117CrossRefGoogle Scholar
- 5.Glorot X, Bordes A, Bengio Y (2011) Deep sparse rectifier neural networks. In: Aistats, vol 15, p 275Google Scholar
- 6.Hinton GE, Osindero S, Teh Y-W (2006) A fast learning algorithm for deep belief nets. Neural Comput 18(7):1527–1554MathSciNetzbMATHCrossRefGoogle Scholar
- 7.Bengio Y, Lamblin P, Popovici D, Larochelle H, et al (2007) Greedy layer-wise training of deep networks. Adv Neural Inf Proces Syst 19:153–160Google Scholar
- 8.Shannon CE (2001) A mathematical theory of communication. ACM SIGMOBILE Mobile Computing and Communications Review 5(1):3–55MathSciNetCrossRefGoogle Scholar
- 9.Shannon CE (1951) Prediction and entropy of printed english. Bell Syst Tech J 30(1):50–64zbMATHCrossRefGoogle Scholar
- 10.Abramson N (1963) Information theory and coding. McGraw-Hill, New YorkGoogle Scholar
- 11.He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778Google Scholar
- 12.He K, Zhang X, Ren S, Sun J (2016) Identity mappings in deep residual networks. In: European conference on computer vision. Springer, pp 630–645Google Scholar
- 13.Szegedy C, Ioffe S , Vanhoucke V, Alemi AA (2017) Inception-v4, inception-resnet and the impact of residual connections on learning. In: AAAI, pp 4278–4284Google Scholar
- 14.Hanson SJ , Pratt LY (1989) Comparing biases for minimal network construction with back-propagation. In: Advances in neural information processing systems, pp 177–185Google Scholar
- 15.Lecun Y, Denker JS , Solla SA (1990) Optimal brain damage. In: Advances in neural information processing systems, pp 598–605Google Scholar
- 16.Andrews R, Diederich J, Tickle AB (1995) Survey and critique of techniques for extracting rules from trained artificial neural networks. Knowl-Based Syst 8(6):373–389zbMATHCrossRefGoogle Scholar
- 17.Benítez JM, Castro JL, Requena I (1997) Are artificial neural networks black boxes? IEEE Trans Neural Netw 8(5):1156–1164CrossRefGoogle Scholar
- 18.Srinivas S, Babu RV (2015) Data-free parameter pruning for deep neural networks. arXiv:1507.06149
- 19.Olshausen BA, Field DJ (1997) Sparse coding with an overcomplete basis set: a strategy employed by v1? Vision Res 37(23):3311–3325CrossRefGoogle Scholar
- 20.Lee H, Ekanadham C, Ng AY (2008) Sparse deep belief net model for visual area v2. In: Advances in neural information processing systems, pp 873–880Google Scholar
- 21.Nair V, Hinton GE (2009) 3d object recognition with deep belief nets. In: Advances in neural information processing systems, pp 1339–1347Google Scholar
- 22.Ng A (2011) Sparse autoencoder, vol. 72 of CS294a lecture notesGoogle Scholar
- 23.Zhang X, Dou H, Ju T, Xu J, Zhang S (2016) Fusing heterogeneous features from stacked sparse autoencoder for histopathological image analysis. IEEE Journal of Biomedical and Health Informatics 20 (5):1377–1383CrossRefGoogle Scholar
- 24.Xu J, Xiang L, Hang R, Wu J (2014) Stacked sparse autoencoder (ssae) based framework for nuclei patch classification on breast cancer histopathology. In: 2014 IEEE 11th international symposium on biomedical imaging (ISBI). IEEE, pp 999–1002Google Scholar
- 25.Tao C, Pan H, Li Y, Zou Z (2015) Unsupervised spectral–spatial feature learning with stacked sparse autoencoder for hyperspectral imagery classification. IEEE Geosci Remote Sens Lett 12(12):2438–2442CrossRefGoogle Scholar
- 26.Deng J, Zhang Z, Marchi E, Schuller B (2013) Sparse autoencoder-based feature transfer learning for speech emotion recognition. In: 2013 Humaine association conference on affective computing and intelligent interaction. IEEE, pp 511–516Google Scholar
- 27.Hinton GE, Salakhutdinov RR (2006) Reducing the dimensionality of data with neural networks. Science 313(5786):504–507MathSciNetzbMATHCrossRefGoogle Scholar
- 28.Hinton GE (2012) A practical guide to training restricted boltzmann machines. In: Neural networks: tricks of the trade. Springer, pp 599–619Google Scholar
- 29.Ciresan DC, Meier U, Gambardella LM, Schmidhuber J (2010) Deep, big, simple neural nets for handwritten digit recognition. Neural Comput 22(12):3207–3220CrossRefGoogle Scholar
- 30.Makhzani A, Frey B (2013) K-sparse autoencoders. arXiv:1312.5663
- 31.Cheng Y, Wang D, Zhou P, Zhang T (2017) A survey of model compression and acceleration for deep neural networks. arXiv:1710.09282
- 32.Rudy J, Ding W, Im DJ, Taylor GW (2014) Neural network regularization via robust weight factorization. arXiv:1412.6630
- 33.Bach F, Jenatton R, Mairal J, Obozinski G (2012) Optimization with sparsity-inducing penalties. Foundations and Trends® in Machine Learning 4(1):1–106zbMATHCrossRefGoogle Scholar
- 34.Vincent P, Larochelle H, Bengio Y, Manzagol P-A (2008) Extracting and composing robust features with denoising autoencoders. In: Proceedings of the 25th international conference on machine learning. ACM, pp 1096–1103Google Scholar
- 35.Vincent P, Larochelle H, Lajoie I, Bengio Y, Manzagol P-A (2010) Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. J Mach Learn Res 11(Dec):3371–3408MathSciNetzbMATHGoogle Scholar
- 36.Hinton GE, Srivastava N, Krizhevsky A, Sutskever I, Salakhutdinov RR (2012) Improving neural networks by preventing co-adaptation of feature detectors. arXiv:1207.0580
- 37.Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R (2014) Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res 15(1):1929–1958MathSciNetzbMATHGoogle Scholar
- 38.Castellano G, Fanelli AM (1999) Variable selection using neural-network models. Neurocomputing 31:1–13CrossRefGoogle Scholar
- 39.Oliveira GG, Pedrollo OC, Castro NM (2015) Simplifying artificial neural network models of river basin behaviour by an automated procedure for input variable selection. Eng Appl Artif Intell 40:47–61CrossRefGoogle Scholar
- 40.Olden JD, Joy MK, Death RG (2004) An accurate comparison of methods for quantifying variable importance in artificial neural networks using simulated data. Ecol Model 178(3):389–397CrossRefGoogle Scholar
- 41.Papadokonstantaks S, Lygeros A, Jacobsson SP (2005) Comparison of recent methods for inference of variable influence in neural networks. Neural Netw 19:500–513zbMATHCrossRefGoogle Scholar
- 42.Gómez-Carracedo M, Andrade J, Carrera G, Aires-de Sousa J, Carlosena A, Prada D (2010) Combining kohonen neural networks and variable selection by classification trees to cluster road soil samples. Chemometr Intell Lab Syst 102(1):20–34CrossRefGoogle Scholar
- 43.May R, Dandy G, Maier H (2011) Review of input variable selection methods for artificial neural networks. In: Suzuki K (ed) Artificial neural networks-methodological advances and biomedical applications, InTech, pp 19–44Google Scholar
- 44.Oyefusi A (2008) Oil and the probability of rebel participation among youths in the Niger delta of nigeria. J Peace Res 45(4):539–555CrossRefGoogle Scholar
- 45.Breiman L (1996) Bagging predictors. Mach Learn 24(2):123–140zbMATHGoogle Scholar
- 46.Breiman L (2001) Random forests. Mach Learn 45(1):5–32zbMATHCrossRefGoogle Scholar
- 47.Friedman JH (2001) Greedy function approximation: a gradient boosting machine. Ann Stat 29(5):1189–1232MathSciNetzbMATHCrossRefGoogle Scholar
- 48.Friedman J, Hastie T, Tibshirani R, et al (2000) Additive logistic regression: a statistical view of boosting. Ann Stat 28(2):337–407MathSciNetzbMATHCrossRefGoogle Scholar
- 49.Ho TK (1998) The random subspace method for constructing decision forests. IEEE Trans Pattern Anal Mach Intell 20(8):832–844CrossRefGoogle Scholar
- 50.Schapire RE, Freund Y, Bartlett P, Lee WS, et al (1998) Boosting the margin: a new explanation for the effectiveness of voting methods. Ann Stat 26(5):1651–1686MathSciNetzbMATHCrossRefGoogle Scholar
- 51.Schapire RE, Singer Y (1999) Improved boosting algorithms using confidence-rated predictions. Mach Learn 37(3):297–336zbMATHCrossRefGoogle Scholar
- 52.Seiffert C, Khoshgoftaar TM, Van Hulse J, Napolitano A (2008) Rusboost: Improving classification performance when training data is skewed. In: 19th International conference on pattern recognition, 2008. ICPR 2008. IEEE, pp 1–4Google Scholar
- 53.Warmuth MK , Liao J, Rätsch G (2006) Totally corrective boosting algorithms that maximize the margin. In: Proceedings of the 23rd international conference on machine learning. ACM, pp 1001–1008Google Scholar
- 54.Dietterich TG (2000) Ensemble methods in machine learning. In: International workshop on multiple classifier systems. Springer, pp 1–15Google Scholar
- 55.Riedl R, Brandstätter E, Roithmayr F (2008) Identifying decision strategies: a process-and outcome-based classification method. Behav Res Methods 40(3):795–807CrossRefGoogle Scholar
- 56.Glaholt MG, Reingold EM (2011) Eye movement monitoring as a process tracing methodology in decision making research. Journal of Neuroscience, Psychology, and Economics 4(2):125– 146CrossRefGoogle Scholar
- 57.Gere A, Danner L, de Antoni N, Kovács S, Dürrschmid K, Sipos L (2016) Visual attention accompanying food decision process: an alternative approach to choose the best models. Food Qual Prefer 51:1–7CrossRefGoogle Scholar
- 58.Russo JE, Leclerc F (1994) An eye-fixation analysis of choice processes for consumer nondurables. J Consum Res 21(2):274–290CrossRefGoogle Scholar