Sparse semi-autoencoders to solve the vanishing information problem in multi-layered neural networks

Kamimura, Ryotaro; Takeuchi, Haruhiko

doi:10.1007/s10489-018-1393-x

Sparse semi-autoencoders to solve the vanishing information problem in multi-layered neural networks

Published: 24 January 2019

Volume 49, pages 2522–2545, (2019)
Cite this article

Applied Intelligence Aims and scope Submit manuscript

Ryotaro Kamimura¹ &
Haruhiko Takeuchi²

229 Accesses
7 Citations
Explore all metrics

Abstract

The present paper aims to propose a new neural network called “sparse semi-autoencoder” to overcome the vanishing information problem inherent to multi-layered neural networks. The vanishing information problem represents a natural tendency of multi-layered neural networks to lose information in input patterns as well as training errors, including also natural reduction in information due to constraints such as sparse regularization. To overcome this problem, two methods are proposed here, namely, input information enhancement by semi-autoencoders and the separation of error minimization and sparse regularization by soft pruning. First, we try to enhance information in input patterns to prevent the information from decreasing when going through multi-layers. The information enhancement is realized in a form of new architecture called “semi-autoencoders”, in which information in input patterns is forced to be given to all hidden layers to keep the original information in input patterns as much as possible. Second, information reduction by the sparse regularization is separated from a process of information acquisition as error minimization. The sparse regularization is usually applied in training autoencoders, and it has a natural tendency to decrease information by restricting the information capacity. This information reduction in terms of the penalties tends to eliminate even necessary and important information, because of the existence of many parameters to harmonize the penalties with error minimization. Thus, we introduce a new method of soft pruning, where information acquisition of error minimization and information reduction of sparse regularization are separately applied without a drastic change in connection weights, as is the case of the pruning methods. The two methods of information enhancement and soft pruning try jointly to keep the original information as much as possible and particularly to keep necessary and important information by enabling the making of a flexible compromise between information acquisition and reduction. The method was applied to the artificial data set, eye-tracking data set, and rebel forces participation data set. With the artificial data set, we demonstrated that the selectivity of connection weights increased by the soft pruning, giving sparse weights, and the final weights were naturally interpreted. Then, when it was applied to the real data set of eye tracking, it was confirmed that the present method outperformed the conventional methods, including the ensemble methods, in terms of generalization. In addition, for the eye-tracking data set, we could interpret the final results according to the conventional eye-tracking theory of choice process. Finally, the rebel data set showed the possibility of detailed interpretation of relations between inputs and outputs. However, it was also found that the method had the limitation that the selectivity by the soft pruning could not be increased.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Sparse Bayesian Learning for Extreme Learning Machine Auto-encoder

Autoencoders reloaded

Article Open access 21 June 2022

Composite Denoising Autoencoders

Notes

The eye-tracking data set was made by the co-author, Haruhiko Takeuchi, which will be obtained from his personal Web page: http://www2.accsnet.ne.jp/~amx08582/.
The rebel forces participation data set was discussed thoroughly in A. Oefusi’s paper on the Web page: http://journals.sagepub.com/doi/abs/10.1177/0022343308091360.

References

Hochreiter S (1998) The vanishing gradient problem during learning recurrent neural nets and problem solutions. Int J Uncertainty Fuzziness Knowledge Based Syst 6(02):107–116
Article MATH Google Scholar
Bengio Y, Simard P, Frasconi P (1994) Learning long-term dependencies with gradient descent is difficult. IEEE Trans Neural Netw 5(2):157–166
Article Google Scholar
Glorot X, Bengio Y (2010) Understanding the difficulty of training deep feedforward neural networks. In: Proceedings of the thirteenth international conference on artificial intelligence and statistics, pp 249–256
Schmidhuber J (2015) Deep learning in neural networks: an overview. Neural Netw 61:85–117
Article Google Scholar
Glorot X, Bordes A, Bengio Y (2011) Deep sparse rectifier neural networks. In: Aistats, vol 15, p 275
Hinton GE, Osindero S, Teh Y-W (2006) A fast learning algorithm for deep belief nets. Neural Comput 18(7):1527–1554
Article MathSciNet MATH Google Scholar
Bengio Y, Lamblin P, Popovici D, Larochelle H, et al (2007) Greedy layer-wise training of deep networks. Adv Neural Inf Proces Syst 19:153–160
Google Scholar
Shannon CE (2001) A mathematical theory of communication. ACM SIGMOBILE Mobile Computing and Communications Review 5(1):3–55
Article MathSciNet Google Scholar
Shannon CE (1951) Prediction and entropy of printed english. Bell Syst Tech J 30(1):50–64
Article MATH Google Scholar
Abramson N (1963) Information theory and coding. McGraw-Hill, New York
Google Scholar
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778
He K, Zhang X, Ren S, Sun J (2016) Identity mappings in deep residual networks. In: European conference on computer vision. Springer, pp 630–645
Szegedy C, Ioffe S , Vanhoucke V, Alemi AA (2017) Inception-v4, inception-resnet and the impact of residual connections on learning. In: AAAI, pp 4278–4284
Hanson SJ , Pratt LY (1989) Comparing biases for minimal network construction with back-propagation. In: Advances in neural information processing systems, pp 177–185
Lecun Y, Denker JS , Solla SA (1990) Optimal brain damage. In: Advances in neural information processing systems, pp 598–605
Andrews R, Diederich J, Tickle AB (1995) Survey and critique of techniques for extracting rules from trained artificial neural networks. Knowl-Based Syst 8(6):373–389
Article MATH Google Scholar
Benítez JM, Castro JL, Requena I (1997) Are artificial neural networks black boxes? IEEE Trans Neural Netw 8(5):1156–1164
Article Google Scholar
Srinivas S, Babu RV (2015) Data-free parameter pruning for deep neural networks. arXiv:1507.06149
Olshausen BA, Field DJ (1997) Sparse coding with an overcomplete basis set: a strategy employed by v1? Vision Res 37(23):3311–3325
Article Google Scholar
Lee H, Ekanadham C, Ng AY (2008) Sparse deep belief net model for visual area v2. In: Advances in neural information processing systems, pp 873–880
Nair V, Hinton GE (2009) 3d object recognition with deep belief nets. In: Advances in neural information processing systems, pp 1339–1347
Ng A (2011) Sparse autoencoder, vol. 72 of CS294a lecture notes
Zhang X, Dou H, Ju T, Xu J, Zhang S (2016) Fusing heterogeneous features from stacked sparse autoencoder for histopathological image analysis. IEEE Journal of Biomedical and Health Informatics 20 (5):1377–1383
Article Google Scholar
Xu J, Xiang L, Hang R, Wu J (2014) Stacked sparse autoencoder (ssae) based framework for nuclei patch classification on breast cancer histopathology. In: 2014 IEEE 11th international symposium on biomedical imaging (ISBI). IEEE, pp 999–1002
Tao C, Pan H, Li Y, Zou Z (2015) Unsupervised spectral–spatial feature learning with stacked sparse autoencoder for hyperspectral imagery classification. IEEE Geosci Remote Sens Lett 12(12):2438–2442
Article Google Scholar
Deng J, Zhang Z, Marchi E, Schuller B (2013) Sparse autoencoder-based feature transfer learning for speech emotion recognition. In: 2013 Humaine association conference on affective computing and intelligent interaction. IEEE, pp 511–516
Hinton GE, Salakhutdinov RR (2006) Reducing the dimensionality of data with neural networks. Science 313(5786):504–507
Article MathSciNet MATH Google Scholar
Hinton GE (2012) A practical guide to training restricted boltzmann machines. In: Neural networks: tricks of the trade. Springer, pp 599–619
Ciresan DC, Meier U, Gambardella LM, Schmidhuber J (2010) Deep, big, simple neural nets for handwritten digit recognition. Neural Comput 22(12):3207–3220
Article Google Scholar
Makhzani A, Frey B (2013) K-sparse autoencoders. arXiv:1312.5663
Cheng Y, Wang D, Zhou P, Zhang T (2017) A survey of model compression and acceleration for deep neural networks. arXiv:1710.09282
Rudy J, Ding W, Im DJ, Taylor GW (2014) Neural network regularization via robust weight factorization. arXiv:1412.6630
Bach F, Jenatton R, Mairal J, Obozinski G (2012) Optimization with sparsity-inducing penalties. Foundations and Trends® in Machine Learning 4(1):1–106
Article MATH Google Scholar
Vincent P, Larochelle H, Bengio Y, Manzagol P-A (2008) Extracting and composing robust features with denoising autoencoders. In: Proceedings of the 25th international conference on machine learning. ACM, pp 1096–1103
Vincent P, Larochelle H, Lajoie I, Bengio Y, Manzagol P-A (2010) Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. J Mach Learn Res 11(Dec):3371–3408
MathSciNet MATH Google Scholar
Hinton GE, Srivastava N, Krizhevsky A, Sutskever I, Salakhutdinov RR (2012) Improving neural networks by preventing co-adaptation of feature detectors. arXiv:1207.0580
Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R (2014) Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res 15(1):1929–1958
MathSciNet MATH Google Scholar
Castellano G, Fanelli AM (1999) Variable selection using neural-network models. Neurocomputing 31:1–13
Article Google Scholar
Oliveira GG, Pedrollo OC, Castro NM (2015) Simplifying artificial neural network models of river basin behaviour by an automated procedure for input variable selection. Eng Appl Artif Intell 40:47–61
Article Google Scholar
Olden JD, Joy MK, Death RG (2004) An accurate comparison of methods for quantifying variable importance in artificial neural networks using simulated data. Ecol Model 178(3):389–397
Article Google Scholar
Papadokonstantaks S, Lygeros A, Jacobsson SP (2005) Comparison of recent methods for inference of variable influence in neural networks. Neural Netw 19:500–513
Article MATH Google Scholar
Gómez-Carracedo M, Andrade J, Carrera G, Aires-de Sousa J, Carlosena A, Prada D (2010) Combining kohonen neural networks and variable selection by classification trees to cluster road soil samples. Chemometr Intell Lab Syst 102(1):20–34
Article Google Scholar
May R, Dandy G, Maier H (2011) Review of input variable selection methods for artificial neural networks. In: Suzuki K (ed) Artificial neural networks-methodological advances and biomedical applications, InTech, pp 19–44
Oyefusi A (2008) Oil and the probability of rebel participation among youths in the Niger delta of nigeria. J Peace Res 45(4):539–555
Article Google Scholar
Breiman L (1996) Bagging predictors. Mach Learn 24(2):123–140
MATH Google Scholar
Breiman L (2001) Random forests. Mach Learn 45(1):5–32
Article MATH Google Scholar
Friedman JH (2001) Greedy function approximation: a gradient boosting machine. Ann Stat 29(5):1189–1232
Article MathSciNet MATH Google Scholar
Friedman J, Hastie T, Tibshirani R, et al (2000) Additive logistic regression: a statistical view of boosting. Ann Stat 28(2):337–407
Article MathSciNet MATH Google Scholar
Ho TK (1998) The random subspace method for constructing decision forests. IEEE Trans Pattern Anal Mach Intell 20(8):832–844
Article Google Scholar
Schapire RE, Freund Y, Bartlett P, Lee WS, et al (1998) Boosting the margin: a new explanation for the effectiveness of voting methods. Ann Stat 26(5):1651–1686
Article MathSciNet MATH Google Scholar
Schapire RE, Singer Y (1999) Improved boosting algorithms using confidence-rated predictions. Mach Learn 37(3):297–336
Article MATH Google Scholar
Seiffert C, Khoshgoftaar TM, Van Hulse J, Napolitano A (2008) Rusboost: Improving classification performance when training data is skewed. In: 19th International conference on pattern recognition, 2008. ICPR 2008. IEEE, pp 1–4
Warmuth MK , Liao J, Rätsch G (2006) Totally corrective boosting algorithms that maximize the margin. In: Proceedings of the 23rd international conference on machine learning. ACM, pp 1001–1008
Dietterich TG (2000) Ensemble methods in machine learning. In: International workshop on multiple classifier systems. Springer, pp 1–15
Riedl R, Brandstätter E, Roithmayr F (2008) Identifying decision strategies: a process-and outcome-based classification method. Behav Res Methods 40(3):795–807
Article Google Scholar
Glaholt MG, Reingold EM (2011) Eye movement monitoring as a process tracing methodology in decision making research. Journal of Neuroscience, Psychology, and Economics 4(2):125– 146
Article Google Scholar
Gere A, Danner L, de Antoni N, Kovács S, Dürrschmid K, Sipos L (2016) Visual attention accompanying food decision process: an alternative approach to choose the best models. Food Qual Prefer 51:1–7
Article Google Scholar
Russo JE, Leclerc F (1994) An eye-fixation analysis of choice processes for consumer nondurables. J Consum Res 21(2):274–290
Article Google Scholar

Download references

Acknowledgments

We are very grateful to an editor and two reviewers for valuable comments on the paper. This research is supported by the Japan Society for the Promotion of Science under the Grants-in-Aid for Scientific Research-grant 16K00339.

Author information

Authors and Affiliations

IT Education Center, Tokai University, 4-1-1 Kitakaname, Hiratsuka, Kanagawa, 259-1292, Japan
Ryotaro Kamimura
Human Informatics Research Institute, National Institute of Advanced Industrial Science and Technology (AIST), 1-1-1 Higashi, Tsukuba, 305-8566, Japan
Haruhiko Takeuchi

Authors

Ryotaro Kamimura
View author publications
You can also search for this author in PubMed Google Scholar
Haruhiko Takeuchi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ryotaro Kamimura.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix: Soft pruning by layer-wise selectivity increase

We briefly explain here how to modify connection weights to have strong selectivity. The selectivity of connection weights is realized by multiplying the connection weights by their normalized importance with a fixed number of learning steps (N steps), as shown in Fig. 11. In the first learning step in Fig. 11a, the ordinary learning is performed with T learning epochs. Then, from the second step on, the normalized importance of connection weights is computed, and weights are multiplied by their importance in Fig. 11b. The effect of importance values for the weights naturally diminishes when the number of epochs increases. Thus, when the number of steps goes beyond T epochs, the normalized importance is again computed and multiplied, and so on. The actual value of T was fixed to 100 in all the experiments, because we maximally repeated the learning step ten times, amounting to a total of 1,000 epochs, which corresponded to the default learning epochs of the Matlab neural network package. As mentioned above, we tried to use the default parameter values as much as possible for easy reproduction of the present results.

We present here how to control the selectivity step by step, and the superscript s for input patterns is omitted for simplification. In the first epoch of the first step denoted by the superscript (1, 1), we have the output from the j th hidden neuron

$$ {~}^{(1,1)}v_{j}^{(2)} = \text{tansig} \left( \sum\limits_{k = 1}^{n_{1}} {~}^{(1,1)}w_{j k}^{(2)} x_{k} \right). $$

(19)

Then, the output from the output neuron (input) is computed by

$$ {~}^{(1,1)}o_{k}^{(1)} = \sum\limits_{j = 1}^{n_{2}} {~}^{(1,1)}w_{k j}^{(1)} {~}^{(1,1)}v_{j}^{(2)}. $$

(20)

The error is computed by

$$ {~}^{(1,1)}E = \sum\limits_{k = 1}^{n_{1}} \left( x_{k} - {~}^{(1,1)}o_{k}^{(1)} \right)^{2}. $$

(21)

Then, the number of epochs is increased up to the T th epoch (T, 1) in the first step. For the final T th epoch, the importance of the first step is computed by

$$ {~}^{(T,1)}u_{j k}^{(2)}= \left| {~}^{(T,1)}w_{j k}^{(2)} \right|. $$

(22)

The normalized importance is

$$ {~}^{(T,1)}z_{j k}^{(2)}= { {{~}^{(T,1)}u_{j k}^{(2) } \over { {~}^{(T,1)}{u}_{\max}^{(2)}}}}. $$

(23)

Then, the importance in the second step in Fig. 11b is used to modify connection weights of the first step in Fig. 11a as

$$ {~}^{(1,2)}w_{j k}^{(2)}= {~}^{(T,1)}z_{j k}^{(2)} {~}^{(T,1)}w_{j k}^{(2)}. $$

(24)

Gradually, connection weights are modified to take into account the effect of importance of the connection weights. As can be seen in Fig. 11a–c, gradually, the number of strong connection weights diminishes. Finally, in the final T th epoch of the N th step in Fig. 11c3, the number of strong connection weights decreases to the final point.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kamimura, R., Takeuchi, H. Sparse semi-autoencoders to solve the vanishing information problem in multi-layered neural networks. Appl Intell 49, 2522–2545 (2019). https://doi.org/10.1007/s10489-018-1393-x

Download citation

Published: 24 January 2019
Issue Date: 15 July 2019
DOI: https://doi.org/10.1007/s10489-018-1393-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Sparse semi-autoencoders to solve the vanishing information problem in multi-layered neural networks

Abstract

Access this article

Similar content being viewed by others

Sparse Bayesian Learning for Extreme Learning Machine Auto-encoder

Autoencoders reloaded

Composite Denoising Autoencoders

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Appendix: Soft pruning by layer-wise selectivity increase

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Sparse semi-autoencoders to solve the vanishing information problem in multi-layered neural networks

Abstract

Access this article

Similar content being viewed by others

Sparse Bayesian Learning for Extreme Learning Machine Auto-encoder

Autoencoders reloaded

Composite Denoising Autoencoders

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Appendix: Soft pruning by layer-wise selectivity increase

Appendix: Soft pruning by layer-wise selectivity increase

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation