Skip to main content
Log in

Sparse semi-autoencoders to solve the vanishing information problem in multi-layered neural networks

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

The present paper aims to propose a new neural network called “sparse semi-autoencoder” to overcome the vanishing information problem inherent to multi-layered neural networks. The vanishing information problem represents a natural tendency of multi-layered neural networks to lose information in input patterns as well as training errors, including also natural reduction in information due to constraints such as sparse regularization. To overcome this problem, two methods are proposed here, namely, input information enhancement by semi-autoencoders and the separation of error minimization and sparse regularization by soft pruning. First, we try to enhance information in input patterns to prevent the information from decreasing when going through multi-layers. The information enhancement is realized in a form of new architecture called “semi-autoencoders”, in which information in input patterns is forced to be given to all hidden layers to keep the original information in input patterns as much as possible. Second, information reduction by the sparse regularization is separated from a process of information acquisition as error minimization. The sparse regularization is usually applied in training autoencoders, and it has a natural tendency to decrease information by restricting the information capacity. This information reduction in terms of the penalties tends to eliminate even necessary and important information, because of the existence of many parameters to harmonize the penalties with error minimization. Thus, we introduce a new method of soft pruning, where information acquisition of error minimization and information reduction of sparse regularization are separately applied without a drastic change in connection weights, as is the case of the pruning methods. The two methods of information enhancement and soft pruning try jointly to keep the original information as much as possible and particularly to keep necessary and important information by enabling the making of a flexible compromise between information acquisition and reduction. The method was applied to the artificial data set, eye-tracking data set, and rebel forces participation data set. With the artificial data set, we demonstrated that the selectivity of connection weights increased by the soft pruning, giving sparse weights, and the final weights were naturally interpreted. Then, when it was applied to the real data set of eye tracking, it was confirmed that the present method outperformed the conventional methods, including the ensemble methods, in terms of generalization. In addition, for the eye-tracking data set, we could interpret the final results according to the conventional eye-tracking theory of choice process. Finally, the rebel data set showed the possibility of detailed interpretation of relations between inputs and outputs. However, it was also found that the method had the limitation that the selectivity by the soft pruning could not be increased.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

Notes

  1. The eye-tracking data set was made by the co-author, Haruhiko Takeuchi, which will be obtained from his personal Web page: http://www2.accsnet.ne.jp/~amx08582/.

  2. The rebel forces participation data set was discussed thoroughly in A. Oefusi’s paper on the Web page: http://journals.sagepub.com/doi/abs/10.1177/0022343308091360.

References

  1. Hochreiter S (1998) The vanishing gradient problem during learning recurrent neural nets and problem solutions. Int J Uncertainty Fuzziness Knowledge Based Syst 6(02):107–116

    Article  MATH  Google Scholar 

  2. Bengio Y, Simard P, Frasconi P (1994) Learning long-term dependencies with gradient descent is difficult. IEEE Trans Neural Netw 5(2):157–166

    Article  Google Scholar 

  3. Glorot X, Bengio Y (2010) Understanding the difficulty of training deep feedforward neural networks. In: Proceedings of the thirteenth international conference on artificial intelligence and statistics, pp 249–256

  4. Schmidhuber J (2015) Deep learning in neural networks: an overview. Neural Netw 61:85–117

    Article  Google Scholar 

  5. Glorot X, Bordes A, Bengio Y (2011) Deep sparse rectifier neural networks. In: Aistats, vol 15, p 275

  6. Hinton GE, Osindero S, Teh Y-W (2006) A fast learning algorithm for deep belief nets. Neural Comput 18(7):1527–1554

    Article  MathSciNet  MATH  Google Scholar 

  7. Bengio Y, Lamblin P, Popovici D, Larochelle H, et al (2007) Greedy layer-wise training of deep networks. Adv Neural Inf Proces Syst 19:153–160

    Google Scholar 

  8. Shannon CE (2001) A mathematical theory of communication. ACM SIGMOBILE Mobile Computing and Communications Review 5(1):3–55

    Article  MathSciNet  Google Scholar 

  9. Shannon CE (1951) Prediction and entropy of printed english. Bell Syst Tech J 30(1):50–64

    Article  MATH  Google Scholar 

  10. Abramson N (1963) Information theory and coding. McGraw-Hill, New York

    Google Scholar 

  11. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778

  12. He K, Zhang X, Ren S, Sun J (2016) Identity mappings in deep residual networks. In: European conference on computer vision. Springer, pp 630–645

  13. Szegedy C, Ioffe S , Vanhoucke V, Alemi AA (2017) Inception-v4, inception-resnet and the impact of residual connections on learning. In: AAAI, pp 4278–4284

  14. Hanson SJ , Pratt LY (1989) Comparing biases for minimal network construction with back-propagation. In: Advances in neural information processing systems, pp 177–185

  15. Lecun Y, Denker JS , Solla SA (1990) Optimal brain damage. In: Advances in neural information processing systems, pp 598–605

  16. Andrews R, Diederich J, Tickle AB (1995) Survey and critique of techniques for extracting rules from trained artificial neural networks. Knowl-Based Syst 8(6):373–389

    Article  MATH  Google Scholar 

  17. Benítez JM, Castro JL, Requena I (1997) Are artificial neural networks black boxes? IEEE Trans Neural Netw 8(5):1156–1164

    Article  Google Scholar 

  18. Srinivas S, Babu RV (2015) Data-free parameter pruning for deep neural networks. arXiv:1507.06149

  19. Olshausen BA, Field DJ (1997) Sparse coding with an overcomplete basis set: a strategy employed by v1? Vision Res 37(23):3311–3325

    Article  Google Scholar 

  20. Lee H, Ekanadham C, Ng AY (2008) Sparse deep belief net model for visual area v2. In: Advances in neural information processing systems, pp 873–880

  21. Nair V, Hinton GE (2009) 3d object recognition with deep belief nets. In: Advances in neural information processing systems, pp 1339–1347

  22. Ng A (2011) Sparse autoencoder, vol. 72 of CS294a lecture notes

  23. Zhang X, Dou H, Ju T, Xu J, Zhang S (2016) Fusing heterogeneous features from stacked sparse autoencoder for histopathological image analysis. IEEE Journal of Biomedical and Health Informatics 20 (5):1377–1383

    Article  Google Scholar 

  24. Xu J, Xiang L, Hang R, Wu J (2014) Stacked sparse autoencoder (ssae) based framework for nuclei patch classification on breast cancer histopathology. In: 2014 IEEE 11th international symposium on biomedical imaging (ISBI). IEEE, pp 999–1002

  25. Tao C, Pan H, Li Y, Zou Z (2015) Unsupervised spectral–spatial feature learning with stacked sparse autoencoder for hyperspectral imagery classification. IEEE Geosci Remote Sens Lett 12(12):2438–2442

    Article  Google Scholar 

  26. Deng J, Zhang Z, Marchi E, Schuller B (2013) Sparse autoencoder-based feature transfer learning for speech emotion recognition. In: 2013 Humaine association conference on affective computing and intelligent interaction. IEEE, pp 511–516

  27. Hinton GE, Salakhutdinov RR (2006) Reducing the dimensionality of data with neural networks. Science 313(5786):504–507

    Article  MathSciNet  MATH  Google Scholar 

  28. Hinton GE (2012) A practical guide to training restricted boltzmann machines. In: Neural networks: tricks of the trade. Springer, pp 599–619

  29. Ciresan DC, Meier U, Gambardella LM, Schmidhuber J (2010) Deep, big, simple neural nets for handwritten digit recognition. Neural Comput 22(12):3207–3220

    Article  Google Scholar 

  30. Makhzani A, Frey B (2013) K-sparse autoencoders. arXiv:1312.5663

  31. Cheng Y, Wang D, Zhou P, Zhang T (2017) A survey of model compression and acceleration for deep neural networks. arXiv:1710.09282

  32. Rudy J, Ding W, Im DJ, Taylor GW (2014) Neural network regularization via robust weight factorization. arXiv:1412.6630

  33. Bach F, Jenatton R, Mairal J, Obozinski G (2012) Optimization with sparsity-inducing penalties. Foundations and Trends® in Machine Learning 4(1):1–106

    Article  MATH  Google Scholar 

  34. Vincent P, Larochelle H, Bengio Y, Manzagol P-A (2008) Extracting and composing robust features with denoising autoencoders. In: Proceedings of the 25th international conference on machine learning. ACM, pp 1096–1103

  35. Vincent P, Larochelle H, Lajoie I, Bengio Y, Manzagol P-A (2010) Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. J Mach Learn Res 11(Dec):3371–3408

    MathSciNet  MATH  Google Scholar 

  36. Hinton GE, Srivastava N, Krizhevsky A, Sutskever I, Salakhutdinov RR (2012) Improving neural networks by preventing co-adaptation of feature detectors. arXiv:1207.0580

  37. Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R (2014) Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res 15(1):1929–1958

    MathSciNet  MATH  Google Scholar 

  38. Castellano G, Fanelli AM (1999) Variable selection using neural-network models. Neurocomputing 31:1–13

    Article  Google Scholar 

  39. Oliveira GG, Pedrollo OC, Castro NM (2015) Simplifying artificial neural network models of river basin behaviour by an automated procedure for input variable selection. Eng Appl Artif Intell 40:47–61

    Article  Google Scholar 

  40. Olden JD, Joy MK, Death RG (2004) An accurate comparison of methods for quantifying variable importance in artificial neural networks using simulated data. Ecol Model 178(3):389–397

    Article  Google Scholar 

  41. Papadokonstantaks S, Lygeros A, Jacobsson SP (2005) Comparison of recent methods for inference of variable influence in neural networks. Neural Netw 19:500–513

    Article  MATH  Google Scholar 

  42. Gómez-Carracedo M, Andrade J, Carrera G, Aires-de Sousa J, Carlosena A, Prada D (2010) Combining kohonen neural networks and variable selection by classification trees to cluster road soil samples. Chemometr Intell Lab Syst 102(1):20–34

    Article  Google Scholar 

  43. May R, Dandy G, Maier H (2011) Review of input variable selection methods for artificial neural networks. In: Suzuki K (ed) Artificial neural networks-methodological advances and biomedical applications, InTech, pp 19–44

  44. Oyefusi A (2008) Oil and the probability of rebel participation among youths in the Niger delta of nigeria. J Peace Res 45(4):539–555

    Article  Google Scholar 

  45. Breiman L (1996) Bagging predictors. Mach Learn 24(2):123–140

    MATH  Google Scholar 

  46. Breiman L (2001) Random forests. Mach Learn 45(1):5–32

    Article  MATH  Google Scholar 

  47. Friedman JH (2001) Greedy function approximation: a gradient boosting machine. Ann Stat 29(5):1189–1232

    Article  MathSciNet  MATH  Google Scholar 

  48. Friedman J, Hastie T, Tibshirani R, et al (2000) Additive logistic regression: a statistical view of boosting. Ann Stat 28(2):337–407

    Article  MathSciNet  MATH  Google Scholar 

  49. Ho TK (1998) The random subspace method for constructing decision forests. IEEE Trans Pattern Anal Mach Intell 20(8):832–844

    Article  Google Scholar 

  50. Schapire RE, Freund Y, Bartlett P, Lee WS, et al (1998) Boosting the margin: a new explanation for the effectiveness of voting methods. Ann Stat 26(5):1651–1686

    Article  MathSciNet  MATH  Google Scholar 

  51. Schapire RE, Singer Y (1999) Improved boosting algorithms using confidence-rated predictions. Mach Learn 37(3):297–336

    Article  MATH  Google Scholar 

  52. Seiffert C, Khoshgoftaar TM, Van Hulse J, Napolitano A (2008) Rusboost: Improving classification performance when training data is skewed. In: 19th International conference on pattern recognition, 2008. ICPR 2008. IEEE, pp 1–4

  53. Warmuth MK , Liao J, Rätsch G (2006) Totally corrective boosting algorithms that maximize the margin. In: Proceedings of the 23rd international conference on machine learning. ACM, pp 1001–1008

  54. Dietterich TG (2000) Ensemble methods in machine learning. In: International workshop on multiple classifier systems. Springer, pp 1–15

  55. Riedl R, Brandstätter E, Roithmayr F (2008) Identifying decision strategies: a process-and outcome-based classification method. Behav Res Methods 40(3):795–807

    Article  Google Scholar 

  56. Glaholt MG, Reingold EM (2011) Eye movement monitoring as a process tracing methodology in decision making research. Journal of Neuroscience, Psychology, and Economics 4(2):125– 146

    Article  Google Scholar 

  57. Gere A, Danner L, de Antoni N, Kovács S, Dürrschmid K, Sipos L (2016) Visual attention accompanying food decision process: an alternative approach to choose the best models. Food Qual Prefer 51:1–7

    Article  Google Scholar 

  58. Russo JE, Leclerc F (1994) An eye-fixation analysis of choice processes for consumer nondurables. J Consum Res 21(2):274–290

    Article  Google Scholar 

Download references

Acknowledgments

We are very grateful to an editor and two reviewers for valuable comments on the paper. This research is supported by the Japan Society for the Promotion of Science under the Grants-in-Aid for Scientific Research-grant 16K00339.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ryotaro Kamimura.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix: Soft pruning by layer-wise selectivity increase

Appendix: Soft pruning by layer-wise selectivity increase

We briefly explain here how to modify connection weights to have strong selectivity. The selectivity of connection weights is realized by multiplying the connection weights by their normalized importance with a fixed number of learning steps (N steps), as shown in Fig. 11. In the first learning step in Fig. 11a, the ordinary learning is performed with T learning epochs. Then, from the second step on, the normalized importance of connection weights is computed, and weights are multiplied by their importance in Fig. 11b. The effect of importance values for the weights naturally diminishes when the number of epochs increases. Thus, when the number of steps goes beyond T epochs, the normalized importance is again computed and multiplied, and so on. The actual value of T was fixed to 100 in all the experiments, because we maximally repeated the learning step ten times, amounting to a total of 1,000 epochs, which corresponded to the default learning epochs of the Matlab neural network package. As mentioned above, we tried to use the default parameter values as much as possible for easy reproduction of the present results.

Fig. 11
figure 11

Weight modification taking into account the selectivity from the first a to the final N th step c with T epochs in each step. In the figure, solid lines and dotted lines respectively represent positive and negative weights

We present here how to control the selectivity step by step, and the superscript s for input patterns is omitted for simplification. In the first epoch of the first step denoted by the superscript (1, 1), we have the output from the j th hidden neuron

$$ {~}^{(1,1)}v_{j}^{(2)} = \text{tansig} \left( \sum\limits_{k = 1}^{n_{1}} {~}^{(1,1)}w_{j k}^{(2)} x_{k} \right). $$
(19)

Then, the output from the output neuron (input) is computed by

$$ {~}^{(1,1)}o_{k}^{(1)} = \sum\limits_{j = 1}^{n_{2}} {~}^{(1,1)}w_{k j}^{(1)} {~}^{(1,1)}v_{j}^{(2)}. $$
(20)

The error is computed by

$$ {~}^{(1,1)}E = \sum\limits_{k = 1}^{n_{1}} \left( x_{k} - {~}^{(1,1)}o_{k}^{(1)} \right)^{2}. $$
(21)

Then, the number of epochs is increased up to the T th epoch (T, 1) in the first step. For the final T th epoch, the importance of the first step is computed by

$$ {~}^{(T,1)}u_{j k}^{(2)}= \left| {~}^{(T,1)}w_{j k}^{(2)} \right|. $$
(22)

The normalized importance is

$$ {~}^{(T,1)}z_{j k}^{(2)}= { {{~}^{(T,1)}u_{j k}^{(2) } \over { {~}^{(T,1)}{u}_{\max}^{(2)}}}}. $$
(23)

Then, the importance in the second step in Fig. 11b is used to modify connection weights of the first step in Fig. 11a as

$$ {~}^{(1,2)}w_{j k}^{(2)}= {~}^{(T,1)}z_{j k}^{(2)} {~}^{(T,1)}w_{j k}^{(2)}. $$
(24)

Gradually, connection weights are modified to take into account the effect of importance of the connection weights. As can be seen in Fig. 11a–c, gradually, the number of strong connection weights diminishes. Finally, in the final T th epoch of the N th step in Fig. 11c3, the number of strong connection weights decreases to the final point.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Kamimura, R., Takeuchi, H. Sparse semi-autoencoders to solve the vanishing information problem in multi-layered neural networks. Appl Intell 49, 2522–2545 (2019). https://doi.org/10.1007/s10489-018-1393-x

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-018-1393-x

Keywords

Navigation