Entropy-based pruning method for convolutional neural networks

  • Cheonghwan Hur
  • Sanggil KangEmail author


Various compression approaches including pruning techniques have been developed to lighten the computational complexity of neural networks. Most pruning techniques determine the threshold of pruning weights or input features based on statistical analysis of the value of weights after completing their training. Their compression performance is limited because they do not take into account the contribution of weights to output during training. To solve this problem, we propose an entropy-based pruning technique that determines the threshold by considering the average amount of information from the weights to output while training. In the experiment section, we demonstrate and analyze our method for a convolutional neural network image classifier modeled by using Mixed National Institute of Standards and Technology image data. From the experimental results, our technique shows that compression performance has improved by more than 28% overall, compared to the well-known pruning technique. Also, the pruning speed has improved by 14%.


Convolutional neural network Gaussian Entropy Pruning Threshold Weight 



This work was supported by Inha University Research Grant.


  1. 1.
    Mao H et al (2017) Exploring the granularity of sparsity in convolutional neural networks. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2017Google Scholar
  2. 2.
    Guo Y, Yao A, Chen Y (2016) Dynamic network surgery for efficient DNNS. In: Advances In Neural Information Processing Systems, 2016, pp 1379–1387Google Scholar
  3. 3.
    Vanhoucke V, Senior A, Mao MZ (2011) Improving the speed of neural networks on CPUs. In: Proceedings of Deep Learning and Unsupervised Feature Learning NIPS Workshop, 2011, p 4Google Scholar
  4. 4.
    Denton EL et al (2014) Exploiting linear structure within convolutional networks for efficient evaluation. In: Advances in Neural Information Processing Systems, 2014, pp 1269–1277Google Scholar
  5. 5.
    Courbariaux M, Bengio Y, David J-P (2015) Binaryconnect: training deep neural networks with binary weights during propagations. In: Advances in Neural Information Processing Systems, 2015, pp 3123–3131Google Scholar
  6. 6.
    Hubara I et al (2016) Binarized neural networks. In: Advances in Neural Information Processing Systems, 2016, pp 4107–4115Google Scholar
  7. 7.
    Denil M et al (2013) Predicting parameters in deep learning. In: Advances in Neural Information Processing Systems, 2013, pp 2148–2156Google Scholar
  8. 8.
    Ye J (2005) Generalized low rank approximations of matrices. Mach Learn 61(1–3):167–191CrossRefGoogle Scholar
  9. 9.
    Yu D, Li Deng (2011) Deep learning and its applications to signal and information processing [exploratory dsp]. IEEE Signal Process Mag 28(1):145–154CrossRefGoogle Scholar
  10. 10.
    Cheng J et al (2017) Quantized CNN: a unified approach to accelerate and compress convolutional networks. IEEE Trans Neural Netw Learn Syst. CrossRefGoogle Scholar
  11. 11.
    Schneider P, Biehl M, Hammer B (2009) Adaptive relevance matrices in learning vector quantization. Neural Comput 21(12):3532–3561MathSciNetCrossRefGoogle Scholar
  12. 12.
    Kim JK, Kang S (2017) Neural network-based coronary heart disease risk prediction using feature correlation analysis. J Healthc Eng. CrossRefGoogle Scholar
  13. 13.
    Le Cun Y, Denker J (1989) Sove Solla, Richard Howard and Lawrence Jockel, “Optimal Brain Damage,”. In: Proceedings of 1989 IEEE Conference on Neural Information Processing Systems—Natural and Synthetic, 1989Google Scholar
  14. 14.
    Hassibi B, Stork DG (1993) Second order derivatives for network pruning: optimal brain surgeon. In: Advances in neural information processing systems, p 164–171Google Scholar
  15. 15.
    Engelbrecht AP (2001) A new pruning heuristic based on variance analysis of sensitivity information. IEEE Trans Neural Netw 12(6):1386–1399CrossRefGoogle Scholar
  16. 16.
    Han S et al (2015) Learning both weights and connections for efficient neural network. In: Advances in Neural Information Processing Systems, 2015, pp 1135–1143Google Scholar
  17. 17.
    Karnin ED (1990) A simple procedure for pruning back-propagation trained neural networks. IEEE Trans Neural Netw 1(2):239–242CrossRefGoogle Scholar
  18. 18.
    Chauvin Y, Rumelhart DE (eds) (1995) Backpropagation theory, architectures, and applications. Psychology Press, HoveGoogle Scholar
  19. 19.
    Lindblad G (1973) Entropy, information and quantum measurements. Commun Math Phys 33(4):305–322MathSciNetCrossRefGoogle Scholar
  20. 20.
    Föllmer H (1973) On entropy and information gain in random fields. Probab Theory Relat Fields 26(3):207–217MathSciNetzbMATHGoogle Scholar
  21. 21.
    Borland L, Plastino AR, Tsallis C (1998) Information gain within nonextensive thermostatistics. J Math Phys 39(12):6490–6501MathSciNetCrossRefGoogle Scholar
  22. 22.
    Nalewajski* RF (2005) Partial communication channels of molecular fragments and their entropy/information indices. Mol Phys 103(4):451–470CrossRefGoogle Scholar
  23. 23.
    Huerta MA, Robertson HS (1969) Entropy, information theory, and the approach to equilibrium of coupled harmonic oscillator systems. J Stat Phys 1(3):393–414CrossRefGoogle Scholar
  24. 24.
    Ebeling W (1993) Entropy and information in processes of self-organization: uncertainty and predictability. Physica A Stat Mech Appl 194(1–4):563–575CrossRefGoogle Scholar
  25. 25.
    Lecun Y, Cortes C, Burges CJC (2010) MNIST handwritten digit database. AT&T Labs [Online]. Accessed 16 Nov 2017
  26. 26.
    Krizhevsky A, Nair V, Hinton G (2014) The CIFAR-10 dataset. Online Accessed 16 Nov 2017
  27. 27.
    Demmel J, Kahan W (1990) Accurate singular values of bidiagonal matrices. SIAM J Sci Stat Comput 11(5):873–912MathSciNetCrossRefGoogle Scholar
  28. 28.
    Hall BA et al (1998) Method for adaptive quantization by multiplication of luminance pixel blocks by a modified, frequency ordered hadamard matrix. U.S. Patent No 5,786,856, 1998Google Scholar
  29. 29.
    Berg A, Deng J, Fei-Fei L (2012) Large scale visual recognition challenge 2012. Accessed 16 Nov 2017
  30. 30.
    Lee K, Ellis DPW (2010) Audio-based semantic concept classification for consumer video. IEEE Trans Audio Speech Lang Process 18(6):1406–1416CrossRefGoogle Scholar
  31. 31.
    Polyak A, Wolf L (2015) Channel-level acceleration of deep face representations. IEEE Access 3:2163–2175CrossRefGoogle Scholar
  32. 32.
    Kharab A, Guenther RB (2011) An introduction to numerical methods: a MATLAB approach. CRC Press, Boca RatonCrossRefGoogle Scholar
  33. 33.
    Liu X et al (2018) Efficient sparse-winograd convolutional neural networks. arXiv preprint arXiv:1802.06367
  34. 34.
    Han S et al (2016) DSD: regularizing deep neural networks with dense-sparse-dense training flow. arXiv preprint arXiv:1607.04381
  35. 35.
    Hoeffding W et al (1948) The central limit theorem for dependent random variables. Duke Math J 15(3):773–780MathSciNetCrossRefGoogle Scholar
  36. 36.
    Meek C, Thiesson B, Heckerman D (2002) The learning-curve sampling method applied to model-based clustering. J Mach Learn Res 2:397–418MathSciNetzbMATHGoogle Scholar
  37. 37.
    Abadi M et al (2016) TensorFlow: a system for large-scale machine learning. In: OSDI, 2016, pp 265–283Google Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2018

Authors and Affiliations

  1. 1.Department of Computer EngineeringInha UniversityIncheonRepublic of Korea

Personalised recommendations