Deep and Sparse Learning in Speech and Language Processing: An Overview

  • Dong WangEmail author
  • Qiang Zhou
  • Amir Hussain
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10023)


Large-scale deep neural models, e.g., deep neural networks (DNN) and recurrent neural networks (RNN), have demonstrated significant success in solving various challenging tasks of speech and language processing (SLP), including speech recognition, speech synthesis, document classification and question answering. This growing impact corroborates the neurobiological evidence concerning the presence of layer-wise deep processing in the human brain. On the other hand, sparse coding representation has also gained similar success in SLP, particularly in signal processing, demonstrating sparsity as another important neurobiological characteristic. Recently, research in these two directions is leading to increasing cross-fertlisation of ideas, thus a unified Sparse Deep or Deep Sparse learning framework warrants much attention. This paper aims to provide an overview of growing interest in this unified framework, and also outlines future research possibilities in this multi-disciplinary area.


Deep learning Sparse coding Speech processing Language processing 



This research was supported by the RSE-NSFC joint project (No. 61411130162), the National Science Foundation of China (NSFC) under the project No. 61371136, the UK Engineering and Physical Sciences Research Council Grant (EPSRC) Grant No. EP/M026981/1, and the MESTDC PhD Foundation Project No. 20130002120011. It is also supported by Huilan Ltd., Tongfang Corp., and FreeNeb.


  1. 1.
    Alain, G., Bengio, Y.: What regularized auto-encoders learn from the data-generating distribution. J. Mach. Learn. Res. 15(1), 3563–3593 (2014)MathSciNetzbMATHGoogle Scholar
  2. 2.
    Arpit, D., Zhou, Y., Ngo, H., Govindaraju, V.: Why regularized auto-encoders learn sparse representation? arXiv preprint arXiv:1505.05561 (2015)
  3. 3.
    Asaei, A., Taghizadeh, M.J., Haghighatshoar, S., Raj, B., Bourlard, H., Cevher, V.: Binary sparse coding of convolutive mixtures for sound localization and separation via spatialization. IEEE Trans. Signal Proces. 64(3), 567–579 (2016)MathSciNetCrossRefGoogle Scholar
  4. 4.
    Barlow, H.: Single units and sensation: a neuron doctrine for perceptual psychology? Perception 1, 371–394 (1972)CrossRefGoogle Scholar
  5. 5.
    Benesty, J.: Springer Handbook of Speech Processing. Springer, Heidelberg (2008)CrossRefGoogle Scholar
  6. 6.
    Bengio, Y.: Learning deep architectures for AI. Found. Trends\(\textregistered \) in Mach. Learn. 2(1), 1–127 (2009)Google Scholar
  7. 7.
    Bengio, Y.: Deep learning of representations for unsupervised and transfer learning. In: ICML Unsupervised and Transfer Learning (2012)Google Scholar
  8. 8.
    Blumensath, T., Davies, M.: Sparse and shift-invariant representations of music. IEEE Trans. Audio Speech Lang. Proces. 14(1), 50–57 (2006)CrossRefGoogle Scholar
  9. 9.
    Bordes, A., Glorot, X., Weston, J.: Joint learning of words and meaning representations for open-text semantic parsing. In: International Conference on Artificial Intelligence and Statistics (2012)Google Scholar
  10. 10.
    Cho, K., Merrienboer, B.V., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., Bengio, Y.: Learning phrase representations using RNN encoder-decoder for statistical machine translation. Computer Science (2014)Google Scholar
  11. 11.
    Collobert, R., Weston, J.: A unified architecture for natural language processing: deep neural networks with multitask learning. In: Proceedings of the 25th International Conference on Machine Learning, pp. 160–167. ACM (2008)Google Scholar
  12. 12.
    Cun, Y.L., Denker, J.S., Solla, S.A.: Optimal brain damage. In: Proceedings of NIPS 1990 (1990)Google Scholar
  13. 13.
    Dahl, G.E., Yu, D., Deng, L., Acero, A.: Large vocabulary continuous speech recognition with context-dependent DBN-HMMs. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4688–4691 (2011)Google Scholar
  14. 14.
    Dahl, G.E., Yu, D., Deng, L., Acero, A.: Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition. IEEE Trans. Audio Speech Lang. Proces. 20(1), 30–42 (2012)CrossRefGoogle Scholar
  15. 15.
    Dayan, P., Abbott, L.F.: Theoretical Neuroscience: Computational and Mathematical Modeling of Neural Systems. MIT press, Cambridge (2001)zbMATHGoogle Scholar
  16. 16.
    Fahlman, S.E., Hinton, G.E.: Connectionist architectures for artificial intelligence. Computer 20(1), 100–109 (1987). (United States)CrossRefGoogle Scholar
  17. 17.
    Földiák, P., Young, M.P.: Sparse coding in the primate cortex. In: The Handbook of Brain Theory and Neural Networks, vol. 1, pp. 1064–1068 (1995)Google Scholar
  18. 18.
    Glorot, X., Bordes, A., Bengio, Y.: Deep sparse rectifier neural networks. In: Proceedings of the 14th International Conference on Artificial Intelligence and Statistics (AISTATS), pp. 315–323 (2011)Google Scholar
  19. 19.
    Goodfellow, I., Bengio, Y., Courville, A.: Deep learning (2016). Book in preparation for MIT Press
  20. 20.
    Goodfellow, I., Courville, A., Bengio, Y.: Large-scale feature learning with spike-and-slab sparse coding. In: International Conference on Machine Learning, pp. 1439–1446 (2012)Google Scholar
  21. 21.
    He, Y., Kavukcuoglu, K., Wang, Y., Szlam, A., Qi, Y.: Unsupervised feature learning by deep sparse coding. arXiv preprint arXiv:1312.5783 (2013)
  22. 22.
    Huang, P., Kim, M., Hasegawajohnson, M., Smaragdis, P.: Deep learning for monaural speech separation. In: ICASSP 2014 (2014)Google Scholar
  23. 23.
    Jaitly, N.: Exploring deep learning methods for discovering features in speech signals. Ph.D. thesis, University of Toronto (2014)Google Scholar
  24. 24.
    Kavukcuoglu, K., Fergus, R., LeCun, Y., et al.: Learning invariant features through topographic filter maps. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2009, pp. 1605–1612. IEEE (2009)Google Scholar
  25. 25.
    Kavukcuoglu, K., Ranzato, M., LeCun, Y.: Fast inference in sparse coding algorithms with applications to object recognition. arXiv preprint arXiv:1010.3467 (2010)
  26. 26.
    Klein, D.J., König, P., Körding, K.P.: Sparse spectrotemporal coding of sounds. EURASIP J. Adv. Signal Proces. 2003(7), 1–9 (2003)CrossRefzbMATHGoogle Scholar
  27. 27.
    Krogh, A., Hertz, J.A.: A simple weight decay can improve generalization. Adv. Neural Inf. Process. Syst. (NIPS) 4, 950–957 (1992)Google Scholar
  28. 28.
    Larochelle, H., Bengio, Y.: Classification using discriminative restricted Boltzmann machines. In: Proceedings of the 25th International Conference on Machine Learning, pp. 536–543. ACM (2008)Google Scholar
  29. 29.
    LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521(7553), 436–444 (2015)CrossRefGoogle Scholar
  30. 30.
    Lee, H.: Unsupervised feature learning via sparse hierarchical representations. Ph.D. thesis, Stanford University (2010)Google Scholar
  31. 31.
    Lee, H., Ekanadham, C., Ng, A.Y.: Sparse deep belief net model for visual area V2. In: Advances in Neural Information Processing Systems, pp. 873–880 (2008)Google Scholar
  32. 32.
    Li, J., Zhang, T., Luo, W., Yang, J., Yuan, X.T., Zhang, J.: Sparseness analysis in the pretraining of deep neural networks. IEEE Trans. Neural Netw. Learn. Syst. PP(99), 1–14 (2016)Google Scholar
  33. 33.
    Li, J., Chang, H., Yang, J.: Sparse deep stacking network for image classification. arXiv preprint arXiv:1501.00777 (2015)
  34. 34.
    Liu, B., Wang, M., Foroosh, H., Tappen, M., Pensky, M.: Sparse convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 806–814 (2015)Google Scholar
  35. 35.
    Liu, C., Zhang, Z., Wang, D.: Pruning deep neural networks by optimal brain damage. In: Interspeech 2014 (2014)Google Scholar
  36. 36.
    Liu, H., Yu, H., Deng, Z.: Multi-document summarization based on two-level sparse representation model. In: National Conference on Artificial Intelligence (2015)Google Scholar
  37. 37.
    Luo, H., Shen, R., Niu, C.: Sparse group restricted Boltzmann machines. arXiv preprint arXiv:1008.4988 (2010)
  38. 38.
    Luo, Y., Bao, G., Xu, Y., Ye, Z.: Supervised monaural speech enhancement using complementary joint sparse representations. IEEE Signal Process. Lett. 23(2), 237–241 (2016)CrossRefGoogle Scholar
  39. 39.
    Makhzani, A., Frey, B.: A winner-take-all method for training sparse convolutional autoencoders. In: NIPS Deep Learning Workshop (2014)Google Scholar
  40. 40.
    Martin, J.H., Jurafsky, D.: Speech and Language Processing. International Edition (2000)Google Scholar
  41. 41.
    Mikolov, T.: Statistical language models based on neural networks. Ph.D. thesis, Brno University of Technology (2012)Google Scholar
  42. 42.
    Nam, J., Herrera, J., Slaney, M., Smith, J.O.: Learning sparse feature representations for music annotation and retrieval. In: ISMIR, pp. 565–570 (2012)Google Scholar
  43. 43.
    Northoff, G.: Unlocking the Brain. Coding, vol. 1. Oxford, New York (2014)Google Scholar
  44. 44.
    Ogrady, P.D., Pearlmutter, B.A.: Discovering speech phones using convolutive non-negative matrix factorisation with a sparseness constraint. Neurocomputing 72(1), 88–101 (2008)CrossRefGoogle Scholar
  45. 45.
    O’Grady, P.D., Pearlmutter, B.A., Rickard, S.T.: Survey of sparse and non-sparse methods in source separation. Int. J. Imaging Syst. Technol. 15(1), 18–33 (2005)CrossRefGoogle Scholar
  46. 46.
    Olshausen, B.A., Field, D.J.: Sparse coding with an overcomplete basis set: a strategy employed by V1? Vis. Res. 37(23), 3311–3325 (1997)CrossRefGoogle Scholar
  47. 47.
    Plumbley, M.D., Blumensath, T., Daudet, L., Gribonval, R., Davies, M.E.: Sparse representations in audio and music: from coding to source separation. Proc. IEEE 98(6), 995–1005 (2010)CrossRefGoogle Scholar
  48. 48.
    Poultney, C., Chopra, S., Cun, Y.L., et al.: Efficient learning of sparse representations with an energy-based model. In: Advances in Neural Information Processing Systems, pp. 1137–1144 (2006)Google Scholar
  49. 49.
    Ranzato, M.A., Boureau, Y.L., Cun, Y.L.: Sparse feature learning for deep belief networks. In: Platt, J.C., Koller, D., Singer, Y., Roweis, S.T. (eds.) Advances in Neural Information Processing Systems, vol. 20, pp. 1185–1192. Curran Associates, Inc. (2008).
  50. 50.
    Rifai, S., Vincent, P., Muller, X., Glorot, X., Bengio, Y.: Contractive auto-encoders: explicit invariance during feature extraction. In: International Conference on Machine Learning (2011)Google Scholar
  51. 51.
    Serre, T., Kreiman, G., Kouh, M., Cadieu, C., Knoblich, U., Poggio, T.: A quantitative theory of immediate visual recognition. Prog. Brain Res. 165, 33–56 (2007)CrossRefGoogle Scholar
  52. 52.
    Setiono, R.: A penalty function approach for prunning feedforward neural networks. Neural Comput. 9(1), 185–204 (1994)CrossRefzbMATHGoogle Scholar
  53. 53.
    Sigg, C.D., Dikk, T., Buhmann, J.M.: Speech enhancement with sparse coding in learned dictionaries. In: 2010 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 4758–4761. IEEE (2010)Google Scholar
  54. 54.
    Sivaram, G., Nemala, S.K., Elhilali, M., Tran, T.D., Hermansky, H.: Sparse coding for speech recognition. In: 2010 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 4346–4349, March 2010Google Scholar
  55. 55.
    Sivaram, G.S., Hermansky, H.: Multilayer perceptron with sparse hidden outputs for phoneme recognition. In: 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5336–5339. IEEE (2011)Google Scholar
  56. 56.
    Socher, R., Huang, E.H., Pennington, J., Ng, A.Y., Manning, C.D.: Dynamic pooling and unfolding recursive autoencoders for paraphrase detection. Adv. Neural Inf. Process. Syst. 24, 801–809 (2011)Google Scholar
  57. 57.
    Socher, R., Pennington, J., Huang, E.H., Ng, A.Y., Manning, C.D.: Semi-supervised recursive autoencoders for predicting sentiment distributions. In: Conference on Empirical Methods in Natural Language Processing, EMNLP 2011, 27–31 July 2011, John Mcintyre Conference Centre, Edinburgh, A Meeting of Sigdat, A Special Interest Group of the ACL, pp. 151–161 (2011)Google Scholar
  58. 58.
    Sun, F., Guo, J., Lan, Y., Xu, J., Cheng, X.: Sparse word embeddings using \(\ell {1}\) regularized online learning. In: IJCAI 2016, pp. 2915–2921 (2016)Google Scholar
  59. 59.
    Teng, P., Jia, Y.: Voice activity detection using convolutive non-negative sparse coding. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 7373–7377. IEEE (2013)Google Scholar
  60. 60.
    Utgoff, P.E., Stracuzzi, D.J.: Many-layered learning. Neural Comput. 14(10), 2497–2529 (2002)CrossRefzbMATHGoogle Scholar
  61. 61.
    Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., Manzagol, P.A.: Stacked denoising autoencoders: learning useful representations in a deep network with a local denoising criterion. J. Mach. Learn. Res. 11(Dec), 3371–3408 (2010)MathSciNetzbMATHGoogle Scholar
  62. 62.
    Vinyals, O., Deng, L.: Are sparse representations rich enough for acoustic modeling? In: INTERSPEECH, pp. 2570–2573 (2012)Google Scholar
  63. 63.
    Vipperla, R., Bozonnet, S., Wang, D., Evans, N.: Robust speech recognition in multi-source noise environments using convolutive non-negative matrix factorization. In: Proceedings of CHiME, pp. 74–79 (2011)Google Scholar
  64. 64.
    Vipperla, R., Geiger, J.T., Bozonnet, S., Wang, D., Evans, N., Schuller, B., Rigoll, G.: Speech overlap detection and attribution using convolutive non-negative sparse coding. In: 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4181–4184. IEEE (2012)Google Scholar
  65. 65.
    Vyas, Y., Carpuat, M.: Sparse bilingual word representations for cross-lingual lexical entailment. In: NAACL 2016, pp. 1187–1197 (2016)Google Scholar
  66. 66.
    Wang, D., Tejedor, J.: Heterogeneous convolutive non-negative sparse coding. In: INTERSPEECH, pp. 2150–2153 (2012)Google Scholar
  67. 67.
    Wang, D., Vipperla, R., Evans, N., Zheng, T.F.: Online non-negative convolutive pattern learning for speech signals. IEEE Trans. Signal Process. 61(1), 44–56 (2013)MathSciNetCrossRefGoogle Scholar
  68. 68.
    Wang, D., Vipperla, R., Evans, N.W.: Online pattern learning for non-negative convolutive sparse coding. In: INTERSPEECH, pp. 65–68 (2011)Google Scholar
  69. 69.
    Wu, C., Yang, H., Zhu, J., Zhang, J., King, I., Lyu, M.R.: Sparse Poisson coding for high dimensional document clustering. In: IEEE International Conference on Big Data (2013)Google Scholar
  70. 70.
    Xu, T., Wang, W., Dai, W.: Sparse coding with adaptive dictionary learning for underdetermined blind speech separation. Speech Commun. 55(3), 432–450 (2013)CrossRefGoogle Scholar
  71. 71.
    Xu, Y., Du, J., Dai, L., Lee, C.: A regression approach to speech enhancement based on deep neural networks. IEEE Trans. Audio Speech Lang. Process. 23(1), 7–19 (2015)CrossRefGoogle Scholar
  72. 72.
    Yogatama, D.: Sparse models of natural language text. Ph.D. thesis, Carnegie Mellon University (2015)Google Scholar
  73. 73.
    Yu, D., Deng, L.: Automatic Speech Recognition: A Deep Learning Approach. Springer, London (2014). IncorporatedGoogle Scholar
  74. 74.
    Yu, D., Seide, F., Li, G., Deng, L.: Exploiting sparseness in deep neural networks for large vocabulary speech recognition. In: Proceedings of ICASSP 2012 (2012)Google Scholar
  75. 75.
    Yu, K., Lin, Y., Lafferty, J.: Learning image representations from the pixel level via hierarchical sparse coding. In: 2011 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1713–1720. IEEE (2011)Google Scholar
  76. 76.
    Zen, H., Senior, A.W., Schuster, M.: Statistical parametric speech synthesis using deep neural networks. In: ICASSP2013 (2013)Google Scholar
  77. 77.
    Zhang, A., Zhu, J., Zhang, B.: Sparse online topic models. In: WWW 2013 (2013)Google Scholar
  78. 78.
    Zhao, M., Wang, D., Zhang, Z., Zhang, X.: Music removal by denoising autoencoder in speech recognition. In: APSIPA 2015 (2015)Google Scholar
  79. 79.
    Zhu, J., Xing, E.P: Sparse topical coding. In: UAI 2012 (2012)Google Scholar

Copyright information

© Springer International Publishing AG 2016

Authors and Affiliations

  1. 1.CSLT, RIITTsinghua UniversityBeijingChina
  2. 2.Tsinghua National Lab for Information Science and TechnologyBeijingChina
  3. 3.University of StirlingScotlandUK

Personalised recommendations