Advertisement

Multimedia Tools and Applications

, Volume 77, Issue 20, pp 27231–27267 | Cite as

Training deep neural networks with non-uniform frame-level cost function for automatic speech recognition

  • Aldonso Becerra
  • J. Ismael de la Rosa
  • Efrén González
  • A. David Pedroza
  • N. Iracemi Escalante
Article

Abstract

The aim of this paper is to exhibit two new variations of the frame-level cost function for training a deep neural network in order to achieve better word error rates in speech recognition. Optimization methods and their minimization functions are underlying aspects to consider when someone is working on neural nets, and hence their improvement is one of the salient objectives of researchers, and this paper deals in part with such a situation. The first proposed framework is based on the concept of extropy, the complementary dual function of an uncertainty measure. The conventional cross-entropy function can be mapped to a non-uniform loss function based on its corresponding extropy, enhancing the frames that have ambiguity in their belonging to specific senones. The second proposal makes a fusion of the presented mapped cross-entropy function and the idea of boosted cross-entropy, which emphasizes those frames with low target posterior probability. The proposed approaches have been performed by using a personalized mid-vocabulary speaker-independent voice corpus. This dataset is employed for recognition of digit strings and personal name lists in Spanish from the northern central part of Mexico on a connected-words phone dialing task. A relative word error rate improvement of \(12.3\%\) and \(10.7\%\) is obtained with the two proposed approaches, respectively, with regard to the conventional well-established cross-entropy objective function.

Keywords

Speech recognition Neural networks Deep learning Cross-entropy Extropy Frame-level loss function 

Notes

Acknowledgments

The first author acknowledge all support given by the Universidad Autónoma de Zacatecas (UAZ) during the years 2014-2017 to realize his PhD academic formation. Additional acknowledgements for the support given by CONACyT during his stay of postgraduate studies.

Compliance with Ethical Standards

Conflict of interests

The authors declare that they have no conflict of interest.

References

  1. 1.
    Ali A, Zhang Y, Cardinal P, Dahak N, Vogel S, Glass J (2014) A complete KALDI recipe for building Arabic speech recognition systems. In: Proceedings of spoken language technology (SLT), pp 525–529Google Scholar
  2. 2.
    Allauzen C, Riley M, Schalkwyk J, Skut W, Mohri M (2007) OpenFst: a general and efficient weighted finite-state transducer library. In: Proceedings of int. conf. on implementation and application of automata (CIAA), pp 11–23Google Scholar
  3. 3.
    Bacchiani M, Senior A, Heigold G (2014) Asynchronous, Online, GMM-free training of a context dependent acoustic model for speech recognition. In: Proceedings of Interspeech, pp 1900–1904Google Scholar
  4. 4.
    Becerra A, de la Rosa JI, González E (2016) A case study of speech recognition in Spanish: from conventional to deep approach. In: Proceedings of IEEE ANDESCONGoogle Scholar
  5. 5.
    Becerra A, de la Rosa JI, González E (2017) Speech recognition in a dialog system: from conventional to deep processing. A case study applied to Spanish. Multimed Tools Appl.  https://doi.org/10.1007/s11042-017-5160-5
  6. 6.
    Bengio Y (2009) Learning deep architectures for AI. Found Trends Mach Learn 2(1):1–127.  https://doi.org/10.1561/2200000006 CrossRefzbMATHGoogle Scholar
  7. 7.
    Bilmes J (2006) What HMMs can do. IEICE Trans Inf Syst E89-D(3):869–891CrossRefGoogle Scholar
  8. 8.
    Bishop C (2006) Pattern recognition and machine learning. Springer, NYzbMATHGoogle Scholar
  9. 9.
    Bourlard H, Morgan N (1993) Connectionist speech recognition: a hybrid approach. Kluwer Academic Publishers, NorwellGoogle Scholar
  10. 10.
    Burbea J, Rao R (1982) On the convexity of some divergence measures based on entropy functions. IEEE Trans Inf Theory 28(3):489–495MathSciNetCrossRefzbMATHGoogle Scholar
  11. 11.
    Chen X, Eversole A, Li G, Yu D, Seide F (2012) Pipelined Back-Propagation for Context-Dependent deep neural networks. In: Proceedings of InterspeechGoogle Scholar
  12. 12.
    Dahl GE, Yu D, Deng L, Acero A (2012) Context-dependent pre-trained deep neural networks for large vocabulary speech recognition. IEEE Trans Audio Speech Lang Process 20(1):30–42CrossRefGoogle Scholar
  13. 13.
    Deng L (2014) A tutorial survey of architectures, algorithms, and applications for deep learning. APSIPA Trans Signal Info Process 3(e2).  https://doi.org/10.1017/atsip.2013.9
  14. 14.
    Deng L, Li X (2013) Machine learning paradigms for speech recognition: an overview. IEEE Trans Audio Speech Lang Process 21(5):1060–1089CrossRefGoogle Scholar
  15. 15.
    Deng L, Kenny P, Lennig M, Gupta V, Seitz F, Mermelstein P (1991) Phonemic hidden markov models with continuous mixture output densities for large vocabulary word recognition. IEEE Trans Signal Process 39(7):1677–1681CrossRefGoogle Scholar
  16. 16.
    Duda R, Hart P, Stork D (2001) Pattern Classification. Wiley, NYzbMATHGoogle Scholar
  17. 17.
    Gales MJF, Young SJ (2007) The application of hidden Markov models in speech recognition. Found Trends Signal Process 1(3):195–304CrossRefzbMATHGoogle Scholar
  18. 18.
    Gauvain J, Lee C h (1994) Maximum a posteriori estimation for multivariate gaussian mixture observations of markov chains. IEEE Trans Speech Audio Process 2 (2):291–298CrossRefGoogle Scholar
  19. 19.
    Ge Z, Iyer AN, Cheluvaraja S, Sundaram R, Ganapathiraju A (2017) Neural network based speaker classification and verification systems with enhanced features. In: Proceedings of intelligent systems conferenceGoogle Scholar
  20. 20.
    Hagan MT, Demuth HB, Beale MH, De Jesús O (2014) Neural network design. CreateSpace, USGoogle Scholar
  21. 21.
    Heigold G, Ney H, Schlüter R (2013) Investigations on an EM-style optimization algorithm for discriminative training of HMMs. IEEE Trans Audio Speech Lang Process 21(12):2616–2626CrossRefGoogle Scholar
  22. 22.
    Hinton G, Osindero S, Teh Y (2006) A fast learning algorithm for deep belief nets. Neural Comput 18(7):1527–1554MathSciNetCrossRefzbMATHGoogle Scholar
  23. 23.
    Hinton G, Deng L, Yu D, Dahl GE, Mohamed A, Jaitly N, Senior A, Vanhoucke V, Nguyen P, Sainath TN, Kingsbury B (2012) Deep neural networks for acoustic modeling in speech recognition. IEEE Signal Proc Mag 29(6):82–97CrossRefGoogle Scholar
  24. 24.
    Huang Z, Li J, Weng Ch, Lee Ch (2014) Beyond cross-entropy: towards better frame-level objective functions for deep neural network training in automatic speech recognition. In: Proceedings of Interspeech, pp 1214–1218Google Scholar
  25. 25.
    Juang BH, Levinson SE, Sondhi M (1986) Maximum likelihood estimation for multivariate mixture observations of markov chains. IEEE Trans Inf Theory IT-32(2):307–309CrossRefGoogle Scholar
  26. 26.
    Jurafsky D, Martin J (2008) Speech and language processing: an introduction to natural language processing, computational linguistics and speech recognition. Pearson, NJGoogle Scholar
  27. 27.
    Kingsbury B, Sainath TN, Soltau H (2012) Scalable minimum Bayes risk training of deep neural network acoustic models using distributed Hessian-free optimization. In: Proceedings of InterSpeechGoogle Scholar
  28. 28.
    Lad F, Sanfilippo G, Agró G (2015) Extropy: complementary dual of entropy. Stat Sci 30(1):40–58MathSciNetCrossRefzbMATHGoogle Scholar
  29. 29.
    LeCun Y, Bengio Y, Hinton G (2015) Deep learning. Nature 521:436–444CrossRefGoogle Scholar
  30. 30.
    Liao Y, Lee H, Lee L (2015) Towards structured deep neural network for automatic speech recognition. In: Proceedings of ASRU,  https://doi.org/10.1109/ASRU.2015.7404786
  31. 31.
    Li X, Wu X (2014) Labeling unsegmented sequence data with DNN-HMM and its application for speech recognition. In: Proceedings of int. symp. on chinese spoken language processing (ISCSLP)Google Scholar
  32. 32.
    Li X, Hong C, Yang Y, Wu X (2013) Deep neural networks for syllable based acoustic modeling in Chinese speech recognition. In: Proceedings of signal and information processing association annu. summit and conf. (APSIPA)Google Scholar
  33. 33.
    Li X, Yang Y, Pang Z, Wu X (2015) A comparative study on selecting acoustic modeling units in deep neural networks based large vocabulary Chinese speech recognition. Neurocomputing 170:251–256CrossRefGoogle Scholar
  34. 34.
    McLachlan G (1988) Mixture models. Marcel Dekker, New YorkzbMATHGoogle Scholar
  35. 35.
    Mehrotra k, Mohan Ch, Ranka S (1997) Elements of artificial neural networks. MIT Press, CambridgezbMATHGoogle Scholar
  36. 36.
    Miao Y, Metze F, Improving low-resource CD-DNN-HMM using dropout and multilingual DNN training (2013). In: Proceedings of InterSpeech, pp 2237–2241Google Scholar
  37. 37.
    Mohamed A, Dahl GE, Hinton G (2012) Acoustic modeling using deep belief networks. IEEE Trans Audio Speech Lang Process 20(1):14–22CrossRefGoogle Scholar
  38. 38.
    Morgan N, Bourlard H (1995) An introduction to hybrid HMM/connectionist continuous speech recognition. IEEE Signal Proc Mag 12(3):25–42CrossRefGoogle Scholar
  39. 39.
    Pan J, Liu C, Wang Z, Hu Y, Jiang H (2012) Investigation of Deep Neural Networks (DNN) for large vocabulary continuous speech recognition: Why DNN surpass GMMs in acoustic modeling. In: Proceedings of int. symp. on chinese spoken language processing (ISCSLP), pp 301–305Google Scholar
  40. 40.
    Povey S, Ghoshal A, Boulianne G, Burget L, Glembek O, Goel N, Hannemann M, Motlicek P, Qian Y, Schwarz P, Silovsky J, Stemmer G, Vesely K (2011) The Kaldi speech recognition toolkit. In: Proceedings of automatic speech recognition and understanding workshop (ASRU)Google Scholar
  41. 41.
    Rao R (1984) Use of diversity and distance measures in the analysis of qualitative data. In: Van Vark GN, Howells WW (eds) Multivariate statistical methods in physical anthropology. D. Reidel Publishing Company, Dordrecht, pp 49–67Google Scholar
  42. 42.
    Rabiner L (1989) A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of IEEE 77(2):257–286CrossRefGoogle Scholar
  43. 43.
    Rabiner L, Juang B (1993) Fundamentals of speech recognition. Prentice-Hall, New JerseyGoogle Scholar
  44. 44.
    Rath S, Povey D, Vesel K, Cernock J (2013) Improved feature processing for deep neural networks. In: Proceedings of Interspeech, pp 109–113Google Scholar
  45. 45.
    Ray J, Thompson B, Shen W (2014) Comparing a high and low-level deep neural network implementation for automatic speech recognition. In: Proceedings of workshop for high performance technical computing in dynamic languages (HPTCDL), pp 41–46Google Scholar
  46. 46.
    Reynolds DA, Quatieri TF, Dunn TRB (2000) Speaker verification using adapted gaussian mixture models. Digital Signal Process 10(1):19–41CrossRefGoogle Scholar
  47. 47.
    Sainath TN, Kingsbury B, Ramabhadran B, Fousek P, Novak P, Mohamed A (2011) Making deep belief networks effective for large vocabulary continuous speech recognition. In: Proceedings of automatic speech recognition and understanding workshop (ASRU)Google Scholar
  48. 48.
    Sainath TN, Kingsbury B, Soltau H, Ramabhadran B (2013) Optimization techniques to improve training speed of deep neural networks for large speech tasks. IEEE Trans Audio Speech Lang Process 21(11):2267–2276CrossRefGoogle Scholar
  49. 49.
    Scowen R (1993) Extended bnf - generic base standards. In: Proceedings of software engineering standards symp., pp 25–34Google Scholar
  50. 50.
    Seide F, Li G, Chen X, Yu D (2011) Feature engineering in context-dependent deep neural networks for conversational speech transcription. In: Proceedings of automatic speech recognition and understanding workshop (ASRU), pp 24–29Google Scholar
  51. 51.
    Seide F, Li G, Yu D (2011) Conversational speech transcription using context-dependent deep neural networks. In: Proceedings of Interspeech, pp 437–440Google Scholar
  52. 52.
    Seki H, Yamamoto K, Nakagawa S (2014) Comparison of syllable-based and phoneme-based DNN-HMM in Japanese speech recognition. In: Proceedings of int. conf. of advanced informatics concept, theory and application (ICAICTA), pp 249–254Google Scholar
  53. 53.
    Seltzer ML, Yu D, Wang Y (2013) An investigation of deep neural networks for noise robust speech recognition. In: Proceedings of ICASSP, pp 7398–7402Google Scholar
  54. 54.
    Senior A, Heigold G, Bacchiani M, Liao H (2014) GMM-free DNN training. In: Proceedings of ICASSP, pp 5639–5643Google Scholar
  55. 55.
    Siniscalchi SM, Svendsen T, Lee Ch (2014) An artificial neural network approach to automatic speech processing. Neurocomputing 140:326–338CrossRefGoogle Scholar
  56. 56.
    Su H, Li G, Yu D, Seide F (2013) Error back propagation for sequence training of context-dependent deep networks for conversational speech transcription. In: Proceeedings of ICASSP, pp 6664–6668Google Scholar
  57. 57.
    Tao D, Cheng Y, Song M, Lin X (2016) Manifold Ranking-Based matrix factorization for saliency detection. IEEE Trans Neural Netw Learn Syst 27(6):1122–1134MathSciNetCrossRefGoogle Scholar
  58. 58.
    Tao D, Lin X, Jin L, Li X (2016) Principal component 2-D long short-term memory for font recognition on single chinese characters. IEEE Trans Cybern 46(3):756–765CrossRefGoogle Scholar
  59. 59.
    Tao D, Guo Y, Song M, Li Y, Yu Z, Tang Y (2016) Person Re-identification by dual-regularized KISS metric learning. IEEE Trans Image Process 25(6):2726–2738MathSciNetCrossRefGoogle Scholar
  60. 60.
    Trentin E, Gori M (2001) A survey of hybrid ANN/HMM models for automatic speech recognition. Neurocomputing 37(1-4):91–126CrossRefzbMATHGoogle Scholar
  61. 61.
    Vesely K, Ghoshal A, Burget L, Povey D (2013) Sequence-discriminative training of deep neural networks. In: Proceedings of Interspeech, pp 2345–2349Google Scholar
  62. 62.
    Vesely K, Hannemann M, Burget L (2013) Semi-supervised training of deep neural networks. In: Proceedings of automatic speech recognition and understanding workshop (ASRU), pp 267–272Google Scholar
  63. 63.
    Vincent P, Larochelle H, Bengio Y, Manzagol P (2008) Extracting and composing robust features with denoising autoencoders. In: Proceedings of int. conf. on machine learning (ICML), pp 1096–1103Google Scholar
  64. 64.
    Wei W, van Vuuren S (1998) Improved neural network training of inter-word context. In: Proceedings of ICASSP.  https://doi.org/10.1109/ICASSP.1998.674476, pp 1520–6149
  65. 65.
    Wiesler S, Golik P, Schluter R, Ney H (2015) Investigations on sequence training of neural networks. In: Proceedings of ICASSP, pp 4565–4569Google Scholar
  66. 66.
    Xue S, Abdel-Hamid O, Jiang H, Dai L, Liu Q (2014) Fast adaptation of deep neural network based on discriminant codes for speech recognition. IEEE Trans Audio Speech Lang Process 22(12):1713–1725CrossRefGoogle Scholar
  67. 67.
    Yang Z, Zhong A, Carass A, Ying SH, Prince JL (2014) Deep learning for cerebellar ataxia classification and functional score regression. Lect Notes Comput Sci 8679:68–76CrossRefGoogle Scholar
  68. 68.
    Yao K, You D, Seide F, Su H, Deng L, Gong Y (2012) Adaptation of context-dependent deep neural networks for automatic speech recognition. In: Proceedings of spoken language technology (SLT), pp 366–369Google Scholar
  69. 69.
    Young S, Evermann G, Gales M, Hain T, Kershaw D, Liu X, Moore G, Odell J, Ollason D, Povey D, Valtchev V, Woodland P (2006) The HTK Book (for version 3.4). Cambridge University Engineering Department, CambridgeGoogle Scholar
  70. 70.
    Yu D, Deng L, Dahl GE (2010) Roles of pretraining and fine-tuning in context-dependent DNN-HMMs for real-world speech recognition. In: Proceedings of NIPS workshop on deep learning and unsupervised feature learningGoogle Scholar
  71. 71.
    Yu D, Deng L (2015) Automatic speech recognition: a deep learning approach. Springer, LondonCrossRefzbMATHGoogle Scholar
  72. 72.
    Yu D, Seide G, Li G, Deng L (2012) Exploiting sparseness in deep neural networks for large vocabulary speech recognition. In: Proceedings of ICASSP, pp 4409–4412Google Scholar
  73. 73.
    Zhang T (2004) Solving large scale linear prediction problems using stochastic gradient descent algorithms. In: Proceedings of int. conf. on machine learning (ICML), pp 919–926Google Scholar
  74. 74.
    Zhang C, Woodland PC (2014) Standalone training of context-dependent deep neural network acoustic models. In: Proceedings of ICASSP, pp 5597–5601Google Scholar
  75. 75.
    Zhao R, Li J, Gong Y (2014) Variable-component deep neural network for robust speech recognition. In: Proceedings of InterspeechGoogle Scholar
  76. 76.
    Zhou P, Jiang H, Dai L, Hu Y, Liu Q (2015) State-clustering based multiple deep neural networks modeling approach for speech recognition. IEEE Trans Audio Speech Lang Process 23(4):631–642CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2018

Authors and Affiliations

  1. 1.Unidad Académica de Ingeniería EléctricaUniversidad Autónoma de ZacatecasZacatecasMéxico
  2. 2.Department of Basic SciencesInstituto Tecnológico de Pabellón de ArteagaPabellón de ArteagaMéxico

Personalised recommendations