Multimedia Tools and Applications

, Volume 74, Issue 17, pp 6769–6795 | Cite as

Histogram equalization of contextual statistics of speech features for robust speech recognition

Article

Abstract

In the recent past, we have witnessed a flurry of research activity aimed at the development of novel and ingenious robustness methods for automatic speech recognition (ASR). Among them, histogram equalization (HEQ) of speech features constitutes one most prominent and successful line of research due to its inherent neat formulation and remarkable performance. In this paper, we adopt an effective modeling framework for joint equalization of spatial-temporal contextual statistics of speech features. On top of that, we explore various combinations of simple differencing and averaging operations to render the contextual relationships of feature vector components, not only between different dimensions but also between consecutive speech frames, in the HEQ process. Furthermore, several variants of HEQ are investigated and integrated into the proposed modeling framework to efficiently compensate for the effects of noise interference on the feature vector components. In addition, the utilities of the methods deduced from this framework and several existing robustness methods are analyzed and compared extensively. All experiments were carried out on the Aurora-2 database and task, and were further verified on the Aurora-4 database and task. Empirical experimental results suggest that our proposed methods can offer substantial improvements over the baseline system and achieve performance competitive to or better than some of the existing noise robustness methods in speech recognition.

Keywords

Automatic speech recognition Noise robustness Histogram equalization Feature contextual statistics 

References

  1. 1.
    Acharya T, Ray AK (2005) Image processing: principles and applications. Wiley-InterscienceGoogle Scholar
  2. 2.
    Alpaydin E (2010) Radial basis functions. In: The book “Introduction to Machine Learning, Second Edition,” The MIT Press, ch. 12.3: 288–294Google Scholar
  3. 3.
    Atal BS (1974) Effectiveness of linear prediction characteristics of the speech wave for automatic speaker identification and verification. J Acoust Soc Am 55(6):1304–1312CrossRefGoogle Scholar
  4. 4.
    Boll SF (1979) Suppression of acoustic noise in speech using spectral subtraction. IEEE Trans on Acoustics, Speech, and Signal Process 27(2):113–120CrossRefGoogle Scholar
  5. 5.
    Buera L, Lleida E, Miguel A, Ortega A, Saz O (2007) Cepstral vector normalization based on stereo data for robust speech recognition. IEEE Trans on Audio, Speech and Lang Process 15(3):1098–1113CrossRefGoogle Scholar
  6. 6.
    Chen B, Chen KY, Chen PN, Chen YW (2012) Spoken document retrieval with unsupervised query modeling techniques. IEEE Trans on Audio, Speech and Lang Process 20(9):2602–2612CrossRefGoogle Scholar
  7. 7.
    Chen B, Lin SH (2010) Distribution-based feature compensation for robust speech recognition. In: The book “Recent Advances in Robust Speech Recognition Technology,” edited by Ramez J, Griz JM, Segura J, Bentham Science Publishers.Google Scholar
  8. 8.
    Chen B, Lin SH, Chang YM, Liu JW (2013) Extractive speech summarization using evaluation metric-related training criteria. Inf Process & Manag 49(1):1–12CrossRefGoogle Scholar
  9. 9.
    Chen WH, Lin SH, Chen B (2008) Exploiting spatial-temporal feature distribution characteristics for robust speech recognition. In Proceedings of the Annual Conference of the International Speech Communication Association, 2204–2207.Google Scholar
  10. 10.
    Davis S, Mermelstein P (1980) Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans on Acoustics, Speech and Signal Process 28(4):357–366CrossRefGoogle Scholar
  11. 11.
    Dharanipragada S, Padmanabhan M (2000) A nonlinear unsupervised adaptation technique for speech recognition. In Proc of the Int Conf on Spoken Lang Processing 4:556–559Google Scholar
  12. 12.
    Droppo J, Acero A (2008) Environmental robustness. In: Springer Handbook of Speech Processing, J Benesty, MM Sondhi, and Y Huang, Eds. New York: Springer, ch. 33: 653–679Google Scholar
  13. 13.
    Ephraim Y, Malah D (1985) Speech enhancement using a minimum mean-square log-spectral amplitude estimator. IEEE Trans on Acoustic, Speech and Signal Process 33(2):443–445CrossRefGoogle Scholar
  14. 14.
    Furui S (1981) Cepstral analysis technique for automatic speaker verification. IEEE Trans on Acoustic, Speech and Signal Process 29(2):254–272CrossRefGoogle Scholar
  15. 15.
    Gales MJ (1995) Model-based Techniques for Noise Robust Speech Recognition. Ph.D. Thesis, Cambridge UniversityGoogle Scholar
  16. 16.
    Gauvain J, Lee CH (1994) Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains. IEEE Trans on Speech and Audio Process 2(2):291–298CrossRefGoogle Scholar
  17. 17.
    Gong Y (1995) Speech recognition in noisy environments: a survey. Speech Comm 16(3):261–291CrossRefGoogle Scholar
  18. 18.
    Hilger F, Ney H (2006) Quantile based histogram equalization for noise robust large vocabulary speech recognition. IEEE Trans on Audio, Speech, and Lan Processing 14(3):845–854CrossRefGoogle Scholar
  19. 19.
    Hirsch HG (2002) Experimental framework for the performance evaluation of speech recognition front-ends on a large vocabulary task. ETSI STQ-Aurora DSR Working GroupGoogle Scholar
  20. 20.
    Hirsch HG, Pearce D (2002) The aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions. In Proceedings of the ISCA ITRW ASR 2002 “Automatic Speech Recognition: Challenges for the Next Millennium”Google Scholar
  21. 21.
    Hsieh HJ, Hung JW, Chen B (2012) Exploring joint equalization of spatial-temporal contextual statistics of speech features for robust speech recognition. In Proceedings of the Annual Conference of the International Speech Communication AssociationGoogle Scholar
  22. 22.
    Hsu CW, Lee LS (2004) Higher order cepstral moment normalization (HOCMN) for robust speech recognition. In Proceedings of the International Conference on Acoustics, Speech and Signal Processing, 197–200Google Scholar
  23. 23.
    Huang X, Acero A, Hon HW (2001) Spoken Language Processing: a guide to theory, algorithm and system development. Prentice HallGoogle Scholar
  24. 24.
    Hung JW, Fan HT (2009) Subband feature statistics normalization techniques based on a discrete wavelet transform for robust speech recognition. IEEE Signal Process Lett 16(9):806–809CrossRefGoogle Scholar
  25. 25.
    Joshi V, Bilgi R, Umesh S, Garcia L, Benitez C (2011) Sub-band level histogram equalization for robust speech recognition. In Proceedings of the Annual Conference of the International Speech Communication Association, 1661–1664.Google Scholar
  26. 26.
    Junqua JC, Vassallo L (1996) Context modeling and clustering in continuous speech recognition. In Proceedings of the International Conference on Spoken Language Processing, 2262–2265Google Scholar
  27. 27.
    Kanedera N, Arai T, Hermansky H, Pavel M (1997) On the importance of various modulation frequencies for speech recognition. In Proceedings of the European Conference on Speech Communication and Technology, 1079–1082Google Scholar
  28. 28.
    Leggetter CJ, Woodland PC (1995) Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models. Comput Speech Lang 9:171–185CrossRefGoogle Scholar
  29. 29.
    Lim JS, Oppenheim AV (1979) Enhancement and bandwidth compression of noisy speech. Proc IEEE 67(12):1586–1604CrossRefGoogle Scholar
  30. 30.
    Lin SH, Chen B, Yeh YM (2009) Exploring the use of speech features and their corresponding distribution characteristics for robust speech recognition. IEEE Trans on Audio, Speech and Lang Process 17(1):84–94CrossRefGoogle Scholar
  31. 31.
    Macho D, Mauuary L, Noé B, Cheng YM, Ealey D, Jouvet D, Kelleher H, Pearce D, Saadoun F (2002) Evaluation of a noise-robust DSR front-end on Aurora databases. In Proceedings of the Annual Conference of the International Speech Communication Association, 17–20Google Scholar
  32. 32.
    Molau S, Keysers D, Ney H (2003) Matching training and test data distributions for robust speech recognition. Speech Comm 41(4):579–601CrossRefGoogle Scholar
  33. 33.
    Moreno P (1996) Speech recognition in noisy environment. Ph.D. Dissertation, ECE Department, Carnegie Mellon University, Pittsburgh, PA.Google Scholar
  34. 34.
    Saon G, Dharanipragada S, Povey D (2004) Feature space Gaussianization. In Proc of the IEEE Int Conf Acoust, Speech, and Signal Process 1:329–332Google Scholar
  35. 35.
    Segura JC, Benitez C, Torre A, Rubio AJ, Ramirez J (2004) Cepstral domain segmental nonlinear feature transformations for robust speech recognition. IEEE Signal Process Lett 11(5):517–520CrossRefGoogle Scholar
  36. 36.
    Suk YH, Choi SH, Lee HS (1999) Cepstrum third-order normalization method for noisy speech recognition. Electron Lett 35(7):527–528CrossRefGoogle Scholar
  37. 37.
    The radial basis network toolkit. Available from: http://www.mathworks.com/
  38. 38.
    Torre A, Peinado AM, Segura JC, Perez-Cordoba JL, Bentez MC, Rubio AJ (2005) Histogram equalization of speech representation for robust speech recognition. IEEE Trans on Speech and Audio Process 13(3):355–366CrossRefGoogle Scholar
  39. 39.
    Viikki A, Laurila K (1998) Cepstral domain segmental feature vector normalization for noise robust speech recognition. Speech Comm 25(1–3):133–147CrossRefGoogle Scholar
  40. 40.
    Wu J, Huo Q (2006) An environment-compensated minimum classification error training approach based on stochastic vector mapping. IEEE Trans on Audio, Speech and Lang Proces 14(6):2147–2155CrossRefGoogle Scholar
  41. 41.
    Young S, Evermann G, Gales M, Hain T, Kershaw D, Liu X, Moore G, Odell J, Ollason D, Povey D, Valtchev V, Woodland P (2006) The HTK Book (for HTK Version 3.4). Cambridge University Engineering Department, CambridgeGoogle Scholar

Copyright information

© Springer Science+Business Media New York 2014

Authors and Affiliations

  1. 1.Department of Computer Science & Information EngineeringNational Taiwan Normal UniversityTaipeiTaiwan
  2. 2.Department of Electrical EngineeringNational Chi Nan UniversityNantouTaiwan

Personalised recommendations