Circuits, Systems, and Signal Processing

, Volume 36, Issue 9, pp 3731–3760 | Cite as

Spectral Reconstruction and Noise Model Estimation Based on a Masking Model for Noise Robust Speech Recognition

  • Jose A. Gonzalez
  • Angel M. Gómez
  • Antonio M. Peinado
  • Ning Ma
  • Jon Barker
Article
  • 165 Downloads

Abstract

An effective way to increase noise robustness in automatic speech recognition (ASR) systems is feature enhancement based on an analytical distortion model that describes the effects of noise on the speech features. One of such distortion models that has been reported to achieve a good trade-off between accuracy and simplicity is the masking model. Under this model, speech distortion caused by environmental noise is seen as a spectral mask and, as a result, noisy speech features can be either reliable (speech is not masked by noise) or unreliable (speech is masked). In this paper, we present a detailed overview of this model and its applications to noise robust ASR. Firstly, using the masking model, we derive a spectral reconstruction technique aimed at enhancing the noisy speech features. Two problems must be solved in order to perform spectral reconstruction using the masking model: (1) mask estimation, i.e. determining the reliability of the noisy features, and (2) feature imputation, i.e. estimating speech for the unreliable features. Unlike missing data imputation techniques where the two problems are considered as independent, our technique jointly addresses them by exploiting a priori knowledge of the speech and noise sources in the form of a statistical model. Secondly, we propose an algorithm for estimating the noise model required by the feature enhancement technique. The proposed algorithm fits a Gaussian mixture model to the noise by iteratively maximising the likelihood of the noisy speech signal so that noise can be estimated even during speech-dominating frames. A comprehensive set of experiments carried out on the Aurora-2 and Aurora-4 databases shows that the proposed method achieves significant improvements over the baseline system and other similar missing data imputation techniques.

Keywords

Speech recognition Noise robustness Feature compensation Noise model estimation Missing data imputation 

References

  1. 1.
    A. Acero, L. Deng, T. Kristjansson, J. Zhang, HMM adaptation using vector Taylor series for noisy speech recognition, in Proceedings of ICSLP, pp. 229–232 (2000)Google Scholar
  2. 2.
    J.M. Baker, L. Deng, J. Glass, S. Khudanpur, C.H. Lee, N. Morgan, D. O’Shaughnessy, Research developments and directions in speech recognition and understanding, part 1. IEEE Signal Process. Mag. 26(3), 75–80 (2009)CrossRefGoogle Scholar
  3. 3.
    J.M. Baker, L. Deng, S. Khudanpur, C.H. Lee, J. Glass, N. Morgan, D. O’Shaughnessy, Updated MINDS report on speech recognition and understanding, part 2. IEEE Signal Process. Mag. 26(4), 78–85 (2009)CrossRefGoogle Scholar
  4. 4.
    J. Barker, M. Cooke, D.P.W. Ellis, Decoding speech in the presence of other sources. Speech Commun. 45(1), 5–25 (2005)CrossRefGoogle Scholar
  5. 5.
    J. Barker, L. Josifovski, M.P. Cooke, P.D. Green, Soft decisions in missing data techniques for robust automatic speech recognition, in Proceedings of ICSLP (2000)Google Scholar
  6. 6.
    C. Cerisara, S. Demange, J.P. Haton, On noise masking for automatic missing data speech recognition: a survey and discussion. Comput. Speech Lang. 21(3), 443–457 (2007)CrossRefGoogle Scholar
  7. 7.
    M. Cooke, P.D. Green, L. Josifovski, A. Vizinho, Robust automatic speech recognition with missing and unreliable acoustic data. Speech Commun. 34(3), 267–285 (2001)CrossRefMATHGoogle Scholar
  8. 8.
    M. Cooke, A. Morris, P.D. Green, Missing data techniques for robust speech recognition, in Proceedings of ICASSP, pp. 863–866 (1997)Google Scholar
  9. 9.
    M. Delcroix, K. Kinoshita, T. Nakatani, S. Araki, A. Ogawa, T. Hori, S. Watanabe, M. Fujimoto, T. Yoshioka, T. Oba, et al, Speech recognition in the presence of highly non-stationary noise based on spatial, spectral and temporal speech/noise modeling combined with dynamic variance adaptation, in Proceedings of the 1st International Workshop on Machine Listening in Multisource Environments (CHiME), pp. 12–17 (2011)Google Scholar
  10. 10.
    A.P. Dempster, N.M. Laird, D.B. Rubin, Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. 39(1), 1–38 (1977)MathSciNetMATHGoogle Scholar
  11. 11.
    L. Deng, J. Droppo, A. Acero, Enhancement of log mel power spectra of speech using a phase-sensitive model of the acoustic environment and sequential estimation of the corrupting noise. IEEE Trans. Speech Audio Process. 12(2), 133–143 (2004)CrossRefGoogle Scholar
  12. 12.
    P.J. Dhrymes, Moments of truncated (normal) distributions (2005)Google Scholar
  13. 13.
    ETSI: ETSI ES 201 108—Distributed speech recognition; front-end feature extraction algorithm; compression algorithms (2003)Google Scholar
  14. 14.
    ETSI: ETSI ES 202 050—Distributed speech recognition; advanced front-end feature extraction algorithm; compression algorithms (2007)Google Scholar
  15. 15.
    F. Faubel, J. McDonough, D. Klakow, A phase-averaged model for the relationship between noisy speech, clean speech and noise in the log-mel domain, in Proceedings of the Interspeech, pp. 553–556 (2008)Google Scholar
  16. 16.
    F. Faubel, J. McDonough, D. Klakow, Bounded conditional mean imputation with Gaussian mixture models: a reconstruction approach to partly occluded features, in Proceedings of the ICASSP, pp. 3869–3872 (2009)Google Scholar
  17. 17.
    F. Faubel, H. Raja, J. McDonough, D. Klakow, Particle filter based soft-mask estimation for missing feature reconstruction, in Proceedings of the IWAENC (2008)Google Scholar
  18. 18.
    J.A. González, A.M. Peinado, A.M. Gómez, MMSE feature reconstruction based on an occlusion model for robust ASR, in Advances in Speech and Language Technologies for Iberian Languages—IberSPEECH 2012, Communications in Computer and Information Science, (Springer, 2012), pp. 217–226Google Scholar
  19. 19.
    J.A. González, A.M. Peinado, A.M. Gómez, N. Ma, Log-spectral feature reconstruction based on an occlusion model for noise robust speech recognition, in Proceedings of the Interspeech, pp. 2630–2633 (2012)Google Scholar
  20. 20.
    J.A. González, A.M. Peinado, N. Ma, A.M. Gómez, J. Barker, MMSE-based missing-feature reconstruction with temporal modeling for robust speech recognition. IEEE Trans. Audio Speech Lang. Process. 21(3), 624–635 (2013)CrossRefGoogle Scholar
  21. 21.
    R.C. Hendriks, R. Heusdens, J. Jensen, MMSE based noise PSD tracking with low complexity, in Proceedings of the ICASSP, pp. 4266–4269 (2010)Google Scholar
  22. 22.
    H.G. Hirsch, Experimental framework for the performance evaluation of speech recognition front-ends of large vocabulary task (Tech. rep, STQ AURORA DSR Working Group, 2002)Google Scholar
  23. 23.
    H.G. Hirsch, D. Pearce, The Aurora experimental framework for the performance evaluations of speech recognitions systems under noise conditions. Proc. ISCA ITRW ASR 2000, 181–188 (2000)Google Scholar
  24. 24.
    V. Leutnant, R. Haeb-Umbach, An analytic derivation of a phase-sensitive observation model for noise robust speech recognition, in Proceedings of the Interspeech, pp. 2395–2398 (2009)Google Scholar
  25. 25.
    J. Li, L. Deng, Y. Gong, R. Haeb-Umbach, An overview of noise-robust automatic speech recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 22(4), 745–777 (2014)CrossRefGoogle Scholar
  26. 26.
    J. Li, L. Deng, R. Haeb-Umbach, Y. Gong, Robust Automatic Speech Recognition: A Bridge to Practical Applications (Academic Press, Cambridge, 2015)Google Scholar
  27. 27.
    P.C. Loizou, Speech Enhancement: Theory and Practice (CRC, Boca Raton, 2007)Google Scholar
  28. 28.
    N. Ma, P. Green, J. Barker, A. Coy, Exploiting correlogram structure for robust speech recognition with multiple speech sources. Speech Commun. 49(12), 874–891 (2007)CrossRefGoogle Scholar
  29. 29.
    R. Martin, Noise power spectral density estimation based on optimal smoothing and minimum statistics. IEEE Trans. Speech Audio Process. 9(5), 504–512 (2001)CrossRefGoogle Scholar
  30. 30.
    J.A. Morales-Cordovilla, N. Ma, V.E. Sánchez, J.L. Carmona, A.M. Peinado, J. Barker, A pitch based noise estimation technique for robust speech recognition with missing data, in Proceedings of the ICASSP, pp. 4808–4811 (2011)Google Scholar
  31. 31.
    P.J. Moreno, Speech recognition in noisy environments. Ph.D. thesis, Carnegie Mellon University (1996)Google Scholar
  32. 32.
    A. Nádas, D. Nahamoo, M.A. Picheny, Speech recognition using noise-adaptive prototypes. IEEE Trans. Acoust. Speech Signal Process. 37(10), 1495–1503 (1989)CrossRefGoogle Scholar
  33. 33.
    T. Nakatani, T. Yoshioka, S. Araki, M. Delcroix, M. Fujimoto, Logmax observation model with MFCC-based spectral prior for reduction of highly nonstationary ambient noise, in Proceedings of the ICASSP, pp. 4029–4032 (2012)Google Scholar
  34. 34.
    M.H. Radfar, A.H. Banihashemi, R.M. Dansereau, A. Sayadiyan, Nonlinear minimum mean square error estimator for mixture-maximisation approximation. Electron. Lett. 42(12), 724–725 (2006)CrossRefGoogle Scholar
  35. 35.
    B. Raj, M.L. Seltzer, R.M. Stern, Reconstruction of missing features for robust speech recognition. Speech Commun. 48(4), 275–296 (2004)CrossRefGoogle Scholar
  36. 36.
    B. Raj, R. Singh, Reconstructing spectral vectors with uncertain spectrographic masks for robust speech recognition, in Proceedings of the ASRU, pp. 65–70 (2005)Google Scholar
  37. 37.
    B. Raj, R.M. Stern, Missing-feature approaches in speech recognition. IEEE Signal Process. Mag. 22(5), 101–116 (2005)CrossRefGoogle Scholar
  38. 38.
    J. Ramírez, J.M. Górriz, J.C. Segura, Voice Activity Detection. Fundamentals and Speech Recognition System Robustness (INTECH Open Access Publisher, NewYork, 2007)CrossRefGoogle Scholar
  39. 39.
    J. Ramírez, J.C. Segura, C. Benítez, A. De La Torre, A. Rubio, Efficient voice activity detection algorithms using long-term speech information. Speech Commun. 42(3), 271–287 (2004)CrossRefGoogle Scholar
  40. 40.
    A.M. Reddy, B. Raj, Soft mask estimation for single channel speaker separation, in Workshop on Statistical and Perceptual Audio Processing SAPA (2004)Google Scholar
  41. 41.
    A.M. Reddy, B. Raj, Soft mask methods for single-channel speaker separation. IEEE Trans. Audio Speech Lang. Process. 15(6), 1766–1776 (2007)CrossRefGoogle Scholar
  42. 42.
    U. Remes, Y. Nankaku, K. Tokuda, GMM-based missing-feature reconstruction on multi-frame windows, in Proceedings of the Interspeech, pp. 1665–1668 (2011)Google Scholar
  43. 43.
    S.J. Rennie, J.R. Hershey, P.A. Olsen, Single-channel multitalker speech recognition. IEEE Signal Process. Mag. 27(6), 66–80 (2010)Google Scholar
  44. 44.
    S.T. Roweis, Factorial models and refiltering for speech separation and denoising, in Proceedings of the Eurospeech, pp. 1009–1012 (2003)Google Scholar
  45. 45.
    J.C. Segura, A. de la Torre, M.C. Benítez, A.M. Peinado, Model-based compensation of the additive noise for continuous speech recognition. Experiments using the Aurora II database and tasks, In Proceedings of the Eurospeech, pp. 221–224 (2001)Google Scholar
  46. 46.
    V. Stouten, H. Van Hamme, P. Wambacq, Effect of phase-sensitive environment model and higher order VTS on noisy speech feature enhancement. Proc. ICASSP 1, 433–436 (2005)Google Scholar
  47. 47.
    A.P. Varga, R.K. Moore, Hidden Markov model decomposition of speech and noise, in Proceedings of the ICASSP, pp. 845–848 (1990)Google Scholar
  48. 48.
    T. Virtanen, R. Singh, B. Raj (eds.), Techniques for Noise Robustness in Automatic Speech Recognition (Wiley, Chichester, West Sussex, 2012)Google Scholar

Copyright information

© Springer Science+Business Media New York 2017

Authors and Affiliations

  1. 1.Department of Computer ScienceUniversity of SheffieldSheffieldUK
  2. 2.Department of Signal TheoryTelematics and CommunicationsGranadaSpain

Personalised recommendations