Skip to main content
Log in

Spectral Reconstruction and Noise Model Estimation Based on a Masking Model for Noise Robust Speech Recognition

  • Published:
Circuits, Systems, and Signal Processing Aims and scope Submit manuscript

Abstract

An effective way to increase noise robustness in automatic speech recognition (ASR) systems is feature enhancement based on an analytical distortion model that describes the effects of noise on the speech features. One of such distortion models that has been reported to achieve a good trade-off between accuracy and simplicity is the masking model. Under this model, speech distortion caused by environmental noise is seen as a spectral mask and, as a result, noisy speech features can be either reliable (speech is not masked by noise) or unreliable (speech is masked). In this paper, we present a detailed overview of this model and its applications to noise robust ASR. Firstly, using the masking model, we derive a spectral reconstruction technique aimed at enhancing the noisy speech features. Two problems must be solved in order to perform spectral reconstruction using the masking model: (1) mask estimation, i.e. determining the reliability of the noisy features, and (2) feature imputation, i.e. estimating speech for the unreliable features. Unlike missing data imputation techniques where the two problems are considered as independent, our technique jointly addresses them by exploiting a priori knowledge of the speech and noise sources in the form of a statistical model. Secondly, we propose an algorithm for estimating the noise model required by the feature enhancement technique. The proposed algorithm fits a Gaussian mixture model to the noise by iteratively maximising the likelihood of the noisy speech signal so that noise can be estimated even during speech-dominating frames. A comprehensive set of experiments carried out on the Aurora-2 and Aurora-4 databases shows that the proposed method achieves significant improvements over the baseline system and other similar missing data imputation techniques.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Notes

  1. According to (2), the power spectrum of the clean speech and noise signals at a given frequency band f can exceed that of the noisy speech signal if \(\cos \theta _f < 0\), and thus, the difference \({\varvec{y}}-\max ({\varvec{x}},{\varvec{n}})\) can be negative.

  2. Besides GMMs, other generative models can also be used for modelling these distributions. In particular, spectral reconstruction can benefit from the use of more complex speech priors such as hidden Markov models (HMMs) along with language models, as it is usually done in automatic speech recognition. These priors are expected to provide more accurate estimates of the posterior distribution \(p({\varvec{x}}|{\varvec{y}})\), thus leading to better clean speech estimates.

References

  1. A. Acero, L. Deng, T. Kristjansson, J. Zhang, HMM adaptation using vector Taylor series for noisy speech recognition, in Proceedings of ICSLP, pp. 229–232 (2000)

  2. J.M. Baker, L. Deng, J. Glass, S. Khudanpur, C.H. Lee, N. Morgan, D. O’Shaughnessy, Research developments and directions in speech recognition and understanding, part 1. IEEE Signal Process. Mag. 26(3), 75–80 (2009)

    Article  Google Scholar 

  3. J.M. Baker, L. Deng, S. Khudanpur, C.H. Lee, J. Glass, N. Morgan, D. O’Shaughnessy, Updated MINDS report on speech recognition and understanding, part 2. IEEE Signal Process. Mag. 26(4), 78–85 (2009)

    Article  Google Scholar 

  4. J. Barker, M. Cooke, D.P.W. Ellis, Decoding speech in the presence of other sources. Speech Commun. 45(1), 5–25 (2005)

    Article  Google Scholar 

  5. J. Barker, L. Josifovski, M.P. Cooke, P.D. Green, Soft decisions in missing data techniques for robust automatic speech recognition, in Proceedings of ICSLP (2000)

  6. C. Cerisara, S. Demange, J.P. Haton, On noise masking for automatic missing data speech recognition: a survey and discussion. Comput. Speech Lang. 21(3), 443–457 (2007)

    Article  Google Scholar 

  7. M. Cooke, P.D. Green, L. Josifovski, A. Vizinho, Robust automatic speech recognition with missing and unreliable acoustic data. Speech Commun. 34(3), 267–285 (2001)

    Article  MATH  Google Scholar 

  8. M. Cooke, A. Morris, P.D. Green, Missing data techniques for robust speech recognition, in Proceedings of ICASSP, pp. 863–866 (1997)

  9. M. Delcroix, K. Kinoshita, T. Nakatani, S. Araki, A. Ogawa, T. Hori, S. Watanabe, M. Fujimoto, T. Yoshioka, T. Oba, et al, Speech recognition in the presence of highly non-stationary noise based on spatial, spectral and temporal speech/noise modeling combined with dynamic variance adaptation, in Proceedings of the 1st International Workshop on Machine Listening in Multisource Environments (CHiME), pp. 12–17 (2011)

  10. A.P. Dempster, N.M. Laird, D.B. Rubin, Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. 39(1), 1–38 (1977)

    MathSciNet  MATH  Google Scholar 

  11. L. Deng, J. Droppo, A. Acero, Enhancement of log mel power spectra of speech using a phase-sensitive model of the acoustic environment and sequential estimation of the corrupting noise. IEEE Trans. Speech Audio Process. 12(2), 133–143 (2004)

    Article  Google Scholar 

  12. P.J. Dhrymes, Moments of truncated (normal) distributions (2005)

  13. ETSI: ETSI ES 201 108—Distributed speech recognition; front-end feature extraction algorithm; compression algorithms (2003)

  14. ETSI: ETSI ES 202 050—Distributed speech recognition; advanced front-end feature extraction algorithm; compression algorithms (2007)

  15. F. Faubel, J. McDonough, D. Klakow, A phase-averaged model for the relationship between noisy speech, clean speech and noise in the log-mel domain, in Proceedings of the Interspeech, pp. 553–556 (2008)

  16. F. Faubel, J. McDonough, D. Klakow, Bounded conditional mean imputation with Gaussian mixture models: a reconstruction approach to partly occluded features, in Proceedings of the ICASSP, pp. 3869–3872 (2009)

  17. F. Faubel, H. Raja, J. McDonough, D. Klakow, Particle filter based soft-mask estimation for missing feature reconstruction, in Proceedings of the IWAENC (2008)

  18. J.A. González, A.M. Peinado, A.M. Gómez, MMSE feature reconstruction based on an occlusion model for robust ASR, in Advances in Speech and Language Technologies for Iberian Languages—IberSPEECH 2012, Communications in Computer and Information Science, (Springer, 2012), pp. 217–226

  19. J.A. González, A.M. Peinado, A.M. Gómez, N. Ma, Log-spectral feature reconstruction based on an occlusion model for noise robust speech recognition, in Proceedings of the Interspeech, pp. 2630–2633 (2012)

  20. J.A. González, A.M. Peinado, N. Ma, A.M. Gómez, J. Barker, MMSE-based missing-feature reconstruction with temporal modeling for robust speech recognition. IEEE Trans. Audio Speech Lang. Process. 21(3), 624–635 (2013)

    Article  Google Scholar 

  21. R.C. Hendriks, R. Heusdens, J. Jensen, MMSE based noise PSD tracking with low complexity, in Proceedings of the ICASSP, pp. 4266–4269 (2010)

  22. H.G. Hirsch, Experimental framework for the performance evaluation of speech recognition front-ends of large vocabulary task (Tech. rep, STQ AURORA DSR Working Group, 2002)

  23. H.G. Hirsch, D. Pearce, The Aurora experimental framework for the performance evaluations of speech recognitions systems under noise conditions. Proc. ISCA ITRW ASR 2000, 181–188 (2000)

    Google Scholar 

  24. V. Leutnant, R. Haeb-Umbach, An analytic derivation of a phase-sensitive observation model for noise robust speech recognition, in Proceedings of the Interspeech, pp. 2395–2398 (2009)

  25. J. Li, L. Deng, Y. Gong, R. Haeb-Umbach, An overview of noise-robust automatic speech recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 22(4), 745–777 (2014)

    Article  Google Scholar 

  26. J. Li, L. Deng, R. Haeb-Umbach, Y. Gong, Robust Automatic Speech Recognition: A Bridge to Practical Applications (Academic Press, Cambridge, 2015)

    Google Scholar 

  27. P.C. Loizou, Speech Enhancement: Theory and Practice (CRC, Boca Raton, 2007)

    Google Scholar 

  28. N. Ma, P. Green, J. Barker, A. Coy, Exploiting correlogram structure for robust speech recognition with multiple speech sources. Speech Commun. 49(12), 874–891 (2007)

    Article  Google Scholar 

  29. R. Martin, Noise power spectral density estimation based on optimal smoothing and minimum statistics. IEEE Trans. Speech Audio Process. 9(5), 504–512 (2001)

    Article  Google Scholar 

  30. J.A. Morales-Cordovilla, N. Ma, V.E. Sánchez, J.L. Carmona, A.M. Peinado, J. Barker, A pitch based noise estimation technique for robust speech recognition with missing data, in Proceedings of the ICASSP, pp. 4808–4811 (2011)

  31. P.J. Moreno, Speech recognition in noisy environments. Ph.D. thesis, Carnegie Mellon University (1996)

  32. A. Nádas, D. Nahamoo, M.A. Picheny, Speech recognition using noise-adaptive prototypes. IEEE Trans. Acoust. Speech Signal Process. 37(10), 1495–1503 (1989)

    Article  Google Scholar 

  33. T. Nakatani, T. Yoshioka, S. Araki, M. Delcroix, M. Fujimoto, Logmax observation model with MFCC-based spectral prior for reduction of highly nonstationary ambient noise, in Proceedings of the ICASSP, pp. 4029–4032 (2012)

  34. M.H. Radfar, A.H. Banihashemi, R.M. Dansereau, A. Sayadiyan, Nonlinear minimum mean square error estimator for mixture-maximisation approximation. Electron. Lett. 42(12), 724–725 (2006)

    Article  Google Scholar 

  35. B. Raj, M.L. Seltzer, R.M. Stern, Reconstruction of missing features for robust speech recognition. Speech Commun. 48(4), 275–296 (2004)

    Article  Google Scholar 

  36. B. Raj, R. Singh, Reconstructing spectral vectors with uncertain spectrographic masks for robust speech recognition, in Proceedings of the ASRU, pp. 65–70 (2005)

  37. B. Raj, R.M. Stern, Missing-feature approaches in speech recognition. IEEE Signal Process. Mag. 22(5), 101–116 (2005)

    Article  Google Scholar 

  38. J. Ramírez, J.M. Górriz, J.C. Segura, Voice Activity Detection. Fundamentals and Speech Recognition System Robustness (INTECH Open Access Publisher, NewYork, 2007)

    Book  Google Scholar 

  39. J. Ramírez, J.C. Segura, C. Benítez, A. De La Torre, A. Rubio, Efficient voice activity detection algorithms using long-term speech information. Speech Commun. 42(3), 271–287 (2004)

    Article  Google Scholar 

  40. A.M. Reddy, B. Raj, Soft mask estimation for single channel speaker separation, in Workshop on Statistical and Perceptual Audio Processing SAPA (2004)

  41. A.M. Reddy, B. Raj, Soft mask methods for single-channel speaker separation. IEEE Trans. Audio Speech Lang. Process. 15(6), 1766–1776 (2007)

    Article  Google Scholar 

  42. U. Remes, Y. Nankaku, K. Tokuda, GMM-based missing-feature reconstruction on multi-frame windows, in Proceedings of the Interspeech, pp. 1665–1668 (2011)

  43. S.J. Rennie, J.R. Hershey, P.A. Olsen, Single-channel multitalker speech recognition. IEEE Signal Process. Mag. 27(6), 66–80 (2010)

    Google Scholar 

  44. S.T. Roweis, Factorial models and refiltering for speech separation and denoising, in Proceedings of the Eurospeech, pp. 1009–1012 (2003)

  45. J.C. Segura, A. de la Torre, M.C. Benítez, A.M. Peinado, Model-based compensation of the additive noise for continuous speech recognition. Experiments using the Aurora II database and tasks, In Proceedings of the Eurospeech, pp. 221–224 (2001)

  46. V. Stouten, H. Van Hamme, P. Wambacq, Effect of phase-sensitive environment model and higher order VTS on noisy speech feature enhancement. Proc. ICASSP 1, 433–436 (2005)

    Google Scholar 

  47. A.P. Varga, R.K. Moore, Hidden Markov model decomposition of speech and noise, in Proceedings of the ICASSP, pp. 845–848 (1990)

  48. T. Virtanen, R. Singh, B. Raj (eds.), Techniques for Noise Robustness in Automatic Speech Recognition (Wiley, Chichester, West Sussex, 2012)

Download references

Acknowledgements

This work was supported by the Spanish MINECO (Ministerio de Economía y Competitividad)/FEDER Project TEC2013-46690-P.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jose A. Gonzalez.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Gonzalez, J.A., Gómez, A.M., Peinado, A.M. et al. Spectral Reconstruction and Noise Model Estimation Based on a Masking Model for Noise Robust Speech Recognition. Circuits Syst Signal Process 36, 3731–3760 (2017). https://doi.org/10.1007/s00034-016-0480-7

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00034-016-0480-7

Keywords

Navigation