Spectral Reconstruction and Noise Model Estimation Based on a Masking Model for Noise Robust Speech Recognition

Gonzalez, Jose A.; Gómez, Angel M.; Peinado, Antonio M.; Ma, Ning; Barker, Jon

doi:10.1007/s00034-016-0480-7

Spectral Reconstruction and Noise Model Estimation Based on a Masking Model for Noise Robust Speech Recognition

Published: 06 January 2017

Volume 36, pages 3731–3760, (2017)
Cite this article

Circuits, Systems, and Signal Processing Aims and scope Submit manuscript

371 Accesses
2 Citations
Explore all metrics

Abstract

An effective way to increase noise robustness in automatic speech recognition (ASR) systems is feature enhancement based on an analytical distortion model that describes the effects of noise on the speech features. One of such distortion models that has been reported to achieve a good trade-off between accuracy and simplicity is the masking model. Under this model, speech distortion caused by environmental noise is seen as a spectral mask and, as a result, noisy speech features can be either reliable (speech is not masked by noise) or unreliable (speech is masked). In this paper, we present a detailed overview of this model and its applications to noise robust ASR. Firstly, using the masking model, we derive a spectral reconstruction technique aimed at enhancing the noisy speech features. Two problems must be solved in order to perform spectral reconstruction using the masking model: (1) mask estimation, i.e. determining the reliability of the noisy features, and (2) feature imputation, i.e. estimating speech for the unreliable features. Unlike missing data imputation techniques where the two problems are considered as independent, our technique jointly addresses them by exploiting a priori knowledge of the speech and noise sources in the form of a statistical model. Secondly, we propose an algorithm for estimating the noise model required by the feature enhancement technique. The proposed algorithm fits a Gaussian mixture model to the noise by iteratively maximising the likelihood of the noisy speech signal so that noise can be estimated even during speech-dominating frames. A comprehensive set of experiments carried out on the Aurora-2 and Aurora-4 databases shows that the proposed method achieves significant improvements over the baseline system and other similar missing data imputation techniques.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fundamentals, present and future perspectives of speech enhancement

Article 22 January 2020

Multi-channel spectrograms for speech processing applications using deep learning methods

Article Open access 24 September 2020

Noise robust automatic speech recognition: review and analysis

Article 24 June 2023

Notes

According to (2), the power spectrum of the clean speech and noise signals at a given frequency band f can exceed that of the noisy speech signal if \(\cos \theta _f < 0\), and thus, the difference \({\varvec{y}}-\max ({\varvec{x}},{\varvec{n}})\) can be negative.
Besides GMMs, other generative models can also be used for modelling these distributions. In particular, spectral reconstruction can benefit from the use of more complex speech priors such as hidden Markov models (HMMs) along with language models, as it is usually done in automatic speech recognition. These priors are expected to provide more accurate estimates of the posterior distribution \(p({\varvec{x}}|{\varvec{y}})\), thus leading to better clean speech estimates.

References

A. Acero, L. Deng, T. Kristjansson, J. Zhang, HMM adaptation using vector Taylor series for noisy speech recognition, in Proceedings of ICSLP, pp. 229–232 (2000)
J.M. Baker, L. Deng, J. Glass, S. Khudanpur, C.H. Lee, N. Morgan, D. O’Shaughnessy, Research developments and directions in speech recognition and understanding, part 1. IEEE Signal Process. Mag. 26(3), 75–80 (2009)
Article Google Scholar
J.M. Baker, L. Deng, S. Khudanpur, C.H. Lee, J. Glass, N. Morgan, D. O’Shaughnessy, Updated MINDS report on speech recognition and understanding, part 2. IEEE Signal Process. Mag. 26(4), 78–85 (2009)
Article Google Scholar
J. Barker, M. Cooke, D.P.W. Ellis, Decoding speech in the presence of other sources. Speech Commun. 45(1), 5–25 (2005)
Article Google Scholar
J. Barker, L. Josifovski, M.P. Cooke, P.D. Green, Soft decisions in missing data techniques for robust automatic speech recognition, in Proceedings of ICSLP (2000)
C. Cerisara, S. Demange, J.P. Haton, On noise masking for automatic missing data speech recognition: a survey and discussion. Comput. Speech Lang. 21(3), 443–457 (2007)
Article Google Scholar
M. Cooke, P.D. Green, L. Josifovski, A. Vizinho, Robust automatic speech recognition with missing and unreliable acoustic data. Speech Commun. 34(3), 267–285 (2001)
Article MATH Google Scholar
M. Cooke, A. Morris, P.D. Green, Missing data techniques for robust speech recognition, in Proceedings of ICASSP, pp. 863–866 (1997)
M. Delcroix, K. Kinoshita, T. Nakatani, S. Araki, A. Ogawa, T. Hori, S. Watanabe, M. Fujimoto, T. Yoshioka, T. Oba, et al, Speech recognition in the presence of highly non-stationary noise based on spatial, spectral and temporal speech/noise modeling combined with dynamic variance adaptation, in Proceedings of the 1st International Workshop on Machine Listening in Multisource Environments (CHiME), pp. 12–17 (2011)
A.P. Dempster, N.M. Laird, D.B. Rubin, Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. 39(1), 1–38 (1977)
MathSciNet MATH Google Scholar
L. Deng, J. Droppo, A. Acero, Enhancement of log mel power spectra of speech using a phase-sensitive model of the acoustic environment and sequential estimation of the corrupting noise. IEEE Trans. Speech Audio Process. 12(2), 133–143 (2004)
Article Google Scholar
P.J. Dhrymes, Moments of truncated (normal) distributions (2005)
ETSI: ETSI ES 201 108—Distributed speech recognition; front-end feature extraction algorithm; compression algorithms (2003)
ETSI: ETSI ES 202 050—Distributed speech recognition; advanced front-end feature extraction algorithm; compression algorithms (2007)
F. Faubel, J. McDonough, D. Klakow, A phase-averaged model for the relationship between noisy speech, clean speech and noise in the log-mel domain, in Proceedings of the Interspeech, pp. 553–556 (2008)
F. Faubel, J. McDonough, D. Klakow, Bounded conditional mean imputation with Gaussian mixture models: a reconstruction approach to partly occluded features, in Proceedings of the ICASSP, pp. 3869–3872 (2009)
F. Faubel, H. Raja, J. McDonough, D. Klakow, Particle filter based soft-mask estimation for missing feature reconstruction, in Proceedings of the IWAENC (2008)
J.A. González, A.M. Peinado, A.M. Gómez, MMSE feature reconstruction based on an occlusion model for robust ASR, in Advances in Speech and Language Technologies for Iberian Languages—IberSPEECH 2012, Communications in Computer and Information Science, (Springer, 2012), pp. 217–226
J.A. González, A.M. Peinado, A.M. Gómez, N. Ma, Log-spectral feature reconstruction based on an occlusion model for noise robust speech recognition, in Proceedings of the Interspeech, pp. 2630–2633 (2012)
J.A. González, A.M. Peinado, N. Ma, A.M. Gómez, J. Barker, MMSE-based missing-feature reconstruction with temporal modeling for robust speech recognition. IEEE Trans. Audio Speech Lang. Process. 21(3), 624–635 (2013)
Article Google Scholar
R.C. Hendriks, R. Heusdens, J. Jensen, MMSE based noise PSD tracking with low complexity, in Proceedings of the ICASSP, pp. 4266–4269 (2010)
H.G. Hirsch, Experimental framework for the performance evaluation of speech recognition front-ends of large vocabulary task (Tech. rep, STQ AURORA DSR Working Group, 2002)
H.G. Hirsch, D. Pearce, The Aurora experimental framework for the performance evaluations of speech recognitions systems under noise conditions. Proc. ISCA ITRW ASR 2000, 181–188 (2000)
Google Scholar
V. Leutnant, R. Haeb-Umbach, An analytic derivation of a phase-sensitive observation model for noise robust speech recognition, in Proceedings of the Interspeech, pp. 2395–2398 (2009)
J. Li, L. Deng, Y. Gong, R. Haeb-Umbach, An overview of noise-robust automatic speech recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 22(4), 745–777 (2014)
Article Google Scholar
J. Li, L. Deng, R. Haeb-Umbach, Y. Gong, Robust Automatic Speech Recognition: A Bridge to Practical Applications (Academic Press, Cambridge, 2015)
Google Scholar
P.C. Loizou, Speech Enhancement: Theory and Practice (CRC, Boca Raton, 2007)
Google Scholar
N. Ma, P. Green, J. Barker, A. Coy, Exploiting correlogram structure for robust speech recognition with multiple speech sources. Speech Commun. 49(12), 874–891 (2007)
Article Google Scholar
R. Martin, Noise power spectral density estimation based on optimal smoothing and minimum statistics. IEEE Trans. Speech Audio Process. 9(5), 504–512 (2001)
Article Google Scholar
J.A. Morales-Cordovilla, N. Ma, V.E. Sánchez, J.L. Carmona, A.M. Peinado, J. Barker, A pitch based noise estimation technique for robust speech recognition with missing data, in Proceedings of the ICASSP, pp. 4808–4811 (2011)
P.J. Moreno, Speech recognition in noisy environments. Ph.D. thesis, Carnegie Mellon University (1996)
A. Nádas, D. Nahamoo, M.A. Picheny, Speech recognition using noise-adaptive prototypes. IEEE Trans. Acoust. Speech Signal Process. 37(10), 1495–1503 (1989)
Article Google Scholar
T. Nakatani, T. Yoshioka, S. Araki, M. Delcroix, M. Fujimoto, Logmax observation model with MFCC-based spectral prior for reduction of highly nonstationary ambient noise, in Proceedings of the ICASSP, pp. 4029–4032 (2012)
M.H. Radfar, A.H. Banihashemi, R.M. Dansereau, A. Sayadiyan, Nonlinear minimum mean square error estimator for mixture-maximisation approximation. Electron. Lett. 42(12), 724–725 (2006)
Article Google Scholar
B. Raj, M.L. Seltzer, R.M. Stern, Reconstruction of missing features for robust speech recognition. Speech Commun. 48(4), 275–296 (2004)
Article Google Scholar
B. Raj, R. Singh, Reconstructing spectral vectors with uncertain spectrographic masks for robust speech recognition, in Proceedings of the ASRU, pp. 65–70 (2005)
B. Raj, R.M. Stern, Missing-feature approaches in speech recognition. IEEE Signal Process. Mag. 22(5), 101–116 (2005)
Article Google Scholar
J. Ramírez, J.M. Górriz, J.C. Segura, Voice Activity Detection. Fundamentals and Speech Recognition System Robustness (INTECH Open Access Publisher, NewYork, 2007)
Book Google Scholar
J. Ramírez, J.C. Segura, C. Benítez, A. De La Torre, A. Rubio, Efficient voice activity detection algorithms using long-term speech information. Speech Commun. 42(3), 271–287 (2004)
Article Google Scholar
A.M. Reddy, B. Raj, Soft mask estimation for single channel speaker separation, in Workshop on Statistical and Perceptual Audio Processing SAPA (2004)
A.M. Reddy, B. Raj, Soft mask methods for single-channel speaker separation. IEEE Trans. Audio Speech Lang. Process. 15(6), 1766–1776 (2007)
Article Google Scholar
U. Remes, Y. Nankaku, K. Tokuda, GMM-based missing-feature reconstruction on multi-frame windows, in Proceedings of the Interspeech, pp. 1665–1668 (2011)
S.J. Rennie, J.R. Hershey, P.A. Olsen, Single-channel multitalker speech recognition. IEEE Signal Process. Mag. 27(6), 66–80 (2010)
Google Scholar
S.T. Roweis, Factorial models and refiltering for speech separation and denoising, in Proceedings of the Eurospeech, pp. 1009–1012 (2003)
J.C. Segura, A. de la Torre, M.C. Benítez, A.M. Peinado, Model-based compensation of the additive noise for continuous speech recognition. Experiments using the Aurora II database and tasks, In Proceedings of the Eurospeech, pp. 221–224 (2001)
V. Stouten, H. Van Hamme, P. Wambacq, Effect of phase-sensitive environment model and higher order VTS on noisy speech feature enhancement. Proc. ICASSP 1, 433–436 (2005)
Google Scholar
A.P. Varga, R.K. Moore, Hidden Markov model decomposition of speech and noise, in Proceedings of the ICASSP, pp. 845–848 (1990)
T. Virtanen, R. Singh, B. Raj (eds.), Techniques for Noise Robustness in Automatic Speech Recognition (Wiley, Chichester, West Sussex, 2012)

Download references

Acknowledgements

This work was supported by the Spanish MINECO (Ministerio de Economía y Competitividad)/FEDER Project TEC2013-46690-P.

Author information

Authors and Affiliations

Department of Computer Science, University of Sheffield, Sheffield, UK
Jose A. Gonzalez, Ning Ma & Jon Barker
Department of Signal Theory, Telematics and Communications, Granada, Spain
Angel M. Gómez & Antonio M. Peinado

Authors

Jose A. Gonzalez
View author publications
You can also search for this author in PubMed Google Scholar
Angel M. Gómez
View author publications
You can also search for this author in PubMed Google Scholar
Antonio M. Peinado
View author publications
You can also search for this author in PubMed Google Scholar
Ning Ma
View author publications
You can also search for this author in PubMed Google Scholar
Jon Barker
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jose A. Gonzalez.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Gonzalez, J.A., Gómez, A.M., Peinado, A.M. et al. Spectral Reconstruction and Noise Model Estimation Based on a Masking Model for Noise Robust Speech Recognition. Circuits Syst Signal Process 36, 3731–3760 (2017). https://doi.org/10.1007/s00034-016-0480-7

Download citation

Received: 05 January 2016
Revised: 13 December 2016
Accepted: 20 December 2016
Published: 06 January 2017
Issue Date: September 2017
DOI: https://doi.org/10.1007/s00034-016-0480-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Spectral Reconstruction and Noise Model Estimation Based on a Masking Model for Noise Robust Speech Recognition

Abstract

Access this article

Similar content being viewed by others

Fundamentals, present and future perspectives of speech enhancement

Multi-channel spectrograms for speech processing applications using deep learning methods

Noise robust automatic speech recognition: review and analysis

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Spectral Reconstruction and Noise Model Estimation Based on a Masking Model for Noise Robust Speech Recognition

Abstract

Access this article

Similar content being viewed by others

Fundamentals, present and future perspectives of speech enhancement

Multi-channel spectrograms for speech processing applications using deep learning methods

Noise robust automatic speech recognition: review and analysis

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation