Deep Recurrent Networks for Separation and Recognition of Single-Channel Speech in Nonstationary Background Audio

Erdogan, Hakan; Hershey, John R.; Watanabe, Shinji; Le Roux, Jonathan

doi:10.1007/978-3-319-64680-0_7

Hakan Erdogan⁵,
John R. Hershey⁶,
Shinji Watanabe⁶ &
…
Jonathan Le Roux⁶

2537 Accesses
22 Citations

Abstract

We investigate the use of deep neural networks and deep recurrent neural networks for separation and recognition of speech in challenging environments. Mask prediction networks received considerable interest recently for speech separation and speech enhancement problems where the background signals are nonstationary and challenging. Initial signal-level enhancement with deep neural networks has also been shown to be useful for noise-robust speech recognition in these environments. We consider using various loss functions for training the networks and illustrate differences among them. We compare the performance of deep computational architectures with conventional statistical techniques as well as variants of nonnegative matrix factorization, and establish that one can achieve impressively superior results with deep-learning-based techniques on this problem.

This work was largely completed when the first author was on sabbatical leave at MERL from his faculty position at Sabanci University, Istanbul.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 149.00; Price excludes VAT (USA)

Softcover Book: USD 199.99; Price excludes VAT (USA)

Hardcover Book: USD 199.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Benesty, J., Makino, S., Chen, J.: Speech Enhancement. Springer Science & Business Media, New York (2005)
Google Scholar
Cohen, I.: Optimal speech enhancement under signal presence uncertainty using log-spectral amplitude estimator. IEEE Signal Process. Lett. 9(4), 113–116 (2002)
Article MathSciNet Google Scholar
Cohen, I., Berdugo, B.: Noise estimation by minima controlled recursive averaging for robust speech enhancement. IEEE Signal Process. Lett. 9(1), 12–15 (2002)
Article Google Scholar
Ephraim, Y., Malah, D.: Speech enhancement using a minimum-mean square error short-time spectral amplitude estimator. IEEE Trans. Acoust. Speech Signal Process. 32(6), 1109–1121 (1984)
Article Google Scholar
Ephraim, Y., Malah, D.: Speech enhancement using a minimum mean-square error log-spectral amplitude estimator. IEEE Trans. Acoust. Speech Signal Process. 33(2), 443–445 (1985)
Article Google Scholar
Erdogan, H., Hershey, J.R., Watanabe, S., Le Roux, J.: Phase-sensitive and recognition-boosted speech separation using deep recurrent neural networks. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Brisbane (2015)
Google Scholar
Gemmeke, J.F., Virtanen, T., Hurmalainen, A.: Exemplar-based sparse representations for noise robust automatic speech recognition. IEEE Trans. Audio Speech Lang. Process. 19(7), 2067–2080 (2011)
Article Google Scholar
Grais, E.M., Erdogan, H.: Single channel speech music separation using nonnegative matrix factorization and spectral masks. In: Proceedings of the International Conference on Digital Signal Processing (DSP), pp. 1–6 (2011)
Google Scholar
Grais, E.M., Sen, M.U., Erdogan, H.: Deep neural networks for single channel source separation. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Florence (2014)
Google Scholar
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Article Google Scholar
Hu, Y., Loizou, P.C.: Evaluation of objective quality measures for speech enhancement. IEEE Trans. Audio Speech Lang. Process. 16(1), 229–238 (2008)
Article Google Scholar
Huang, P.S., Kim, M., Hasegawa-Johnson, M., Smaragdis, P.: Deep learning for monaural speech separation. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Florence, pp. 1581–1585 (2014)
Google Scholar
Kubichek, R.: Mel-cepstral distance measure for objective speech quality assessment. In: IEEE Pacific Rim Conference on Communications, Computers and Signal Processing, vol. 1, pp. 125–128 (1993)
Google Scholar
Le Roux, J., Vincent, E., Mizuno, Y., Kameoka, H., Ono, N., Sagayama, S.: Consistent Wiener filtering: generalized time–frequency masking respecting spectrogram consistency. In: Proceedings of the International Conference on Latent Variable Analysis and Signal Separation, pp. 89–96 (2010)
Google Scholar
Le Roux, J., Weninger, F.J., Hershey, J.R.: Sparse NMF – half-baked or well done? Technical Report, TR2015-023, Mitsubishi Electric Research Laboratories (MERL), Cambridge, MA (2015)
Google Scholar
Lee, D.D., Seung, H.S.: Algorithms for non-negative matrix factorization. In: Advances in Neural Information Processing Systems (NIPS), pp. 556–562 (2001)
Google Scholar
Loizou, P.C.: Speech Enhancement: Theory and Practice. CRC Press, Boca Raton, FL (2013)
Google Scholar
Lu, X., Tsao, Y., Matsuda, S., Hori, C.: Speech enhancement based on deep denoising autoencoder. In: Proceedings of the Interspeech, Lyon, pp. 3444–3448 (2013)
Google Scholar
Maas, A.L., O’Neil, T.M., Hannun, A.Y., Ng, A.Y.: Recurrent neural network feature enhancement: the 2nd CHiME challenge. In: Proceedings of the CHiME Workshop on Machine Listening in Multisource Environments, Vancouver, pp. 79–80 (2013)
Google Scholar
Narayanan, A., Wang, D.: Ideal ratio mask estimation using deep neural networks for robust speech recognition. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Vancouver, pp. 7092–7096 (2013)
Google Scholar
Paul, D.B., Baker, J.M.: The design for the Wall Street Journal-based CSR corpus. In: Proceedings of the Workshop on Speech and Natural Language, pp. 357–362 (1992)
Google Scholar
Schmidt, M.N., Olsson, R.K.: Single-channel speech separation using sparse non-negative matrix factorization. In: Proceedings of the Interspeech, Pittsburgh, PA, pp. 1652–55 (2006)
Google Scholar
Smaragdis, P.: Convolutive speech bases and their application to supervised speech separation. IEEE Trans. Audio Speech Lang. Process. 15(1), 1–14 (2007)
Article Google Scholar
Taal, C.H., Hendriks, R.C., Heusdens, R., Jensen, J.: An algorithm for intelligibility prediction of time–frequency weighted noisy speech. IEEE Trans. Audio Speech Lang. Process. 19(7), 2125–2136 (2011)
Article Google Scholar
Vincent, E., Gribonval, R., Févotte, C.: Performance measurement in blind audio source separation. IEEE Trans. Audio Speech Lang. Process. 14(4), 1462–1469 (2006)
Article Google Scholar
Vincent, E., Barker, J., Watanabe, S., Le Roux, J., Nesta, F., Matassoni, M.: The second “CHiME” speech separation and recognition challenge: datasets, tasks and baselines. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Vancouver, pp. 126–130 (2013)
Google Scholar
Virtanen, T.: Monaural sound source separation by nonnegative matrix factorization with temporal continuity and sparseness criteria. IEEE Trans. Audio Speech Lang. Process. 15(3), 1066–1074 (2007)
Article Google Scholar
Wang, Z.Q., Wang, D.: A joint training framework for robust automatic speech recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 24(4), 796–806 (2016)
Article Google Scholar
Wang, Y., Narayanan, A., Wang, D.: On training targets for supervised speech separation. IEEE/ACM Trans. Audio Speech Lang. Process. 22(12), 1849–58 (2014)
Article Google Scholar
Weninger, F., Geiger, J., Wöllmer, M., Schuller, B., Rigoll, G.: The Munich feature enhancement approach to the 2013 CHiME challenge using BLSTM recurrent neural networks. In: Proceedings of the 2nd CHiME Speech Separation and Recognition Challenge held in conjunction with ICASSP 2013, Vancouver, pp. 86–90 (2013)
Google Scholar
Weninger, F., Hershey, J.R., Le Roux, J., Schuller, B.: Discriminatively trained recurrent neural networks for single-channel speech separation. In: Proceedings of the IEEE Global Conference on Signal and Information Processing (GlobalSIP), pp. 577–581 (2014)
Google Scholar
Weninger, F., Le Roux, J., Hershey, J., Watanabe, S.: Discriminative NMF and its application to single-channel source separation. In: Proceedings of the Interspeech, Singapore (2014)
Google Scholar
Weninger, F., Erdogan, H., Watanabe, S., Vincent, E., Le Roux, J., Hershey, J.R., Schuller, B.: Speech enhancement with LSTM recurrent neural networks and its application to noise-robust ASR. In: Proceedings of the International Conference on Latent Variable Analysis and Signal Separation (LVA/ICA) (2015)
Google Scholar
Williamson, D.S., Wang, Y., Wang, D.: Complex ratio masking for monaural speech separation. IEEE/ACM Trans. Audio Speech Lang. Process. 24(3), 483–492 (2016)
Article Google Scholar
Xu, Y., Du, J., Dai, L.R., Lee, C.H.: An experimental study on speech enhancement based on deep neural networks. Signal Process. Lett. 21(1), 65–68 (2014)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Microsoft Research, Redmond, WA, USA
Hakan Erdogan
Mitsubishi Electric Research Laboratories (MERL), Cambridge, MA, USA
John R. Hershey, Shinji Watanabe & Jonathan Le Roux

Authors

Hakan Erdogan
View author publications
You can also search for this author in PubMed Google Scholar
John R. Hershey
View author publications
You can also search for this author in PubMed Google Scholar
Shinji Watanabe
View author publications
You can also search for this author in PubMed Google Scholar
Jonathan Le Roux
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hakan Erdogan .

Editor information

Editors and Affiliations

Mitsubishi Electric Research Laboratories (MERL), Cambridge, Massachusetts, USA
Shinji Watanabe
NTT Communication Science Laboratories, NTT Corporation, Kyoto, Japan
Marc Delcroix
Language Technologies Institute, Carnegie Mellon University, Pittsburgh, Pennsylvania, USA
Florian Metze
Mitsubishi Electric Research Laboratories (MERL), Cambridge, Massachusetts, USA
John R. Hershey

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Erdogan, H., Hershey, J.R., Watanabe, S., Le Roux, J. (2017). Deep Recurrent Networks for Separation and Recognition of Single-Channel Speech in Nonstationary Background Audio. In: Watanabe, S., Delcroix, M., Metze, F., Hershey, J. (eds) New Era for Robust Speech Recognition. Springer, Cham. https://doi.org/10.1007/978-3-319-64680-0_7

Download citation

DOI: https://doi.org/10.1007/978-3-319-64680-0_7
Published: 26 July 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-64679-4
Online ISBN: 978-3-319-64680-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics