Skip to main content

Improving Automatic Speech Recognition Containing Additive Noise Using Deep Denoising Autoencoders of LSTM Networks

Part of the Lecture Notes in Computer Science book series (LNAI,volume 9811)

Abstract

Automatic speech recognition systems (ASR) suffer from performance degradation under noisy conditions. Recent work, using deep neural networks to denoise spectral input features for robust ASR, have proved to be successful. In particular, Long Short-Term Memory (LSTM) autoencoders have outperformed other state of the art denoising systems when applied to the mfcc’s of a speech signal. In this paper we also consider denoising LSTM autoencoders (DLSTMA), but instead use three different DLSTMAs and apply each to the mfcc’s, fundamental frequency, and energy features, respectively. Results are given using several kinds of additive noise at different intensity levels, and show how this collection of DLSTMA’s improves the performance of the ASR in comparison with the LSTM autoencoder.

Keywords

  • LSTM
  • Deep learning
  • Denoising autoencoders

This is a preview of subscription content, access via your institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • DOI: 10.1007/978-3-319-43958-7_42
  • Chapter length: 8 pages
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
eBook
USD   89.00
Price excludes VAT (USA)
  • ISBN: 978-3-319-43958-7
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
Softcover Book
USD   119.99
Price excludes VAT (USA)
Fig. 1.
Fig. 2.
Fig. 3.

References

  1. Weninger, F., Watanabe, S., Tachioka, Y., Schuller, B.: Deep recurrent de-noising auto-encoder and blind de-reverberation for reverberated speech recognition. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4623–4627. IEEE (2014)

    Google Scholar 

  2. Bagchi, D., Mandel, M.I., Wang, Z., He, Y., Plummer, A., Fosler-Lussier, E.: Combining spectral feature mapping and multi-channel model-based source separation for noise-robust automatic speech recognition. In: Proceedings of IEEE ASRU (2015)

    Google Scholar 

  3. Kalinli, O., Seltzer, M.L., Droppo, J., Acero, A.: Noise adaptive training for robust automatic speech recognition. IEEE Trans. Audio Speech Lang. Process. 18(8), 1889–1901 (2010)

    CrossRef  Google Scholar 

  4. Ishii, T., Komiyama, H., Shinozaki, T., Horiuchi, Y., Kuroiwa, S.: Reverberant speech recognition based on denoising autoencoder. In: INTERSPEECH, pp. 3512–3516 (2013)

    Google Scholar 

  5. Zhang, Z., Wang, L., Kai, A., Yamada, T., Li, W., Iwahashi, M.: Deep neural network-based bottleneck feature and denoising autoencoder-based dereverberation for distant-talking speaker identification. EURASIP J. Audio Speech Music Process. 2015(1), 1–13 (2015)

    CrossRef  Google Scholar 

  6. Delcroix, M., Yoshioka, T., Ogawa, A., Kubo, Y., Fujimoto, M., Ito, N., Nakamura, A.: Linear prediction-based dereverberation with advanced speech enhancement and recognition technologies for the REVERB challenge. In: Proceedings of REVERB Workshop (2014)

    Google Scholar 

  7. Kawase, T., Niwa, K., Hioka, Y., Kobayashi, K.: Selection of optimal array noise reduction parameter set for accurate speech recognition in various noisy environments. In: Western Pacific Acoustics Conference (2015)

    Google Scholar 

  8. Zhao, M., Wang, D., Zhang, Z., Zhang, X.: Music removal by denoising autoencoder in speech recognition. In: APSIPA 2015 (2015)

    Google Scholar 

  9. Seltzer, M.L., Yu, D., Wang, Y.: An investigation of deep neural networks for noise robust speech recognition. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7398–7402 (2013)

    Google Scholar 

  10. Du, J., Wang, Q., Gao, T., Xu, Y., Dai, L.R., Lee, C.H.: Robust speech recognition with speech enhanced deep neural networks. In: INTERSPEECH, pp. 616–620 (2014)

    Google Scholar 

  11. Han, K., He, Y., Bagchi, D., Fosler-Lussier, E., Wang, D.: Deep neural network based spectral feature mapping for robust speech recognition. In: INTERSPEECH, pp. 2484–2488 (2015)

    Google Scholar 

  12. Maas, A.L., Le, Q.V., O’Neil, T.M., Vinyals, O., Nguyen, P., Ng, A.Y.: Recurrent neural networks for noise reduction in robust ASR. In: INTERSPEECH, pp. 22–25 (2012)

    Google Scholar 

  13. Deng, L., Li, J., Huang, J.T., Yao, K., Yu, D., Seide, F., Seltzer, M., Zweig, G., He, X., Williams, J., Gong, Y.: Recent advances in deep learning for speech research at Microsoft. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8604–8608 (2013)

    Google Scholar 

  14. Geiger, J.T., Weninger, F., Gemmeke, J.F., Wollmer, M., Schuller, B., Rigoll, G.: Memory-enhanced neural networks and NMF for robust ASR. IEEE/ACM Trans. Audio Speech Lang. Process. 22(6), 1037–1046 (2014)

    CrossRef  Google Scholar 

  15. Zen, H., Sak, H.: Unidirectional long short-term memory recurrent neural network with recurrent output layer for lowlatency speech synthesis. In: Submitted to ICASSP (2015)

    Google Scholar 

  16. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)

    CrossRef  Google Scholar 

  17. Graves, A., Navdeep, J., Abdel-Rahman, M.: Hybrid speech recognition with deep bidirectional LSTM. In: IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU) (2013)

    Google Scholar 

  18. Graves, A., Fernández, S., Schmidhuber, J.: Bidirectional LSTM networks for improved phoneme classification and recognition. In: Duch, W., Kacprzyk, J., Oja, E., Zadrożny, S. (eds.) ICANN 2005. LNCS, vol. 3697, pp. 799–804. Springer, Heidelberg (2005)

    Google Scholar 

  19. Fan, Y., Qian, Y., Xie, F.L., Soong, F.K.: TTS synthesis with bidirectional LSTM based recurrent neural networks. In: Interspeech, pp. 1964–1968 (2014)

    Google Scholar 

  20. Feng, X., Zhang, Y., Glass, J.: Speech feature denoising and dereverberation via deep autoencoders for noisy reverberant speech recognition. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1759–1763 (2014)

    Google Scholar 

  21. Kominek, J., Black, A.W.: The CMU Arctic speech databases. In: Fifth ISCA Workshop on Speech Synthesis (2004)

    Google Scholar 

  22. Speechmatics. https://www.speechmatics.com

  23. Erro, D., Sainz, I., Navas, E., Hernaez, I.: Improved HNM-based vocoder for statistical synthesizers. In: INTERSPEECH, pp. 1809–1812 (2011)

    Google Scholar 

Download references

Acknowledgements

This work was supported by the SEP and CONACyT under the Program SEP-CONACyT, CB-2012-01, No. 182432, in Mexico, as well as the University of Costa Rica in Costa Rica.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Marvin Coto-Jiménez .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and Permissions

Copyright information

© 2016 Springer International Publishing Switzerland

About this paper

Cite this paper

Coto-Jiménez, M., Goddard-Close, J., Martínez-Licona, F. (2016). Improving Automatic Speech Recognition Containing Additive Noise Using Deep Denoising Autoencoders of LSTM Networks. In: Ronzhin, A., Potapova, R., Németh, G. (eds) Speech and Computer. SPECOM 2016. Lecture Notes in Computer Science(), vol 9811. Springer, Cham. https://doi.org/10.1007/978-3-319-43958-7_42

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-43958-7_42

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-43957-0

  • Online ISBN: 978-3-319-43958-7

  • eBook Packages: Computer ScienceComputer Science (R0)