Advertisement

Speech Enhancement with LSTM Recurrent Neural Networks and its Application to Noise-Robust ASR

  • Felix Weninger
  • Hakan Erdogan
  • Shinji Watanabe
  • Emmanuel Vincent
  • Jonathan Le Roux
  • John R. Hershey
  • Björn Schuller
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9237)

Abstract

We evaluate some recent developments in recurrent neural network (RNN) based speech enhancement in the light of noise-robust automatic speech recognition (ASR). The proposed framework is based on Long Short-Term Memory (LSTM) RNNs which are discriminatively trained according to an optimal speech reconstruction objective. We demonstrate that LSTM speech enhancement, even when used ‘naïvely’ as front-end processing, delivers competitive results on the CHiME-2 speech recognition task. Furthermore, simple, feature-level fusion based extensions to the framework are proposed to improve the integration with the ASR back-end. These yield a best result of 13.76 % average word error rate, which is, to our knowledge, the best score to date.

Keywords

Recurrent Neural Network Automatic Speech Recognition Source Separation Deep Neural Network Speech Enhancement 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

  1. 1.
    Deng, L., Yu, D., Platt, J.: Scalable stacking and learning for building deep architectures. In: Proceedings of ICASSP, Kyoto, Japan, pp. 2133–2136 (2012)Google Scholar
  2. 2.
    Erdogan, H., Hershey, J.R., Watanabe, S., Le Roux, J.: Phase-sensitive and recognition-boosted speech separation using deep recurrent neural networks. In: Proceedings of ICASSP, Brisbane, Australia (2015)Google Scholar
  3. 3.
    Geiger, J.T., Zhang, Z., Weninger, F., Schuller, B., Rigoll, G.: Robust speech recognition using Long Short-Term Memory recurrent neural networks for hybrid acoustic modelling. In: Proceedings of INTERSPEECH, ISCA, Singapore (2014)Google Scholar
  4. 4.
    Graves, A., Mohamed, A., Hinton, G.: Speech recognition with deep recurrent neural networks. In: Proceedings of ICASSP, Vancouver, Canada, pp. 6645–6649 (2013)Google Scholar
  5. 5.
    Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)CrossRefGoogle Scholar
  6. 6.
    Kingsbury, B., Sainath, T.N., Soltau, H.: Scalable minimum Bayes risk training of deep neural network acoustic models using distributed hessian-free optimization. In: Proceedings of ICASSP, Kyoto, Japan (2012)Google Scholar
  7. 7.
    Narayanan, A., Wang, D.L.: The role of binary mask patterns in automatic speech recognition in background noise. J. Acoust. Soc. Am. 133, 3083–3093 (2013)CrossRefGoogle Scholar
  8. 8.
    Narayanan, A., Wang, D.L.: Improving robustness of deep neural network acoustic models via speech separation and joint adaptive training. IEEE/ACM Trans. Audio Speech Lang. Process. 23(1), 92–101 (2015)Google Scholar
  9. 9.
    Ozerov, A., Vincent, E., Bimbot, F.: A general flexible framework for the handling of prior information in audio source separation. IEEE Trans. Audio Speech Lang. Process. 20(4), 1118–1133 (2012)CrossRefGoogle Scholar
  10. 10.
    Sak, H., Vinyals, O., Heigold, G., Senior, A., McDermott, E., Monga, R., Mao, M.: Sequence discriminative distributed training of long short-term memory recurrent neural networks. In: Proceedings of INTERSPEECH, ISCA, Singapore (2014)Google Scholar
  11. 11.
    Seltzer, M.L., Yu, D., Wang, Y.: An investigation of deep neural networks for noise robust speech recognition. In: Proceedings of ICASSP, Vancouver, Canada, pp. 7398–7402. IEEE (2013)Google Scholar
  12. 12.
    Veselỳ, K., Ghoshal, A., Burget, L., Povey, D.: Sequence-discriminative training of deep neural networks. In: INTERSPEECH, pp. 2345–2349 (2013)Google Scholar
  13. 13.
    Vincent, E., Barker, J., Watanabe, S., Le Roux, J., Nesta, F., Matassoni, M.: The second ‘CHiME’ speech separation and recognition challenge: datasets, tasks and baselines. In: Proceedings of ICASSP, Vancouver, Canada, pp. 126–130 (2013)Google Scholar
  14. 14.
    Vincent, E., Gribonval, R., Févotte, C.: Performance measurement in blind audio source separation. IEEE Trans. Audio Speech Lang. Process. 14(4), 1462–1469 (2006)CrossRefGoogle Scholar
  15. 15.
    Weng, C., Yu, D., Watanabe, S., Juang, B.H.: Recurrent deep neural networks for robust speech recognition. In: Proceedings of ICASSP, Florence, Italy, pp. 5569–5573 (2014)Google Scholar
  16. 16.
    Weninger, F., Hershey, J.R., Le Roux, J., Schuller, B.: Discriminatively trained recurrent neural networks for single-channel speech separation. In: Proceedings of GlobalSIP, pp. 740–744. IEEE, Atlanta (2014)Google Scholar
  17. 17.
    Weninger, F., Le Roux, J., Hershey, J.R., Watanabe, S.: Discriminative NMF and its application to single-channel source separation. In: Proceedings of INTERSPEECH, Singapore (2014)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2015

Authors and Affiliations

  • Felix Weninger
    • 1
  • Hakan Erdogan
    • 2
    • 3
  • Shinji Watanabe
    • 2
  • Emmanuel Vincent
    • 4
  • Jonathan Le Roux
    • 2
  • John R. Hershey
    • 2
  • Björn Schuller
    • 5
  1. 1.Machine Intelligence and Signal Processing GroupTUMMunichGermany
  2. 2.Mitsubishi Electric Research LaboratoriesCambridgeUSA
  3. 3.Sabanci UniversityIstanbulTurkey
  4. 4.InriaVillers-les-nancyFrance
  5. 5.Department of ComputingImperial College LondonLondonUK

Personalised recommendations