Advertisement

Noise and Speech Estimation as Auxiliary Tasks for Robust Speech Recognition

  • Gueorgui Pironkov
  • Stéphane Dupont
  • Sean U. N. Wood
  • Thierry Dutoit
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10583)

Abstract

Dealing with noise deteriorating the speech is still a major problem for automatic speech recognition. An interesting approach to tackle this problem consists of using multi-task learning. In this case, an efficient auxiliary task is clean-speech generation. This auxiliary task is trained in addition to the main speech recognition task and its goal is to help improve the results of the main task. In this paper, we investigate this idea further by generating features extracted directly from the audio file containing only the noise, instead of the clean-speech. After demonstrating that an improvement can be obtained through this multi-task learning auxiliary task, we also show that using both noise and clean-speech estimation auxiliary tasks leads to a 4% relative word error rate improvement in comparison to the classic single-task learning on the CHiME4 dataset.

Keywords

Speech recognition Multi-task learning Robust ASR Noise estimation CHiME4 

Notes

Acknowledgments

This work has been partly funded by the Walloon Region of Belgium through the SPW-DGO6 Wallinov Program no 1610152.

References

  1. 1.
    Bell, P., Renals, S.: Regularization of context-dependent deep neural networks with context-independent multi-task training. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4290–4294. IEEE (2015)Google Scholar
  2. 2.
    Caruana, R.: Multitask learning. Mach. learn. 28(1), 41–75 (1997)MathSciNetCrossRefGoogle Scholar
  3. 3.
    Chen, D., Mak, B., Leung, C.C., Sivadas, S.: Joint acoustic modeling of triphones and trigraphemes by multi-task learning deep neural networks for low-resource speech recognition. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5592–5596. IEEE (2014)Google Scholar
  4. 4.
    Chen, N., Qian, Y., Yu, K.: Multi-task learning for text-dependent speaker verification. In: Sixteenth Annual Conference of the International Speech Communication Association (2015)Google Scholar
  5. 5.
    Chen, Z., Watanabe, S., Erdogan, H., Hershey, J.R.: Speech enhancement and recognition using multi-task learning of long short-term memory recurrent neural networks. In: INTERSPEECH, pp. 3274–3278. ISCA (2015)Google Scholar
  6. 6.
    Dehak, N., Kenny, P., Dehak, R., Dumouchel, P., Ouellet, P.: Front-end factor analysis for speaker verification. IEEE Trans. Audio Speech Lang. Process. 19(4), 788–798 (2011)CrossRefGoogle Scholar
  7. 7.
    Garofolo, J., Graff, D., Paul, D., Pallett, D.: CSR-I (WSJ0) Complete LDC93S6A. Web Download. Linguistic Data Consortium, Philadelphia (1993)Google Scholar
  8. 8.
    Giri, R., Seltzer, M.L., Droppo, J., Yu, D.: Improving speech recognition in reverberation using a room-aware deep neural network and multi-task learning. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5014–5018. IEEE (2015)Google Scholar
  9. 9.
    Hinton, G., Deng, L., Yu, D., Dahl, G.E., Mohamed, A.R., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., Sainath, T.N., et al.: Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. Sig. Process. Mag. 29(6), 82–97 (2012)CrossRefGoogle Scholar
  10. 10.
    Hu, Q., Wu, Z., Richmond, K., Yamagishi, J., Stylianou, Y., Maia, R.: Fusion of multiple parameterisations for DNN-based sinusoidal speech synthesis with multi-task learning. In: Proceedings of Interspeech (2015)Google Scholar
  11. 11.
    Huang, Z., Li, J., Siniscalchi, S.M., Chen, I.F., Wu, J., Lee, C.H.: Rapid adaptation for deep neural networks through multi-task learning. In: Sixteenth Annual Conference of the International Speech Communication Association (2015)Google Scholar
  12. 12.
    Kim, S., Raj, B., Lane, I.: Environmental noise embeddings for robust speech recognition (2016). arxiv preprint arXiv:1601.02553
  13. 13.
    Kundu, S., Mantena, G., Qian, Y., Tan, T., Delcroix, M., Sim, K.C.: Joint acoustic factor learning for robust deep neural network based automatic speech recognition. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5025–5029. IEEE (2016)Google Scholar
  14. 14.
    LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521(7553), 436–444 (2015)CrossRefGoogle Scholar
  15. 15.
    Li, B., Sainath, T.N., Weiss, R.J., Wilson, K.W., Bacchiani, M.: Neural network adaptive beamforming for robust multichannel speech recognition. In: Proceedings of Interspeech (2016)Google Scholar
  16. 16.
    Li, X., Wang, Y.Y., Tur, G.: Multi-task learning for spoken language understanding with shared slots. In: Twelfth Annual Conference of the International Speech Communication Association (2011)Google Scholar
  17. 17.
    Lu, Y., Lu, F., Sehgal, S., Gupta, S., Du, J., Tham, C.H., Green, P., Wan, V.: Multitask learning in connectionist speech recognition. In: Proceedings of the Australian International Conference on Speech Science and Technology (2004)Google Scholar
  18. 18.
    Pironkov, G., Dupont, S., Dutoit, T.: Multi-task learning for speech recognition: an overview. In: Proceedings of the 24th European Symposium on Artificial Neural Networks (ESANN) (2016)Google Scholar
  19. 19.
    Pironkov, G., Dupont, S., Dutoit, T.: Speaker-aware long short-term memory multi-task learning for speech recognition. In: 24th European Signal Processing Conference (EUSIPCO), pp. 1911–1915. IEEE (2016)Google Scholar
  20. 20.
    Pironkov, G., Dupont, S., Dutoit, T.: Speaker-aware multi-task learning for automatic speech recognition. In: 23rd International Conference on Pattern Recognition (ICPR) (2016)Google Scholar
  21. 21.
    Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., Hannemann, M., Motlicek, P., Qian, Y., Schwarz, P., et al.: The kaldi speech recognition toolkit. In: IEEE 2011 Workshop on Automatic Speech Recognition and Understanding. IEEE Signal Processing Society (2011)Google Scholar
  22. 22.
    Qian, Y., Tan, T., Yu, D.: An investigation into using parallel data for far-field speech recognition. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5725–5729. IEEE (2016)Google Scholar
  23. 23.
    Qian, Y., Yin, M., You, Y., Yu, K.: Multi-task joint-learning of deep neural networks for robust speech recognition. In: IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pp. 310–316. IEEE (2015)Google Scholar
  24. 24.
    Sakti, S., Kawanishi, S., Neubig, G., Yoshino, K., Nakamura, S.: Deep bottleneck features and sound-dependent i-vectors for simultaneous recognition of speech and environmental sounds. In: Spoken Language Technology Workshop (SLT), pp. 35–42. IEEE (2016)Google Scholar
  25. 25.
    Seltzer, M.L., Droppo, J.: Multi-task learning in deep neural networks for improved phoneme recognition. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6965–6969. IEEE (2013)Google Scholar
  26. 26.
    Stadermann, J., Koska, W., Rigoll, G.: Multi-task learning strategies for a recurrent neural net in a hybrid tied-posteriors acoustic model. In: INTERSPEECH, pp. 2993–2996 (2005)Google Scholar
  27. 27.
    Tan, T., Qian, Y., Yu, D., Kundu, S., Lu, L., Sim, K.C., Xiao, X., Zhang, Y.: Speaker-aware training of LSTM-RNNS for acoustic modelling. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5280–5284. IEEE (2016)Google Scholar
  28. 28.
    Tang, Z., Li, L., Wang, D.: Multi-task recurrent model for speech and speaker recognition (2016). arxiv preprint arXiv:1603.09643
  29. 29.
    Vincent, E., Watanabe, S., Nugraha, A.A., Barker, J., Marxer, R.: An analysis of environment, microphone and data simulation mismatches in robust speech recognition. Computer Speech & Language (2016)Google Scholar
  30. 30.
    Wu, Z., Valentini-Botinhao, C., Watts, O., King, S.: Deep neural networks employing multi-task learning and stacked bottleneck features for speech synthesis. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4460–4464. IEEE (2015)Google Scholar
  31. 31.
    Xiong, W., Droppo, J., Huang, X., Seide, F., Seltzer, M., Stolcke, A., Yu, D., Zweig, G.: Achieving human parity in conversational speech recognition (2016). arxiv preprint arXiv:1610.05256

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  • Gueorgui Pironkov
    • 1
  • Stéphane Dupont
    • 1
  • Sean U. N. Wood
    • 2
  • Thierry Dutoit
    • 1
  1. 1.Circuit Theory and Signal Processing LabUniversity of MonsMonsBelgium
  2. 2.NECOTIS, Department of Electrical and Computer EngineeringUniversity of SherbrookeQCCanada

Personalised recommendations