Speech Enhancement for Speaker Recognition Using Deep Recurrent Neural Networks

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10458)


This paper describes the speech denoising system based on long short-term memory (LSTM) neural networks. The architecture of the presented network is designed to make speech enhancement in spectrogram magnitude domain. The audio resynthesis is performed via the inverse short-time Fourier transform by maintaining the original phase. Objective quality is assessed by root mean square error between clean and denoised audio signals on CHiME corpus and speaker verification rate by using RSR2015 corpus. Proposed system demonstrates improved results on both metrics.


Noise suppression Denoising Speech restoration LSTM Neural networks Speaker verification 


  1. 1.
    Lukin, A., Todd, J.: Suppression of musical noise artifacts in audio noise reduction by adaptive 2D filtering. In: Audio Engineering Society Convention 123 (2007)Google Scholar
  2. 2.
    Valin, J.-M.: Speex: a free codec for free speech. In: Conference (2006)Google Scholar
  3. 3.
    Liu, D., Smaragdis, P., Kim, M.: Experiments on deep learning for speech denoising. In: 15th Annual Conference of the International Speech Communication Association (INTERSPEECH), Singapore, pp. 2685–2689 (2014)Google Scholar
  4. 4.
    Feng, X., Zhang, Y., Glass, J.: Speech feature denoising and dereverberation via deep autoencoders for noisy reverberant speech recognition. In: International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1759–1763. IEEE Press, Italy (2014)Google Scholar
  5. 5.
    Weninger, F., Erdogan, H., Watanabe, S., Vincent, E., Roux, J., Hershey, J.R., Schuller, B.: Speech enhancement with LSTM recurrent neural networks and its application to noise-robust ASR. In: 12th International Conference on Latent Variable Analysis and Signal Separation (LVA/ICA), Liberec, Czech Republic (2015)Google Scholar
  6. 6.
    Sun, L., Kang, S., Li, K., Meng, H.: Voice conversion using deep Bidirectional Long Short-Term Memory based Recurrent Neural Networks. In: International Conference on Acoustics, Speech and Signal Processing (ICASSP), Queensland, Australia (2015)Google Scholar
  7. 7.
    Mimura, M., Sakai, S., Kawahara, T.: Speech dereverberation using long short-term memory. In: 16th Annual Conference of the International Speech Communication Association (INTERSPEECH), Dresden, Germany, pp. 2435–2439 (2015)Google Scholar
  8. 8.
    Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)CrossRefGoogle Scholar
  9. 9.
    Xu, Y., Du, J., Da, L.-R.: A regression approach to speech enhancement based on deep neural networks. IEEE/ACM Trans. Audio Speech Lang. Process. 23(1), 7–19 (2015). IEEE PressCrossRefGoogle Scholar
  10. 10.
    Keras: Deep Learning library for Theano and TensorFlow.
  11. 11.
    Hinton, G., Deng, L., Dahl, G.E., Mohamed, A., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., Sainath, T., Kingsbury, B.: Deep neural networks for acoustic modeling in speech recognition. IEEE Sig. Process. Mag. 29(6), 82–97 (2012). IEEE PressCrossRefGoogle Scholar
  12. 12.
    Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.R.: Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15(1), 1929–1958 (2014)MathSciNetzbMATHGoogle Scholar
  13. 13.
    Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. In: 32nd International Conference on Machine Learning (ICML), Lille, France (2015)Google Scholar
  14. 14.
    Gers, F.A., Schmidhuber, J., Cummins, F.: Learning to forget: continual prediction with LSTM. Neural Comput. 12(10), 2451–2471 (2000)CrossRefGoogle Scholar
  15. 15.
    Crochiere, R.: A weighted overlap-add method of short-time Fourier analysis/synthesis. IEEE Trans. Acoust. Speech Sig. Process. ASSP–28, 99–102 (1980)CrossRefGoogle Scholar
  16. 16.
    Kingma, D., Ba, J.: Adam: a method for stochastic optimization. In: 3rd International Conference for Learning Representations, San Diego (2015)Google Scholar
  17. 17.
    Christensen, H., Barker, J., Ma, N., Green, P.: The CHiME corpus: a resource and a challenge for computational hearing in multisource environments. In: 11th Annual Conference of the International Speech Communication Association (INTERSPEECH), Chiba, Japan (2010)Google Scholar
  18. 18.
    Larcher, A., Lee, K.A., Ma, B., Li, H.: RSR2015: database for textdependent speaker verification using multiple pass-phrases. In: 13th Annual Conference of the International Speech Communication Association (INTERSPEECH), Portland, Oregon, USA (2012)Google Scholar
  19. 19.
    Ferras, M., Madikeri, S., Motlicek, P., Dey, S., Bourlard, H.: A large-scale open-source acoustic simulator for speaker recognition. IEEE Sig. Process. Lett. 23(4), 527–531 (2016)CrossRefGoogle Scholar
  20. 20.
    Dehak, N., Kenny, P., Dehak, R., Dumouchel, P., Ouellet, P.: Front-end factor analysis for speaker verification. IEEE Trans. Audio Speech Lang. Process. 19, 788–798 (2011). IEEE PressCrossRefGoogle Scholar
  21. 21.
    Prince, S.J., Elder, J.H.: Probabilistic linear discriminant analysis for inferences about identity. In: 11th International Conference on Computer Vision (ICCV), Rio de Janeiro, Brazil, pp. 1–8 (2007)Google Scholar
  22. 22.
    Testarium. Research tool.
  23. 23.
    Denoising examples.

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  1. 1.ASM Solutions LLCMoscowRussia
  2. 2.Lomonosov Moscow State UniversityMoscowRussia
  3. 3.Master Synthesis LLCMoscowRussia

Personalised recommendations