Advertisement

Improved Speaker Adaptation by Combining I-vector and fMLLR with Deep Bottleneck Networks

  • Thai Son NguyenEmail author
  • Kevin Kilgour
  • Matthias Sperber
  • Alex Waibel
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10458)

Abstract

This paper investigates how deep bottleneck neural networks can be used to combine the benefits of both i-vectors and speaker-adaptive feature transformations. We show how a GMM-based speech recognizer can be greatly improved by applying feature-space maximum likelihood linear regression (fMLLR) transformation to outputs of a deep bottleneck neural network trained on a concatenation of regular Mel filterbank features and speaker i-vectors. The addition of the i-vectors reduces word error rate of the GMM system by 3–7% compared to an identical system without i-vectors. We also examine Deep Neural Network (DNN) systems trained on various combinations of i-vectors, fMLLR-transformed bottleneck features and other feature space transformations. The best approach results speaker-adapted DNNs which showed 15–19% relative improvement over a strong speaker-independent DNN baseline.

Keywords

DNN fMLLR I-vector Bottleneck extraction 

References

  1. 1.
    Cardinal, P., Dehak, N., Zhang, Y., Glass, J.: Speaker adaptation using the i-vector technique for bottleneck features. In: Proceedings of Interspeech, vol. 2015 (2015)Google Scholar
  2. 2.
    Cettolo, M., Niehues, J., Stüker, S., Bentivogli, L., Frederico, M.: Report on the 10th iwslt evaluation campaign. In: The International Workshop on Spoken Language Translation (IWSLT) (2013)Google Scholar
  3. 3.
    Dehak, N., Kenny, P.J., Dehak, R., Dumouchel, P., Ouellet, P.: Front-end factor analysis for speaker verification. IEEE Trans. Audio Speech Lang. Process. 19(4), 788–798 (2011)CrossRefGoogle Scholar
  4. 4.
    Finke, M., Geutner, P., Hild, H., Kemp, T., Ries, K., Westphal, M.: The Karlsruhe VERBMOBIL speech recognition engine. In: Proceedings of ICASSP (1997)Google Scholar
  5. 5.
    Gales, M.J.: Maximum likelihood linear transformations for HMM-based speech recognition. Comput. Speech Lang. 12(2), 75–98 (1998)CrossRefGoogle Scholar
  6. 6.
    Gales, M.J.: Semi-tied covariance matrices for hidden markov models. IEEE Trans. Speech Audio Process. 7(3), 272–281 (1999)CrossRefGoogle Scholar
  7. 7.
    Gauvain, J.L., Lee, C.H.: Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains. IEEE Trans. Speech Audio Process. 2(2), 291–298 (1994)CrossRefGoogle Scholar
  8. 8.
    Gehring, J., Miao, Y., Metze, F., Waibel, A.: Extracting deep bottleneck features using stacked auto-encoders. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 3377–3381 (2013)Google Scholar
  9. 9.
    Graff, D.: The 1996 broadcast news speech and language-model corpus. In: Proceedings of the DARPA Workshop on Spoken Language Technology (1997)Google Scholar
  10. 10.
    Kaukoranta, T., Franti, P., Nevalainen, O.: A new iterative algorithm for VQ codebook generation. In: Proceedings of the 1998 International Conference on Image Processing, ICIP 1998, vol. 2, pp. 589–593 (1998)Google Scholar
  11. 11.
    Liao, H.: Speaker adaptation of context dependent deep neural networks. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 7947–7951. IEEE (2013)Google Scholar
  12. 12.
    Miao, Y., Zhang, H., Metze, F.: Speaker adaptive training of deep neural network acoustic models using i-vectors. IEEE/ACM Trans. Audio Speech Lang. Process. 23(11), 1938–1949 (2015)CrossRefGoogle Scholar
  13. 13.
    Mohamed, A.R., Hinton, G., Penn, G.: Understanding how deep belief networks perform acoustic modelling. In: 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4273–4276 (2012)Google Scholar
  14. 14.
    Parthasarathi, S.H.K., Hoffmeister, B., Matsoukas, S., Mandal, A., Strom, N., Garimella, S.: fMLLR based feature-space speaker adaptation of DNN acoustic models. In: Proceedings of Interspeech, vol. 2015 (2015)Google Scholar
  15. 15.
    Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., Hannemann, M., Motlicek, P., Qian, Y., Schwarz, P., et al.: The Kaldi speech recognition toolkit. In: IEEE 2011 Workshop on Automatic Speech Recognition and Understanding, No. EPFL-CONF-192584 (2011)Google Scholar
  16. 16.
    Rath, P.S., Povey, D., Veselý, K., Černocký, J.: Improved feature processing for deep neural networks. In: Proceedings of Interspeech 2013, vol. 8, pp. 109–113 (2013)Google Scholar
  17. 17.
    Rousseau, A., Deléglise, P., Estève, Y.: Enhancing the TED-LIUM corpus with selected data for language modeling and more TED talks. In: LREC (2014)Google Scholar
  18. 18.
    Saon, G., Soltau, H., Nahamoo, D., Picheny, M.: Speaker adaptation of neural network acoustic models using i-vectors. In: ASRU, pp. 55–59 (2013)Google Scholar
  19. 19.
    Seide, F., Li, G., Chen, X., Yu, D.: Feature engineering in context-dependent deep neural networks for conversational speech transcription. In: 2011 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pp. 24–29. IEEE (2011)Google Scholar
  20. 20.
    Senior, A., Lopez-Moreno, I.: Improving DNN speaker independence with i-vector inputs. In: 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2014)Google Scholar
  21. 21.
    Stüker, S., Kilgour, K., Kraft, F.: Quaero 2010 speech-to-text evaluation systems. In: Nagel, W., Kröner, D., Resch, M. (eds.) High Performance Computing in Science and Engineering 2011, pp. 607–618. Springer, Heidelberg (2012). doi: 10.1007/978-3-642-23869-7_44 Google Scholar
  22. 22.
    Tan, T., Qian, Y., Yu, D., Kundu, S., Lu, L., Sim, K.C., Xiao, X., Zhang, Y.: Speaker-aware training of LSTM-RNNS for acoustic modelling. In: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5280–5284. IEEE (2016)Google Scholar
  23. 23.
    Tomashenko, N., Khokhlov, Y., Esteve, Y.: On the use of gaussian mixture model framework to improve speaker adaptation of deep neural network acoustic models. In: Proceedings of INTERSPEECH (2016)Google Scholar
  24. 24.
    Williams, W., Prasad, N., Mrva, D., Ash, T., Robinson, T.: Scaling recurrent neural network language models. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5391–5395 (2015)Google Scholar
  25. 25.
    Yu, D., Seltzer, M.L.: Improved bottleneck features using pretrained deep neural networks. In: Interspeech, vol. 237, p. 240 (2011)Google Scholar

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  • Thai Son Nguyen
    • 1
    Email author
  • Kevin Kilgour
    • 1
  • Matthias Sperber
    • 1
  • Alex Waibel
    • 1
  1. 1.Institute for Anthropomatics and Robotics, Karlsruhe Institute of TechnologyKarlsruheGermany

Personalised recommendations