Multimedia Tools and Applications

, Volume 75, Issue 9, pp 5109–5124 | Cite as

Distant-talking accent recognition by combining GMM and DNN

  • Khomdet Phapatanaburi
  • Longbiao WangEmail author
  • Ryota Sakagami
  • Zhaofeng Zhang
  • Ximin Li
  • Masahiro Iwahashi


Recently, automatic accent recognition has been paid more and more attentions. However, there are few researches focusing on accent recognition in distant-talking environment which is very important for improving distant-talking speech recognition performance with non-native accents. In this paper, we apply Gaussian Mixture Models (GMM) and Deep Neural Network (DNN) to identify the speaker accent in reverberant environments. The combination of likelihood with these two approaches is also proposed. In reverberant environment, the accent recognition rate was improved from 90.7 % with GMM to 93.0 % with DNN. The combination of GMM and DNN achieved recognition rate of 97.5 %, which outperformed than the individual GMM and DNN because the complementation of GMM and DNN. The relative error reduction is 73.1 % than the GMM-based method and 64.3 % than the DNN-based method, respectively.


Accent recognition GMM DNN Distant-talking speech Machine learning 



This work was supported by JSPS KANKENHI Grant Number 15K16020 and a research grant from the Telecommunications Advancement Foundation (TAF), Japan.


  1. 1.
    Arslan L, Hansen J (1996) Language accent classification in American English. Speech Commun 18(4):353–367CrossRefGoogle Scholar
  2. 2.
    Chen T, Huang C, Chang E, Wang J (2001) Automatic accent recognition using Gaussian mixture models. Proc. of IEEE Workshop on Automatic Speech Recognition and Understanding, 343–346Google Scholar
  3. 3.
    Chen N, Shen W, Campbell J (2010) A linguistically-informative approach to dialect recognition using for automatic accent classification. Proc. of IEEE International Conference in acoustic speech and signal processing (ICASSP), 5014–5017Google Scholar
  4. 4.
    Choueiter G, Zweig G, Nguyen P (2008) An empirical study of automatic dialect classification. Proc. of IEEE International Conference in acoustic speech and signal processing (ICASSP)Google Scholar
  5. 5.
    Dempster AP, Laird NM, Rubin DB (1997) Maximum likehood from incomplate data via EM algorithum. J R Stat Soc Ser B 39:1–38MathSciNetGoogle Scholar
  6. 6.
    Deshpande S, Chikkerur S, Govindaraju V (2005) Accent classification in speech. Proc. of IEEE Workshop on Automatic recognition advanced TechnologiesGoogle Scholar
  7. 7.
    Fohr D, Illina I (2007) Text-independent foreign accent classification using statistical methods. Proc. of IEEE International Conference on Signal Processing and Communications (ICSPC 2007), 812–815Google Scholar
  8. 8.
    Garofolo JS, Lamel LF, Fisher WM, Fiscus JG, Pallett DS, Dahlgren NL, Zue V (1990) Timit acoustic-phonetic continuous speech corpus. National Institute of Standards and Technology Disc 1-1.1, NTIS Order No. PB91-5050651996, 91Google Scholar
  9. 9.
    Gupta V, Mermelstein P (1982) Effect of speaker accent on the performance of a speaker-independent isolated word recognition. J Acoust Soc Am 71:1581–1587CrossRefGoogle Scholar
  10. 10.
    Hinton G, Deng L, Yu D, Dahl GE, Mohamed A, Jaitly N, Senior A, Vanhoucke V, Nguyen P, Sainath TN, Kingsbury B (2012) Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Signal Process Mag 29(6):82–97CrossRefGoogle Scholar
  11. 11.
    Hinton G, Salakhutdinov R (2006) Reducing the dimensionality of data with neural networks. Science 313(5786):504–507MathSciNetCrossRefzbMATHGoogle Scholar
  12. 12.
    Hirsch H, Finster H (2008) A new approach for the adaptation of HMMs to reverberation and background noise. Speech Comm 50(3):244–263CrossRefGoogle Scholar
  13. 13.
    Krizhevsky A, Sutskever I, Hinton GE (2012) Image net classification with deep convolutional neural networks. NIPS 2(2):1–4Google Scholar
  14. 14.
    Lazaridis A, Khoury E, Goldman J (2014) Swiss French regional accent recognition. The Speaker and Language Recognition Workshop June 16–19, in Joensuu, finland, ProceedingsGoogle Scholar
  15. 15.
    Minematsu N, Okabe K, Ogaki K, Hirose K (2011) measurement of objective intelligibility of japanese accented English Using ERJ (English Read by Japanese) Database. Proc Interspeech, 1481–1484Google Scholar
  16. 16.
    Mohamed A, Dahl GE, Hinton G (2012) Acoustic modeling using deep belief networks. IEEE Trans Audio Speech Lang Process 20:12–22CrossRefGoogle Scholar
  17. 17.
    Nakamura S, Hiyane K, Asano F, Nishiura T, Yamada T (2000) Acoustical sound database in real environments for sound scene understanding and hands-free speech recognition. Proc LREC2000, 965–968Google Scholar
  18. 18.
    Nishiura T et al (2008) Evaluation framework for distant-talking speech recognition under reverberant environments. Proc Interspeech, 968–971Google Scholar
  19. 19.
    Reynolds DA (1995) Speaker identification and verification using Gaussian mixture speaker models. Speech Comm 17(1–2):91–108CrossRefGoogle Scholar
  20. 20.
    Sehr A, Maas R, Kellermann W (2010) Reverberation model-based decoding in the logmelspec domain for robust distant-talking speech recognition. IEEE Trans ASLP 18(7):1676–1691Google Scholar
  21. 21.
    Tsai MY, Lee LS (2003) Pronunciation variation analysis based on acoustic and phonemic distancemeasures with application examples on Mandarin Chinese. Proc. of IEEE Workshop Autom. Speech Recogn. Understand 117–122Google Scholar
  22. 22.
    Ueda Y, Wang L, Kai A, Xiao X, Chng E, Li H (2015) Single-channel dereverberation for distant-talking speech recognition by combining denoising autoencoder and temporal structure normalization. J Signal Process SystGoogle Scholar
  23. 23.
    Wang L, Kitaoka N, Nakagawa S (2006) Robust distant speech recognition by combining multiple microphone-array processing with position-dependent CMN. EURASIP J Appl Signal Process. 95491: 1–11Google Scholar
  24. 24.
    Wang L, Kitaoka N, Nakagawa S (2007) Robust distant speaker recognition based on position-dependent CMN by combining speaker-specific GMM with speaker-adapted HMM. Speech Comm 49(6):501–513CrossRefGoogle Scholar
  25. 25.
    Wang L, Odani K, Kai A (2012) Dereverberation and denoising based on generalized spectral subtraction by nutil-channel LMS algorithm using a small-scale microphone array. EURASIP J Adv Signal Process 12:1–11Google Scholar
  26. 26.
    Wu M, Wang D (2006) A two-stage algorithm for one-microphone reverberant speech enhancement. IEEE Trans ASLP 14(3):774–784Google Scholar
  27. 27.
    Yamada T, Wang L, Kai A (2013) Improvement of distant-talking speaker identification using bottleneck features of DNN. Proc Interspeech, 3661–3664Google Scholar
  28. 28.
    Yoshioka T, Gales MJF (2015) Environmentally robust ASR front-end for deep neural network acoustic models. Comput Speech Lang 31(1):65–86CrossRefGoogle Scholar
  29. 29.
    Yoshioka T, Sehr A, Delcroix M, Kinoshita K, Maas R, Nakatani T, Kellermann W (2012) Making machines understand us in reverberant rooms: robustness against reverberation for automatic speech recognition. IEEE Signal Process Mag 29(6):114–126CrossRefGoogle Scholar
  30. 30.
    Young S, Kershow D, Odell J, Ollason D, Valtchev V, Woodland P (2000) The HTK book. Cambridge University (for HTK version 3.0)Google Scholar
  31. 31.
    Zhang Z, Wang L, Kai A, Odani K, Li W, Iwahashi M (2015) Deep neural network-based bottleneck feature and denoising autoencoder-based dereverberation for distant-talking speaker identification. EURASIP J Audio Speech Music Process 12Google Scholar
  32. 32.
    Zhang Z, Wang L, Kai A (2014) Distant-talking speaker identification by generalized spectral subtraction-based dereverberation and its efficient computation. EURASIP J Audio Speech Music Process 15:1–12Google Scholar

Copyright information

© Springer Science+Business Media New York 2015

Authors and Affiliations

  • Khomdet Phapatanaburi
    • 1
  • Longbiao Wang
    • 1
    Email author
  • Ryota Sakagami
    • 1
  • Zhaofeng Zhang
    • 1
  • Ximin Li
    • 2
  • Masahiro Iwahashi
    • 1
  1. 1.Nagaoka University of TechnologyNakaokaJapan
  2. 2.Xiamen Kuaishangtong Information Technology Co.,LtdXiamenChina

Personalised recommendations