Skip to main content
Log in

Distant-talking accent recognition by combining GMM and DNN

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript


Recently, automatic accent recognition has been paid more and more attentions. However, there are few researches focusing on accent recognition in distant-talking environment which is very important for improving distant-talking speech recognition performance with non-native accents. In this paper, we apply Gaussian Mixture Models (GMM) and Deep Neural Network (DNN) to identify the speaker accent in reverberant environments. The combination of likelihood with these two approaches is also proposed. In reverberant environment, the accent recognition rate was improved from 90.7 % with GMM to 93.0 % with DNN. The combination of GMM and DNN achieved recognition rate of 97.5 %, which outperformed than the individual GMM and DNN because the complementation of GMM and DNN. The relative error reduction is 73.1 % than the GMM-based method and 64.3 % than the DNN-based method, respectively.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others


  1. Arslan L, Hansen J (1996) Language accent classification in American English. Speech Commun 18(4):353–367

    Article  Google Scholar 

  2. Chen T, Huang C, Chang E, Wang J (2001) Automatic accent recognition using Gaussian mixture models. Proc. of IEEE Workshop on Automatic Speech Recognition and Understanding, 343–346

  3. Chen N, Shen W, Campbell J (2010) A linguistically-informative approach to dialect recognition using for automatic accent classification. Proc. of IEEE International Conference in acoustic speech and signal processing (ICASSP), 5014–5017

  4. Choueiter G, Zweig G, Nguyen P (2008) An empirical study of automatic dialect classification. Proc. of IEEE International Conference in acoustic speech and signal processing (ICASSP)

  5. Dempster AP, Laird NM, Rubin DB (1997) Maximum likehood from incomplate data via EM algorithum. J R Stat Soc Ser B 39:1–38

    MathSciNet  Google Scholar 

  6. Deshpande S, Chikkerur S, Govindaraju V (2005) Accent classification in speech. Proc. of IEEE Workshop on Automatic recognition advanced Technologies

  7. Fohr D, Illina I (2007) Text-independent foreign accent classification using statistical methods. Proc. of IEEE International Conference on Signal Processing and Communications (ICSPC 2007), 812–815

  8. Garofolo JS, Lamel LF, Fisher WM, Fiscus JG, Pallett DS, Dahlgren NL, Zue V (1990) Timit acoustic-phonetic continuous speech corpus. National Institute of Standards and Technology Disc 1-1.1, NTIS Order No. PB91-5050651996, 91

  9. Gupta V, Mermelstein P (1982) Effect of speaker accent on the performance of a speaker-independent isolated word recognition. J Acoust Soc Am 71:1581–1587

    Article  Google Scholar 

  10. Hinton G, Deng L, Yu D, Dahl GE, Mohamed A, Jaitly N, Senior A, Vanhoucke V, Nguyen P, Sainath TN, Kingsbury B (2012) Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Signal Process Mag 29(6):82–97

    Article  Google Scholar 

  11. Hinton G, Salakhutdinov R (2006) Reducing the dimensionality of data with neural networks. Science 313(5786):504–507

    Article  MathSciNet  MATH  Google Scholar 

  12. Hirsch H, Finster H (2008) A new approach for the adaptation of HMMs to reverberation and background noise. Speech Comm 50(3):244–263

    Article  Google Scholar 

  13. Krizhevsky A, Sutskever I, Hinton GE (2012) Image net classification with deep convolutional neural networks. NIPS 2(2):1–4

    Google Scholar 

  14. Lazaridis A, Khoury E, Goldman J (2014) Swiss French regional accent recognition. The Speaker and Language Recognition Workshop June 16–19, in Joensuu, finland, Proceedings

  15. Minematsu N, Okabe K, Ogaki K, Hirose K (2011) measurement of objective intelligibility of japanese accented English Using ERJ (English Read by Japanese) Database. Proc Interspeech, 1481–1484

  16. Mohamed A, Dahl GE, Hinton G (2012) Acoustic modeling using deep belief networks. IEEE Trans Audio Speech Lang Process 20:12–22

    Article  Google Scholar 

  17. Nakamura S, Hiyane K, Asano F, Nishiura T, Yamada T (2000) Acoustical sound database in real environments for sound scene understanding and hands-free speech recognition. Proc LREC2000, 965–968

  18. Nishiura T et al (2008) Evaluation framework for distant-talking speech recognition under reverberant environments. Proc Interspeech, 968–971

  19. Reynolds DA (1995) Speaker identification and verification using Gaussian mixture speaker models. Speech Comm 17(1–2):91–108

    Article  Google Scholar 

  20. Sehr A, Maas R, Kellermann W (2010) Reverberation model-based decoding in the logmelspec domain for robust distant-talking speech recognition. IEEE Trans ASLP 18(7):1676–1691

    Google Scholar 

  21. Tsai MY, Lee LS (2003) Pronunciation variation analysis based on acoustic and phonemic distancemeasures with application examples on Mandarin Chinese. Proc. of IEEE Workshop Autom. Speech Recogn. Understand 117–122

  22. Ueda Y, Wang L, Kai A, Xiao X, Chng E, Li H (2015) Single-channel dereverberation for distant-talking speech recognition by combining denoising autoencoder and temporal structure normalization. J Signal Process Syst

  23. Wang L, Kitaoka N, Nakagawa S (2006) Robust distant speech recognition by combining multiple microphone-array processing with position-dependent CMN. EURASIP J Appl Signal Process. 95491: 1–11

  24. Wang L, Kitaoka N, Nakagawa S (2007) Robust distant speaker recognition based on position-dependent CMN by combining speaker-specific GMM with speaker-adapted HMM. Speech Comm 49(6):501–513

    Article  Google Scholar 

  25. Wang L, Odani K, Kai A (2012) Dereverberation and denoising based on generalized spectral subtraction by nutil-channel LMS algorithm using a small-scale microphone array. EURASIP J Adv Signal Process 12:1–11

    Google Scholar 

  26. Wu M, Wang D (2006) A two-stage algorithm for one-microphone reverberant speech enhancement. IEEE Trans ASLP 14(3):774–784

    Google Scholar 

  27. Yamada T, Wang L, Kai A (2013) Improvement of distant-talking speaker identification using bottleneck features of DNN. Proc Interspeech, 3661–3664

  28. Yoshioka T, Gales MJF (2015) Environmentally robust ASR front-end for deep neural network acoustic models. Comput Speech Lang 31(1):65–86

    Article  Google Scholar 

  29. Yoshioka T, Sehr A, Delcroix M, Kinoshita K, Maas R, Nakatani T, Kellermann W (2012) Making machines understand us in reverberant rooms: robustness against reverberation for automatic speech recognition. IEEE Signal Process Mag 29(6):114–126

    Article  Google Scholar 

  30. Young S, Kershow D, Odell J, Ollason D, Valtchev V, Woodland P (2000) The HTK book. Cambridge University (for HTK version 3.0)

  31. Zhang Z, Wang L, Kai A, Odani K, Li W, Iwahashi M (2015) Deep neural network-based bottleneck feature and denoising autoencoder-based dereverberation for distant-talking speaker identification. EURASIP J Audio Speech Music Process 12

  32. Zhang Z, Wang L, Kai A (2014) Distant-talking speaker identification by generalized spectral subtraction-based dereverberation and its efficient computation. EURASIP J Audio Speech Music Process 15:1–12

    Google Scholar 

Download references


This work was supported by JSPS KANKENHI Grant Number 15K16020 and a research grant from the Telecommunications Advancement Foundation (TAF), Japan.

Author information

Authors and Affiliations


Corresponding author

Correspondence to Longbiao Wang.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Phapatanaburi, K., Wang, L., Sakagami, R. et al. Distant-talking accent recognition by combining GMM and DNN. Multimed Tools Appl 75, 5109–5124 (2016).

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: