Multimedia Tools and Applications

, Volume 78, Issue 21, pp 30749–30767 | Cite as

Adaptive recognition of different accents conversations based on convolutional neural network

  • Jiang Zhong
  • Pan ZhangEmail author
  • Xue Li


In this paper, an adaptive recognition of different accents conversations is proposed based on Convolutional Neural Network(CNN), which is used to deal with dialogue speech recognition problems that contain different accents in the CALL_CENTER environment. For the first time, the Mel-Frequency Cepstral Coefficients (MFCC) feature and the SPECTROGRAM feature are combined as the input of CNN to train the speakers’ voice feature model and to estimate the change point. Then, an accent classification method based on weighted fusion feature is proposed, and we introduced the IFLY voice recognition system to propose different accent dialogue recognition models based on speaker segmentation. In the experiments, a real database about the dialogue voice related to insurance sales and real estate sales industry is used to be dataset. After a comparative experiment, the results show that the word error rate for speech recognition after speaker segmentation and accent classification was reduced by 20% compared to the original speech recognition word error rate.


Combined feature Speaker segmentation Accent classification Speech recognition 



The authors acknowledge the National Key Research and Development Program of China (Grant No.2017YFB1402400), National High Technology Research and Development Program of China (Grant: 2015AA015308), Social Undertakings and Livelihood Security Science and Technology Innovation Funds of CQ CSTC (Grant: cstc2017shmsA0641), the National Nature Science Foundation of China (Grant: 61762025).


  1. 1.
    Ajmera J, McCowan I, Bourlard H (2004) Robust speaker change detection. IEEE Sign Process Lett 11(8):649–651CrossRefGoogle Scholar
  2. 2.
    Resnick MC (1976) Phonological variants and dialect identification in Latin American Spanish. Hispania 59(4):969Google Scholar
  3. 3.
    Arslan LM, Hansen JHL (1996) Language accent classification in American English. Speech CommunicationGoogle Scholar
  4. 4.
    Bakis R, Chen S, Gopalakrishnan P, Gopinath R, Maes S, Polymenakos L, Franz M (1997) Transcription of broadcast news shows with the IBM large vocabulary speech recognition system. Proc. DARPA Speech Recogn Workshop, VA: 67–72Google Scholar
  5. 5.
    Barras C, Zhu X, Meignier S et al (2006) Multistage speaker diarization of broadcast news. IEEE Trans Audio Speech Lang Process 14(5):1505–1512CrossRefGoogle Scholar
  6. 6.
    Bonastre JF, Delacourt P, Fredouille C, Merlin T, Wellekens C (2000) A speaker tracking system based on speaker turn detection for NIST evaluation. Acoust Speech Sign Process 2000. ICASSP'00. Proc IEEE Int Conf 2000 2:1177–1180Google Scholar
  7. 7.
    Cettolo M, Vescovi M, Rizzi R (2005) Evaluation of BIC-based algorithms for audio segmentation. J Comput Speech Lang 19(2):147–170CrossRefGoogle Scholar
  8. 8.
    Chen S, Gopalakrishnan PS (1998) Speaker, environment and channel change detection and clustering via the bayesian information criterion. Proc darpa Broadcast News Transcript Understanding Workshop 8:127–132Google Scholar
  9. 9.
    Cole RA, Rudnicky AI, Zue VM (1979) Performance of an expert spectrogram reader. J Acoust Soc Am 65(S1):S81–S81CrossRefGoogle Scholar
  10. 10.
    Davis S, Mermelstein P (1980) Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans Acoust Speech Signal Process 28(4):357–366CrossRefGoogle Scholar
  11. 11.
    Delacourt P, Wellekens C (2000) DISTBIC: a speaker-based segmentation for audio data indexing. Speech Comm 32(1):111–126CrossRefGoogle Scholar
  12. 12.
    Deller JR Jr, Proakis JG, Hansen JH (2000) Discrete time processing of speech signals, Second edn. IEEE Press, New YorkGoogle Scholar
  13. 13.
    Gish H, Siu MH, Rohlicek R (1991) Segregation of speakers for speech recognition and speaker identification. Acoust Speech Signal Process 1991. ICASSP-91., 1991 Int Conf: 873–876Google Scholar
  14. 14.
    <iFly Mobile Speech Platform Mobile Speech Client 5.0>Google Scholar
  15. 15.
    Ji SE, Kim W (2016) Speech recognition accuracy prediction using speech quality measure. J Korean Inst Inf Commun Eng 20(3):471–476Google Scholar
  16. 16.
    Jin H, Kubala F, Schwartz R (1997) Automatic speaker clustering. Proc DARPA Speech Recogn Workshop: 108–111Google Scholar
  17. 17.
    Kadri H, Davy M, Rabaoui A, Lachiri Z, Ellouze N (2008) Robust audio speaker segmentation using one class SVMs. Signal Process Conf 2008 16th European: 1–5Google Scholar
  18. 18.
    Mammone RJ, Zhang X, Ramachandran RP (1996) Robust speaker recognition: a feature-based approach. IEEE Signal Process Mag 13(5):58–71CrossRefGoogle Scholar
  19. 19.
    Mingliang q Yuguo X, Yiming Y (2008) Semi-supervised learning based Chinese dialect Identification[C]. Proceedings of 2008 9th international conference on signal processing. Beijing: 1608–1611Google Scholar
  20. 20.
    C. Pedersen and J. Diederich, (2007) Accent classification using support vector machines. 6th IEEE/ACIS Int Conf Comput Inform Sci ICISGoogle Scholar
  21. 21.
    Quatieri TF (2006) Discrete-time speech signal processing: principles and practice. Pearson Education IndiaGoogle Scholar
  22. 22.
    Resnick M (1969) Spanish[J]. C. Dialect zones and automatic dialect identification in Latin American. Hispanic 52:553–568Google Scholar
  23. 23.
    Resnick MC (1980) Phonological variants and dialect identification Spanish[M]. Walter de Gruyter. Dialect identification Anuario de Letras in Latin AmericanGoogle Scholar
  24. 24.
    Reynolds DA (1994) Experimental evaluation of features for robust speaker identification. IEEE Trans Speech Audio Process 2(4):639–643CrossRefGoogle Scholar
  25. 25.
    Saeidi R, Mohammadi HS, Rodman RD, Kinnunen T (2007) A new segmentation algorithm combined with transient frames power for text independent speaker verification. Acoust Speech Sign Process 2007. ICASSP 2007. IEEE Int Conf 4:305Google Scholar
  26. 26.
    Sell G, Garcia-Romero D, McCree A (2015) Speaker diarization with I-Vectors from DNN senone posteriors. Proc Interspeech: 3096–3099Google Scholar
  27. 27.
    Siegler MA, Jain U, Raj B, Stern RM (1997) Automatic segmentation, classification and clustering of broadcast news audio. Proc DARPA Speech Recogn Workshop, VA: 97–99Google Scholar
  28. 28.
    Speer SR, Warren P, Schafer A (2003) Intonation and sentence processing. Proc 15th Int Congress Phonetic Sci: 95–105Google Scholar
  29. 29.
    Stevens SS, Volkmann J, Newman EB (1937) A scale for the measurement of the psychological magnitude pitch. J Acoust Soc Am 8(3):185–190CrossRefGoogle Scholar
  30. 30.
    Tompson J, Stein M, Lecun Y, Perlin K (2014) Real-time continuous pose recovery of human hands using convolutional networks. ACM Trans Graph (TOG) 33(5):169CrossRefGoogle Scholar
  31. 31.
    Tranter SE, Reynolds DA (2006) An overview of automatic speaker diarization systems. IEEE Trans Audio Speech Lang Process 14(5):1557–1565CrossRefGoogle Scholar
  32. 32.
    S. Ullah and F. Karray, (2007) Speaker accent classification system using a fuzzy Gaussian classifier. Conf Inform Emerging Technol (ICIET 2007)Google Scholar
  33. 33.
    Wang Y et al (2015) Robust subspace clustering for multi-view data by exploiting correlation consensus. IEEE Trans Image Process 24(11):3939–3949MathSciNetCrossRefGoogle Scholar
  34. 34.
    Wang Y et al. (2016) Iterative views agreement: an iterative low-rank based structured optimization method to multi-view spectral clustering. Int Joint Conf Artif Intell: 2153–2159Google Scholar
  35. 35.
    Wang Y et al (2017) Effective multi-query expansions: collaborative deep networks for robust landmark retrieval. IEEE Trans Image Process 26(3):1393–1404MathSciNetCrossRefGoogle Scholar
  36. 36.
    Wang Y et al. (2018) Multiview spectral clustering via structured low-rank matrix factorization. IEEE trans. Neural networks and learning systemsGoogle Scholar
  37. 37.
    Wu L, Wang Y, Li X et al (2018) What-and-where to match: deep spatially multiplicative integration networks for person re-identification[J]. Pattern Recogn 76:727–738CrossRefGoogle Scholar
  38. 38.
    Wu L, Wang Y, Gao J et al (2018) Deep adaptive feature embedding with local sample distributions for person re-identification[J]. Pattern Recogn 73:275–288CrossRefGoogle Scholar
  39. 39.
    Wu L, Wang Y, Ge Z, Hu Q, Li X (2018) Structured deep hashing with convolutional neural networks for fast person re-identification. Comput Vis Image Underst 167:63–73CrossRefGoogle Scholar
  40. 40.
    Wu L, Wang Y, Li X, et al. (2018) Deep attention-based spatially recursive networks for fine-grained visual recognition[J]. IEEE Trans CybernetGoogle Scholar
  41. 41.
    Zhong J, Zhang P, Li X (2017) A Combined Feature Approach for Speaker Segmentation Using Convolution Neural Network[C]// Advances in Multimedia Information Processing – PCM 2017. Pacific rim conference on multimedia. Springer, Cham: 550–559Google Scholar
  42. 42.
    Zue V, Lamel L (1986) An expert spectrogram reader: a knowledge-based approach to speech recognition. Acoust Speech Sign Process IEEE Int Conf ICASSP'86 11:1197–1200CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2018

Authors and Affiliations

  1. 1.College of Computer ScienceChongqing UniversityChongqingChina
  2. 2.Key Laboratory of Dependable Service Computing in Cyber Physical Society, Ministry of EducationChongqing UniversityChongqingChina
  3. 3.China United Network Communications Co., Ltd. Xi’an BranchXi’anChina
  4. 4.School of Information Technology and Electrical EngineeringUniversity of QueenslandBrisbaneAustralia

Personalised recommendations