The Journal of Supercomputing

, Volume 72, Issue 5, pp 1757–1769 | Cite as

Speaker diarization method of telemarketer and client for improving speech dictation performance

  • Dahae Jung
  • Min-Kyoung Bae
  • Man Yong Choi
  • Eui Chul Lee
  • Jinoo JoungEmail author


Financial institutions employ speech dictation systems that convert the conversation recordings between telemarketer and client into the texts. The dictation system is necessary for checking incomplete sales, in which a telemarketer fails to provide important sales information to a client. However, the manually performed dictation procedure takes too much time and effort. Automatic speech dictation system is being adopted as an alternative. We suggest that, in such an automatic speech dictation system, a speaker diarization is performed prior to speech recognition. In this paper, we propose a diarization method based on pitch detection, which suits very well to given condition in which two speakers, telemarketer and client, make a conversation in a telephone recording. We suggest a method based on average short time spectral feature and unsupervised learning scheme. In the experiments, actual telephone recordings for insurance contraction were used. We obtained on average about 6 % of Diarization Error Rate (DER).


Speaker diarization Telemarketing Unsupervised learning  Speech dictation 



This work was supported by a NAP (National Agenda Project) of the Korea Research Council of Fundamental Science & Technology. Also, this work was supported by the MSIP(Ministry of Science, ICT and Future Planning), Korea, under the ITRC(Information Technology Research Center) support program (IITP-2015-H8501-15-1014) supervised by the IITP(Institute for Information & communications Technology Promotion).


  1. 1.
    Huynh TM, Lee G (2013) Small object segmentation based on visual saliency in natural images. J Inf Process Syst 9(4):592–601. doi: 10.3745/JIPS.2013.9.4.592 CrossRefGoogle Scholar
  2. 2.
    Youssef E, Zakaria E, Reda A (2014) Multimodal biometric using a hierarchical fusion of a person’s face, voice, and online signature. J Inf Process Syst 10(4):555–567. doi: 10.3745/JIPS.02.0007 CrossRefGoogle Scholar
  3. 3.
    Prabhat V, Raghuraj S, Avinash KS (2013) A framework to integrate speech based interface for blind web users on the websites of public interest. Human-centric Comput Info Sci 3:21. doi: 10.1186/2192-1962-3-21 CrossRefGoogle Scholar
  4. 4.
    Ubai S (2012) Speaker Diarization : speaker diarization and identification. Dissertation. Master’s thesis, The University of Manchester, School of Comput SciGoogle Scholar
  5. 5.
    Timo B, Michael J, Catalin G (2008) Forensic speaker verification using formant features and Gaussian Mixture Models. INTERSPEECHGoogle Scholar
  6. 6.
    Huazhong N, Ming L, Hao T, Thomas H (2006) A spectral clustering approach to speaker diarization. INTERSPEECHGoogle Scholar
  7. 7.
    Speaker Recognition (2006) Speaker Recognition. National Science and Technology Council (NSTC).
  8. 8.
    Joseph P. Campbell, Jr. (1997) SPEAKER RECOGNITION. Proc IEEE 1437–1462. doi: 10.1109/5.628714
  9. 9.
    Xavier AM (2006) ROBUST SPEAKER DIARIZATION FOR MEET-INGS. Dissertation, PhD thesis, Univ Polit‘ecnica de CatalunyaGoogle Scholar
  10. 10.
    Tin LN, Hanwu S, Haizhou LI, Susanto R (2009) Speaker diarization in meeting audio. Int Conf Acous Speech Signal Processing, pp 4073–4076, doi: 10.1109/ICASSP.2008.4960523
  11. 11.
    Tang H, Chu SM, Hasegawa JM, Huang TS (2012) Partially supervised speaker clustering. IEEE Trans Pattern Anal Mach Intell 34:959–964CrossRefGoogle Scholar
  12. 12.
    Uchechukwu O, Ofoegbu (2007) Model formation and Classification techniques for conversations-based speaker discrimination. UMI Dissertations Publishing.
  13. 13.
    Nishi S, Parminder S (2010) Automatic segmentation of wave file. Int J of Comput Sci Commun 1(2):267–270Google Scholar
  14. 14.
    Bernard W, John RG Jr, John MM, John K, Charles SW, Robert HH, James RZED Jr, Robert CG (2005) Adaptive noise cancelling: principles and applications. Browse J M Proc IEEE 63:1692–1716Google Scholar
  15. 15.
    Richard FL (1990) Perceptual Pitch Detector. Int Conf Acous Speech Signal Process 1:357–360Google Scholar
  16. 16.
    Jhanwar N, Ajay KR (2004) Pitch Correlogram Clustering for Fast Speaker Identification. EURASIP J Adv Signal Process 2640–2649. doi: 10.1155/S1110865704408026
  17. 17.
  18. 18.
    Nobuyuki O (1979) A threshold selection method from Gray-Level Histogram. IEEE Trans Syst Man Cybernetics SMC–9(1):62–66Google Scholar

Copyright information

© Springer Science+Business Media New York 2015

Authors and Affiliations

  • Dahae Jung
    • 1
  • Min-Kyoung Bae
    • 1
  • Man Yong Choi
    • 2
  • Eui Chul Lee
    • 1
  • Jinoo Joung
    • 1
    Email author
  1. 1.Department of Computer ScienceSangmyung UniversityJongno-GuRepublic of Korea
  2. 2.Safety Measurement CenterKorea Research Institute of Standards and ScienceYuseong-guRepublic of Korea

Personalised recommendations