Incomplete-Data-Driven Speaker Segmentation for Diarization Application; A Help-Training Approach

  • Farshad Teimoori
  • Farbod RazzaziEmail author


This paper presents a new segmentation method for diarization application. This method is established on a support vector regression-based discriminative engine which bears the main duty of estimating the most possible change points. This engine is aided by a generative classifier in a help-training approach. Considering that there are no pre-labeled training samples in a segmentation task, the proposed model-based segmentation method attempts to suggest an appropriate solution to overcome this obstacle. The introduced iterative method supposes that the initial frames in a given segment belong to the associated speaker. This hypothesis permits the SVR engine to be initiated in the first iteration. In the following iterations, discriminative regression block in conjunction with the generative classifier tags the remaining frames with advantageous (positive) and disadvantageous (negative) labels. These newly labeled frames establish the working set to update the associated speaker model. In addition to the proposed segmentation method, a new strategy is introduced to estimate inserted and deleted change points. In the evaluation section, in addition to the common experimental assessment, attempts are made to achieve a unique and comprehensive insight into the statistical aspects of choosing training samples. Finally, comparison of the proposed segmentation and diarization system with similar method shows approximately 22.95% enhancement in the performance.


Speaker diarization Unsupervised segmentation LS-SVR Online speech segmentation Help-training approach 


  1. 1.
    M.M. Adankon, M. Cheriet, Help-training for semi-supervised support vector machines. Pattern Recogn. 44(9), 2220–2230 (2011)CrossRefGoogle Scholar
  2. 2.
    X. Angueramiro et al., Speaker diarization: a review of recent research. IEEE Trans. Audio Speech Lang. Process. 20(2), 356–370 (2012)CrossRefGoogle Scholar
  3. 3.
    X. Anguera et al., in International Workshop on Machine Learning for Multimodal Interaction. Robust speaker segmentation for meetings: the ICSI-SRI spring 2005 diarization system, (Springer, New York, 2005), p. 402–414CrossRefGoogle Scholar
  4. 4.
    M. Barnard et al., Robust multi-speaker tracking via dictionary learning and identity modeling. IEEE Trans. Multimed. 16(3), 864–880 (2014)CrossRefGoogle Scholar
  5. 5.
    B. Bielefeld, Language identification using shifted delta cepstrum. Presented at the 14th annual speech research symposium (1994)Google Scholar
  6. 6.
    M. Cettolo, M. Vescovi, in 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (ICASSP’03). Efficient audio segmentation algorithms based on the BIC, (IEEE, 2003), p. VI–537Google Scholar
  7. 7.
    S. Cumani, P. Laface, Large-scale training of pairwise support vector machines for speaker recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 22(11), 1590–1600 (2014)CrossRefGoogle Scholar
  8. 8.
    H. Frihia, H. Bahi, HMM/SVM segmentation and labelling of Arabic speech for speech recognition applications. Int. J. Speech Technol. 20(3), 563–573 (2017)CrossRefGoogle Scholar
  9. 9.
    S. Galliano et al., in The ESTER 2 Evaluation Campaign for the Rich Transcription of French Radio Broadcasts. Tenth annual conference of the international speech communication association (2009)Google Scholar
  10. 10.
    V. Hautamaki et al., Sparse classifier fusion for speaker verification. IEEE Trans. Audio Speech Lang. Process. 21(8), 1622–1631 (2013)CrossRefGoogle Scholar
  11. 11.
    M. Hu et al., Speaker Change Detection and Speaker Diarization Using Spatial Information. 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP), (IEEE, 2015), p. 5743–5747Google Scholar
  12. 12.
    M. India et al., LSTM neural network-based speaker segmentation using acoustic and language modelling. Presented at the August 20 (2017)Google Scholar
  13. 13.
    T. Kinnunen, P. Rajan, in ICASSP. A practical, self-adaptive voice activity detector for speaker verification with noisy telephone and microphone data, (Citeseer, 2013), p. 7229–7233Google Scholar
  14. 14.
    M. Kotti et al., in 2006 IEEE International Conference on Multimedia and Expo. Automatic speaker segmentation using multiple features and distance measures: a comparison of three approaches, (IEEE, 2006), p. 1101–1104Google Scholar
  15. 15.
    K. Kumar et al., in 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Delta-spectral cepstral coefficients for robust speech recognition, (IEEE, 2011), p. 4784–4787Google Scholar
  16. 16.
    H. Kun, D.L. Wang, Towards generalizing classification based speech separation. IEEE Trans. Audio Speech Lang. Process. 21(1), 168–177 (2013)CrossRefGoogle Scholar
  17. 17.
    J. Li et al., An overview of noise-robust automatic speech recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 22(4), 745–777 (2014)CrossRefGoogle Scholar
  18. 18.
    J. Mairal et al., in Proceedings of the 26th Annual International Conference on Machine Learning. Online dictionary learning for sparse coding, (ACM, 2009), p. 689–696Google Scholar
  19. 19.
    A.S. Malegaonkar et al., Efficient speaker change detection using adapted Gaussian mixture models. IEEE Trans. Audio Speech Lang. Process. 15(6), 1859–1869 (2007)CrossRefGoogle Scholar
  20. 20.
    S. Meignier, T. Merlin, in CMU SPUD Workshop. LIUM SpkDiarization: an open source toolkit for diarization (2010)Google Scholar
  21. 21.
    H. Meinedo, J. Neto, in 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings (ICASSP’03). Audio segmentation, classification and clustering in a broadcast news task, (IEEE, 2003), p. II–5Google Scholar
  22. 22.
    M.H. Moattar, M.M. Homayounpour, A review on speaker diarization systems and approaches. Speech Commun. 54(10), 1065–1103 (2012)CrossRefGoogle Scholar
  23. 23.
    S.H.K. Parthasarathi et al., Wordless sounds: robust speaker diarization using privacy-preserving audio representations. IEEE Trans. Audio Speech Lang. Process. 21(1), 85–98 (2013)MathSciNetCrossRefGoogle Scholar
  24. 24.
    H. Phan et al., Random regression forests for acoustic event detection and classification. IEEE/ACM Trans. Audio Speech Lang. Process. 23(1), 20–31 (2015)CrossRefGoogle Scholar
  25. 25.
    D.A. Reynolds, P. Torres-Carrasquillo, The MIT Lincoln Laboratory RT-04F diarization systems: Applications to broadcast audio and telephone conversations. DTIC Document (2004)Google Scholar
  26. 26.
    T.N. Sainath et al., Exemplar-based sparse representation features: from TIMIT to LVCSR. IEEE Trans. Audio Speech Lang. Process. 19(8), 2598–2613 (2011)CrossRefGoogle Scholar
  27. 27.
    B. Schölkopf, A.J. Smola, Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond (MIT Press, Cambridge, MA, 2002)Google Scholar
  28. 28.
    Y. Shao et al., 2007 IEEE International Conference on Acoustics, Speech and Signal Processing-ICASSP’07. Incorporating auditory feature uncertainties in robust speaker identification, (IEEE, 2007), p. IV–277Google Scholar
  29. 29.
    J. Silovsky, J. Prazak, in 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Speaker diarization of broadcast streams using two-stage clustering based on i-vectors and cosine distance scoring, (IEEE, 2012), p. 4193–4196Google Scholar
  30. 30.
    M. Sinclair, S. King, in 2013 IEEE International Conference on Acoustics, Speech and Signal Processing. Where are the challenges in speaker diarization? (IEEE, 2013), p. 7741–7745Google Scholar
  31. 31.
    A. Tritschler, R.A. Gopinath, in Eurospeech. Improved speaker segmentation and segments clustering using the bayesian information criterion (1999), p. 679–682Google Scholar
  32. 32.
    S. Xavier-de-Souza et al., Coupled simulated annealing. IEEE Trans. Syst. Man Cybern. Part B (Cybernetics) 40(2), 320–335 (2010)CrossRefGoogle Scholar
  33. 33.
    X. Yong et al., A regression approach to speech enhancement based on deep neural networks. IEEE/ACM Trans. Audio Speech Lang. Process. 23(1), 7–19 (2015)CrossRefGoogle Scholar
  34. 34.
    X. Zhao et al., CASA-based robust speaker identification. IEEE Trans. Audio Speech Lang. Process. 20(5), 1608–1616 (2012)MathSciNetCrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2018

Authors and Affiliations

  1. 1.Department of Electrical and Computer Engineering, Science and Research BranchIslamic Azad UniversityTehranIran

Personalised recommendations