International Journal of Speech Technology

, Volume 19, Issue 4, pp 945–963 | Cite as

Speaker diarization system using MKMFCC parameterization and WLI-fuzzy clustering

  • V. Subba Ramaiah
  • R. Rajeswara Rao


Speaker diarization is the process of determining “who speak when?” with appropriate speaker labels with respect to the time regions where they spoke. Accordingly, in the previous work, a model based speaker diarization using the tangential weighted Mel frequency cepstral coefficients as the feature parameter for the voice activity detection and Lion optimization algorithm for the clustering of the audio streams into speaker group was performed. In this paper, speaker diarization system is proposed using multiple kernel weighted Mel frequency cepstral coefficient (MKMFCC) parameterization and Wu-and-Li Index (WLI)-fuzzy clustering. First, a MKMFCC which utilizes the multiple kernels like the tangential and exponential for weighting the MFCC’s is proposed for the feature parameterization. Second, a clustering algorithm called the WLI-Fuzzy clustering is proposed for grouping the segments of the same speaker groups. The experimentation of the proposed speaker diarization system is carried out over the publically available ELSDSR corpus data set having the audio signal with seven different speakers. The performance evaluation of the proposed speaker diarization system is analysed using the measures such as diarization error rate, F-measure and false alarm rate. The results show that the proposed speaker diarization system proved better for tracking the active speakers from multiple speakers with improved tracking accuracy.


WLI- fuzzy clustering Multiple kernel Bayesian Inference criterion Voice activity detection i-Vector extraction 


  1. Ajmera, J., McCowan, I., & Bourlard, H. (2004). Robust Speaker Change Detection. IEEE Signal Processing Letters, 11(8), 649–651.CrossRefGoogle Scholar
  2. Anguera, X., Bozonnet, S., Evans, N., Fredouille, C., & Friedland, G. (2010). Speaker diarization: A review of recent research. In Proceedings of IEEE TASLP (pp. 1–14).Google Scholar
  3. Bakis, R., Chen, S., Gopalakrishnan, P., Gopinath, R., Maes, S., Polymenakos, L., & Franz, M. (1997). Transcription of broadcast news shows with the IBM large vocabulary speech recognition system. In Proceedings of the speech recognition workshop (pp. 67–72).Google Scholar
  4. Barras, C., Zhu, X., Meignier, S., & Gauvain, J.-L. (2006). Multistage Speaker Diarization of Broadcast News. IEEE Transactions on Audio, Speech and Language Processing, 14(5), 1505–1512.CrossRefGoogle Scholar
  5. Beigi, H., & Maes, S. (1998). Speaker, channel and environment change detection. In Proceedings of the world congress on automation.Google Scholar
  6. Bezdek, J. C., Ehrlich, R., & Full, W. (1984). FCM: The fuzzy c-means clustering algorithm. Computers & Geosciences, 10(2–3), 191–203.CrossRefGoogle Scholar
  7. Chen, S., & Gopalakrishnan, P. (1998). Speaker, environment and channel change detection and clustering via the bayesian information criterion. Proceedings of DARPA Broadcast News Transcription and Understanding Workshop, 8, 127–132.Google Scholar
  8. Chen, J., Shue, L., & Ser, W. (2002). A new approach for speaker tracking in reverberant environment. Signal Processing, 82(7), 1023–1028.CrossRefzbMATHGoogle Scholar
  9. Davis, S., & Mermelstein, P. (1980). Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Transaction on Acoustic Speech Signal Processing, 28(4), 357–366.CrossRefGoogle Scholar
  10. Delgado, H., Anguera, X., Fredouille, C., & Serrano, J. (2015). Fast single- and cross-show speaker diarization using binary key speaker modeling. IEEE Transactions on Audio, Speech and Language Processing, 23(12), 2286–2297.CrossRefGoogle Scholar
  11. Dunn, R. B., Reynolds, D. A., & Quatieri, T. F. (2000). Approaches to speaker detection and tracking in conversational speech. Digital Signal Processing, 10(1–3), 93–112.CrossRefGoogle Scholar
  12. Evans, N., Bozonnet, S., & Wang, D. (2012). A comparative study of bottom-up and top-down approaches to speaker diarization. IEEE Transactions on Audio, Speech and Language Processing, 20(2), 382–392.CrossRefGoogle Scholar
  13. Gish, H., & Schmidt, N. (1994). Text-independent speaker identification. IEEE Signal Processing Magazine, 11(4), 18–32.CrossRefGoogle Scholar
  14. Huijbregts, M., & van Leeuwen, D. A. (2012). Large-scale speaker diarization for long recordings and small collections. IEEE Transaction on Audio, Speech, and Language Processing, 20(2), 404–413.CrossRefGoogle Scholar
  15. Jiang, J.-Y., Liou, R.-J., & Lee, S.-J. (2011). A fuzzy self-constructing feature clustering algorithm for text classification. IEEE Transactions on Knowledge and Data Engineering, 23(3), 335–349.CrossRefGoogle Scholar
  16. Kenny, P., Gupta, V., Stafylakis, T., Ouellet, P. & Alam, J. (2014). Deep neural networks for Baum-Wech statistics for speaker Recognition. In Proceedings of neural networks for speaker and language modelling.Google Scholar
  17. Kubala, F., Jin, H., Matsoukas, S., Nguyen, L., Schwartz, R., & Makhou, J. (1997). The 1996 BBN Byblos Hub-4 transcription system. In Proceedings of the speech recognition workshop (pp. 90–93).Google Scholar
  18. Le, V. B., Mella, O., & Fohr, D. (2007). Speaker diarization using normalized cross likelihood ratio. Interspeech, 7, 1869–1872.Google Scholar
  19. Madikeri, S., Himawan, I., Motlicek, P., & Ferras, M. (2015). Integrating online i-vector extractor with information bottleneck based speaker diarization system. In Interspeech-2015 16 th Annual Conference of the international speech communication association, Dresden, Germany, September 6–10 (pp. 3105–3109).Google Scholar
  20. Makhoul, J. (1975). Linear prediction: a tutorial review. Proceedings of IEEE, 63(4), 561–580.CrossRefGoogle Scholar
  21. Meignier, S., Bonastre, J.-F., & Igounet, S. (2001). E-HMM approach for learning and adapting sound models for speaker indexing. In Proceedings of Odyssey workshop (pp. 175–180).Google Scholar
  22. Moattar, M. H., & Homayounpour, M. M. (2012). Variational conditional random fields for online speaker detection and tracking. Speech Communication, 54(6), 763–780.CrossRefGoogle Scholar
  23. Muda, L., Begam, M., & Elamvazuthi, I. (2010). Voice recognition algorithms using Mel frequency cepstral coefficient (MFCC) and dynamic time warping (DTW) techniques. Journal of Computing, 2(3), 138–153.Google Scholar
  24. Nguyen, T. H., Chng, E. S., & Li, H. (2015). Speaker diarization: An emerging research. In T. Ogunfunmi (Ed.), Speech and audio processing for coding, enhancement and recognition, Part II (pp. 229–277). New York: Springer.Google Scholar
  25. NIST. (2009). The NIST Rich Transcription 2009 (RT’09) evaluation.
  26. Oku, T., Sato, S., Kobayashi, A., Homma, S., & Imai, T. (2012). Low-latency speaker diarization based on Bayesian information criterion with multiple phoneme classes. In Proceedings of IEEE international conference on acoustics, speech and signal processing (pp. 4189–4192).Google Scholar
  27. Pertila, P. (2013). Online blind speech separation using multiple acoustic speaker tracking and time–frequency masking. Computer Speech & Language, 27(3), 683–702.CrossRefGoogle Scholar
  28. Ramaiah, V. S., & Rao, R. R. (in press) A novel approach for speaker diarization system using tmfcc parameterization and lion optimization. Journal of Central South University of Technology.Google Scholar
  29. Reynolds, D. (2009). Universal background models. In Encyclopedia of biometrics (pp. 1349–1352). New York: Springer.Google Scholar
  30. Reynolds, D. A., & Rose, R. C. (1995). Robust text-independent speaker identification using Gaussian mixture speaker models. IEEE Transactions on Speech and Audio Processing, 3(1), 72–83.CrossRefGoogle Scholar
  31. Siegler, M. A., Jain, U., Raj, B., & Stern, R. M. (1997). Automatic segmentation, classification and clustering of broadcast news audio. In Proceedings of DARPA speech recognition workshop (pp. 97–99).Google Scholar
  32. Siegler, M., Jain, U., Ray, B., & Stern, R. (1997). Automatic segmentation, classifcation and clustering of broadcast news audio. In Proceedings of the speech recognition workshop (pp. 97–99).Google Scholar
  33. Sohal, J. S., & Sukhvinder, K. (2015). Optimization of speaker diarization by reducing diarization error rate: A review. International Journal of Electronics and Communication Engineering, 4, 84–87.Google Scholar
  34. Stevens, S., & Volkmann, J. (1940). The relation of pitch to frequency: a revised scale. The American Journal of Psychology, 53(3), 329–353.CrossRefGoogle Scholar
  35. Vijayasenan, D., Valente, F., & Bourlard, H. (2012). Multistream speaker diarization of meetings recordings beyond MFCC and TDOA features. Speech Communication, 54, 55–67.CrossRefGoogle Scholar
  36. Woodland, P., Gales, M., Pye, D., & Young, S. (1997). The development of the 1996 HTK broadcast news transcription system. In Proceedings of the speech recognition workshop (pp. 73–78).Google Scholar
  37. Wu, C.-H., Ouyang, C.-S., Chen, L.-W., & Lu, L.-W. (2013). A new fuzzy clustering validity index with a median factor for centroid-based clustering. IEEE Transaction on Fuzzy System, 23(3), 1–16.CrossRefGoogle Scholar
  38. Xu, Y., McLoughlin, I., Song, Y., & Wu, K. (2015). Improved i-vector representation for speaker diarization. Circuits, Systems, and Signal Processing, 35(9), 3393–3404.MathSciNetCrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media New York 2016

Authors and Affiliations

  1. 1.Mahatma Gandhi Institute of TechnologyHyderabadIndia
  2. 2.JNTUKKakinadaIndia

Personalised recommendations