Pattern Analysis and Applications

, Volume 21, Issue 1, pp 193–204 | Cite as

A feature selection-based speaker clustering method for paralinguistic tasks

  • Gábor Gosztolya
  • László Tóth
Short Paper


In recent years, computational paralinguistics has emerged as a new topic within speech technology. It concerns extracting non-linguistic information from speech (such as emotions, the level of conflict, whether the speaker is drunk). It was shown recently that many methods applied here can be assisted by speaker clustering; for example, the features extracted from the utterances could be normalized speaker-wise instead of using a global method. In this paper, we propose a speaker clustering algorithm based on standard clustering approaches like K-means and feature selection. By applying this speaker clustering technique in two paralinguistic tasks, we were able to significantly improve the accuracy scores of several machine learning methods, and we also obtained an insight into what features could be efficiently used to separate the different speakers.


Computational paralinguistics Clustering Speaker clustering Feature selection Classifier combination Support-vector machines Deep neural networks AdaBoost.MH 



This publication is supported by the European Union and co-funded by the European Social Fund. Project title: Telemedicine-oriented research activities in the fields of mathematics, informatics and medical sciences. Project number: TÁMOP-4.2.2.A-11/1/KONV-2012-0073.


  1. 1.
    Ajmera J, Wooters C (2003) A robust speaker clustering algorithm. In: Proceedings of ASRU, pp 411–416Google Scholar
  2. 2.
    Benbouzid D, Busa-Fekete R, Casagrande N, Collin FD, Kégl B (2012) MultiBoost: a multi-purpose boosting package. J Mach Learn Res 13:549–553MATHGoogle Scholar
  3. 3.
    Bezdek JC (1981) Pattern recognition with fuzzy objective function algorithms. Plenum, New YorkCrossRefMATHGoogle Scholar
  4. 4.
    Bradley P, Fayyad UM (1998) Refining initial points for K-means clustering. In: Proceedings of ICML, Madison, WI, USA, pp 91–99Google Scholar
  5. 5.
    Cha SH (2007) Comprehensive survey on distance/similarity measures between probability density functions. Int J Math Models Methods Appl Sci 1(4):300–307Google Scholar
  6. 6.
    Chang CC, Lin CJ (2011) LIBSVM: a library for support vector machines. ACM Trans Intell Syst Technol 2:1–27CrossRefGoogle Scholar
  7. 7.
    Dehak N, Kenny PJ, Dehak R, Dumouchel P, Ouellet P (2010) Front end factor analysis for speaker verification. IEEE transactions on audio, speech and language processing, pp 788–798Google Scholar
  8. 8.
    Dupuy G, Meignier S, Deléglise P, Estève Y (2014) Recent improvements on ILP-based clustering for broadcast news speaker diarization. In: Proceedings of Odyssey, pp 187–193Google Scholar
  9. 9.
    Eyben F, Weninger F, Schuller B (2013) Affect recognition in real-life acoustic conditions - A new perspective on feature selection. In: Proceedings of Interspeech, Lyon, France, pp 2044–2048Google Scholar
  10. 10.
    Eyben F, Wöllmer M, Schuller B (2010) Opensmile: the Munich versatile and fast open-source audio feature extractor. In: Proceedings of ACM multimedia, pp 1459–1462Google Scholar
  11. 11.
    Felföldi L, Kocsor A, Tóth L (2003) Classifier combination in speech recognition. Period Polytech Electr Eng 47(1):125–140MATHGoogle Scholar
  12. 12.
    Fred AL, Jain AK (2005) Combining multiple clusterings using evidence accumulation. IEEE Trans Pattern Anal Mach Intell 27(6):835–850CrossRefGoogle Scholar
  13. 13.
    Glorot X, Bordes A, Bengio Y (2011) Deep sparse rectifier networks. In: Proceedings of AISTATS, pp 315–323Google Scholar
  14. 14.
    Gosztolya G (2014) Is AdaBoost competitive for phoneme classification? In: Proceedings of CINTI (IEEE), Budapest, Hungary, pp 61–66Google Scholar
  15. 15.
    Gosztolya G (2015) Conflict intensity estimation from speech using greedy forward-backward feature selection. In: Proceedings of Interspeech, Dresden, Germany, pp 1339–1344Google Scholar
  16. 16.
    Gosztolya G, Busa-Fekete R, Tóth L (2013) Detecting autism, emotions and social signals using AdaBoost. In: Proceedings of Interspeech, Lyon, France, pp. 220–224Google Scholar
  17. 17.
    Gosztolya G, Dombi J (2014) Applying representative uninorms for phonetic classifier combination. In: Proceedings of MDAI, Tokyo, Japan, pp 182–191Google Scholar
  18. 18.
    Gosztolya G, Grósz T, Busa-Fekete R, Tóth L (2014) Detecting the intensity of cognitive and physical load using AdaBoost and deep rectifier neural networks. In: Proceedings of Interspeech, Singapore, pp 452–456Google Scholar
  19. 19.
    Gosztolya G, Grósz T, Busa-Fekete R, Tóth L (2016) Determining native language and deception using phonetic features and classifier combination. In: Proceedings of Interspeech, p. acceptedGoogle Scholar
  20. 20.
    Gosztolya G, Kocsor A (2005) A hierarchical evaluation methodology in speech recognition. Acta Cybern 17(2):213–224MathSciNetMATHGoogle Scholar
  21. 21.
    Gosztolya G, Szilágyi L (2015) Application of fuzzy and possibilistic \(c\)-means clustering models in blind speaker clustering. Acta Polytechnica Hungarica 12(7):41–56Google Scholar
  22. 22.
    Grósz T, Busa-Fekete R, Gosztolya G, Tóth L (2015) Assessing the degree of Nativeness and Parkinson’s condition using Gaussian Processes and Deep Rectifier Neural Networks. In: Proceedings of Interspeech, pp 1339–1343Google Scholar
  23. 23.
    Guan N, Tao D, Luo Z, Yuan B (2012) NeNMF: an optimal gradient method for nonnegative matrix factorization. IEEE Trans Signal Process 60(6):2882–2898MathSciNetCrossRefGoogle Scholar
  24. 24.
    Gupta R, Audhkhasi K, Lee S, Narayanan SS (2013) Speech paralinguistic event detection using probabilistic time-series smoothing and masking. In: Proceedings of Interspeech, pp 173–177Google Scholar
  25. 25.
    Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH (2009) The WEKA data mining software: an update. ACM SIGKDD Explor Newsl 11(1):10–18CrossRefGoogle Scholar
  26. 26.
    Han KJ, Narayanan SS (2008) Agglomerative hierarchical speaker clustering using incremental Gaussian mixture cluster modeling. In: Proceedings of Interspeech, pp 20–23Google Scholar
  27. 27.
    Hand D, Mannila H, Smyth P (2001) Principles of data mining. MIT Press, CambridgeGoogle Scholar
  28. 28.
    Hantke S, Weninger F, Kurle R, Ringeval F, Batliner A, Mousa AED, Schuller B (2016) I hear you eat and speak: automatic recognition of Eating Condition and food type, use-cases, and impact on ASR performance. PLoS One 1–24Google Scholar
  29. 29.
    Kaya H, Özkaptan T, Salah AA, Gürgen F (2014) Canonical correlation analysis and local fisher discriminant analysis based multi-view acoustic feature reduction for physical load prediction. In: Proceedings of Interspeech, Singapore, pp 442–446Google Scholar
  30. 30.
    Manning C, Raghavan P, Schütze H (2008) Introduction to information retrieval. Cambridge University Press, CambridgeCrossRefMATHGoogle Scholar
  31. 31.
    Neuberger T, Beke A (2013) Automatic laughter detection in spontaneous speech using GMM–SVM method. In: Proceedings of TSD, pp 113–120Google Scholar
  32. 32.
    Plessis B, Sicsu A, Heutte L, Menu E, Lecolinet E, Debon O, Moreau JV (1993) A multi-classifier combination strategy for the recognition of handwritten cursive words. In: Proceedings of ICDAR, pp 642–645Google Scholar
  33. 33.
    Räsänen O, Pohjalainen J (2013) Random subset feature selection in automatic recognition of developmental disorders, affective states, and level of conflict from speech. In: Proceedings of Interspeech, Lyon, France, pp 210–214Google Scholar
  34. 34.
    Schapire R, Singer Y (1999) Improved boosting algorithms using confidence-rated predictions. Mach Learn 37(3):297–336CrossRefMATHGoogle Scholar
  35. 35.
    Schölkopf B, Platt J, Shawe-Taylor J, Smola A, Williamson R (2001) Estimating the support of a high-dimensional distribution. Neural Comput 13(7):1443–1471CrossRefMATHGoogle Scholar
  36. 36.
    Schuller B, Steidl S, Batliner A, Epps J, Eyben F, Ringeval F, Marchi E, Zhang Y (2014) The INTERSPEECH 2014 computational paralinguistics challenge: cognitive & physical load. In: Proceedings of Interspeech, pp 427–431Google Scholar
  37. 37.
    Schuller B, Steidl S, Batliner A, Hantke S, Hönig F, Orozco-Arroyave JR, Nöth E, Zhang Y, Weninger F (2015) The INTERSPEECH 2015 computational paralinguistics challenge: Nativeness, Parkinson’s & Eating Condition. In: Proceedings of Interspeech, pp 478–482Google Scholar
  38. 38.
    Schuller B, Steidl S, Batliner A, Vinciarelli A, Scherer K, Ringeval F, Chetouani M, Weninger F, Eyben F, Marchi E, Salamin H, Polychroniou A, Valente F, Kim S (2013) The Interspeech 2013 computational paralinguistics challenge: social signals, conflict, emotion, autism. In: Proceedings of Interspeech, Lyon, France, pp 148–152Google Scholar
  39. 39.
    Sculley D (2010) Web-scale k-means clustering. In: Proceedings of WWW, Raleigh, North Carolina, USA, pp 1177–1178Google Scholar
  40. 40.
    van Segbroeck M, Travadi R, Vaz C, Kim J, Black MP, Potamianos A, Narayanan SS (2014) Classification of Cognitive Load from speech using an i-vector framework. In: Proceedings of Interspeech, Singapore, pp 671–675Google Scholar
  41. 41.
    Sokal RR, Michener CD (1958) A statistical method for evaluating systematic relationships. Univ Kans Sci Bull 28(1):1409–1438Google Scholar
  42. 42.
    Steinhaus H (1956) Sur la division des corp materiels en parties. Bull Acad Pol Sci C1 III. (IV):801–804MathSciNetMATHGoogle Scholar
  43. 43.
    Stroop JR (1935) Studies of interference in serial verbal reactions. J Exp Psychol 18(6):643–662CrossRefGoogle Scholar
  44. 44.
    Szilágyi L, Szilágyi SM (2014) Generalization rules for the suppressed fuzzy \(c\)-means clustering algorithm. Neurocomputing 139:298–309CrossRefGoogle Scholar
  45. 45.
    Todd SC, Tóth MT, Busa-Fekete R (2009) A MATLAB program for cluster analysis using graph theory. Comput Geosci 36(6):1205–1213CrossRefGoogle Scholar
  46. 46.
    Tóth L (2014) Combining time- and frequency-domain convolution in convolutional neural network-based phone recognition. In: Proceedings of ICASSP, pp 190–194Google Scholar
  47. 47.
    Tóth SL, Sztahó D, Vicsi K (2012) Speech emotion perception by human and machine. In: Proceedings of COST action, Patras, Greece, pp 213–224Google Scholar
  48. 48.
    Yap TF (2012) Speech production under Cognitive Load: effects and classification. Ph.D. thesis, University of New South WalesGoogle Scholar
  49. 49.
    Yu K, Jiang X, Bunke H (2012) Partially supervised speaker clustering. IEEE Trans Pattern Anal Mach Intell 34(5):959–971CrossRefGoogle Scholar

Copyright information

© Springer-Verlag London 2017

Authors and Affiliations

  1. 1.MTA-SZTE Research Group on Artificial Intelligence of the Hungarian Academy of SciencesUniversity of SzegedSzegedHungary
  2. 2.Institute of Informatics, University of SzegedSzegedHungary

Personalised recommendations