Segment-level probabilistic sequence kernel and segment-level pyramid match kernel based extreme learning machine for classification of varying length patterns of speech

  • Shikha GuptaEmail author
  • Ahmed Karanath
  • Kansul Mahrifa
  • A. D. Dileep
  • Veena Thenkanidiyoor


In this work, we address some issues in the classification of varying length patterns of speech represented as sets of continuous-valued feature vectors using kernel methods. Kernels designed for varying length patterns are called as dynamic kernels. We propose two dynamic kernels namely segment-level pyramid match kernel (SLPMK) and segment-level probabilistic sequence kernel (SLPSK) for classification of long duration speech, represented as varying length sets of feature vectors using extreme learning machine (ELM). SLPMK and SLPSK are designed by partitioning the speech signal into increasingly finer segments and matching the corresponding segments. SLPSK is built upon a set of Gaussian basis functions, where half of the basis functions contain class-specific information while the other half implicates the common characteristics of all the speech utterances of all classes. The computational complexity of SVM training algorithms is usually intensive, which is at least quadratic with respect to the number of training examples. It is difficult to deal with the immense amount of data using traditional SVMs. For reducing the training time of classifier we propose to use a simple algorithm namely ELM. ELM refers to a wider type of generalized single hidden layer feedforward networks (SLFNs) whose hidden layer need not be tuned. In our work, we proposed to explore kernel based ELM to exploit dynamic kernels. We study the performance of the ELM-based classifiers using the proposed SLPSK and SLPMK for speech emotion recognition and speaker identification tasks and compare with other kernels for varying length patterns. Experimental studies showed that proposed ELM-based approach offer a 10–12% of relative improvement over baseline approach, and a 3–9% relative improvement over ELMs/SVMs using other state-of-the-art dynamic kernels.


Varying length patterns Extreme learning machine Segment level probabilistic sequence kernel Segment level pyramid match kernel Speech emotion recognition Speaker identification 



  1. Alexandos, I., Tefas, A., & Pitas, ioannis. (2015). On the kernel extreme learning machine classifiers. Pattern Recognition Letters, 54, 11–17.CrossRefGoogle Scholar
  2. Allwein, E. L., Schapire, R. E., & Singer, Y. (2000). Reducing multiclass to binary: A unifying approach for margin classifiers. Journal of Machine Learning Research, 1(Dec), 113–141.MathSciNetzbMATHGoogle Scholar
  3. Boughorbel, S., Tarel, J. P., & Boujemaa, N. (2005). The intermediate matching kernel for image local features. In Proceedings of the International Joint Conference on Neural Networks (IJCNN 2005) (pp. 889–894), MontrealGoogle Scholar
  4. Burkhardt, F., Paeschke, A., Rolfes, M., & Weiss, W. S. B. (2005). A database of German emotional speech. In Proceedings of INTERSPEECH (pp. 1517–1520), Lisbon.Google Scholar
  5. Campbell, W. M., & Sturim, D. D. E. (2006). Support vector machines using GMM supervectors for speaker verification. IEEE Signal Processing Letters, 13(5), 308–311.CrossRefGoogle Scholar
  6. Chang, C. C., & Linm, C. J. (2011). LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2(3), 1–27. Scholar
  7. Chen, Yh., Lopez-Moreno, I., Sainath, T., Visontai, M., Alvarez, R., & Parada, C. (2015). Locally connected and convolutional neural networks for small footprint speaker recognition. In Proceedings of INTERSPEECH (pp. 1136–1140), Dresden.Google Scholar
  8. Chorowski, J., Wang, J., & Zurada, J. M. (2014). Review and performance comparison of svm-and elm-based classifiers. Neurocomputing, 128, 507–516.CrossRefGoogle Scholar
  9. Dileep, A. D., & Chandra Sekhar, C. (2012). Speaker recognition using pyramid match kernel based support vector machines. Internatiional Journal for Speech Technology, 15(3), 365–379.CrossRefGoogle Scholar
  10. Dileep, A. D., & Chandra Sekhar, C. (2014). GMM-based intermediate matching kernel for classification of varying length patterns of long duration speech using support vector machines. IEEE Transactions on Neural Networks and Learning Systems, 25(8), 1421–1432.CrossRefGoogle Scholar
  11. Gemert, Veenman C. J., Smeulders, A. W. M., & Geusebroek, J. M. (2010). Visual word ambiguity. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(17), 1271–1283.CrossRefGoogle Scholar
  12. Gordon, G., & Tibshirani, R. (2012). Karush-kuhn-tucker conditions. Optimization, 10(725/36), 725.Google Scholar
  13. Grauman, K., & Darrell, T. (2007). The pyramid match kernel: Efficient learning with sets of features. The Journal of Machine Learning Research, 8, 725–760.zbMATHGoogle Scholar
  14. Gupta, S., Dileep, A. D., & Thenkanidiyoor, V. (2016a). Segment-level pyramid match kernels for the classification of varying length patterns of speech using svms. In Signal Processing Conference (EUSIPCO), 2016 24th European, IEEE (pp. 2030–2034).Google Scholar
  15. Gupta, S., Thenkanidiyoor, V., & Dileep, A. D. (2016b). Segment-level probabilistic sequence kernel based support vector machines for classification of varying length patterns of speech. In International Conference on Neural Information Processing (pp. 321–328). New York: Springer.Google Scholar
  16. Huang, G. (2014). An insight into extreme learning machines: Random neurons, random features and kernels. Cognitive Computation, 6(3), 376–390. Scholar
  17. Huang, G. B., Chen, L., & Siew, C. K. (2006). Universal approximation using incremental constructive feedforward networks with random hidden nodes. IEEE Transactions on Neural Networks, 17(4), 879–892.CrossRefGoogle Scholar
  18. Huang, G. B., Zhou, H., Ding, X., et al. (2012). Extreme learning machine for regression and multiclass classification. IEEE Transactions on Systems, Man, and Cybernetics, B (Cybernetics), 42(2), 513–529.CrossRefGoogle Scholar
  19. Lee, K. A., HTK You, C. H. (2007). A GMM-based probabilistic sequence kernel for speaker verification. In Proceedings of INTERSPEECH, (pp. 294–297), Antwerp.Google Scholar
  20. Lazebnik, S., Schmid, C., & Ponce, J. (2006). Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2006), (vol. 2, pp. 2169–2178), New York.Google Scholar
  21. Mao, Q., Dong, M., Huang, Z., & Zhan, Y. (2014). Learning salient features for speech emotion recognition using convolutional neural networks. IEEE Transactions on Multimedia, 16(8), 2203–2213.CrossRefGoogle Scholar
  22. Newcombe, R. G. (1998). Two-sided confidence intervals for the single proportion: Comparison of seven methods. Statistics in Medicine, 17(8), 857–872.CrossRefGoogle Scholar
  23. Rabiner, L., & Juang, B. H. (2003). Fundamentals of Speech Recognition. Pearson Education.Google Scholar
  24. Rao, C. R., & Mitra, S. K. (1971). Generalized inverse of matrices and its applications (Vol. 7). New York: Wiley.zbMATHGoogle Scholar
  25. Reynolds, D. A. (1995). Speaker identification and verification using Gaussian mixture speaker models. Speech Communication, 17, 91–108.CrossRefGoogle Scholar
  26. Reynolds, D. A., Quatieri, T. F., & Dunn, R. B. (2000). Speaker verification using adapted Gaussian mixture models. Digital Signal Processing, 10(1–3), 19–41.CrossRefGoogle Scholar
  27. Sachdev, A., Dileep, A. D., & Thenkanidiyoor, V. (2015). Example-specific density based matching kernel for classification of varying length patterns of speech using support vector machines. In Proceedings of ICONIP, (pp. 177–184). Istanbul.Google Scholar
  28. Smith, N., Gales, M., & Niranjan, M. (2001). Data-dependent kernels in SVM classification of speech patterns. Tech. Rep. CUED/F-INFENG/TR.387, Cambridge University Engineering Department, Cambridge.Google Scholar
  29. Steidl, S. (2009). Automatic classification of emotion-related user states in spontaneous childern’s speech. PhD thesis, Der Technischen Fakultät der Universität Erlangen-Nürnberg, Germany.Google Scholar
  30. Swain, M. J., & Ballard, D. H. (1991). Color indexing. International Journal of Computer Vision, 7(1), 11–32.CrossRefGoogle Scholar
  31. The NIST Year 2002 Speaker Recognition Evaluation Plan. (2002). http://www.itlnistgov/iad/mig/tests/spk/2002/
  32. The NIST Year 2003 Speaker Recognition Evaluation Plan. (2003). http://www.itlnistgov/iad/mig/tests/sre/2003/
  33. Vedaldi, A., & Zisserman, A. (2010). Efficient additive kernels via explicit feature maps. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), (pp. 3539–3546).Google Scholar
  34. Wang J., KYFLTH Yang, J., & Gong, Y. (2010). Locality-constrained linear coding for image classification. In Proceedings of CVPR’10, IEEE (pp. 3360–3367). State College: The Pennsylvania State University.Google Scholar
  35. Yang, J., Yu, K., Gong, Y., & Huang, T. (2009). Linear spatial pyramid matching using sparse coding for image classification. In Proceedings of CVPR’09, IEEE, (pp. 1794–1801).Google Scholar
  36. You, C. H., Lee, K. A., & Li, H. (2009). An SVM kernel with GMM-supervector based on the Bhattacharyya distance for speaker recognition. IEEE Signal Processing Letters, 16(1), 49–52.CrossRefGoogle Scholar
  37. Zhang, L., Zhang, D., & Tian, F. (2016). Svm and elm: Who wins? object recognition with deep convolutional features from imagenet. In Proceedings of ELM-2015 (Vol. 1, pp. 249–263). Springer: New York.Google Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2019

Authors and Affiliations

  1. 1.School of Computing and Electrical EngineeringIndian Institute of Technology MandiKamandIndia
  2. 2.Department of Computer Science and EngineeringNational Institute of Technology GoaPondaIndia

Personalised recommendations