Speech Recognition Combining MFCCs and Image Features

  • Stamatis KarlosEmail author
  • Nikos Fazakis
  • Katerina Karanikola
  • Sotiris Kotsiantis
  • Kyriakos Sgarbas
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9811)


Automatic speech recognition (ASR) task constitutes a well-known issue among fields like Natural Language Processing (NLP), Digital Signal Processing (DSP) and Machine Learning (ML). In this work, a robust supervised classification model is presented (MFCCs + autocor + SVM) for feature extraction of solo speech signals. Mel Frequency Cepstral Coefficients (MFCCs) are exploited combined with Content Based Image Retrieval (CBIR) features extracted from spectrogram produced by each frame of the speech signal. Improvement of classification accuracy using such extended feature vectors is examined against using only MFCCs with several classifiers for three scenarios of different number of speakers.


ASR MFCCs Supervised model Feature extraction CBIR features 


  1. 1.
    Yu, G.: Audio Classification From Time-Frequency Texture, Massachusetts Institute of Technology. Ecole Polytechnique, Palaiseau Cedex, NSL, Time, pp. 1677–1680 (2009)Google Scholar
  2. 2.
    Muroi, T., Takashima, R., Takiguchi, T., Ariki, Y.: Gradient-based acoustic features for speech recognition. In: International Symposium on Intelligent Signal Processing Communication Systems 2009, ISPACS 2009, pp. 445–448 (2009)Google Scholar
  3. 3.
    Khunarsa, P., Lursinsap, C., Raicharoen, T.: Impulsive environment sound detection by neural classification of spectrogram and mel-frequency coefficient images. In: Zeng, Z., Wang, J. (eds.) Advances in Neural Network Research and Applications. LNEE, vol. 67, pp. 337–346. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  4. 4.
    Davis, S.B., Mermelstein, P.: Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. Trans. Acoust. Speech Signal Process. 28(4), 357–366 (1980)CrossRefGoogle Scholar
  5. 5.
    Huang, J., Kumar, S.R., Mitra, M., Zhu, W.-J., Zabih, R.: Image indexing using color correlograms. In: Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 762–768 (1997)Google Scholar
  6. 6.
    Lux, M., Chatzichristofis, S.A.: Lire: lucene image retrieval. In: Proceedings of the 16th ACM International Conference on Multimedia - MM 2008, p. 1085 (2008)Google Scholar
  7. 7.
    Lux, M.: Content based image retrieval with LIRe. In: Proceedings of the 19th ACM International Conference on Multimedia, pp. 735–738 (2011)Google Scholar
  8. 8.
    Lux, M., Oge, M.: Visual Information Retrieval using Java and LIRE. Morgan & Claypool, San Rafael (2013)Google Scholar
  9. 9.
    Souli, S., Lachiri, Z.: Environmental sounds spectrogram classification using log-gabor filters and multiclass support vector machines. Int. J. Comput. 9(4–3), 142–149 (2012)Google Scholar
  10. 10.
    Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 720–723 (2007)Google Scholar
  11. 11.
    Lei, H., Meyer, B.T., Mirghafori, N.: Spectro-temporal Gabor features for speaker recognition. In: ICASSP, pp. 4241–4244 (2012)Google Scholar
  12. 12.
    Gramss, T.: Fast algorithms to find invariant features for a word recognizing neural net. Int. J. Speech Technol. 18(1), 180–184 (2014)Google Scholar
  13. 13.
    Kleinschmidt, M.: Localized spectro-temporal features for automatic speech recognition, pp. 2573–2576 (2003)Google Scholar
  14. 14.
    Kleinschmidt, M.: Methods for capturing spectro-temporal modulations in automatic speech recognition. Acta Acust. - Acust. 88(3), 416–422 (2002)Google Scholar
  15. 15.
    Nilufar, S., Ray, N., Molla, M.K.I., Hirose, K. Spectrogram based features selection using multiple kernel learning for speech/music discrimination. In: 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 501–504 (2012)Google Scholar
  16. 16.
    Dennis, J., Tran, H.D., Li, H.: Spectrogram image feature for sound event classification in mismatched conditions. IEEE Signal Process. Lett. 18(2), 130–133 (2011)CrossRefGoogle Scholar
  17. 17.
    Ghosal, A., Chakraborty, R., Dhara, B.C., Saha, S.K.: Song/instrumental classification using spectrogram based contextual features. In: Proceedings of the CUBE International Information Technology Conference - CUBE 2012, p. 21 (2012)Google Scholar
  18. 18.
    Khunarsal, P., Lursinsap, C., Raicharoen, T.: Very short time environmental sound classification based on spectrogram pattern matching. Inf. Sci. (Ny) 243, 57–74 (2013)CrossRefGoogle Scholar
  19. 19.
    He, L., Lech, M., Maddage, N., Allen, N.: Stress and emotion recognition using log-Gabor filter analysis of speech spectrograms. In: Proceedings - 2009 3rd International Conference on Affective Computing and Intelligent Interaction Work. ACII 2009, pp. 1–5 (2009)Google Scholar
  20. 20.
    Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The WEKA data mining software. In: ACM SIGKDD Explorations Newsletter, vol. 11, no. 1, p. 10 (2009)Google Scholar
  21. 21.
    Mayo, M.: ImageFilter WEKA filter that uses LIRE to extract image features (2015).
  22. 22.
    Georganti, E., May, T., Van De Par, S., Mourjopoulos, J.: Sound source distance estimation in rooms based on statistical properties of binaural signals. IEEE Trans. Audio, Speech Lang. Process. 21(8), 1727–1741 (2013)CrossRefGoogle Scholar
  23. 23.
    Cummins, F., Grimaldi, M., Leonard, T., Simko, J.: The CHAINS speech corpus: CHAracterizing INdividual speakers. In: Proceedings of the SPECOM, pp. 1–6 (2006)Google Scholar
  24. 24.
    Chatzichristofis, S.A., Boutalis, Y.S., Arampatzis, A.: Accelerating image retrieval using Binary Haar Wavelet transform on the color and edge directivity descriptor. In: Proceedings of the 5th International Multi-Conference Computing in the Global Information Technology, ICCGI 2010, vol. 4, no. 1, pp. 41–47 (2010)Google Scholar
  25. 25.
    Jalab, H.: Image retrieval system based on color layout descriptor and Gabor filters. In: IEEE Conference on Open Systems, pp. 32–36 (2011)Google Scholar
  26. 26.
    Chatzichristofis, S.A., Boutalis, Y.S.: FCTH: fuzzy color and texture histogram - a low level feature for accurate image retrieval. In: 2008 Ninth International Workshop on Image Analysis for Multimedia Interactive Services, pp. 191–196 (2008)Google Scholar
  27. 27.
    Bosch, A., Zisserman, A., Munoz, X.: Representing shape with a spatial pyramid kernel. In: CIVR 2007 Proceedings of the 6th ACM International Conference on Image and Video Retrieval, pp. 401–408 (2007)Google Scholar
  28. 28.
    Thiruvengatanadhan, R.: Speech/Music Classification using SVM. Int. J. Comput. Appl. 65(6), 36–41 (2013)Google Scholar
  29. 29.
    Chang, C., Lin, C.: LIBSVM: a library for support vector machines. ACM Trans. Intell. Syst. Technol. 2, 1–39 (2011)CrossRefGoogle Scholar
  30. 30.
    Demšar, J.: Statistical comparisons of classifiers over multiple data sets. J. Mach. Learn. Res. 7, 1–30 (2006)MathSciNetzbMATHGoogle Scholar
  31. 31.
    Paraskevas, I., Rangoussi, M.: The hartley phase spectrum as an assistive feature for classification. In: Solé-Casals, J., Zaiats, V. (eds.) NOLISP 2009. LNCS, vol. 5933, pp. 51–59. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  32. 32.
    Hong, Y., Zhu, W.: Spatial co-training for semi-supervised image classification. Pattern Recognit. Lett. 63, 59–65 (2015)CrossRefGoogle Scholar

Copyright information

© Springer International Publishing Switzerland 2016

Authors and Affiliations

  • Stamatis Karlos
    • 1
    Email author
  • Nikos Fazakis
    • 1
  • Katerina Karanikola
    • 1
  • Sotiris Kotsiantis
    • 1
  • Kyriakos Sgarbas
    • 1
  1. 1.University of PatrasPatrasGreece

Personalised recommendations