Skip to main content
Log in

Convolutional support vector machines for speech recognition

  • Published:
International Journal of Speech Technology Aims and scope Submit manuscript

Abstract

Convolutional neural networks (CNNs) have demonstrated the state-of-the-art performances on automatic speech recognition. Softmax activation function for prediction and minimizing the cross-entropy loss is employed by most of the CNNs. This paper proposes a new deep architecture in which two heterogeneous classification techniques named as CNN and support vector machines (SVMs) are combined together. In this proposed model, features are learned using convolution layer and classified by SVMs. The last layer of CNN i.e. softmax layer is replaced by SVMs to efficiently deal with high dimensional features. This model should be interpreted as a special form of structured SVM and named as convolutional support vector machine (CSVM). Instead of training each component separately, the parameters of CNN and SVMs are jointly trained using frame level max-margin, sequence level max-margin, and state-level minimum Bayes risk criterion. The performance of CSVM is checked on TIMIT and Wall Street Journal datasets for phone recognition. By incorporating the features of both CNN and SVMs, CSVM improves the result by 13.33% and 2.31% over baseline CNN and segmental recurrent neural networks respectively.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2

Similar content being viewed by others

References

  • Abdel-Hamid, O., Deng, L., & Yu, D. (2003). Exploring convolutional neural network structures and optimization techniques for speech recognition. Interspeech, 2013, 3366–3370.

    Google Scholar 

  • Abdel-Hamid, O., Mohamed, A., Jiang, H., & Penn, G. (2012). Applying convolutional neural networks concepts to hybrid NN-HMM model for speech recognition. In presented at the 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2012.

  • Abdel-Hamid, O., Mohamed, A. R., Jiang, H., Deng, L., Penn, G., & Yu, D. (2014). Convolutional neural networks for speech recognition. IEEE-ACM Transactions on Audio Speech and Language Processing, 22, 1533–1545.

    Article  Google Scholar 

  • Agarap, A. F. (2017). An architecture combining convolutional neural network (CNN) and support vector machine (SVM) for image classification. arXiv preprint arXiv.1712.03541.

  • Bourlard, H. A., & Morgan, N. (2012). Connectionist speech recognition: A hybrid approach (Vol. 247). New York: Springer.

    Google Scholar 

  • Boyd, S., & Mutapcic, A. (2006). Subgradient methods. In Lecture notes of EE364b, Stanford University, Winter Quarter, Vol. 2007.

  • Crammer, K., & Singer, Y. (2001). On the algorithmic implementation of multiclass kernel-based vector machines. Journal of Machine Learning Research, 2, 265–292.

    MATH  Google Scholar 

  • Dahl, G., Mohamed, A. R., & Hinton, G. E. (2010). Phone recognition with the mean-covariance restricted Boltzmann machine. Advances in Neural Information Processing Systems, 1, 469–477.

    Google Scholar 

  • Dean, J., Corrado, G., Monga, R., Chen, K., Devin, M., Mao, M., et al. (2012). Large scale distributed deep networks. In Advances in neural information processing systems, pp. 1223–1231.

  • Deng, L., & Chen, J. (2014). Sequence classification using the high-level features extracted from deep neural networks. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6844–6848.

  • Deng, L., & Yu, D. (2007). Use of differential cepstra as acoustic features in hidden trajectory modeling for phonetic recognition. In IEEE International Conference on Acoustics, Speech and Signal Processing, 2007. ICASSP 2007, pp. 445–448.

  • Gales, M., & Young, S. (2008). The application of hidden Markov models in speech recognition. Foundations and Trends® in Signal Processing, 1, 195–304.

    Article  MATH  Google Scholar 

  • Ganapathiraju, A., Hamaker, J., & Picone, J. (1998). Support vector machines for speech recognition. In Fifth international conference on spoken language processing, Sydney.

  • Gehring, J., Auli, M., Grangier, D., Yarats, D., & Dauphin, Y. N. (2017). Convolutional sequence to sequence learning. arXiv preprint arXiv.1705.03122.

  • Gibson, M., & Hain, T. (2006). Hypothesis spaces for minimum Bayes risk training in large vocabulary speech recognition. In Ninth international conference on spoken language processing, Pittsburgh.

  • Graves, A., Mohamed, A. R., & Hinton, G. (2013). Speech recognition with deep recurrent neural networks. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vancouver (pp. 6645–6649).

  • Halberstadt, A. K. (1998). Heterogeneous acoustic measurements and multiple classifiers for speech recognition. Doctoral dissertation.

  • Hifny, Y., & Renals, S. (2009). Speech recognition using augmented conditional random fields. IEEE transactions on Audio, Speech and Language Processing, 17, 354–365.

    Article  Google Scholar 

  • Hinton, G. E., Srivastava, N., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. R. (2012). Improving neural networks by preventing co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580.

  • Joachims, T., Finley, T., & Yu, C. N. J. (2009). Cutting-plane training of structural SVMs. Machine Learning, 77, 27–59.

    Article  MATH  Google Scholar 

  • Kaiser, J., Horvat, B., & Kacic, Z. (2000). A novel loss function for the overall risk criterion based discriminative training of HMM models. In Sixth international conference on spoken language processing, Beijing.

  • Kingsbury, B., Sainath, T. N., & Soltau, H. (2012). Scalable minimum Bayes risk training of deep neural network acoustic models using distributed Hessian-free optimization. In Thirteenth annual conference of the international speech communication association, Portland.

  • Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems (pp. 1097–1105).

  • Lee, K.-F., & Hon, H.-W. (1989). Speaker-independent phone recognition using hidden Markov models. IEEE Transactions on Acoustics, Speech, and Signal Processing, 37, 1641–1648.

    Article  Google Scholar 

  • Lu, L., Kong, L., Dyer, C., Smith, N. A., & Renals, S. (2016). Segmental recurrent neural networks for end-to-end speech recognition. In Presented at the Interspeech 2016.

  • Ming, J., & Smith, F. J. (1998 ). Improved phone recognition using Bayesian triphone models. In IEEE international conference on acoustics, speech and signal processing, 1998, Seattle (pp. 409–412).

  • Mohamed, A. R., Dahl, G., & Hinton, G. (2009). Deep belief networks for phone recognition. Nips Workshop on Deep Learning for Speech Recognition and Related Applications, 1, 39.

    Google Scholar 

  • Mohamed, A. R., Dahl, G. E., & Hinton, G. (2012). Acoustic modeling using deep belief networks. IEEE transactions on Audio, Speech and Language Processing, 20, 14–22.

    Article  Google Scholar 

  • Mohamed, A. R., Yu, D., & Deng, L. (2010). Investigation of full-sequence training of deep belief networks for speech recognition. In Interspeech, Makuhari (pp. 2846–2849).

  • Nagi, J., Di Caro, G. A., Giusti, A., Nagi, F., & Gambardella, L. M. (2012). Convolutional neural support vector machines: Hybrid visual pattern classifiers for multi-robot systems. ICMLA. https://doi.org/10.1109/ICMLA.2012.14

    Google Scholar 

  • Palaz, D., Collobert, R., & Doss, M. M. (2013). End-to-end phoneme sequence recognition using convolutional neural networks. arXiv preprint arXiv.1312.2137.

  • Povey, D., & Kingsbury, B. (2007). Evaluation of proposed modifications to MPE for large scale discriminative training. In Presented at the 2007 IEEE international conference on acoustics, speech and signal processing - ICASSP ‘07, Honolulu.

  • Robinson, A. (1994). An application to recurrent nets to phone probability estimation. IEEE transactions on Neural Networks, 5, 298–305.

    Article  Google Scholar 

  • Sainath, T. N., Kingsbury, B., Mohamed, A., Dahl, G. E., Saon, G., Soltau, H., et al. (2013a). Improvements to deep convolutional neural networks for LVCSR. In Presented at the 2013 IEEE workshop on automatic speech recognition and understanding, Olomouc.

  • Sainath, T. N., Mohamed, A., Kingsbury, B., & Ramabhadran, B. (2013b). Deep convolutional neural networks for LVCSR. In Presented at the 2013 IEEE international conference on acoustics, speech and signal processing.

  • Sainath, T. N., Ramabhadran, B., Picheny, M., Nahamoo, D., & Kanevsky, D. (2011). Exemplar-based sparse representation features: From TIMIT to LVCSR. IEEE transactions on Audio, Speech and Language Processing, 19, 2598–2613.

    Article  Google Scholar 

  • Shalev-Shwartz, S., Singer, Y., & Srebro, N. (2007). Pegasos. In Presented at the Proceedings of the 24th international conference on Machine learning - ICML ‘07.

  • Shalev-Shwartz, S., Singer, Y., Srebro, N., & Cotter, A. (2010). Pegasos: Primal estimated sub-gradient solver for SVM. Mathematical Programming, 127, 3–30.

    Article  MathSciNet  MATH  Google Scholar 

  • Suykens, J. K., & Vandewalle, J. (1999). Training multilayer perceptron classifiers based on a modified support vector method. IEEE Transactions on Neural Networks, 10, 907–911.

    Article  Google Scholar 

  • Tang, Y. (2013). Deep learning using linear support vector machines. arXiv preprint arXiv.1306.0239.

  • Vapnik, V. N. (1995). The nature of statistical learning theory. New York: Springer.

    Book  MATH  Google Scholar 

  • Vasquez, D., Gruhn, R., & Minker, W. (2013). Hierarchical neural network structures for phoneme recognition (Vol. 1). Heidelberg: Springer.

    Book  Google Scholar 

  • Veselý, K., Ghoshal, A., Burget, L., & Povey, D. (2013). Sequence-discriminative training of deep neural networks. In Interspeech (pp. 2345–2349).

  • Viikki, O., & Laurila, K. (1998). Cepstral domain segmental feature vector normalization for noise robust speech recognition. Speech Communication, 25, 133–147.

    Article  Google Scholar 

  • Wiesler, S., Golik, P., Schluter, R., & Ney, H. (2015) Investigations on sequence training of neural networks. In Presented at the 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP).

  • Zhang, S.-X., & Gales, M. J. (2011). Structured support vector machines for noise robust continuous speech recognition. In Twelfth Annual Conference of the International Speech Communication Association, Florence.

  • Zhang, S. X., & Gales, M. J. F. (2013). Structured SVMs for automatic speech recognition. IEEE Transactions on Audio Speech and Language Processing, 21, 544–555.

    Article  Google Scholar 

  • Zhang, S.-X., Ragni, A., & Gales, M. J. F. (2010). Structured log linear models for noise robust speech recognition. IEEE Signal Processing Letters, 17, 945–948.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Vishal Passricha.

Ethics declarations

Conflict of interest

There is no conflict of interest for this paper.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Passricha, V., Aggarwal, R.K. Convolutional support vector machines for speech recognition. Int J Speech Technol 22, 601–609 (2019). https://doi.org/10.1007/s10772-018-09584-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10772-018-09584-4

Keywords

Navigation