Convolutional support vector machines for speech recognition

Passricha, Vishal; Aggarwal, Rajesh Kumar

doi:10.1007/s10772-018-09584-4

Convolutional support vector machines for speech recognition

Published: 11 December 2018

Volume 22, pages 601–609, (2019)
Cite this article

International Journal of Speech Technology Aims and scope Submit manuscript

590 Accesses
21 Citations
Explore all metrics

Abstract

Convolutional neural networks (CNNs) have demonstrated the state-of-the-art performances on automatic speech recognition. Softmax activation function for prediction and minimizing the cross-entropy loss is employed by most of the CNNs. This paper proposes a new deep architecture in which two heterogeneous classification techniques named as CNN and support vector machines (SVMs) are combined together. In this proposed model, features are learned using convolution layer and classified by SVMs. The last layer of CNN i.e. softmax layer is replaced by SVMs to efficiently deal with high dimensional features. This model should be interpreted as a special form of structured SVM and named as convolutional support vector machine (CSVM). Instead of training each component separately, the parameters of CNN and SVMs are jointly trained using frame level max-margin, sequence level max-margin, and state-level minimum Bayes risk criterion. The performance of CSVM is checked on TIMIT and Wall Street Journal datasets for phone recognition. By incorporating the features of both CNN and SVMs, CSVM improves the result by 13.33% and 2.31% over baseline CNN and segmental recurrent neural networks respectively.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Abdel-Hamid, O., Deng, L., & Yu, D. (2003). Exploring convolutional neural network structures and optimization techniques for speech recognition. Interspeech, 2013, 3366–3370.
Google Scholar
Abdel-Hamid, O., Mohamed, A., Jiang, H., & Penn, G. (2012). Applying convolutional neural networks concepts to hybrid NN-HMM model for speech recognition. In presented at the 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2012.
Abdel-Hamid, O., Mohamed, A. R., Jiang, H., Deng, L., Penn, G., & Yu, D. (2014). Convolutional neural networks for speech recognition. IEEE-ACM Transactions on Audio Speech and Language Processing, 22, 1533–1545.
Article Google Scholar
Agarap, A. F. (2017). An architecture combining convolutional neural network (CNN) and support vector machine (SVM) for image classification. arXiv preprint arXiv.1712.03541.
Bourlard, H. A., & Morgan, N. (2012). Connectionist speech recognition: A hybrid approach (Vol. 247). New York: Springer.
Google Scholar
Boyd, S., & Mutapcic, A. (2006). Subgradient methods. In Lecture notes of EE364b, Stanford University, Winter Quarter, Vol. 2007.
Crammer, K., & Singer, Y. (2001). On the algorithmic implementation of multiclass kernel-based vector machines. Journal of Machine Learning Research, 2, 265–292.
MATH Google Scholar
Dahl, G., Mohamed, A. R., & Hinton, G. E. (2010). Phone recognition with the mean-covariance restricted Boltzmann machine. Advances in Neural Information Processing Systems, 1, 469–477.
Google Scholar
Dean, J., Corrado, G., Monga, R., Chen, K., Devin, M., Mao, M., et al. (2012). Large scale distributed deep networks. In Advances in neural information processing systems, pp. 1223–1231.
Deng, L., & Chen, J. (2014). Sequence classification using the high-level features extracted from deep neural networks. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6844–6848.
Deng, L., & Yu, D. (2007). Use of differential cepstra as acoustic features in hidden trajectory modeling for phonetic recognition. In IEEE International Conference on Acoustics, Speech and Signal Processing, 2007. ICASSP 2007, pp. 445–448.
Gales, M., & Young, S. (2008). The application of hidden Markov models in speech recognition. Foundations and Trends® in Signal Processing, 1, 195–304.
Article MATH Google Scholar
Ganapathiraju, A., Hamaker, J., & Picone, J. (1998). Support vector machines for speech recognition. In Fifth international conference on spoken language processing, Sydney.
Gehring, J., Auli, M., Grangier, D., Yarats, D., & Dauphin, Y. N. (2017). Convolutional sequence to sequence learning. arXiv preprint arXiv.1705.03122.
Gibson, M., & Hain, T. (2006). Hypothesis spaces for minimum Bayes risk training in large vocabulary speech recognition. In Ninth international conference on spoken language processing, Pittsburgh.
Graves, A., Mohamed, A. R., & Hinton, G. (2013). Speech recognition with deep recurrent neural networks. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vancouver (pp. 6645–6649).
Halberstadt, A. K. (1998). Heterogeneous acoustic measurements and multiple classifiers for speech recognition. Doctoral dissertation.
Hifny, Y., & Renals, S. (2009). Speech recognition using augmented conditional random fields. IEEE transactions on Audio, Speech and Language Processing, 17, 354–365.
Article Google Scholar
Hinton, G. E., Srivastava, N., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. R. (2012). Improving neural networks by preventing co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580.
Joachims, T., Finley, T., & Yu, C. N. J. (2009). Cutting-plane training of structural SVMs. Machine Learning, 77, 27–59.
Article MATH Google Scholar
Kaiser, J., Horvat, B., & Kacic, Z. (2000). A novel loss function for the overall risk criterion based discriminative training of HMM models. In Sixth international conference on spoken language processing, Beijing.
Kingsbury, B., Sainath, T. N., & Soltau, H. (2012). Scalable minimum Bayes risk training of deep neural network acoustic models using distributed Hessian-free optimization. In Thirteenth annual conference of the international speech communication association, Portland.
Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems (pp. 1097–1105).
Lee, K.-F., & Hon, H.-W. (1989). Speaker-independent phone recognition using hidden Markov models. IEEE Transactions on Acoustics, Speech, and Signal Processing, 37, 1641–1648.
Article Google Scholar
Lu, L., Kong, L., Dyer, C., Smith, N. A., & Renals, S. (2016). Segmental recurrent neural networks for end-to-end speech recognition. In Presented at the Interspeech 2016.
Ming, J., & Smith, F. J. (1998 ). Improved phone recognition using Bayesian triphone models. In IEEE international conference on acoustics, speech and signal processing, 1998, Seattle (pp. 409–412).
Mohamed, A. R., Dahl, G., & Hinton, G. (2009). Deep belief networks for phone recognition. Nips Workshop on Deep Learning for Speech Recognition and Related Applications, 1, 39.
Google Scholar
Mohamed, A. R., Dahl, G. E., & Hinton, G. (2012). Acoustic modeling using deep belief networks. IEEE transactions on Audio, Speech and Language Processing, 20, 14–22.
Article Google Scholar
Mohamed, A. R., Yu, D., & Deng, L. (2010). Investigation of full-sequence training of deep belief networks for speech recognition. In Interspeech, Makuhari (pp. 2846–2849).
Nagi, J., Di Caro, G. A., Giusti, A., Nagi, F., & Gambardella, L. M. (2012). Convolutional neural support vector machines: Hybrid visual pattern classifiers for multi-robot systems. ICMLA. https://doi.org/10.1109/ICMLA.2012.14
Google Scholar
Palaz, D., Collobert, R., & Doss, M. M. (2013). End-to-end phoneme sequence recognition using convolutional neural networks. arXiv preprint arXiv.1312.2137.
Povey, D., & Kingsbury, B. (2007). Evaluation of proposed modifications to MPE for large scale discriminative training. In Presented at the 2007 IEEE international conference on acoustics, speech and signal processing - ICASSP ‘07, Honolulu.
Robinson, A. (1994). An application to recurrent nets to phone probability estimation. IEEE transactions on Neural Networks, 5, 298–305.
Article Google Scholar
Sainath, T. N., Kingsbury, B., Mohamed, A., Dahl, G. E., Saon, G., Soltau, H., et al. (2013a). Improvements to deep convolutional neural networks for LVCSR. In Presented at the 2013 IEEE workshop on automatic speech recognition and understanding, Olomouc.
Sainath, T. N., Mohamed, A., Kingsbury, B., & Ramabhadran, B. (2013b). Deep convolutional neural networks for LVCSR. In Presented at the 2013 IEEE international conference on acoustics, speech and signal processing.
Sainath, T. N., Ramabhadran, B., Picheny, M., Nahamoo, D., & Kanevsky, D. (2011). Exemplar-based sparse representation features: From TIMIT to LVCSR. IEEE transactions on Audio, Speech and Language Processing, 19, 2598–2613.
Article Google Scholar
Shalev-Shwartz, S., Singer, Y., & Srebro, N. (2007). Pegasos. In Presented at the Proceedings of the 24th international conference on Machine learning - ICML ‘07.
Shalev-Shwartz, S., Singer, Y., Srebro, N., & Cotter, A. (2010). Pegasos: Primal estimated sub-gradient solver for SVM. Mathematical Programming, 127, 3–30.
Article MathSciNet MATH Google Scholar
Suykens, J. K., & Vandewalle, J. (1999). Training multilayer perceptron classifiers based on a modified support vector method. IEEE Transactions on Neural Networks, 10, 907–911.
Article Google Scholar
Tang, Y. (2013). Deep learning using linear support vector machines. arXiv preprint arXiv.1306.0239.
Vapnik, V. N. (1995). The nature of statistical learning theory. New York: Springer.
Book MATH Google Scholar
Vasquez, D., Gruhn, R., & Minker, W. (2013). Hierarchical neural network structures for phoneme recognition (Vol. 1). Heidelberg: Springer.
Book Google Scholar
Veselý, K., Ghoshal, A., Burget, L., & Povey, D. (2013). Sequence-discriminative training of deep neural networks. In Interspeech (pp. 2345–2349).
Viikki, O., & Laurila, K. (1998). Cepstral domain segmental feature vector normalization for noise robust speech recognition. Speech Communication, 25, 133–147.
Article Google Scholar
Wiesler, S., Golik, P., Schluter, R., & Ney, H. (2015) Investigations on sequence training of neural networks. In Presented at the 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP).
Zhang, S.-X., & Gales, M. J. (2011). Structured support vector machines for noise robust continuous speech recognition. In Twelfth Annual Conference of the International Speech Communication Association, Florence.
Zhang, S. X., & Gales, M. J. F. (2013). Structured SVMs for automatic speech recognition. IEEE Transactions on Audio Speech and Language Processing, 21, 544–555.
Article Google Scholar
Zhang, S.-X., Ragni, A., & Gales, M. J. F. (2010). Structured log linear models for noise robust speech recognition. IEEE Signal Processing Letters, 17, 945–948.
Article Google Scholar

Download references

Author information

Authors and Affiliations

Computer Engineering Department, National Institute of Technology, Kurukshetra, India
Vishal Passricha
Computer Engineering Department, National Institute of Technology, Kurukshetra, India
Rajesh Kumar Aggarwal

Authors

Vishal Passricha
View author publications
You can also search for this author in PubMed Google Scholar
Rajesh Kumar Aggarwal
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Vishal Passricha.

Ethics declarations

Conflict of interest

There is no conflict of interest for this paper.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Passricha, V., Aggarwal, R.K. Convolutional support vector machines for speech recognition. Int J Speech Technol 22, 601–609 (2019). https://doi.org/10.1007/s10772-018-09584-4

Download citation

Received: 30 November 2017
Accepted: 27 November 2018
Published: 11 December 2018
Issue Date: September 2019
DOI: https://doi.org/10.1007/s10772-018-09584-4

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Convolutional support vector machines for speech recognition

Abstract

Access this article

Similar content being viewed by others

Robust Noisy Speech Recognition Using Deep Neural Support Vector Machines

Emotion Recognition Using Support Vector Machine and Deep Neural Network

Theoretical learning guarantees applied to acoustic modeling

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Convolutional support vector machines for speech recognition

Abstract

Access this article

Similar content being viewed by others

Robust Noisy Speech Recognition Using Deep Neural Support Vector Machines

Emotion Recognition Using Support Vector Machine and Deep Neural Network

Theoretical learning guarantees applied to acoustic modeling

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation