Abstract
Convolutional neural networks (CNNs) have demonstrated the state-of-the-art performances on automatic speech recognition. Softmax activation function for prediction and minimizing the cross-entropy loss is employed by most of the CNNs. This paper proposes a new deep architecture in which two heterogeneous classification techniques named as CNN and support vector machines (SVMs) are combined together. In this proposed model, features are learned using convolution layer and classified by SVMs. The last layer of CNN i.e. softmax layer is replaced by SVMs to efficiently deal with high dimensional features. This model should be interpreted as a special form of structured SVM and named as convolutional support vector machine (CSVM). Instead of training each component separately, the parameters of CNN and SVMs are jointly trained using frame level max-margin, sequence level max-margin, and state-level minimum Bayes risk criterion. The performance of CSVM is checked on TIMIT and Wall Street Journal datasets for phone recognition. By incorporating the features of both CNN and SVMs, CSVM improves the result by 13.33% and 2.31% over baseline CNN and segmental recurrent neural networks respectively.
Similar content being viewed by others
References
Abdel-Hamid, O., Deng, L., & Yu, D. (2003). Exploring convolutional neural network structures and optimization techniques for speech recognition. Interspeech, 2013, 3366–3370.
Abdel-Hamid, O., Mohamed, A., Jiang, H., & Penn, G. (2012). Applying convolutional neural networks concepts to hybrid NN-HMM model for speech recognition. In presented at the 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2012.
Abdel-Hamid, O., Mohamed, A. R., Jiang, H., Deng, L., Penn, G., & Yu, D. (2014). Convolutional neural networks for speech recognition. IEEE-ACM Transactions on Audio Speech and Language Processing, 22, 1533–1545.
Agarap, A. F. (2017). An architecture combining convolutional neural network (CNN) and support vector machine (SVM) for image classification. arXiv preprint arXiv.1712.03541.
Bourlard, H. A., & Morgan, N. (2012). Connectionist speech recognition: A hybrid approach (Vol. 247). New York: Springer.
Boyd, S., & Mutapcic, A. (2006). Subgradient methods. In Lecture notes of EE364b, Stanford University, Winter Quarter, Vol. 2007.
Crammer, K., & Singer, Y. (2001). On the algorithmic implementation of multiclass kernel-based vector machines. Journal of Machine Learning Research, 2, 265–292.
Dahl, G., Mohamed, A. R., & Hinton, G. E. (2010). Phone recognition with the mean-covariance restricted Boltzmann machine. Advances in Neural Information Processing Systems, 1, 469–477.
Dean, J., Corrado, G., Monga, R., Chen, K., Devin, M., Mao, M., et al. (2012). Large scale distributed deep networks. In Advances in neural information processing systems, pp. 1223–1231.
Deng, L., & Chen, J. (2014). Sequence classification using the high-level features extracted from deep neural networks. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6844–6848.
Deng, L., & Yu, D. (2007). Use of differential cepstra as acoustic features in hidden trajectory modeling for phonetic recognition. In IEEE International Conference on Acoustics, Speech and Signal Processing, 2007. ICASSP 2007, pp. 445–448.
Gales, M., & Young, S. (2008). The application of hidden Markov models in speech recognition. Foundations and Trends® in Signal Processing, 1, 195–304.
Ganapathiraju, A., Hamaker, J., & Picone, J. (1998). Support vector machines for speech recognition. In Fifth international conference on spoken language processing, Sydney.
Gehring, J., Auli, M., Grangier, D., Yarats, D., & Dauphin, Y. N. (2017). Convolutional sequence to sequence learning. arXiv preprint arXiv.1705.03122.
Gibson, M., & Hain, T. (2006). Hypothesis spaces for minimum Bayes risk training in large vocabulary speech recognition. In Ninth international conference on spoken language processing, Pittsburgh.
Graves, A., Mohamed, A. R., & Hinton, G. (2013). Speech recognition with deep recurrent neural networks. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vancouver (pp. 6645–6649).
Halberstadt, A. K. (1998). Heterogeneous acoustic measurements and multiple classifiers for speech recognition. Doctoral dissertation.
Hifny, Y., & Renals, S. (2009). Speech recognition using augmented conditional random fields. IEEE transactions on Audio, Speech and Language Processing, 17, 354–365.
Hinton, G. E., Srivastava, N., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. R. (2012). Improving neural networks by preventing co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580.
Joachims, T., Finley, T., & Yu, C. N. J. (2009). Cutting-plane training of structural SVMs. Machine Learning, 77, 27–59.
Kaiser, J., Horvat, B., & Kacic, Z. (2000). A novel loss function for the overall risk criterion based discriminative training of HMM models. In Sixth international conference on spoken language processing, Beijing.
Kingsbury, B., Sainath, T. N., & Soltau, H. (2012). Scalable minimum Bayes risk training of deep neural network acoustic models using distributed Hessian-free optimization. In Thirteenth annual conference of the international speech communication association, Portland.
Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems (pp. 1097–1105).
Lee, K.-F., & Hon, H.-W. (1989). Speaker-independent phone recognition using hidden Markov models. IEEE Transactions on Acoustics, Speech, and Signal Processing, 37, 1641–1648.
Lu, L., Kong, L., Dyer, C., Smith, N. A., & Renals, S. (2016). Segmental recurrent neural networks for end-to-end speech recognition. In Presented at the Interspeech 2016.
Ming, J., & Smith, F. J. (1998 ). Improved phone recognition using Bayesian triphone models. In IEEE international conference on acoustics, speech and signal processing, 1998, Seattle (pp. 409–412).
Mohamed, A. R., Dahl, G., & Hinton, G. (2009). Deep belief networks for phone recognition. Nips Workshop on Deep Learning for Speech Recognition and Related Applications, 1, 39.
Mohamed, A. R., Dahl, G. E., & Hinton, G. (2012). Acoustic modeling using deep belief networks. IEEE transactions on Audio, Speech and Language Processing, 20, 14–22.
Mohamed, A. R., Yu, D., & Deng, L. (2010). Investigation of full-sequence training of deep belief networks for speech recognition. In Interspeech, Makuhari (pp. 2846–2849).
Nagi, J., Di Caro, G. A., Giusti, A., Nagi, F., & Gambardella, L. M. (2012). Convolutional neural support vector machines: Hybrid visual pattern classifiers for multi-robot systems. ICMLA. https://doi.org/10.1109/ICMLA.2012.14
Palaz, D., Collobert, R., & Doss, M. M. (2013). End-to-end phoneme sequence recognition using convolutional neural networks. arXiv preprint arXiv.1312.2137.
Povey, D., & Kingsbury, B. (2007). Evaluation of proposed modifications to MPE for large scale discriminative training. In Presented at the 2007 IEEE international conference on acoustics, speech and signal processing - ICASSP ‘07, Honolulu.
Robinson, A. (1994). An application to recurrent nets to phone probability estimation. IEEE transactions on Neural Networks, 5, 298–305.
Sainath, T. N., Kingsbury, B., Mohamed, A., Dahl, G. E., Saon, G., Soltau, H., et al. (2013a). Improvements to deep convolutional neural networks for LVCSR. In Presented at the 2013 IEEE workshop on automatic speech recognition and understanding, Olomouc.
Sainath, T. N., Mohamed, A., Kingsbury, B., & Ramabhadran, B. (2013b). Deep convolutional neural networks for LVCSR. In Presented at the 2013 IEEE international conference on acoustics, speech and signal processing.
Sainath, T. N., Ramabhadran, B., Picheny, M., Nahamoo, D., & Kanevsky, D. (2011). Exemplar-based sparse representation features: From TIMIT to LVCSR. IEEE transactions on Audio, Speech and Language Processing, 19, 2598–2613.
Shalev-Shwartz, S., Singer, Y., & Srebro, N. (2007). Pegasos. In Presented at the Proceedings of the 24th international conference on Machine learning - ICML ‘07.
Shalev-Shwartz, S., Singer, Y., Srebro, N., & Cotter, A. (2010). Pegasos: Primal estimated sub-gradient solver for SVM. Mathematical Programming, 127, 3–30.
Suykens, J. K., & Vandewalle, J. (1999). Training multilayer perceptron classifiers based on a modified support vector method. IEEE Transactions on Neural Networks, 10, 907–911.
Tang, Y. (2013). Deep learning using linear support vector machines. arXiv preprint arXiv.1306.0239.
Vapnik, V. N. (1995). The nature of statistical learning theory. New York: Springer.
Vasquez, D., Gruhn, R., & Minker, W. (2013). Hierarchical neural network structures for phoneme recognition (Vol. 1). Heidelberg: Springer.
Veselý, K., Ghoshal, A., Burget, L., & Povey, D. (2013). Sequence-discriminative training of deep neural networks. In Interspeech (pp. 2345–2349).
Viikki, O., & Laurila, K. (1998). Cepstral domain segmental feature vector normalization for noise robust speech recognition. Speech Communication, 25, 133–147.
Wiesler, S., Golik, P., Schluter, R., & Ney, H. (2015) Investigations on sequence training of neural networks. In Presented at the 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP).
Zhang, S.-X., & Gales, M. J. (2011). Structured support vector machines for noise robust continuous speech recognition. In Twelfth Annual Conference of the International Speech Communication Association, Florence.
Zhang, S. X., & Gales, M. J. F. (2013). Structured SVMs for automatic speech recognition. IEEE Transactions on Audio Speech and Language Processing, 21, 544–555.
Zhang, S.-X., Ragni, A., & Gales, M. J. F. (2010). Structured log linear models for noise robust speech recognition. IEEE Signal Processing Letters, 17, 945–948.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
There is no conflict of interest for this paper.
Rights and permissions
About this article
Cite this article
Passricha, V., Aggarwal, R.K. Convolutional support vector machines for speech recognition. Int J Speech Technol 22, 601–609 (2019). https://doi.org/10.1007/s10772-018-09584-4
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10772-018-09584-4