Abstract
In recent times, pre-trained convolutional neural networks (CNN) have outdid traditional automatic speech recognition systems based on Gaussian mixture model (GMM) on a variety of large vocabulary benchmarks. The proposed work presents, Odia keyword recognition using CNN approach. Different parts of a human body in spoken Odia language is considered as the keywords for recognition. MFCC feature extraction technique along with spectrogram representation of voiced keywords are considered as the input to the CNN. The projected CNN model is trained and implemented using a python frame work Keras with Tensorflow as backend. Various performance metrics are considered to compare the proposed model with a fully connected deep neural network model (DNN). A number of experiments are conducted with variation of epochs and split ratio (training-validation-testing) and results are obtained to show the accuracy, loss and other performance metrics of both the models. It is observed that the proposed model outperforms the DNN model. Further the average recognition accuracy of proposed CNN model is analyzed with other approaches like (hidden Markov model) HMM and SVM (support vector machine) model and manifests superior average recognition rate as opposed to considered state-of-the-art approaches.
Similar content being viewed by others
References
Schalkwyk J, Beeferman D, Beaufays F, Byrne B, Chelba C, Cohen M, Kamvar M, Strope B (2010) Your word is my command: Google search by voice: a case study. Adv Speech Recogn, 61-90
Guoguo C, Parada C, Heigold G (2014) Small-footprint keyword spotting using deep neural networks.IEEE Int Conf Acoust Speech Signal Proces (ICASSP), Florence, 4087-4091
Rohlicek JR, Russell W, Roukos S, Gish H (1990) Continuous hidden Markov modeling for speaker-independent wordspotting. In: Proc Int Conf Acoust Speech Signal Proces (ICASSP). IEEE, 627–630
Rose RC, Paul DB (1990). A hidden Markov model based keyword recognition system. In: Proc Int Conf Acoust Speech Signal Proces (ICASSP). IEEE, 129-132
Wilpon JG, Miller LG, Modi P (1991) Improvements and applications for key word recognition using hidden Markov modeling techniques. In: Proc Int Conf Acoust Speech Signal Processing (ICASSP). IEEE, 309-312
Silaghi MC (2005) Spotting subsequences matching an HMM using the average observation probability criteria with application to keyword spotting. AAAI Vol. 3, 1118-1123
Grangier D, Keshet J, Bengio S (2009) Discriminative keyword spotting. Automatic speech and speaker recognition: large margin and kernel methods,Vol. 51,175-194
Tabibian S, Akbari A, Nasersharif B (2011) An evolutionary based discriminative system for keyword spotting. In 2011 International Symposium on Artificial Intelligence and Signal Processing (AISP), IEEE,83-88
Li KP, Naylor JA, Rossen ML (1992) A whole word recurrent neural network for keyword spotting. In [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing,IEEE, 2:81-84
Fernández S, Graves A, Schmidhuber J (2007) An application of recurrent neural networks to discriminative keyword spotting. In International Conference on Artificial Neural Networks, Springer, Berlin, 220-229
Tóth L (2014) Combining time-and frequency-domain convolution in convolutional neural network-based phone recognition. In 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE,190-194
Sainath TN, Mohamed AR, Kingsbury B, Ramabhadran B (2013) Deep convolutional neural networks for LVCSR. In 2013 IEEE international conference on acoustics, speech and signal processing, IEEE, 8614-8618
Li X, Zhou Z (2017) Speech Command Recognition with Convolutional Neural Network. CS229 Stanford education
Warden P (2018) Speech commands: a dataset for limited-vocabulary speech recognition. arXiv preprint arXiv:1804.03209
Jufarsky D, Martin JH, Jufarsky D, Martin JH (2000) Speech and language processing: an introduction to natural language processing, computational linguistics, and speech recognition, Prentice Hall series in artificial intelligence.
Dave N (2013) Feature extraction methods LPC, PLP and MFCC in speech recognition. Int J Adv Res Eng Technol 1(6):1–4
LeCun Y, Bengio Y (1998) Convolutional networks for images, speech, and time series, The handbook of brain theory and neural networks
Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, p.1097-1105
Zhang X, Trmal J, Povey D, Khudanpur S (2014) Improving deep neural network acoustic models using generalized maxout networks.In 2014 IEEE international conference on acoustics, speech and signal processing (ICASSP), IEEE, p.215-219
Zhang Y, Pezeshki M, Brakel P, Zhang S, Bengio CLY, Courville A (2017) Towards end-to-end speech recognition with deep convolutional neural networks. arXiv preprint arXiv:1701.02720
Sainath TN, Parada C (2015) Convolutional neural networks for small-footprint keyword spotting. In Sixteenth Annual Conference of the International Speech Communication Association
Hinton GE, Srivastava N, Krizhevsky A, Sutskever I, Salakhutdinov RR (2012) Improving neural networks by preventing co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580
Landis JR, Koch GG (1977) The measurement of observer agreement for categorical data. Biometrics, 33(1): 159-74. PMID: 843571159-174
Mohanty P, Nayak AK (2018) Isolated Odia digit recognition using HTK: an implementation view. In 2018 2nd International Conference on Data Science and Business Analytics (ICDSBA), IEEE, 30-35
Mohanty P, Nayak AK (2019) Multi-class support vector machine based continuous voiced Odia numerals recognition. Int J Sci Technol Res 8(10):2754–2764
Mohapatra H, Panda Rath AK, N, (2022) IoT infrastructure for the accident avoidance: an approach of smart transportation. Int J Inf Tecnol. https://doi.org/10.1007/s41870-022-00872-6
Mohanty P, Sahoo JP, Nayak AK (2022) Voiced Odia digit recognition using convolutional neural network. Advances in distributed computing and machine learning. Lecture Notes in Networks and Systems, vol 302. Springer, Singapore. https://doi.org/10.1007/978-981-16-4807-6_16
Rusia MK, Singh DK (2021) An efficient CNN approach for facial expression recognition with some measures of overfitting. Int J Inf Tecnol 13:2419–2430. https://doi.org/10.1007/s41870-021-00803-x
de Coimbra AD, Sabato L, Viana Martin Loesener Da S, Christoph B (2018) A neural attention model for speech command recognition, arXiv preprint arXiv:1808.08929
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. Adv Neural Inform Proces Syst,Vol. 30
Bahdanau D, Cho K, Bengio Y (2014) Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473
Amodei D, Ananthanarayanan S, Anubhai R, Bai J, Battenberg E, Case C, Casper J, Catanzaro B, Cheng Q, Chen G, Chen J (2016) PMLR. End-to-end speech recognition in English and Mandarin, In International conference on machine learning, pp 173–182
Tang Raphael, Lin Jimmy (2018) Deep residual learning for small-footprint keyword spotting, In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), p.5484-5488
Arik S, Kliegl M, Child R, Hestness J, Gibiansky A, Fougner C, Prenger R, Coates A (2020) Convolutional recurrent neural networks for small-footprint keyword spotting. U.S. Patent 10,540,961, Baidu USA LLC
Zhang Y, Suda N, Lai L, Chandra V (2017) Hello edge: Kkeyword spotting on microcontrollers, arXiv preprint arXiv:1711.07128
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Mohanty, P., Nayak, A.K. CNN based keyword spotting: An application for context based voiced Odia words. Int. j. inf. tecnol. 14, 3647–3658 (2022). https://doi.org/10.1007/s41870-022-00992-z
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s41870-022-00992-z