Abstract
The ever-increasing need for intelligent systems technology capable of enhancing human-machine interaction in different areas of life has become a very important topic. However, since speech is the main used method of communication, deep Learning-based speech recognition systems still have poor performance when trained on small datasets. To solve this problem, an automated speech recognition system based on a convolutional neural network and MFCC algorithm, is proposed in this paper. Firstly, the data augmentation technique is applied to the dataset in order to double it. Then, MFCCs are extracted from all the audio files. After that, CNN model is trained on the MFCCs with the help of early stopping and dropout techniques to prevent the overfitting problem. Finally, the best model with the highest performance is selected. Simulation results on Kaggle TensorFlow Speech Recognition Challenge dataset, which contains 20 basic command words, showed that the selected model achieved 88.35% accuracy.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Anusuya, M.A., Katti, S.K.: Speech recognition by machine, a review. Int. J. Comput. Sci. Inf. Secur. IJCSIS 6(3), 181–205 (2010)
Nassif, A.B., Shahin, I., Attili, I., Azzeh, M., Shaalan, K.: Speech recognition using deep neural networks: a systematic review. IEEE Access, 1 (2019)
Mohamed, A.R., Dahl, G.E., Hinton, G.E.: Acoustic modeling using deep belief networks. IEEE Trans. Audio Speech Lang. Process. 20, 14–22 (2012)
Yu, D., Seide, F., Li, G.: Conversational speech transcription using context-dependent deep neural network. In: Proceedings of the 12th Annual Conference of the International; Speech Communication Association, pp. 27–31 (2011)
Sainath, T.N., Mohamed, A.R., Kingsbury, B., Ramabhadran, B.: Deep convolutional neural networks for LVCSR. In: Proceedings of the 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, Vancouver B, pp. 8614–8618 (2013)
Singh, H., Bathla, A.K.: A survey on speech recognition. Int. J. Adv. Res. Comput. Eng. Technol. 2(6), 2186–2189 (2013)
Anusuya, M.A., Katti, S.K.: Speech recognition by machine: a review. Int. J. Comput. Sci. Inf. Secur. 6(3), 181–205 (2009)
Davis, K.H., Biddulph, R., Balashek, S.: Automatic recognition of spoken digits. Acoust. Soc. Am. 24(6), 627–642 (1952)
Klevans, R., Rodman, R.: Voice Recognition. Artech House, Boston (1997)
Singh, H., Bathla, A.K.: A survey on speech recognition. Int. J. Adv. Res. Comput. Eng. Technol 2(6), 2186–2189 (2013)
Xie, Y., Le, L., Zhou, Y., Raghavan, V.V.: Deep learning for natural language processing. In: Handbook of Statistics. Elsevier, Amsterdam (2018)
Bansal, M., Thivakaran, T.K.: Analysis of speech recognition using convolutional neural network 11(1) (2020)
Martens, J., Sutskever, I.: Learning recurrent neural networks with Hessian-free optimization. In: ICML, pp. 1033–1040 (2011)
Morgan, N.: Deep and wide: multiple layers in automatic speech recognition. IEEE Trans. Audio Speech Lang. Process 20(1), 7–13 (2012)
Deng, L., et al.: Recent advances in deep learning for speech research at Microsoft. IEEE Acoust Speech Signal Process., 8604–8608 (2013)
Davis, S., Mermelstein, P.: Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans. Acoust. Speech Signal Process. 28(4), 357–366 (1980)
Gupta, H., Gupta, D.: LPC and LPCC method of feature extraction in Speech Recognition System. In: Cloud System and Big Data Engineering (Confluence). IEEE (2016)
Mashao, D.J., Gotoh, Y., Silverman, H.F.: Analysis of LPC/DFT features for an HMM-based alphadigit recognizer. IEEE Signal Process. Lett. 3(4), 103–106 (1996)
Lecun, Y., et al.: Gradient-based learning applied to document recognition. IEEE 86(11), 2278–2324 (1998)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Aswad, A., Alghannam, E., Zhang, Q. (2023). Developing MFCC-CNN Based Voice Recognition System with Data Augmentation and Overfitting Solving Techniques. In: Hu, Z., Ye, Z., He, M. (eds) Advances in Artificial Systems for Medicine and Education VI. AIMEE 2022. Lecture Notes on Data Engineering and Communications Technologies, vol 159. Springer, Cham. https://doi.org/10.1007/978-3-031-24468-1_11
Download citation
DOI: https://doi.org/10.1007/978-3-031-24468-1_11
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-24467-4
Online ISBN: 978-3-031-24468-1
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)