Developing MFCC-CNN Based Voice Recognition System with Data Augmentation and Overfitting Solving Techniques

Aswad, Ali; Alghannam, Essa; Zhang, Qingying

doi:10.1007/978-3-031-24468-1_11

Ali Aswad⁵,
Essa Alghannam^5,6,7 &
Qingying Zhang^7,8

Part of the book series: Lecture Notes on Data Engineering and Communications Technologies ((LNDECT,volume 159))

Included in the following conference series:

International Conference of Artificial Intelligence, Medical Engineering, Education

642 Accesses
1 Citations

Abstract

The ever-increasing need for intelligent systems technology capable of enhancing human-machine interaction in different areas of life has become a very important topic. However, since speech is the main used method of communication, deep Learning-based speech recognition systems still have poor performance when trained on small datasets. To solve this problem, an automated speech recognition system based on a convolutional neural network and MFCC algorithm, is proposed in this paper. Firstly, the data augmentation technique is applied to the dataset in order to double it. Then, MFCCs are extracted from all the audio files. After that, CNN model is trained on the MFCCs with the help of early stopping and dropout techniques to prevent the overfitting problem. Finally, the best model with the highest performance is selected. Simulation results on Kaggle TensorFlow Speech Recognition Challenge dataset, which contains 20 basic command words, showed that the selected model achieved 88.35% accuracy.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 229.00; Price excludes VAT (USA)

Softcover Book: USD 299.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Anusuya, M.A., Katti, S.K.: Speech recognition by machine, a review. Int. J. Comput. Sci. Inf. Secur. IJCSIS 6(3), 181–205 (2010)
Google Scholar
Nassif, A.B., Shahin, I., Attili, I., Azzeh, M., Shaalan, K.: Speech recognition using deep neural networks: a systematic review. IEEE Access, 1 (2019)
Google Scholar
Mohamed, A.R., Dahl, G.E., Hinton, G.E.: Acoustic modeling using deep belief networks. IEEE Trans. Audio Speech Lang. Process. 20, 14–22 (2012)
Article Google Scholar
Yu, D., Seide, F., Li, G.: Conversational speech transcription using context-dependent deep neural network. In: Proceedings of the 12th Annual Conference of the International; Speech Communication Association, pp. 27–31 (2011)
Google Scholar
Sainath, T.N., Mohamed, A.R., Kingsbury, B., Ramabhadran, B.: Deep convolutional neural networks for LVCSR. In: Proceedings of the 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, Vancouver B, pp. 8614–8618 (2013)
Google Scholar
Singh, H., Bathla, A.K.: A survey on speech recognition. Int. J. Adv. Res. Comput. Eng. Technol. 2(6), 2186–2189 (2013)
Google Scholar
Anusuya, M.A., Katti, S.K.: Speech recognition by machine: a review. Int. J. Comput. Sci. Inf. Secur. 6(3), 181–205 (2009)
Google Scholar
Davis, K.H., Biddulph, R., Balashek, S.: Automatic recognition of spoken digits. Acoust. Soc. Am. 24(6), 627–642 (1952)
Google Scholar
Klevans, R., Rodman, R.: Voice Recognition. Artech House, Boston (1997)
Google Scholar
Singh, H., Bathla, A.K.: A survey on speech recognition. Int. J. Adv. Res. Comput. Eng. Technol 2(6), 2186–2189 (2013)
Google Scholar
Xie, Y., Le, L., Zhou, Y., Raghavan, V.V.: Deep learning for natural language processing. In: Handbook of Statistics. Elsevier, Amsterdam (2018)
Google Scholar
Bansal, M., Thivakaran, T.K.: Analysis of speech recognition using convolutional neural network 11(1) (2020)
Google Scholar
Martens, J., Sutskever, I.: Learning recurrent neural networks with Hessian-free optimization. In: ICML, pp. 1033–1040 (2011)
Google Scholar
Morgan, N.: Deep and wide: multiple layers in automatic speech recognition. IEEE Trans. Audio Speech Lang. Process 20(1), 7–13 (2012)
Google Scholar
Deng, L., et al.: Recent advances in deep learning for speech research at Microsoft. IEEE Acoust Speech Signal Process., 8604–8608 (2013)
Google Scholar
Davis, S., Mermelstein, P.: Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans. Acoust. Speech Signal Process. 28(4), 357–366 (1980)
Article Google Scholar
Gupta, H., Gupta, D.: LPC and LPCC method of feature extraction in Speech Recognition System. In: Cloud System and Big Data Engineering (Confluence). IEEE (2016)
Google Scholar
Mashao, D.J., Gotoh, Y., Silverman, H.F.: Analysis of LPC/DFT features for an HMM-based alphadigit recognizer. IEEE Signal Process. Lett. 3(4), 103–106 (1996)
Google Scholar
Lecun, Y., et al.: Gradient-based learning applied to document recognition. IEEE 86(11), 2278–2324 (1998)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Mechatronics Engineering, Tishreen University, Lattakia, Syria
Ali Aswad & Essa Alghannam
School of Engineering, Manara University, Lattakia, Syria
Essa Alghannam
Wuhan University of Technology, Wuhan, China
Essa Alghannam & Qingying Zhang
Royal Institute of Technology, Stockholm, Sweden
Qingying Zhang

Authors

Ali Aswad
View author publications
You can also search for this author in PubMed Google Scholar
Essa Alghannam
View author publications
You can also search for this author in PubMed Google Scholar
Qingying Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Essa Alghannam .

Editor information

Editors and Affiliations

International Center of Informatics and Computer Science, Faculty of Applied Mathematics, National Technical University of Ukraine “Igor Sikorsky Kyiv Polytechnic Institute”, Kyiv, Ukraine
Zhengbing Hu
School of Computer Science, Hubei University of Technology, Wuhan, China
Zhiwei Ye
Halmos College of Arts and Sciences, Nova Southeastern University, Fort Lauderdale, FL, USA
Matthew He

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Aswad, A., Alghannam, E., Zhang, Q. (2023). Developing MFCC-CNN Based Voice Recognition System with Data Augmentation and Overfitting Solving Techniques. In: Hu, Z., Ye, Z., He, M. (eds) Advances in Artificial Systems for Medicine and Education VI. AIMEE 2022. Lecture Notes on Data Engineering and Communications Technologies, vol 159. Springer, Cham. https://doi.org/10.1007/978-3-031-24468-1_11

Download citation

DOI: https://doi.org/10.1007/978-3-031-24468-1_11
Published: 21 January 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-24467-4
Online ISBN: 978-3-031-24468-1
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics