Skip to main content

Developing MFCC-CNN Based Voice Recognition System with Data Augmentation and Overfitting Solving Techniques

  • Conference paper
  • First Online:
Advances in Artificial Systems for Medicine and Education VI (AIMEE 2022)

Abstract

The ever-increasing need for intelligent systems technology capable of enhancing human-machine interaction in different areas of life has become a very important topic. However, since speech is the main used method of communication, deep Learning-based speech recognition systems still have poor performance when trained on small datasets. To solve this problem, an automated speech recognition system based on a convolutional neural network and MFCC algorithm, is proposed in this paper. Firstly, the data augmentation technique is applied to the dataset in order to double it. Then, MFCCs are extracted from all the audio files. After that, CNN model is trained on the MFCCs with the help of early stopping and dropout techniques to prevent the overfitting problem. Finally, the best model with the highest performance is selected. Simulation results on Kaggle TensorFlow Speech Recognition Challenge dataset, which contains 20 basic command words, showed that the selected model achieved 88.35% accuracy.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 229.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 299.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Anusuya, M.A., Katti, S.K.: Speech recognition by machine, a review. Int. J. Comput. Sci. Inf. Secur. IJCSIS 6(3), 181–205 (2010)

    Google Scholar 

  2. Nassif, A.B., Shahin, I., Attili, I., Azzeh, M., Shaalan, K.: Speech recognition using deep neural networks: a systematic review. IEEE Access, 1 (2019)

    Google Scholar 

  3. Mohamed, A.R., Dahl, G.E., Hinton, G.E.: Acoustic modeling using deep belief networks. IEEE Trans. Audio Speech Lang. Process. 20, 14–22 (2012)

    Article  Google Scholar 

  4. Yu, D., Seide, F., Li, G.: Conversational speech transcription using context-dependent deep neural network. In: Proceedings of the 12th Annual Conference of the International; Speech Communication Association, pp. 27–31 (2011)

    Google Scholar 

  5. Sainath, T.N., Mohamed, A.R., Kingsbury, B., Ramabhadran, B.: Deep convolutional neural networks for LVCSR. In: Proceedings of the 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, Vancouver B, pp. 8614–8618 (2013)

    Google Scholar 

  6. Singh, H., Bathla, A.K.: A survey on speech recognition. Int. J. Adv. Res. Comput. Eng. Technol. 2(6), 2186–2189 (2013)

    Google Scholar 

  7. Anusuya, M.A., Katti, S.K.: Speech recognition by machine: a review. Int. J. Comput. Sci. Inf. Secur. 6(3), 181–205 (2009)

    Google Scholar 

  8. Davis, K.H., Biddulph, R., Balashek, S.: Automatic recognition of spoken digits. Acoust. Soc. Am. 24(6), 627–642 (1952)

    Google Scholar 

  9. Klevans, R., Rodman, R.: Voice Recognition. Artech House, Boston (1997)

    Google Scholar 

  10. Singh, H., Bathla, A.K.: A survey on speech recognition. Int. J. Adv. Res. Comput. Eng. Technol 2(6), 2186–2189 (2013)

    Google Scholar 

  11. Xie, Y., Le, L., Zhou, Y., Raghavan, V.V.: Deep learning for natural language processing. In: Handbook of Statistics. Elsevier, Amsterdam (2018)

    Google Scholar 

  12. Bansal, M., Thivakaran, T.K.: Analysis of speech recognition using convolutional neural network 11(1) (2020)

    Google Scholar 

  13. Martens, J., Sutskever, I.: Learning recurrent neural networks with Hessian-free optimization. In: ICML, pp. 1033–1040 (2011)

    Google Scholar 

  14. Morgan, N.: Deep and wide: multiple layers in automatic speech recognition. IEEE Trans. Audio Speech Lang. Process 20(1), 7–13 (2012)

    Google Scholar 

  15. Deng, L., et al.: Recent advances in deep learning for speech research at Microsoft. IEEE Acoust Speech Signal Process., 8604–8608 (2013)

    Google Scholar 

  16. Davis, S., Mermelstein, P.: Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans. Acoust. Speech Signal Process. 28(4), 357–366 (1980)

    Article  Google Scholar 

  17. Gupta, H., Gupta, D.: LPC and LPCC method of feature extraction in Speech Recognition System. In: Cloud System and Big Data Engineering (Confluence). IEEE (2016)

    Google Scholar 

  18. Mashao, D.J., Gotoh, Y., Silverman, H.F.: Analysis of LPC/DFT features for an HMM-based alphadigit recognizer. IEEE Signal Process. Lett. 3(4), 103–106 (1996)

    Google Scholar 

  19. Lecun, Y., et al.: Gradient-based learning applied to document recognition. IEEE 86(11), 2278–2324 (1998)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Essa Alghannam .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Aswad, A., Alghannam, E., Zhang, Q. (2023). Developing MFCC-CNN Based Voice Recognition System with Data Augmentation and Overfitting Solving Techniques. In: Hu, Z., Ye, Z., He, M. (eds) Advances in Artificial Systems for Medicine and Education VI. AIMEE 2022. Lecture Notes on Data Engineering and Communications Technologies, vol 159. Springer, Cham. https://doi.org/10.1007/978-3-031-24468-1_11

Download citation

Publish with us

Policies and ethics