Experimental Evaluation of CNN Architecture for Speech Recognition

  • Md Amaan Haque
  • Abhishek Verma
  • John Sahaya Rani AlexEmail author
  • Nithya Venkatesan
Conference paper
Part of the Advances in Intelligent Systems and Computing book series (AISC, volume 1045)


In recent days, deep learning has been widely used in signal and information processing. Among the deep learning algorithms, Convolution Neural Network (CNN) has been widely used for image recognition and classification because of its architecture, high accuracy and efficiency. This paper proposes a method that uses the CNN on audio samples rather than on the image samples in which the CNN method is usually used to train the model. The one-dimensional audio samples are converted into two-dimensional data that consists of matrix of Mel-Frequency Cepstral Coefficients (MFCCs) that are extracted from the audio samples and the number of windows used in the extraction. This proposed CNN model has been evaluated on the TIDIGITS corpus dataset. The paper analyzes different convolution layer architectures with different number of feature maps in each architecture. The three-layer convolution architecture was found to have the highest accuracy of 97.46% among the other discussed architectures.


Convolution Neural Networks (CNN) Deep Neural Networks (DNN) Kernel Mel-Frequency Cepstral Coefficients (MFCC) Speech Recognition (SR) 


  1. 1.
    Hinton, G., et al.: Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Sig. Process. Mag. 29(6), 82–97 (2012)CrossRefGoogle Scholar
  2. 2.
    Graves, A., Mohamed, A., Hinton, G.: Speech recognition with deep recurrent neural networks. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE (2013)Google Scholar
  3. 3.
    Kim, Y.: Convolutional neural networks for sentence classification(2014). arXiv preprint arXiv:1408.5882
  4. 4.
    Sainath, T.N., Parada, C.: Convolutional neural networks for small-footprint keyword spotting. In: Sixteenth Annual Conference of the International Speech Communication Association (2015)Google Scholar
  5. 5.
    Palaz, D., Magimai-Doss, M., Collobert, R.: Analysis of cnn-based speech recognition system using raw speech as input. In: Sixteenth Annual Conference of the International Speech Communication Association (2015)Google Scholar
  6. 6.
    Zhao, T., Zhao, Y., Chen, X.:. Time-frequency kernel-based CNN for speech recognition. In: Sixteenth Annual Conference of the International Speech Communication Association (2015)Google Scholar
  7. 7.
    Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems (2012)Google Scholar
  8. 8.
    Tolias, G., Sicre, R., Jégou, H: Particular object retrieval with integral max-pooling of CNN activations (2015). arXiv preprint arXiv:1511.05879
  9. 9.
    Li, S., et al.: Shape driven kernel adaptation in convolutional neural network for robust facial traits recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2015)Google Scholar
  10. 10.
    Leonard, G., Doddington, G.: TIDIGITS LDC93S10. Web Download. Linguistic Data Consortium, Philadelphia (1993)Google Scholar

Copyright information

© Springer Nature Singapore Pte Ltd. 2020

Authors and Affiliations

  • Md Amaan Haque
    • 1
  • Abhishek Verma
    • 1
  • John Sahaya Rani Alex
    • 1
    Email author
  • Nithya Venkatesan
    • 1
  1. 1.Vellore Institute of TechnologyChennaiIndia

Personalised recommendations