Skip to main content

Advertisement

Log in

An acoustic model and linguistic analysis for Malayalam disyllabic words: a low resource language

  • Published:
International Journal of Speech Technology Aims and scope Submit manuscript

Abstract

Automatic Speech Recognition (ASR) has reaped a lot of attention in recent years. Despite the recent advancements in ASR, the potential for extracting the raw features from speech remains lacking. This paper proposes an Automatic Speech Recognition system on Malayalam speech data using spectrogram images and Convolutional Neural Network (CNN). The voicegram/spectrogram images of sound files are generated, which is fed into CNN. Convolutional Neural Network topology is defined with a set of Convolution and Fully Connected layers and used Softmax layer for classification. An accuracy of 93.33% achieved with this proposed model indicates that spectrogram image-based approaches have promising results in speech-based recognition. An analysis of acoustic characteristics of Malayalam disyllabic words selected to design the ASR system with formant analysis, voice onset time and spectral moments from 4000 tokens produced by 20 speakers is also conducted. A comparison between CNN model and multiple classifiers with acoustic features have been performed and proved the efficiency of deep Neural Networks over raw features.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

References

  • Abdel-Hamid, O., Ar, Mohamed, Jiang, H., Deng, L., Penn, G., & Yu, D. (2014). Convolutional neural networks for speech recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 22(10), 1533–1545.

    Article  Google Scholar 

  • Al-Qatab, B. A., & Ainon, R. N. (2010). Arabic speech recognition using hidden markov model toolkit (HTK). In 2010 international symposium on information technology (Vol. 2, pp. 557–562). IEEE.

  • Badshah, A. M., Ahmad, J., Rahim, N., & Baik, S. W. (2017). Speech emotion recognition from spectrograms with deep convolutional neural network. In 2017 international conference on platform technology and service (PlatCon) (pp. 1–5). IEEE.

  • Bae, S. H., Choi, I., & Kim, N. S. (2016). Acoustic scene classification using parallel combination of LSTM and CNN. In Proceedings of the detection and classification of acoustic scenes and events 2016 workshop (DCASE2016) (pp. 11–15).

  • Bhaskar, P. V., Rao, S. R. M., & Gopi, A. (2012). HTK based Telugu speech recognition. International Journal of Advanced Research in Computer Science and Software Engineering, 2(12), 307–314.

    Google Scholar 

  • Bhaskararao, P. (2011). Salient phonetic features of Indian languages in speech technology. Sadhana, 36(5), 587–599.

    Article  Google Scholar 

  • Cummins, N., Amiriparian, S., Hagerer, G., Batliner, A., Steidl, S., & Schuller, B. W. (2017). An image-based deep spectrum feature representation for the recognition of emotional speech. In Proceedings of the 25th ACM international conference on multimedia (pp. 478–484).

  • Dua, M., Aggarwal, R., Kadyan, V., & Dua, S. (2012). Punjabi automatic speech recognition using HTK. International Journal of Computer Science Issues (IJCSI), 9(4), 359.

    Google Scholar 

  • George, J., Abraham, A., Arya, G., & Kumaraswami, S. (2015). Acoustic characteristics of stop consonants during fast and normal speaking rate in typically developing Malayalam speaking children. Language in India, 15, 47.

    Google Scholar 

  • Gouws, E., Wolvaardt, K., Kleynhans, N., & Barnard, E. (2004). Appropriate baseline values for hmm-based speech recognition. In Proceedings of PRASA (pp. 169–172).

  • Gunawan, A., et al. (2010). English digits speech recognition system based on hidden Markov models. In Proceedings of international conference computer and communication engineering (ICCCE) (pp. 1–5).

  • Huang, Z., Dong, M., Mao, Q., & Zhan, Y. (2014). Speech emotion recognition using CNN. In Proceedings of the 22nd ACM international conference on multimedia (pp. 801–804).

  • Hussain, Q., Proctor, M., Harvey, M., & Demuth, K. (2017). Acoustic characteristics of Punjabi retroflex and dental stops. The Journal of the Acoustical Society of America, 141(6), 4522–4542.

    Article  Google Scholar 

  • Keselj, V. (2009). Speech and language processing Daniel Jurafsky and James H. Martin (Stanford University and University of Colorado at Boulder) Pearson Prentice Hall, 2009, xxxi+ 988 pp; hardbound, ISBN 978-0-13-187321-6, 115.00

  • Kochetov, A., Tabain, M., Sreedevi, N., & Beare, R. (2018). Manner and place differences in Kannada coronal consonants: Articulatory and acoustic results. The Journal of the Acoustical Society of America, 144(6), 3221–3235.

    Article  Google Scholar 

  • Kumar, K., Aggarwal, R., & Jain, A. (2012). A Hindi speech recognition system for connected words using HTK. International Journal of Computational Systems Engineering, 1(1), 25–32.

    Article  Google Scholar 

  • Kurian, C., & Balakrishnan, K. (2009). Speech recognition of Malayalam numbers. In 2009 world congress on nature & biologically inspired computing (NaBIC) (pp. 1475–1479). IEEE.

  • Kurian, C., & Balakrishnan, K. (2012). Development & evaluation of different acoustic models for Malayalam continuous speech recognition. Procedia Engineering, 30, 1081–1088.

    Article  Google Scholar 

  • Kurian, C., & Balakrishnan, K. (2013). Connected digit speech recognition system for Malayalam language. Sadhana, 38(6), 1339–1346.

    Article  Google Scholar 

  • Mao, Q., Dong, M., Huang, Z., & Zhan, Y. (2014). Learning salient features for speech emotion recognition using convolutional neural networks. IEEE Transactions on Multimedia, 16(8), 2203–2213.

    Article  Google Scholar 

  • MATLAB. (2019). Matlab. Natick, MA: The MathWorks.

  • Maxwell, O., Baker, B., Bundgaard-Nielsen, R., & Fletcher, J. (2015). A comparison of the acoustics of nonsense and real word stimuli: Coronal stops in Bengali. International Phonetics Society.

  • Nazer, S., & Suresh, S. (2017). Acoustic analysis of nasal consonants during fast and normal speaking rate in Malayalam speaking adults. International Journal of Advance Research, Ideas and Innovations In Technology.

  • Ohala, M., & Ohala, J. (2001). Acoustic VC transitions correlate with degree of perceptual confusion of place contrast in Hindi. Travaux du cercle Linguistique de Copenhague, 31, 265–284.

    Google Scholar 

  • O’shaughnessy, D. (1987). Speech communications: Human and machine. Piscataway: IEEE, Universities Press.

    MATH  Google Scholar 

  • Palaz, D., Collobert, R., & Doss, M. M. (2013). Estimating phoneme class conditional probabilities from raw speech signal using convolutional neural networks. arXiv preprint arXiv:13041018.

  • Palaz, D., Collobert, R., et al. (2015a). Analysis of CNN-based speech recognition system using raw speech as input. Technical report, Idiap.

  • Palaz, D., Doss, M. M., & Collobert R. (2015b). Convolutional neural networks-based continuous speech recognition using raw speech signal. In 2015 IEEE international conference on acoustics (pp. 4295–4299). IEEE: Speech and Signal Processing (ICASSP).

  • Passricha, V., & Aggarwal, R. K. (2018). Convolutional neural networks for raw speech recognition. In From natural to artificial intelligence-algorithms and applications. IntechOpen

  • Qin, H., & El-Yacoubi, M. A. (2017). Deep representation-based feature extraction and recovering for finger-vein verification. IEEE Transactions on Information Forensics and Security, 12(8), 1816–1829.

    Article  Google Scholar 

  • Rabiner, L. (1993). Fundamentals of speech recognition. Fundamentals of speech recognition.

  • Ramachandran, L. K., & Elizabeth, S. (2018). Generation of GMM weights by dirichlet distribution and model selection using information criterion for Malayalam speech recognition. In International conference on intelligent human computer interaction (pp 111–122). Springer.

  • Sainath, T. N., Kingsbury, B., Saon, G., Soltau, H., Ar, Mohamed, Dahl, G., et al. (2015). Deep convolutional neural networks for large-scale speech tasks. Neural Networks, 64, 39–48.

    Article  Google Scholar 

  • Saini, P., Kaur, P., & Dua, M. (2013). Hindi automatic speech recognition using HTK. International Journal of Engineering Trends and Technology (IJETT), 4(6), 2223–2229.

    Google Scholar 

  • Schlüter, J., & Böck, S. (2014). Improved musical onset detection with convolutional neural networks. In 2014 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 6979–6983). IEEE.

  • Singhal, S., Passricha, V., Sharma, P., & Aggarwal, R. K. (2019). Multi-level region-of-interest CNNS for end to end speech recognition. Journal of Ambient Intelligence and Humanized Computing, 10(11), 4615–4624.

    Article  Google Scholar 

  • Swietojanski, P., Ghoshal, A., & Renals, S. (2014). Convolutional neural networks for distant speech recognition. IEEE Signal Processing Letters, 21(9), 1120–1124.

    Article  Google Scholar 

  • Tabain, M., Butcher, A., Breen, G., & Beare, R. (2016). An acoustic study of nasal consonants in three central Australian languages. The Journal of the Acoustical Society of America, 139(2), 890–903.

    Article  Google Scholar 

  • Wikipedia Contributors. (2020a). Malayalam—Wikipedia, the free encyclopedia. Retrieved March 4, 2020, from https://en.wikipedia.org/w/index.php?title=Malayalam&oldid=943360760.

  • Wikipedia Contributors. (2020b). Spectrogram—Wikipedia, the free encyclopedia. Retrieved March 4, 2020, from https://en.wikipedia.org/w/index.php?title=Spectrogram&oldid=941764840.

  • Yu, D., & Deng, L. (2016). Automatic speech recognition. Berlin: Springer.

    MATH  Google Scholar 

  • Zheng, W., Yu, J., Zou, Y. (2015). An experimental study of speech emotion recognition based on deep convolutional neural networks. In 2015 international conference on affective computing and intelligent interaction (ACII) (pp. 827–831). IEEE.

Download references

Acknowledgements

This research is supported by the Kerala State Council for Science, Technology and Environment (KSCSTE). I thank KSCSTE for funding the project under the Back-to-lab scheme.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to K. R. Lekshmi.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A

The Pulmonic consonants of Malayalam using IPA notation is shown in the Table below.

Appendix B

The Pulmonic consonants of Malayalam is shown in the Table below.

Appendix C

The words used for conducting the experiment is detailed with IPA notation in the Table below.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Lekshmi, K.R., Sherly, E. An acoustic model and linguistic analysis for Malayalam disyllabic words: a low resource language. Int J Speech Technol 24, 483–495 (2021). https://doi.org/10.1007/s10772-021-09807-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10772-021-09807-1

Keywords

Navigation