Skip to main content
Log in

Dravidian language classification from speech signal using spectral and prosodic features

  • Published:
International Journal of Speech Technology Aims and scope Submit manuscript

Abstract

The interesting aspect of the Dravidian languages is a commonality through a shared script, similar vocabulary, and their common root language. In this work, an attempt has been made to classify the four complex Dravidian languages using cepstral coefficients and prosodic features. The speech of Dravidian languages has been recorded in various environments and considered as a database. It is demonstrated that while cepstral coefficients can indeed identify the language correctly with a fair degree of accuracy, prosodic features are added to the cepstral coefficients to improve language identification performance. Legendre polynomial fitting and the principle component analysis (PCA) are applied on feature vectors to reduce dimensionality which further resolves the issue of time complexity. In the experiments conducted, it is found that using both cepstral coefficients and prosodic features, a language identification rate of around 87% is obtained, which is about 18% above the baseline system using Mel-frequency cepstral coefficients (MFCCs). It is observed from the results that the temporal variations and prosody are the important factors needed to be considered for the tasks of language identification.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Notes

  1. The terms ’pitch’ and ’F0’ are interchangeably used in the article.

References

  • Allen, F., Ambikairajah, E., & Epps, J. (2005). Language identification using warping and the shifted delta cepstrum. In IEEE 7th workshop on multimedia signal processing, pp. 1–4. IEEE.

  • Atal, B., & Rabiner, L. (1946). pattern recognition approach to voiced-unvoiced-silence classification with applications to speech recognition. IEEE Transactions on Acoustics, Speech, and Signal Processing, 24(3), 201–212.

    Article  Google Scholar 

  • Brümmer, N., Cumani, S., Glembek, O., Karafiát, M., Matějka, P., Pešán, J., Plchot, O., Soufifar, M., Villiers, E. D., & Cernockỳ, J. H. (2012). Description and analysis of the brno276 system for lre2011. In Odyssey 2012-the speaker and language recognition workshop.

  • Buttkus, B. (2000). Spectral Analysis and Filter Theory in Applied Geophysics: With 23 Tables. Berlin: Springer Science & Business Media.

    Book  Google Scholar 

  • Chandrasekaran, K. (2012). Indeterminacies in howatch’s st. benet’s trilogy. Language in India, 12(12).

  • Childers, D. G., Hahn, M., & Larar, J. N. (1989). Silent and voiced/unvoiced/mixed excitation (four-way) classification of speech. IEEE Transactions on Acoustics, Speech and Signal Processing, 37(11), 1771–1774.

    Article  Google Scholar 

  • Ciresan, D. C., Meier, U., Gambardella, L. M., & Schmidhuber, J. (2010). Deep big simple neural nets excel on handwritten digit recognition []. Retrieved July 03, 2014, from: http://arxiv.orgpdf/1003.0358.

  • Collobert, R., & Weston, J. (2008). A unified architecture for natural language processing: Deep neural networks with multitask learning. In Proceedings of the 25th international conference on machine learning, pp. 160–167. ACM.

  • Dehak, N., Kenny, P. J., Dehak, R., Dumouchel, P., & Ouellet, P. (2011). Front-end factor analysis for speaker verification. IEEE Transactions on Audio, Speech, and Language Processing, 19(4), 788–798.

    Article  Google Scholar 

  • Dehak, N., Torres-Carrasquillo, P. A., Reynolds, D., & Dehak, R. (2011). Language recognition via i-vectors and dimensionality reduction. In Twelfth annual conference of the international speech communication association.

  • Deng, L., Dong, Y., et al. (2014). Deep learning: Methods and applications. Foundations and Trends®. Signal Processing, 7(3–4), 197–387.

    MATH  Google Scholar 

  • Dietterich, T. G. (2000). Ensemble methods in machine learning. In Multiple classifier systems, pp. 1–15. Springer.

  • Ellis, D. (2005). Reproducing the feature outputs of common programs using matlab and melfcc.

  • Ganapathy, S., Han, K., Thomas, S., Omar, M., Segbroeck, M. V., & Narayanan, S. S. (2014). Robust language identification using convolutional neural network features. In Fifteenth annual conference of the international speech communication association.

  • Gnana S. K., & Deepa, S. N. (2013). Review on methods to fix number of hidden neurons in neural networks. In Mathematical Problems in Engineering.

  • Graves, A., Mohamed, A. R., & Hinton, G. (2013). Speech recognition with deep recurrent neural networks. In IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 6645–6649. IEEE.

  • Hinton, G., Deng, L., Dong, Y., Dahl, G. E., Mohamed, A. R., Jaitly, N., et al. (2012). Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Processing Magazine, 29(6), 82–97.

    Article  Google Scholar 

  • Huang, X., Acero, A., Hon, H. W., & Reddy, R. (2001). Spoken language processing: A guide to theory, algorithm, and system development. Upper Saddle River: Prentice Hall PTR.

    Google Scholar 

  • Jain, D., & Cardona, G. (2007). The Indo-Aryan Languages. Abingdon: Routledge.

    Google Scholar 

  • Jiang, B., Song, Y., Wei, S., McLoughlin, I. V., & Dai, L. R. (2014). Task-aware deep bottleneck features for spoken language identification. In Proceedings of the 15th annual conference of the international speech communication association (INTERSPECH), Singapore.

  • Kumar, K., Kim, C., & Stern, R. M. (2011). Delta-spectral cepstral coefficients for robust speech recognition. In IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 4784–4787. IEEE.

  • Li, H., Ma, B., & Lee, K. A. (2013). Spoken language recognition: From fundamentals to practice. Proceedings of the IEEE, 101(5), 1136–1159.

    Article  Google Scholar 

  • Li, H., & Ma, B. (2005). A phonotactic language model for spoken language identification. In Proceedings of the 43rd annual meeting on association for computational linguistics, pp. 515–522. Association for Computational Linguistics.

  • Loizou, P. (1998). A matlab software tool for speech analysis. Dallas: Author.

    Google Scholar 

  • Lopez-Moreno, I., Gonzalez-Dominguez, J., Martinez, D., Plchot, O., Gonzalez-Rodriguez, J., & Moreno, P. J. (2016). On the use of deep feedforward neural networks for automatic language identification. Computer Speech and Language, 40, 46–59.

    Article  Google Scholar 

  • Lopez-Moreno, I., Gonzalez-Dominguez, J., Plchot, O., Martinez, D., Gonzalez-Rodriguez, J., & Moreno, P. (2014). Automatic language identification using deep neural networks. In IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 5337–5341. IEEE.

  • Martínez, D., Burget, L., Ferrer, L., & Scheffer, N. (2012). ivector-based prosodic system for language identification. In IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 4861–4864. IEEE.

  • Matejka, P., Burget, L., Schwarz, P., & Cernocky, J. (2006). Brno university of technology system for nist 2005 language recognition evaluation. In The IEEE Odyssey speaker and language recognition workshop, pp. 1–7. IEEE.

  • Matejka, P., Schwarz, P., Cernockỳ, J., & Chytil, P. (2005). Phonotactic language identification using high quality phoneme recognition. In Interspeech, pp. 2237–2240.

  • Mohamed, A. R., Dahl, G. E., & Hinton, G. (2012). Acoustic modeling using deep belief networks. IEEE Transactions on Audio, Speech, and Language Processing, 20(1), 14–22.

    Article  Google Scholar 

  • Montavon, G. (2009). Deep learning for spoken language identification. In NIPS workshop on deep learning for speech recognition and related applications.

  • Nanavati, T. (2002). Biometrics. New York: Wiley.

    Google Scholar 

  • Ng, R.W., Leung, C.C., Lee, T., Ma, B., & Li, H. (2010). Prosodic attribute model for spoken language identification. In IEEE international conference on acoustics speech and signal processing (ICASSP), pp. 5022–5025. IEEE.

  • Pinto, J., Yegnanarayana, B., Hermansky, H., & Doss, M. M. (2008). Exploiting contextual information for improved phoneme recognition. In IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 4449–4452. IEEE.

  • Prahallad, K., Kumar E. N., Keri V., Rajendran, S., & Black, A. W. (2012). In INTERSPEECH TheIIIT-HIndic speech databases.

  • Ranjan, S., Yu, C., Zhang, C., Kelly, F., & Hansen, J. H. (2016). Language recognition using deep neural networks with very limited training data. In IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 5830–5834. IEEE.

  • Rao, K. S., & Nandi, D. (2015). Language Identification Using Excitation Source Features. Berlin: Springer.

    Google Scholar 

  • Singer, E., Torres-Carrasquillo, P., Reynolds, D. A., McCree, A., Richardson, F., Dehak, N., & Sturim, D. (2012). The mitll nist lre 2011 language recognition system. In IEEE international conference on acoustics speech and signal processing (ICASSP), pp. 209–215.

  • Sturim, D., Campbell, W., Dehak, N., Karam, Z., McCree, A., Reynolds, D., Richardson, F., Torres-Carrasquillo, P., & Shum, S. (2011). The mit ll 2010 speaker recognition evaluation system: Scalable language-independent speaker recognition. In IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 5272–5275. IEEE.

  • Torres-Carrasquillo, P. A., Reynolds, D., & Deller, J. R. Jr. (2002). Language identification usingGaussian mixture model tokenization. In IEEE international conference on acoustics, speech, and signal processing (ICASSP) (Vol. 1, pp. I–757). IEEE.

  • Torres-Carrasquillo, P. A., Singer, E., Kohler, M. A., Greene, R. J., Reynolds, D. A., & Deller Jr., J. R. (2002). Approaches to language identification using Gaussian mixture models and shifted delta cepstral features. In Interspeech.

  • Torres-Carrasquillo P. A., Singer E., Gleason T., McCree A., Reynolds D. A., Richardson F., & Sturim, D. (2010). The mitll nist lre 2009 language recognition system. In IEEE international conference on acoustics speech and signal processing (ICASSP), pp. 4994–4997. IEEE.

  • Young, S., Evermann, G., Gales, M., Hain, T., Kershaw, D., Liu, X., Moore, G., Odell, J., Ollason, D., & Povey, D. (1997). In The HTK book (Vol. 2. Entropic Cambridge Research Laboratory Cambridge).

  • Zissman, M. A. (1996). Comparison of four approaches to automatic language identification of telephone speech. IEEE Transactions on Speech and Audio Processing, 4(1), 31.

    Article  Google Scholar 

  • Zissman, M. A. (1995). Language identification using phoneme recognition and phonotactic language modeling. In International conference on acoustics, speech, and signal processing (ICASSP) (Vol. 5, pp. 3503–3506). IEEE.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Y. V. Srinivasa Murthy.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Koolagudi, S.G., Bharadwaj, A., Srinivasa Murthy, Y.V. et al. Dravidian language classification from speech signal using spectral and prosodic features. Int J Speech Technol 20, 1005–1016 (2017). https://doi.org/10.1007/s10772-017-9466-5

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10772-017-9466-5

Keywords

Navigation