Abstract
The interesting aspect of the Dravidian languages is a commonality through a shared script, similar vocabulary, and their common root language. In this work, an attempt has been made to classify the four complex Dravidian languages using cepstral coefficients and prosodic features. The speech of Dravidian languages has been recorded in various environments and considered as a database. It is demonstrated that while cepstral coefficients can indeed identify the language correctly with a fair degree of accuracy, prosodic features are added to the cepstral coefficients to improve language identification performance. Legendre polynomial fitting and the principle component analysis (PCA) are applied on feature vectors to reduce dimensionality which further resolves the issue of time complexity. In the experiments conducted, it is found that using both cepstral coefficients and prosodic features, a language identification rate of around 87% is obtained, which is about 18% above the baseline system using Mel-frequency cepstral coefficients (MFCCs). It is observed from the results that the temporal variations and prosody are the important factors needed to be considered for the tasks of language identification.
Similar content being viewed by others
Notes
The terms ’pitch’ and ’F0’ are interchangeably used in the article.
References
Allen, F., Ambikairajah, E., & Epps, J. (2005). Language identification using warping and the shifted delta cepstrum. In IEEE 7th workshop on multimedia signal processing, pp. 1–4. IEEE.
Atal, B., & Rabiner, L. (1946). pattern recognition approach to voiced-unvoiced-silence classification with applications to speech recognition. IEEE Transactions on Acoustics, Speech, and Signal Processing, 24(3), 201–212.
Brümmer, N., Cumani, S., Glembek, O., Karafiát, M., Matějka, P., Pešán, J., Plchot, O., Soufifar, M., Villiers, E. D., & Cernockỳ, J. H. (2012). Description and analysis of the brno276 system for lre2011. In Odyssey 2012-the speaker and language recognition workshop.
Buttkus, B. (2000). Spectral Analysis and Filter Theory in Applied Geophysics: With 23 Tables. Berlin: Springer Science & Business Media.
Chandrasekaran, K. (2012). Indeterminacies in howatch’s st. benet’s trilogy. Language in India, 12(12).
Childers, D. G., Hahn, M., & Larar, J. N. (1989). Silent and voiced/unvoiced/mixed excitation (four-way) classification of speech. IEEE Transactions on Acoustics, Speech and Signal Processing, 37(11), 1771–1774.
Ciresan, D. C., Meier, U., Gambardella, L. M., & Schmidhuber, J. (2010). Deep big simple neural nets excel on handwritten digit recognition []. Retrieved July 03, 2014, from: http://arxiv.orgpdf/1003.0358.
Collobert, R., & Weston, J. (2008). A unified architecture for natural language processing: Deep neural networks with multitask learning. In Proceedings of the 25th international conference on machine learning, pp. 160–167. ACM.
Dehak, N., Kenny, P. J., Dehak, R., Dumouchel, P., & Ouellet, P. (2011). Front-end factor analysis for speaker verification. IEEE Transactions on Audio, Speech, and Language Processing, 19(4), 788–798.
Dehak, N., Torres-Carrasquillo, P. A., Reynolds, D., & Dehak, R. (2011). Language recognition via i-vectors and dimensionality reduction. In Twelfth annual conference of the international speech communication association.
Deng, L., Dong, Y., et al. (2014). Deep learning: Methods and applications. Foundations and Trends®. Signal Processing, 7(3–4), 197–387.
Dietterich, T. G. (2000). Ensemble methods in machine learning. In Multiple classifier systems, pp. 1–15. Springer.
Ellis, D. (2005). Reproducing the feature outputs of common programs using matlab and melfcc.
Ganapathy, S., Han, K., Thomas, S., Omar, M., Segbroeck, M. V., & Narayanan, S. S. (2014). Robust language identification using convolutional neural network features. In Fifteenth annual conference of the international speech communication association.
Gnana S. K., & Deepa, S. N. (2013). Review on methods to fix number of hidden neurons in neural networks. In Mathematical Problems in Engineering.
Graves, A., Mohamed, A. R., & Hinton, G. (2013). Speech recognition with deep recurrent neural networks. In IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 6645–6649. IEEE.
Hinton, G., Deng, L., Dong, Y., Dahl, G. E., Mohamed, A. R., Jaitly, N., et al. (2012). Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Processing Magazine, 29(6), 82–97.
Huang, X., Acero, A., Hon, H. W., & Reddy, R. (2001). Spoken language processing: A guide to theory, algorithm, and system development. Upper Saddle River: Prentice Hall PTR.
Jain, D., & Cardona, G. (2007). The Indo-Aryan Languages. Abingdon: Routledge.
Jiang, B., Song, Y., Wei, S., McLoughlin, I. V., & Dai, L. R. (2014). Task-aware deep bottleneck features for spoken language identification. In Proceedings of the 15th annual conference of the international speech communication association (INTERSPECH), Singapore.
Kumar, K., Kim, C., & Stern, R. M. (2011). Delta-spectral cepstral coefficients for robust speech recognition. In IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 4784–4787. IEEE.
Li, H., Ma, B., & Lee, K. A. (2013). Spoken language recognition: From fundamentals to practice. Proceedings of the IEEE, 101(5), 1136–1159.
Li, H., & Ma, B. (2005). A phonotactic language model for spoken language identification. In Proceedings of the 43rd annual meeting on association for computational linguistics, pp. 515–522. Association for Computational Linguistics.
Loizou, P. (1998). A matlab software tool for speech analysis. Dallas: Author.
Lopez-Moreno, I., Gonzalez-Dominguez, J., Martinez, D., Plchot, O., Gonzalez-Rodriguez, J., & Moreno, P. J. (2016). On the use of deep feedforward neural networks for automatic language identification. Computer Speech and Language, 40, 46–59.
Lopez-Moreno, I., Gonzalez-Dominguez, J., Plchot, O., Martinez, D., Gonzalez-Rodriguez, J., & Moreno, P. (2014). Automatic language identification using deep neural networks. In IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 5337–5341. IEEE.
Martínez, D., Burget, L., Ferrer, L., & Scheffer, N. (2012). ivector-based prosodic system for language identification. In IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 4861–4864. IEEE.
Matejka, P., Burget, L., Schwarz, P., & Cernocky, J. (2006). Brno university of technology system for nist 2005 language recognition evaluation. In The IEEE Odyssey speaker and language recognition workshop, pp. 1–7. IEEE.
Matejka, P., Schwarz, P., Cernockỳ, J., & Chytil, P. (2005). Phonotactic language identification using high quality phoneme recognition. In Interspeech, pp. 2237–2240.
Mohamed, A. R., Dahl, G. E., & Hinton, G. (2012). Acoustic modeling using deep belief networks. IEEE Transactions on Audio, Speech, and Language Processing, 20(1), 14–22.
Montavon, G. (2009). Deep learning for spoken language identification. In NIPS workshop on deep learning for speech recognition and related applications.
Nanavati, T. (2002). Biometrics. New York: Wiley.
Ng, R.W., Leung, C.C., Lee, T., Ma, B., & Li, H. (2010). Prosodic attribute model for spoken language identification. In IEEE international conference on acoustics speech and signal processing (ICASSP), pp. 5022–5025. IEEE.
Pinto, J., Yegnanarayana, B., Hermansky, H., & Doss, M. M. (2008). Exploiting contextual information for improved phoneme recognition. In IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 4449–4452. IEEE.
Prahallad, K., Kumar E. N., Keri V., Rajendran, S., & Black, A. W. (2012). In INTERSPEECH TheIIIT-HIndic speech databases.
Ranjan, S., Yu, C., Zhang, C., Kelly, F., & Hansen, J. H. (2016). Language recognition using deep neural networks with very limited training data. In IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 5830–5834. IEEE.
Rao, K. S., & Nandi, D. (2015). Language Identification Using Excitation Source Features. Berlin: Springer.
Singer, E., Torres-Carrasquillo, P., Reynolds, D. A., McCree, A., Richardson, F., Dehak, N., & Sturim, D. (2012). The mitll nist lre 2011 language recognition system. In IEEE international conference on acoustics speech and signal processing (ICASSP), pp. 209–215.
Sturim, D., Campbell, W., Dehak, N., Karam, Z., McCree, A., Reynolds, D., Richardson, F., Torres-Carrasquillo, P., & Shum, S. (2011). The mit ll 2010 speaker recognition evaluation system: Scalable language-independent speaker recognition. In IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 5272–5275. IEEE.
Torres-Carrasquillo, P. A., Reynolds, D., & Deller, J. R. Jr. (2002). Language identification usingGaussian mixture model tokenization. In IEEE international conference on acoustics, speech, and signal processing (ICASSP) (Vol. 1, pp. I–757). IEEE.
Torres-Carrasquillo, P. A., Singer, E., Kohler, M. A., Greene, R. J., Reynolds, D. A., & Deller Jr., J. R. (2002). Approaches to language identification using Gaussian mixture models and shifted delta cepstral features. In Interspeech.
Torres-Carrasquillo P. A., Singer E., Gleason T., McCree A., Reynolds D. A., Richardson F., & Sturim, D. (2010). The mitll nist lre 2009 language recognition system. In IEEE international conference on acoustics speech and signal processing (ICASSP), pp. 4994–4997. IEEE.
Young, S., Evermann, G., Gales, M., Hain, T., Kershaw, D., Liu, X., Moore, G., Odell, J., Ollason, D., & Povey, D. (1997). In The HTK book (Vol. 2. Entropic Cambridge Research Laboratory Cambridge).
Zissman, M. A. (1996). Comparison of four approaches to automatic language identification of telephone speech. IEEE Transactions on Speech and Audio Processing, 4(1), 31.
Zissman, M. A. (1995). Language identification using phoneme recognition and phonotactic language modeling. In International conference on acoustics, speech, and signal processing (ICASSP) (Vol. 5, pp. 3503–3506). IEEE.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Koolagudi, S.G., Bharadwaj, A., Srinivasa Murthy, Y.V. et al. Dravidian language classification from speech signal using spectral and prosodic features. Int J Speech Technol 20, 1005–1016 (2017). https://doi.org/10.1007/s10772-017-9466-5
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10772-017-9466-5