Advertisement

Optimal prosodic feature extraction and classification in parametric excitation source information for Indian language identification using neural network based Q-learning algorithm

  • Himanish Shekhar Das
  • Pinki Roy
Article
  • 12 Downloads

Abstract

Automatic language identification (LID) system has extensively recognized in a real world multilanguage speech specific applications. The formation speech is relying on the vocal tract area which explores the excitation source information for LID task. In this paper, LID system utilizes sub segmental, segmental and supra segmental features from Linear Prediction residual of speech signal, represents various native language speech excitation source information. The glottal flow derivative of speech signal is obtained through iterative adaptive inverse filtering method. Moreover, the prosodic features of speech signal are extracted using short time Fourier transform due to its capability to process non-stationary signals. Finally, the deep neural network based Q-learning (DNNQL) algorithm has been employed for identification of the class label for a specific language. Experimental validation of the proposed approach is carried out using Indian language recorded database. Finally, the proposed LID system approach is performing well with 97.3% accuracy compared to other machine learning based approaches.

Keywords

Automatic language identification (LID) Prosodic feature Iterative adaptive inverse filtering (IAIF) Short time Fourier transform (STFT) And neural network based Q-learning (NNQL) 

References

  1. Ambikairajah, E., Li, H., Wang, L., Yin, B., & Sethu, V. (2011). Language identification: A tutorial. IEEE Circuits and Systems Magazine, 11(2), 82–108.CrossRefGoogle Scholar
  2. Bouguelia, M. R., Nowaczyk, S., Santosh, K. C., & Verikas, A. (2018). Agreeing to disagree: Active learning with noisy labels without crowdsourcing. International Journal of Machine Learning and Cybernetics, 9(8), 1307–1319.CrossRefGoogle Scholar
  3. Dey, N., & Ashour, A. S. (2018a). Applied examples and applications of localization and tracking problem of multiple speech sources. In Direction of arrival estimation and localization of multi-speech sources (pp. 35–48). Cham: Springer.Google Scholar
  4. Dey, N., & Ashour, A. S. (2018b). Sources localization and DOAE techniques of moving multiple sources. In Direction of arrival estimation and localization of multi-speech sources (pp. 23–34). Cham: Springer.Google Scholar
  5. Dey, N., & Ashour, A. S. (2018c). Challenges and future perspectives in speech-sources direction of arrival estimation and localization. In Direction of arrival estimation and localization of multi-speech sources (pp. 49–52). Cham: Springer.Google Scholar
  6. Diez, M., Varona, A., Penagarikano, M., Rodriguez-Fuentes, J. L., & Bordel, G. (2012) On the use of phone log-likelihood ratios as features in spoken language recognition. In Spoken language technology workshop (SLT), 2012 IEEE (pp. 274–279). IEEE.Google Scholar
  7. Diez, M., Varona, A., Penagarikano, M., Rodriguez-Fuentes, J. L., & Bordel, G. (2013) Dimensionality reduction of phone log-likelihood ratio features for spoken language recognition. In INTERSPEECH (pp. 64–68).Google Scholar
  8. Diez, M., Varona, A., Penagarikano, M., Rodriguez-Fuentes, J. L., & Bordel, G. (2014). On the projection of PLLRs for unbounded feature distributions in spoken language recognition. IEEE Signal Processing Letters, 21(9), 1073–1077.CrossRefGoogle Scholar
  9. Ferrer, L., Lei, Y., McLaren, M., & Scheffer, N. (2016). Study of senone-based deep neural network approaches for spoken language recognition. IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP), 24(1), 105–116.CrossRefGoogle Scholar
  10. Gamallo, P., Pichel, J. R., & Alegria, I. (2017). From language identification to language distance. Physica A: Statistical Mechanics and its Applications, 484, 152–162.CrossRefGoogle Scholar
  11. Gonzalez-Dominguez, J., Lopez-Moreno, I., Moreno, P. J., & Gonzalez-Rodriguez, J. (2015). Frame-by-frame language identification in short utterances using deep neural networks. Neural Networks, 64, 49–58.CrossRefGoogle Scholar
  12. Guijarrubia, V. G., & Torres, M. I. (2010). Text-and speech-based phonotactic models for spoken language identification of Basque and Spanish. Pattern Recognition Letters, 31(6), 523–532.CrossRefGoogle Scholar
  13. Jothilakshmi, S., Ramalingam, V., & Palanivel, S. (2012). A hierarchical language identification system for Indian languages. Digital Signal Processing, 22(3), 544–553.MathSciNetCrossRefGoogle Scholar
  14. Kockmann, M., & Burget, L. (2011). Application of speaker-and language identification state-of-the-art techniques for emotion recognition. Speech Communication, 53(9), 1172–1185.CrossRefGoogle Scholar
  15. Koolagudi, S., Rastogi, G., D., and Rao, K. S. (2012) Identification of language using mel-frequency cepstral coefficients (MFCC). Procedia Engineering, 38, 3391–3398.CrossRefGoogle Scholar
  16. Li, H., Ma, B., & Lee, K. A. (2013) Spoken language recognition: from fundamentals to practice. Proceedings of the IEEE, 101(5), 1136–1159.CrossRefGoogle Scholar
  17. Lopez-Moreno, I., Gonzalez-Dominguez, J., Martinez, D., Plchot, O., Gonzalez-Rodriguez, J., & Moreno, P. J. (2016). On the use of deep feed forward neural networks for automatic language identification. Computer Speech & Language, 40, 46–59.CrossRefGoogle Scholar
  18. Lu, X., Shen, P., Tsao, Y., & Kawai, H. (2017). Regularization of neural network model with distance metric learning for i-vector based spoken language identification. Computer Speech & Language, 44, 48–60.CrossRefGoogle Scholar
  19. Manchala, S., Prasad, V. K., & Janaki, V. (2014). GMM based language identification system using robust features. International Journal of Speech Technology, 17(2), 99–105.CrossRefGoogle Scholar
  20. Mary, L., & Yegnanarayana, B. (2008). Extraction and representation of prosodic features for language and speaker recognition. Speech Communication, 50(10), 782–796.CrossRefGoogle Scholar
  21. Masumura, R., Asami, T., Masataki, H., & Aono, Y. (2017) Parallel phonetically aware DNNs and LSTM-RNNS for frame-by-frame discriminative modeling of spoken language identification. In 2017 IEEE international conference on IEEE acoustics, speech and signal processing (ICASSP) (pp. 5260–5264).Google Scholar
  22. Mounika, K. V., Achanta, S., Lakshmi, H. R., Gangashetty, S. V., & Vuppala, A. K. (2016) An investigation of deep neural network architectures for language recognition in Indian languages. In INTERSPEECH (pp. 2930–2933).Google Scholar
  23. Mukherjee, H., Obaidullah, S. M., Santosh, K. C., Phadikar, S., & Roy, K. (2018). Line spectral frequency-based features and extreme learning machine for voice activity detection from audio signal. International Journal of Speech Technology.  https://doi.org/10.1007/s10772-018-9525-6.Google Scholar
  24. Orfanidou, E., Adam, R., Morgan, G., & McQueen, J. M. (2010). Recognition of signed and spoken language: Different sensory inputs, the same segmentation procedure. Journal of Memory and Language, 62(3), 272–283.CrossRefGoogle Scholar
  25. Roy, P., & Das, P. K. (2013). A hybrid VQ-GMM approach for identifying Indian languages. International Journal of Speech Technology, 16, 33–39.CrossRefGoogle Scholar
  26. Sadjadi, S. O., & Hansen, J. H. (2015). Mean Hilbert envelope coefficients (MHEC) for robust speaker and language identification. Speech Communication, 72, 138–148.CrossRefGoogle Scholar
  27. Sim, K. C., & Li, H. (2008). On acoustic diversification front-end for spoken language identification. IEEE Transactions on Audio, Speech, and Language Processing, 16(5), 1029–1037.CrossRefGoogle Scholar
  28. Sizov, A., Lee, K. A., & Kinnunen, T. (2017) Direct optimization of the detection cost for I-vector-based spoken language recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 25(3), 588–597.CrossRefGoogle Scholar
  29. Song, Y., Hong, X., Jiang, B., Cui, R., McLoughlin, I., & Dai, L. R. (2015), Deep bottleneck network based i-vector representation for language identification. In Sixteenth annual conference of the International Speech Communication Association.Google Scholar
  30. Takçı, H., & Güngör, T. (2012). A high performance centroid-based classification approach for language identification. Pattern Recognition Letters, 33(16), 2077–2084.CrossRefGoogle Scholar
  31. Tanaka, T., Shinozaki, T., Watanabe, S., & Hori, T. (2016). Evolution strategy based neural network optimization and LSTM language model for robust speech recognition. Cit. on, 130.Google Scholar
  32. Tong, R., Ma, B., Li, H., & Chng, E. S. (2009). A target-oriented phonotactic front-end for spoken language recognition. IEEE Transactions on Audio, Speech, and Language Processing, 17(7), 1335–1347.CrossRefGoogle Scholar
  33. Trabelsi, I., & Bouhlel, M. S. (2017) Feature selection for GUMI kernel-based SVM in speech emotion recognition. In Artificial intelligence: Concepts, methodologies, tools, and applications (pp. 941–953). IGI Global.Google Scholar
  34. Wang, H., Leun, C.-C., Lee, T., Ma, B., & Li, H. (2013). Shifted-delta mlp features for spoken language recognition. IEEE Signal Processing Letters, 20(1), 15–18.CrossRefGoogle Scholar
  35. Zazo, R., Lozano-Diez, A., Gonzalez-Dominguez, J., Toledano, D. T., & Gonzalez-Rodriguez, J. (2016) Language identification in short utterances using long short-term memory (LSTM) recurrent neural networks. PloS ONE, 11(1), e0146917.Google Scholar
  36. Zhu, D., Li, H., Ma, B., & Lee, C.-H. (2008). Optimizing the performance of spoken language recognition with discriminative training. IEEE Transactions on Audio, Speech, and Language Processing, 16(8), 1642–1653.CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2018

Authors and Affiliations

  1. 1.Department of Computer Science and EngineeringNational Institute of Technology SilcharSilcharIndia

Personalised recommendations