Skip to main content
Log in

Recognition of Spoken Languages from Acoustic Speech Signals Using Fourier Parameters

  • Published:
Circuits, Systems, and Signal Processing Aims and scope Submit manuscript

Abstract

Spoken language identification (LID) or spoken language recognition (LR) is defined as the process of recognizing the language from speech utterance. In this paper, a new Fourier parameter (FP) model is proposed for the task of speaker-independent spoken language recognition. The performance of the proposed FP features is analyzed and compared with the legacy mel-frequency cepstral coefficient (MFCC) features. Two multilingual databases, namely Indian Institute of Technology Kharagpur Multilingual Indian Language Speech Corpus (IITKGP-MLILSC) and Oriental Language Recognition Speech Corpus (AP18-OLR), are used to extract FP and MFCC features. Spoken LID/LR models are developed with the extracted FP and MFCC features using three classifiers, namely support vector machines, feed-forward artificial neural networks, and deep neural networks. Experimental results show that the proposed FP features can effectively recognize different languages from speech signals. It can also be observed that the recognition performance is significantly improved when compared to MFCC features. Further, the recognition performance is enhanced when MFCC and FP features are combined.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16

Similar content being viewed by others

Notes

  1. The terms ‘system,’ ‘model,’ and ‘classifier’ are interchangeably used in this article.

  2. The terms ‘corpus’ and ‘database’ are interchangeably used in this article.

  3. http://www.iitkgp.ac.in.

  4. http://www.olrchallenge.org.

  5. http://cslt.riit.tsinghua.edu.cn/mediawiki/index.php/OLR_Challenge_2018.

  6. http://en.speechocean.com.

  7. http://cslt.riit.tsinghua.edu.cn/mediawiki/index.php/OLR_Challenge_2016.

  8. National Natural Science Foundation of China, http://www.nsfc.gov.cn.

  9. Multilingual Minorlingual Automatic Speech Recognition, http://m2asr.cslt.org.

  10. http://cslt.riit.tsinghua.edu.cn/mediawiki/index.php/OLR_Challenge_2017.

  11. The terms ‘frame’ and ‘segment’ are interchangeably used in this article.

  12. The terms ‘target’ and ‘class’ are interchangeably used in this article.

  13. https://in.mathworks.com/help/stats/relieff.html.

  14. The terms ‘attribute’ and ‘feature’ are interchangeably used in this article.

  15. The user-defined parameter k in ReliefF feature selection refers to k-nearest neighbors. In this paper, the value of k is set to 10.

  16. For a q-class classification problem, SVM with OVO configuration uses \(\frac{q(q-1)}{2}\) binary learners.

  17. For a q-class classification problem, SVM with OVA configuration uses q binary learners.

  18. \(\hbox {tansig}(n) = \frac{2}{(1+e^{-2n})-1}\).

  19. \(\hbox {logsig}(n) = \frac{1}{(1 + e^{-n})}\).

  20. \(\hbox {elliotsig}(n)=\frac{0.5~n}{1+\left| n\right| } + 0.5\).

  21. \(\hbox {softmax}(n) = \frac{e^{n}}{\sum e^{n}}\).

  22. tanh.

  23. \(\sigma \left( x\right) = \left( 1+e^{-x}\right) ^{-1}\).

  24. http://in.mathworks.com/help/deeplearning/ref/nnet.cnn.layer.lstmlayer.html.

  25. http://in.mathworks.com/help/deeplearning/ref/nnet.cnn.layer.fullyconnectedlayer.html.

  26. https://in.mathworks.com/help/deeplearning/ref/trainingoptions.html.

  27. The percentage improvement \(I_{\%}\) in recognition accuracy is computed using, \(I_{\%}=\frac{A_{1}-A_{2}}{A_{2}}\times 100\% \ni A_{1}>A_{2}\), where \(A_{1}\) and \(A_{2}\) are recognition accuracies in percentages.

References

  1. F. Adeeba, S. Hussain, Acoustic feature analysis and discriminative modeling for language identification of closely related South-Asian languages. Circuits Syst. Signal Process. 37(8), 3589–3604 (2018). https://doi.org/10.1007/s00034-017-0724-1

    Article  Google Scholar 

  2. E. Ambikairajah, H. Li, L. Wang, B. Yin, V. Sethu, Language identification: a tutorial. IEEE Circuits Syst. Mag. 11(2), 82–108 (2011). https://doi.org/10.1109/MCAS.2011.941081

    Article  Google Scholar 

  3. J.C. Ang, A. Mirzal, H. Haron, H.N.A. Hamed, Supervised, unsupervised, and semi-supervised feature selection: a review on gene selection. IEEE/ACM Trans. Comput. Biol. Bioinform. 13(5), 971–989 (2016). https://doi.org/10.1109/TCBB.2015.2478454

    Article  Google Scholar 

  4. M.E. Ayadi, M.S. Kamel, F. Karray, Survey on speech emotion recognition: features, classification schemes, and databases. Pattern Recognit. 44(3), 572–587 (2011). https://doi.org/10.1016/j.patcog.2010.09.020

    Article  MATH  Google Scholar 

  5. J. Balleda, H.A. Murthy, Language identification from short segment of speech, in Proc. ICSLP-2000, vol. 3, pp. 1033–1036 (2000)

  6. J. Benesty, M.M. Sondhi, Y. Huang, Springer Handbook of Speech Processing (Springer, Berlin, 2008). https://doi.org/10.1007/978-3-540-49127-9

    Book  Google Scholar 

  7. C.C. Bhanja, M.A. Laskar, R.H. Laskar, A pre-classification-based language identification for Northeast Indian languages using prosody and spectral features. Circuits Syst. Signal Process. (2018). https://doi.org/10.1007/s00034-018-0962-x

    Article  Google Scholar 

  8. C. Busso, S. Mariooryad, A. Metallinou, S. Narayanan, Iterative feature normalization scheme for automatic emotion detection from speech. IEEE Trans. Affect. Comput. 4(4), 386–397 (2013). https://doi.org/10.1109/T-AFFC.2013.26

    Article  Google Scholar 

  9. C. Cortes, V. Vapnik, Support-vector network. Mach. Learn. 20, 273–297 (1995). https://doi.org/10.1007/BF00994018

    Article  MATH  Google Scholar 

  10. S. Davis, P. Mermelstein, Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans. Acoust. Speech Signal Process. 28(4), 357–366 (1980). https://doi.org/10.1109/TASSP.1980.1163420

    Article  Google Scholar 

  11. A. Geron, Hands-on Machine Learning with Scikit-Learn and TensorFlow (O’Reilly Media, Newton, 2017)

    Google Scholar 

  12. J. Gonzalez-Dominguez, I. Lopez-Moreno, H. Sak, J. Gonzalez-Rodriguez, P.J. Moreno, Automatic language identification using long short-term memory recurrent neural networks, in Interspeech 2014, 15th Annual Conference of the International Speech Communication Association, Singapore, pp. 2155–2159 (2014). https://www.isca-speech.org/archive/interspeech_2014/i14_2155.html

  13. M.T. Hagan, H.B. Demuth, M.H. Beale, O.D. Jesus, Neural Network Design, 2nd edn. (Martin Hagan, Boston, 2014)

    Google Scholar 

  14. C.W. Hsu, C.J. Lin, A comparison of methods for multiclass support vector machines. IEEE Trans. Neural Netw. 13(2), 415–425 (2002). https://doi.org/10.1109/72.991427

    Article  Google Scholar 

  15. S. Jothilakshmi, V. Ramalingam, S. Palanivel, A hierarchical language identification system for Indian languages. Digit. Signal Process. 22(3), 544–553 (2012). https://doi.org/10.1016/j.dsp.2011.11.008

    Article  MathSciNet  Google Scholar 

  16. V. Kecman, T.M. Huang, M. Vogt, Iterative Single Data Algorithm for Training Kernel Machines from Huge Data Sets: Theory and Performance (Springer, Berlin, 2005), pp. 255–274. https://doi.org/10.1007/10984697_12

    Book  Google Scholar 

  17. D.P. Kingma, J.L. Ba, ADAM: A method for stochastic optimization. Computing Research Repository (CoRR) abs/1412.6980, arXiv:1412.6980 (2014)

  18. S.G. Koolagudi, A. Bharadwaj, Y.V.S. Murthy, N. Reddy, P. Rao, Dravidian language classification from speech signal using spectral and prosodic features. Int. J. Speech Technol. 20(4), 1005–1016 (2017). https://doi.org/10.1007/s10772-017-9466-5

    Article  Google Scholar 

  19. M. Leena, K.S. Rao, B. Yegnanarayana, Neural network classifiers for language identification using phonotactic and prosodic features, in Proceedings of 2005 International Conference on Intelligent Sensing and Information Processing, pp. 404–408 (2005). https://doi.org/10.1109/ICISIP.2005.1529486

  20. H. Li, B. Ma, K.A. Lee, Spoken language recognition: from fundamentals to practice. Proc. IEEE 101(5), 1136–1159 (2013). https://doi.org/10.1109/JPROC.2012.2237151

    Article  Google Scholar 

  21. S. Maity, A.K. Vuppala, K.S. Rao, D. Nandi, IITKGP-MLILSC speech database for language identification, in 2012 National Conference on Communications (NCC), pp. 1–5 (2012). https://doi.org/10.1109/NCC.2012.6176831, https://ieeexplore.ieee.org/document/6176831/

  22. K.E. Manjunath, K.S. Rao, Improvement of phone recognition accuracy using articulatory features. Circuits Syst. Signal Process. 37(2), 704–728 (2018). https://doi.org/10.1007/s00034-017-0568-8

    Article  Google Scholar 

  23. M.F. Møller, A scaled conjugate gradient algorithm for fast supervised learning. Neural Netw. 6(4), 525–533 (1993). https://doi.org/10.1016/S0893-6080(05)80056-5

    Article  Google Scholar 

  24. K.V. Mounika, A. Sivanand, H.R. Lakshmi, V.G. Suryakanth, V.A. Kumar, An investigation of deep neural network architectures for language recognition in indian languages, in Interspeech 2016, pp. 2930–2933 (2016). https://doi.org/10.21437/Interspeech.2016-910

  25. T. Nagarajan, H.A. Murthy, Language identification using spectral vector distribution across languages, in Proceedings of International Conference on Natural Language Processing (2002)

  26. D. Nandi, D. Pati, K.S. Rao, Language identification using Hilbert envelope and phase information of linear prediction residual, in 2013 International Conference Oriental COCOSDA Held Jointly with 2013 Conference on Asian Spoken Language Research and Evaluation (O-COCOSDA/CASLRE), pp. 1–6 (2013). https://doi.org/10.1109/ICSDA.2013.6709864, https://ieeexplore.ieee.org/document/6709864

  27. A.V. Oppenheim, R.W. Schafer, Discrete-Time Signal Processing (Prentice Hall, Upper Saddle River, NJ, 1999)

    MATH  Google Scholar 

  28. K.S. Rao, Application of prosody models for developing speech systems in Indian languages. Int. J. Speech Technol. 14(1), 19–33 (2011). https://doi.org/10.1007/s10772-010-9086-9

    Article  Google Scholar 

  29. K.S. Rao, S. Sarkar, Robust Speaker Recognition in Noisy Environments (Springer, Berlin, 2014). https://doi.org/10.1007/978-3-319-07130-5

    Book  Google Scholar 

  30. V.R. Reddy, S. Maity, K.S. Rao, Identification of Indian languages using multi-level spectral and prosodic features. Int. J. Speech Technol. 16(4), 489–511 (2013). https://doi.org/10.1007/s10772-013-9198-0

    Article  Google Scholar 

  31. F. Richardson, D. Reynolds, N. Dehak, Deep neural network approaches to speaker and language recognition. IEEE Signal Process. Lett. 22(10), 1671–1675 (2015). https://doi.org/10.1109/LSP.2015.2420092

    Article  Google Scholar 

  32. M. Robnik-Šikonja, I. Kononenko, Theoretical and empirical analysis of ReliefF and RReliefF. Mach. Learn. 53(1), 23–69 (2003). https://doi.org/10.1023/A:1025667309714

    Article  MATH  Google Scholar 

  33. K.G. Sheela, S.N. Deepa, Review on methods to fix number of hidden neurons in neural networks. Math. Probl. Eng. 2013(425740), 1–11 (2013). https://doi.org/10.1155/2013/425740

    Article  Google Scholar 

  34. M. Siu, X. Yang, H. Gish, Discriminatively trained GMMs for language classification using boosting methods. IEEE Trans. Audio Speech Lang. Process. 17(1), 187–197 (2009). https://doi.org/10.1109/TASL.2008.2006653

    Article  Google Scholar 

  35. Sreevani, C.A. Murthy, Bridging feature selection and extraction: Compound feature generation. IEEE Trans. Knowl. Data Eng. 29(4), 757–770 (2017). https://doi.org/10.1109/TKDE.2016.2619712

    Article  Google Scholar 

  36. N.S.S. Srinivas, N. Sugan, L.S. Kumar, M.K. Nath, A. Kanhe, Speaker-independent Japanese isolated speech word recognition using TDRC features, in 2018 International CET Conference on Control, Communication, and Computing (IC4), pp. 278–283 (2018). https://doi.org/10.1109/CETIC4.2018.8530947, https://ieeexplore.ieee.org/document/8530947

  37. N. Sugan, N.S.S. Srinivas, N. Kar, L.S. Kumar, M.K. Nath, A. Kanhe, Performance comparison of different cepstral features for speech emotion recognition, in 2018 International CET Conference on Control, Communication, and Computing (IC4), pp. 266–271 (2018). https://doi.org/10.1109/CETIC4.2018.8531065, https://ieeexplore.ieee.org/document/8531065

  38. Z. Tang, D. Wang, Y. Chen, Q. Chen, AP17-OLR challenge: Data, plan, and baseline. in 2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), pp. 749–753 (2017). https://doi.org/10.1109/APSIPA.2017.8282134, https://ieeexplore.ieee.org/document/8282134

  39. Z. Tang, D. Wang, Q. Chen, AP18-OLR challenge: three tasks and their baselines, in 2018 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), pp. 596–600 (2018). https://doi.org/10.23919/APSIPA.2018.8659714

  40. V.N. Vapnik, Statistical Learning Theory (Wiley, New York, 2001)

    MATH  Google Scholar 

  41. M.K. Veera, R.K. Vuddagiri, S.V. Gangashetty, A.K. Vuppala, Combining evidences from excitation source and vocal tract system features for Indian language identification using deep neural networks. Int. J. Speech Technol. 21(3), 501–508 (2018). https://doi.org/10.1007/s10772-017-9481-6

    Article  Google Scholar 

  42. R.K. Vuddagiri, K. Gurugubelli, P. Jain, H.K. Vydana, A.K. Vuppala, IIITH-ILSC speech database for Indian language identification, in The 6th International Workshop on Spoken Language Technologies for Under-Resourced Languages, pp. 56–60 (2018)

  43. D. Wang, L. Li, D. Tang, Q. Chen, AP16-OL7: a multilingual database for oriental languages and a language recognition baseline, in 2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA), pp. 1–5 (2016). https://doi.org/10.1109/APSIPA.2016.7820796, https://ieeexplore.ieee.org/document/7820796

  44. K. Wang, N. An, B.N. Li, Y. Zhang, L. Li, Speech emotion recognition using Fourier parameters. IEEE Trans. Affect. Comput. 6(1), 69–75 (2015). https://doi.org/10.1109/TAFFC.2015.2392101

    Article  Google Scholar 

Download references

Acknowledgements

The authors would like to thank all anonymous reviewers for providing their valuable comments on earlier drafts of this manuscript. Their comments and suggestions are very useful and helped us to improve the quality of the final manuscript. The authors express out appreciation to Prof. Dr. K. Sreenivasa Rao and his research team for sharing IITKGP-MLILSC database with us during the course of this research. The authors also express out appreciation to Dr. Zhiyuan Tang for sharing AP18-OLR (AP16-OL7 and AP17-OL3) multilingual database with us during the course of this research. The authors would also like to thank MathWorks®, Inc., for providing MATLAB® tool and NCH®, Inc., for providing WavePad® Sound Editor Tool. Any correspondence should be made to N. S. Sai Srinivas.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to N. S. Sai Srinivas.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This work has been supported under ‘Special Manpower Development Program for Chip to System Design (SMDP-C2SD) Project,’ funded by the Ministry of Electronics and Information Technology (MeitY), Government of India (GOI), under vide Sanction Order and Grant No. 9(1)/2014-MDD(Vol. III).

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Srinivas, N.S.S., Sugan, N., Kar, N. et al. Recognition of Spoken Languages from Acoustic Speech Signals Using Fourier Parameters. Circuits Syst Signal Process 38, 5018–5067 (2019). https://doi.org/10.1007/s00034-019-01100-6

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00034-019-01100-6

Keywords

Navigation