Recognition of Spoken Languages from Acoustic Speech Signals Using Fourier Parameters

Srinivas, N. S. Sai; Sugan, N.; Kar, Niladri; Kumar, L. S.; Nath, Malaya Kumar; Kanhe, Aniruddha

doi:10.1007/s00034-019-01100-6

Recognition of Spoken Languages from Acoustic Speech Signals Using Fourier Parameters

Published: 04 April 2019

Volume 38, pages 5018–5067, (2019)
Cite this article

Circuits, Systems, and Signal Processing Aims and scope Submit manuscript

776 Accesses
12 Citations
Explore all metrics

Abstract

Spoken language identification (LID) or spoken language recognition (LR) is defined as the process of recognizing the language from speech utterance. In this paper, a new Fourier parameter (FP) model is proposed for the task of speaker-independent spoken language recognition. The performance of the proposed FP features is analyzed and compared with the legacy mel-frequency cepstral coefficient (MFCC) features. Two multilingual databases, namely Indian Institute of Technology Kharagpur Multilingual Indian Language Speech Corpus (IITKGP-MLILSC) and Oriental Language Recognition Speech Corpus (AP18-OLR), are used to extract FP and MFCC features. Spoken LID/LR models are developed with the extracted FP and MFCC features using three classifiers, namely support vector machines, feed-forward artificial neural networks, and deep neural networks. Experimental results show that the proposed FP features can effectively recognize different languages from speech signals. It can also be observed that the recognition performance is significantly improved when compared to MFCC features. Further, the recognition performance is enhanced when MFCC and FP features are combined.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Automatic speech recognition: a survey

Article 10 November 2020

A comprehensive survey on automatic speech recognition using neural networks

Article 15 August 2023

Multi-language: ensemble learning-based speech emotion recognition

Article 07 May 2024

Notes

The terms ‘system,’ ‘model,’ and ‘classifier’ are interchangeably used in this article.
The terms ‘corpus’ and ‘database’ are interchangeably used in this article.
http://www.iitkgp.ac.in.
http://www.olrchallenge.org.
http://cslt.riit.tsinghua.edu.cn/mediawiki/index.php/OLR_Challenge_2018.
http://en.speechocean.com.
http://cslt.riit.tsinghua.edu.cn/mediawiki/index.php/OLR_Challenge_2016.
National Natural Science Foundation of China, http://www.nsfc.gov.cn.
Multilingual Minorlingual Automatic Speech Recognition, http://m2asr.cslt.org.
http://cslt.riit.tsinghua.edu.cn/mediawiki/index.php/OLR_Challenge_2017.
The terms ‘frame’ and ‘segment’ are interchangeably used in this article.
The terms ‘target’ and ‘class’ are interchangeably used in this article.
https://in.mathworks.com/help/stats/relieff.html.
The terms ‘attribute’ and ‘feature’ are interchangeably used in this article.
The user-defined parameter k in ReliefF feature selection refers to k-nearest neighbors. In this paper, the value of k is set to 10.
For a q-class classification problem, SVM with OVO configuration uses \(\frac{q(q-1)}{2}\) binary learners.
For a q-class classification problem, SVM with OVA configuration uses q binary learners.
\(\hbox {tansig}(n) = \frac{2}{(1+e^{-2n})-1}\).
\(\hbox {logsig}(n) = \frac{1}{(1 + e^{-n})}\).
\(\hbox {elliotsig}(n)=\frac{0.5~n}{1+\left| n\right| } + 0.5\).
\(\hbox {softmax}(n) = \frac{e^{n}}{\sum e^{n}}\).
tanh.
\(\sigma \left( x\right) = \left( 1+e^{-x}\right) ^{-1}\).
http://in.mathworks.com/help/deeplearning/ref/nnet.cnn.layer.lstmlayer.html.
http://in.mathworks.com/help/deeplearning/ref/nnet.cnn.layer.fullyconnectedlayer.html.
https://in.mathworks.com/help/deeplearning/ref/trainingoptions.html.
The percentage improvement \(I_{\%}\) in recognition accuracy is computed using, \(I_{\%}=\frac{A_{1}-A_{2}}{A_{2}}\times 100\% \ni A_{1}>A_{2}\), where \(A_{1}\) and \(A_{2}\) are recognition accuracies in percentages.

References

F. Adeeba, S. Hussain, Acoustic feature analysis and discriminative modeling for language identification of closely related South-Asian languages. Circuits Syst. Signal Process. 37(8), 3589–3604 (2018). https://doi.org/10.1007/s00034-017-0724-1
Article Google Scholar
E. Ambikairajah, H. Li, L. Wang, B. Yin, V. Sethu, Language identification: a tutorial. IEEE Circuits Syst. Mag. 11(2), 82–108 (2011). https://doi.org/10.1109/MCAS.2011.941081
Article Google Scholar
J.C. Ang, A. Mirzal, H. Haron, H.N.A. Hamed, Supervised, unsupervised, and semi-supervised feature selection: a review on gene selection. IEEE/ACM Trans. Comput. Biol. Bioinform. 13(5), 971–989 (2016). https://doi.org/10.1109/TCBB.2015.2478454
Article Google Scholar
M.E. Ayadi, M.S. Kamel, F. Karray, Survey on speech emotion recognition: features, classification schemes, and databases. Pattern Recognit. 44(3), 572–587 (2011). https://doi.org/10.1016/j.patcog.2010.09.020
Article MATH Google Scholar
J. Balleda, H.A. Murthy, Language identification from short segment of speech, in Proc. ICSLP-2000, vol. 3, pp. 1033–1036 (2000)
J. Benesty, M.M. Sondhi, Y. Huang, Springer Handbook of Speech Processing (Springer, Berlin, 2008). https://doi.org/10.1007/978-3-540-49127-9
Book Google Scholar
C.C. Bhanja, M.A. Laskar, R.H. Laskar, A pre-classification-based language identification for Northeast Indian languages using prosody and spectral features. Circuits Syst. Signal Process. (2018). https://doi.org/10.1007/s00034-018-0962-x
Article Google Scholar
C. Busso, S. Mariooryad, A. Metallinou, S. Narayanan, Iterative feature normalization scheme for automatic emotion detection from speech. IEEE Trans. Affect. Comput. 4(4), 386–397 (2013). https://doi.org/10.1109/T-AFFC.2013.26
Article Google Scholar
C. Cortes, V. Vapnik, Support-vector network. Mach. Learn. 20, 273–297 (1995). https://doi.org/10.1007/BF00994018
Article MATH Google Scholar
S. Davis, P. Mermelstein, Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans. Acoust. Speech Signal Process. 28(4), 357–366 (1980). https://doi.org/10.1109/TASSP.1980.1163420
Article Google Scholar
A. Geron, Hands-on Machine Learning with Scikit-Learn and TensorFlow (O’Reilly Media, Newton, 2017)
Google Scholar
J. Gonzalez-Dominguez, I. Lopez-Moreno, H. Sak, J. Gonzalez-Rodriguez, P.J. Moreno, Automatic language identification using long short-term memory recurrent neural networks, in Interspeech 2014, 15th Annual Conference of the International Speech Communication Association, Singapore, pp. 2155–2159 (2014). https://www.isca-speech.org/archive/interspeech_2014/i14_2155.html
M.T. Hagan, H.B. Demuth, M.H. Beale, O.D. Jesus, Neural Network Design, 2nd edn. (Martin Hagan, Boston, 2014)
Google Scholar
C.W. Hsu, C.J. Lin, A comparison of methods for multiclass support vector machines. IEEE Trans. Neural Netw. 13(2), 415–425 (2002). https://doi.org/10.1109/72.991427
Article Google Scholar
S. Jothilakshmi, V. Ramalingam, S. Palanivel, A hierarchical language identification system for Indian languages. Digit. Signal Process. 22(3), 544–553 (2012). https://doi.org/10.1016/j.dsp.2011.11.008
Article MathSciNet Google Scholar
V. Kecman, T.M. Huang, M. Vogt, Iterative Single Data Algorithm for Training Kernel Machines from Huge Data Sets: Theory and Performance (Springer, Berlin, 2005), pp. 255–274. https://doi.org/10.1007/10984697_12
Book Google Scholar
D.P. Kingma, J.L. Ba, ADAM: A method for stochastic optimization. Computing Research Repository (CoRR) abs/1412.6980, arXiv:1412.6980 (2014)
S.G. Koolagudi, A. Bharadwaj, Y.V.S. Murthy, N. Reddy, P. Rao, Dravidian language classification from speech signal using spectral and prosodic features. Int. J. Speech Technol. 20(4), 1005–1016 (2017). https://doi.org/10.1007/s10772-017-9466-5
Article Google Scholar
M. Leena, K.S. Rao, B. Yegnanarayana, Neural network classifiers for language identification using phonotactic and prosodic features, in Proceedings of 2005 International Conference on Intelligent Sensing and Information Processing, pp. 404–408 (2005). https://doi.org/10.1109/ICISIP.2005.1529486
H. Li, B. Ma, K.A. Lee, Spoken language recognition: from fundamentals to practice. Proc. IEEE 101(5), 1136–1159 (2013). https://doi.org/10.1109/JPROC.2012.2237151
Article Google Scholar
S. Maity, A.K. Vuppala, K.S. Rao, D. Nandi, IITKGP-MLILSC speech database for language identification, in 2012 National Conference on Communications (NCC), pp. 1–5 (2012). https://doi.org/10.1109/NCC.2012.6176831, https://ieeexplore.ieee.org/document/6176831/
K.E. Manjunath, K.S. Rao, Improvement of phone recognition accuracy using articulatory features. Circuits Syst. Signal Process. 37(2), 704–728 (2018). https://doi.org/10.1007/s00034-017-0568-8
Article Google Scholar
M.F. Møller, A scaled conjugate gradient algorithm for fast supervised learning. Neural Netw. 6(4), 525–533 (1993). https://doi.org/10.1016/S0893-6080(05)80056-5
Article Google Scholar
K.V. Mounika, A. Sivanand, H.R. Lakshmi, V.G. Suryakanth, V.A. Kumar, An investigation of deep neural network architectures for language recognition in indian languages, in Interspeech 2016, pp. 2930–2933 (2016). https://doi.org/10.21437/Interspeech.2016-910
T. Nagarajan, H.A. Murthy, Language identification using spectral vector distribution across languages, in Proceedings of International Conference on Natural Language Processing (2002)
D. Nandi, D. Pati, K.S. Rao, Language identification using Hilbert envelope and phase information of linear prediction residual, in 2013 International Conference Oriental COCOSDA Held Jointly with 2013 Conference on Asian Spoken Language Research and Evaluation (O-COCOSDA/CASLRE), pp. 1–6 (2013). https://doi.org/10.1109/ICSDA.2013.6709864, https://ieeexplore.ieee.org/document/6709864
A.V. Oppenheim, R.W. Schafer, Discrete-Time Signal Processing (Prentice Hall, Upper Saddle River, NJ, 1999)
MATH Google Scholar
K.S. Rao, Application of prosody models for developing speech systems in Indian languages. Int. J. Speech Technol. 14(1), 19–33 (2011). https://doi.org/10.1007/s10772-010-9086-9
Article Google Scholar
K.S. Rao, S. Sarkar, Robust Speaker Recognition in Noisy Environments (Springer, Berlin, 2014). https://doi.org/10.1007/978-3-319-07130-5
Book Google Scholar
V.R. Reddy, S. Maity, K.S. Rao, Identification of Indian languages using multi-level spectral and prosodic features. Int. J. Speech Technol. 16(4), 489–511 (2013). https://doi.org/10.1007/s10772-013-9198-0
Article Google Scholar
F. Richardson, D. Reynolds, N. Dehak, Deep neural network approaches to speaker and language recognition. IEEE Signal Process. Lett. 22(10), 1671–1675 (2015). https://doi.org/10.1109/LSP.2015.2420092
Article Google Scholar
M. Robnik-Šikonja, I. Kononenko, Theoretical and empirical analysis of ReliefF and RReliefF. Mach. Learn. 53(1), 23–69 (2003). https://doi.org/10.1023/A:1025667309714
Article MATH Google Scholar
K.G. Sheela, S.N. Deepa, Review on methods to fix number of hidden neurons in neural networks. Math. Probl. Eng. 2013(425740), 1–11 (2013). https://doi.org/10.1155/2013/425740
Article Google Scholar
M. Siu, X. Yang, H. Gish, Discriminatively trained GMMs for language classification using boosting methods. IEEE Trans. Audio Speech Lang. Process. 17(1), 187–197 (2009). https://doi.org/10.1109/TASL.2008.2006653
Article Google Scholar
Sreevani, C.A. Murthy, Bridging feature selection and extraction: Compound feature generation. IEEE Trans. Knowl. Data Eng. 29(4), 757–770 (2017). https://doi.org/10.1109/TKDE.2016.2619712
Article Google Scholar
N.S.S. Srinivas, N. Sugan, L.S. Kumar, M.K. Nath, A. Kanhe, Speaker-independent Japanese isolated speech word recognition using TDRC features, in 2018 International CET Conference on Control, Communication, and Computing (IC4), pp. 278–283 (2018). https://doi.org/10.1109/CETIC4.2018.8530947, https://ieeexplore.ieee.org/document/8530947
N. Sugan, N.S.S. Srinivas, N. Kar, L.S. Kumar, M.K. Nath, A. Kanhe, Performance comparison of different cepstral features for speech emotion recognition, in 2018 International CET Conference on Control, Communication, and Computing (IC4), pp. 266–271 (2018). https://doi.org/10.1109/CETIC4.2018.8531065, https://ieeexplore.ieee.org/document/8531065
Z. Tang, D. Wang, Y. Chen, Q. Chen, AP17-OLR challenge: Data, plan, and baseline. in 2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), pp. 749–753 (2017). https://doi.org/10.1109/APSIPA.2017.8282134, https://ieeexplore.ieee.org/document/8282134
Z. Tang, D. Wang, Q. Chen, AP18-OLR challenge: three tasks and their baselines, in 2018 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), pp. 596–600 (2018). https://doi.org/10.23919/APSIPA.2018.8659714
V.N. Vapnik, Statistical Learning Theory (Wiley, New York, 2001)
MATH Google Scholar
M.K. Veera, R.K. Vuddagiri, S.V. Gangashetty, A.K. Vuppala, Combining evidences from excitation source and vocal tract system features for Indian language identification using deep neural networks. Int. J. Speech Technol. 21(3), 501–508 (2018). https://doi.org/10.1007/s10772-017-9481-6
Article Google Scholar
R.K. Vuddagiri, K. Gurugubelli, P. Jain, H.K. Vydana, A.K. Vuppala, IIITH-ILSC speech database for Indian language identification, in The 6th International Workshop on Spoken Language Technologies for Under-Resourced Languages, pp. 56–60 (2018)
D. Wang, L. Li, D. Tang, Q. Chen, AP16-OL7: a multilingual database for oriental languages and a language recognition baseline, in 2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA), pp. 1–5 (2016). https://doi.org/10.1109/APSIPA.2016.7820796, https://ieeexplore.ieee.org/document/7820796
K. Wang, N. An, B.N. Li, Y. Zhang, L. Li, Speech emotion recognition using Fourier parameters. IEEE Trans. Affect. Comput. 6(1), 69–75 (2015). https://doi.org/10.1109/TAFFC.2015.2392101
Article Google Scholar

Download references

Acknowledgements

The authors would like to thank all anonymous reviewers for providing their valuable comments on earlier drafts of this manuscript. Their comments and suggestions are very useful and helped us to improve the quality of the final manuscript. The authors express out appreciation to Prof. Dr. K. Sreenivasa Rao and his research team for sharing IITKGP-MLILSC database with us during the course of this research. The authors also express out appreciation to Dr. Zhiyuan Tang for sharing AP18-OLR (AP16-OL7 and AP17-OL3) multilingual database with us during the course of this research. The authors would also like to thank MathWorks^®, Inc., for providing MATLAB^® tool and NCH^®, Inc., for providing WavePad^® Sound Editor Tool. Any correspondence should be made to N. S. Sai Srinivas.

Author information

Authors and Affiliations

Department of Electronics and Communication Engineering, National Institute of Technology Puducherry Karaikal, Thiruvettakudy, Karaikal, Union Territory of Puducherry, 609 609, India
N. S. Sai Srinivas, N. Sugan, Niladri Kar, L. S. Kumar, Malaya Kumar Nath & Aniruddha Kanhe

Authors

N. S. Sai Srinivas
View author publications
You can also search for this author in PubMed Google Scholar
N. Sugan
View author publications
You can also search for this author in PubMed Google Scholar
Niladri Kar
View author publications
You can also search for this author in PubMed Google Scholar
L. S. Kumar
View author publications
You can also search for this author in PubMed Google Scholar
Malaya Kumar Nath
View author publications
You can also search for this author in PubMed Google Scholar
Aniruddha Kanhe
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to N. S. Sai Srinivas.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This work has been supported under ‘Special Manpower Development Program for Chip to System Design (SMDP-C2SD) Project,’ funded by the Ministry of Electronics and Information Technology (MeitY), Government of India (GOI), under vide Sanction Order and Grant No. 9(1)/2014-MDD(Vol. III).

Rights and permissions

Reprints and permissions

About this article

Cite this article

Srinivas, N.S.S., Sugan, N., Kar, N. et al. Recognition of Spoken Languages from Acoustic Speech Signals Using Fourier Parameters. Circuits Syst Signal Process 38, 5018–5067 (2019). https://doi.org/10.1007/s00034-019-01100-6

Download citation

Received: 31 December 2017
Revised: 22 March 2019
Accepted: 25 March 2019
Published: 04 April 2019
Issue Date: November 2019
DOI: https://doi.org/10.1007/s00034-019-01100-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Recognition of Spoken Languages from Acoustic Speech Signals Using Fourier Parameters

Abstract

Access this article

Similar content being viewed by others

Automatic speech recognition: a survey

A comprehensive survey on automatic speech recognition using neural networks

Multi-language: ensemble learning-based speech emotion recognition

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Recognition of Spoken Languages from Acoustic Speech Signals Using Fourier Parameters

Abstract

Access this article

Similar content being viewed by others

Automatic speech recognition: a survey

A comprehensive survey on automatic speech recognition using neural networks

Multi-language: ensemble learning-based speech emotion recognition

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation