Abstract
In this work, the performance of Multilingual Phone Recognition System (Multi-PRS) is improved using articulatory features (AFs). Four Indian languages – Kannada, Telugu, Bengali and Odia – are used for developing Multi-PRS. The transcription is derived using international phonetic alphabets (IPAs). Multi-PRS is trained using hidden Markov models and the state-of-the-art Deep Neural Networks (DNNs). AFs for five AF groups – place, manner, roundness, frontness and height – are predicted from Mel-frequency cepstral coefficients (MFCCs) using DNNs. The oracle AFs, which are derived from the ground truth IPA transcriptions, are used to set the best performance realizable by the predicted AFs. The performances of predicted and oracle AFs are compared. In addition to the AFs, the phone posteriors are explored to further boost the performance of Multi-PRS. Multi-task learning is explored to improve the prediction accuracy of AFs and thereby reduce the Phone Error Rates (PERs) of Multi-PRSs. Fusion of AFs is done using two approaches: i) lattice re-scoring approach and ii) AFs as tandem features. We show that oracle AFs by feature fusion with MFCCs offer a remarkably low target of PER of 10.4%, which is 24.7% absolute reduction compared with baseline Multi-PRS with MFCCs alone. The best performing system using predicted AFs has shown 3.2% reduction in absolute PER (9.1% reduction in relative PER) compared with baseline Multi-PRS. The best performance is obtained using the tandem approach for fusion of various AFs and phone posteriors.
Similar content being viewed by others
References
The International Phonetic Association 2007 Handbook of the International Phonetic Association. Cambridge University Press
Stuker S, Metze F, Schultz T and Waibel A 2003 Integrating multilingual articulatory features into speech recognition. In: Proceedings of INTERSPEECH, pp. 1033–1036
Manjunath K E and Sreenivasa Rao K 2017 Improvement of phone recognition accuracy using articulatory features. Circuits, Systems, and Signal Processing 37(2): 704–728
Gerfen. 2011 Phonetics theory [online]. Available: http://www.unc.edu/\(\tilde{{\rm g}}\)erfen/Ling 30Sp2002/phonetics.html, pages 251–257
Narayanan S et al 2011 A multimodal real-time MRI articulatory corpus for speech research. In: Proceedings of INTERSPEECH, pp. 837–840
Narayanan S et al 2014 Real-time magnetic resonance imaging and electromagnetic articulography database for speech production research (TC). The Journal of the Acoustical Society of America 136(3): 1307–1311
Lee S, Yildirim S, Kazemzadeh A and Narayanan S 2005 An articulatory study of emotional speech production. In: Proceedings of INTERSPEECH, pp. 497–500
The Centre for Speech Technology Research, The University of Edinburgh. MOCHA-TIMIT: MOCHA MultiCHannel Articulatory database: English [online]. Available: http://www.cstr.ed.ac.uk/research/projects/artic/mocha.html
Afshan A and Ghosh P K 2016 Better acoustic normalization in subject independent acoustic-to-articulatory inversion: benefit to recognition. In: Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5395–5399
Mitra V, Sivaraman G, Nam H, Espy-Wilson C and Saltzman E 2014 Articulatory features from Deep Neural Networks and their role in speech recognition. In: Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3017–3021
Kirchhoff K, Fink G A and Sagerer G 2002 Combining acoustic and articulatory feature information for robust speech recognition. Speech Communication 37: 303–319
Frankel J, Magimai-Doss M, King S, Livescu K, Cetin O 2007 Articulatory feature classifiers trained on 2000 hours of telephone speech. In: Proceedings of INTERSPEECH
Cetin O, Kantor A, King S, Bartels C, Magimai-Doss, Frankel J and Livescu K 2007 An articulatory feature-based tandem approach and factored observation modeling. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), vol. 4, p. IV-645
Rajamanohar M and Fosler-Lussier E 2005 An evaluation of hierarchical articulatory feature detectors. In: Proceedings of the IEEE Workshop on Automatic Speech Recognition and Understanding, pp. 59–64
Dusan S and Deng L 1998 Estimation of articulatory parameters from speech acoustics by Kalman filtering. In: Proceedings of the CITO Researcher Retreat, pp. 47–48
Wakita H 1973 Direct estimation of the vocal tract shape by inverse filtering of acoustic speech waveforms. IEEE Transactions on Audio, Speech, and Language Processing 21(5): 417–427
Dhananjaya N, Yegnanarayana B and Suryakanth V G 2011 Acoustic-phonetic information from excitation source for refining manner hypotheses of a phone recognizer. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Corredor-Ardoy C, Lamel L, Adda-Decker M and Gauvain JL 1998 Multilingual phone recognition of spontaneous telephone speech. In: Proceedings of ICASSP, pp. 413–416
Schultz T and Waibel A 2001 Language independent and language adaptive acoustic modeling for speech recognition. Speech Communication 35: 31–51
Schultz T and Waibel A 1998 Multilingual and crosslingual speech recognition. In: Proceedings of the DARPA Workshop on Broadcast News Transcription and Understanding, pp. 259–262
Schultz T and Kirchhoff K 2006 Multilingual speech processing. Academic Press
Heigold G, Vanhoucke V, Senior A, Nguyen P, Ranzato M, Devin M and Dean J 2013 Multilingual acoustic models using distributed deep neural networks. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Vu N T et al 2014 Multilingual deep neural network based acoustic modeling for rapid language adaptation. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP)
Kumar C S, Mohandas V P and Haizhou L 2005 Multilingual speech recognition: a unified approach. In: Proceedings of INTERSPEECH, pp. 3357–3360
Gangashetty S V, Sekhar C C and Yegnanarayana B 2005 Spotting multilingual consonant–vowel units of speech using neural network models. In: Proceedings of the International Conference on Non-Linear Speech Processing (NOLISP), pp. 303–317
Mohan A, Rose R, Ghalehjegh S H and Umesh S 2014 Acoustic modelling for speech recognition in Indian languages in an agricultural commodities task domain. Speech Communication 56: 167–180
Deng L 1997 Integrated-multilingual speech recognition using universal phonological features in a functional speech production model. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP)
Metze F 2005 Articulatory features for conversational speech recognition. PhD Thesis, Carnegie Mellon University
Zhao Y, Zhao R, Wang X and Ji Q 2016 Multilingual articulatory features augmentation learning. In: Proceedings of the 23rd International Conference on Pattern Recognition (ICPR), pp. 2895–2899
Livescu K et al 2007 Articulatory feature-based methods for acoustic and audio-visual speech recognition: summary from the 2006 JHU summer workshop. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. IV-621–IV-624
Black A W et al 2012 Articulatory features for expressive speech synthesis. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4005–4008
Sahraeian R and Compernolle D V 2017 Crosslingual and multilingual speech recognition based on the speech manifold. IEEE/ACM Transactions on Audio, Speech, and Language Processing 25(12): 2301–2312
King S, Frankel J, Livescu K, McDermott E, Richmond K and Wester M 2007 Speech production knowledge in automatic speech recognition. The Journal of the Acoustical Society of America 121(2): 723–742
Mermelstein P 1969 Computer simulation of articulatory activity in speech production. In: Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI), pp. 447–454
Mermelstein P 1973 Articulatory model for the study of speech production. The Journal of the Acoustical Society of America 53(4): 1070–1082
Rubin P, Baer T and Mermelstein P 1981 An articulatory synthesizer for perceptual research. The Journal of the Acoustical Society of America 70(2): 321–328
Mitra V, Wang W, Stolcke A, Nam H, Richey C, Yuan J and Liberman M 2013 Articulatory trajectories for large-vocabulary speech recognition. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Frankel J and King S 2007 Speech recognition using linear dynamic models. IEEE Transactions on Audio, Speech, and Language Processing 15(1): 246–256
Zlokarnik I 1995 Adding articulatory features to acoustic features for automatic speech recognition. The Journal of the Acoustical Society of America 97(5): 3246–3246
Mitra V et al 2014 Articulatory features from deep neural networks and their role in speech recognition. In: Proceedings of ICASSP, pp. 3017–3021
Rasipuram R and Magimai.-Doss M 2016 Articulatory feature based continuous speech recognition using probabilistic lexical modeling. Computer Speech and Language 36: 233–259
Stuker S, Schultz T, Metze F and Waibel A 2003 Multilingual articulatory features. In: Proceedings of ICASSP, vol. 1, pp. 144–147
Schultz T 2002 GlobalPhone: a multilingual speech and text database developed at Karlsruhe university. In: Proceedings of ICSLP, Denver, CO, USA
Ore B M 2007 Multilingual articulatory features for speech recognition. Master’s Thesis, Wright State University
Rasipuram R and Magimai-Doss M 2011 Improving articulatory feature and phoneme recognition using multitask learning. In: Proceedings of Artificial Neural Networks and Machine Learning (ICANN), vol. 6791, pp. 299–306
Muller M, Stuker S and Waibel A 2016 Towards improving low-resource speech recognition using articulatory and language features. In: Proceedings of the International Workshop on Spoken Language Translation (IWSLT), pp. 1–7
Muller M and Waibel A 2015 Using language adaptive deep neural networks for improved multilingual speech recognition. In: Proceedings of the International Workshop on Spoken Language Translation (IWSLT)
Sahraeian R 2017 Acoustic modeling of under-resourced languages. PhD Thesis, Katholieke Universiteit Leuven (KU Leuven)
Sahraeian R, Compernolle D V and de Wet F 2014 On using intrinsic spectral analysis for low-resource languages. In: Proceedings of the 4th International Workshop on Spoken Language Technologies for Under-resourced Languages (SLTU)
Dash D, Kim M, Teplansky K and Wang J 2018 Automatic speech recognition with articulatory information and a unified dictionary for Hindi, Marathi, Bengali, and Oriya. In: Proceedings of INTERSPEECH
Manjunath K E, Rao K S, Jayagopi D B and Ramasubramanian V 2018 Indian languages ASR: a multilingual phone recognition framework with IPA based common phone-set, predicted articulatory features and feature fusion. In: Proceedings of INTERSPEECH
Development of prosodically guided phonetic engine for searching speech databases in Indian languages [online]. http://speech.iiit.ac.in/svldownloads/pro_po_en_report/
Kumar S B S, Rao K S and Pati D 2013 Phonetic and prosodically rich transcribed speech corpus in Indian languages: Bengali and Odia. In: Proceedings of O-COCOSDA, pp. 1–5
Shridhara MV, Banahatti BK, Narthan L, Karjigi V and Kumaraswamy R 2013 Development of Kannada speech corpus for prosodically guided phonetic search engine. In: Proceedings of O-COCOSDA, pp. 1–6
Madhavi M C, Sharma S and Patil H A 2014 Development of language resources for speech application in Gujarati and Marathi. In: Proceedings of the IEEE International Conference on Asian Language Processing (IALP), vol. 1, pp. 115–118
Sarma B D, Sarma M, Sarma M and Prasanna S R M 2013 Development of Assamese phonetic engine: some issues. In: Proceedings of IEEE INDICON, pp. 1–6
Manjunath K E and Sreenivasa Rao K 2014 Automatic phonetic transcription for read, extempore and conversation speech for an Indian language: Bengali. In: Proceedings of the IEEE National Conference on Communications (NCC)
Riedhammer K T, Bocklet T, Ghoshal A and Povey D 2012 Revisiting semi-continuous hidden Markov models. In: Proceedings of ICASSP, pp. 4721–4724
Zhang X, Trmal J, Povey D and Khudanpur S 2014 Improving deep neural network acoustic models using generalized maxout networks. In: Proceedings of ICASSP, pp. 215–219
Povey D et al 2011 The Kaldi Speech Recognition Toolkit. In: Proceedings of the IEEE Workshop on ASRU
Manjunath K E, Jayagopi D B, Rao K S and Ramasubramanian V 2019 Development and analysis of multilingual phone recognition systems using Indian languages. International Journal of Speech Technology
Sclite Tool [online]. http://www1.icsi.berkeley.edu/Speech/docs/sctk-1.2/sclite.htm
Manjunath K E, Sreenivasa Rao K and Jayagopi D B 2017 Development of multilingual phone recognition system for Indian languages. In: Proceedings of the IEEE International Conference on Signal Processing, Informatics, Communication and Energy Systems (SPICES)
Erler K and Freeman G H 1996 An HMM-based speech recognizer using overlapping articulatory features. Journal of Acoustic Society of America 100(4): 2500–2513
Ohman S E G 1965 Coarticulation in VCV utterances: spectrographic measurements. Journal of Acoustic Society of America 39(1): 151–168
Ramachandran VR Coarticulation knowledge for a text-to-speech system for an Indian language. MS Thesis, Speech and Vision Laboratory, Indian Institute of Technology Madras, India
Hermansky H, Ellis D P and Sharma S 2000 Tandem connectionist feature extraction for conventional HMM systems. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), vol. 3, pp. 1635–1638
Lal P and King S 2013 Cross-lingual automatic speech recognition using tandem features. IEEE Transactions on Audio, Speech, and Language Processing 21(12): 2506–2515
Siniscalchi S M, Li J and Lee C 2006 A study on lattice rescoring with knowledge scores for automatic speech recognition. In: Proceedings of INTERSPEECH, pp. 517–520
Rasipuram R and Magimai-Doss M 2011 Integrating articulatory features using Kullback–Leibler divergence based acoustic model for phoneme recognition. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp. 5192–5195
Ketabdar H and Bourlard H 2008 Hierarchical integration of phonetic and lexical knowledge in phone posterior estimation. In: Proceedings of ICASSP, pp. 4065–4068
Ketabdar H and Bourlard H 2010 Enhanced phone posteriors for improving speech recognition systems. IEEE Transactions on Audio, Speech, and Language Processing 18(6): 1094–1106
Caruana R 1998 Multitask learning. In: Learning to learn. Boston, MA: Springer, pp. 95–133
Acknowledgements
We thank Prof. B Yegnanarayana, Prof. K Sri Rama Murthy and Prof. R Kumaraswamy for providing Telugu and Kannada datasets.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Manjunath, K.E., Jayagopi, D.B., Rao, K.S. et al. Articulatory-feature-based methods for performance improvement of Multilingual Phone Recognition Systems using Indian languages. Sādhanā 45, 190 (2020). https://doi.org/10.1007/s12046-020-01428-9
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1007/s12046-020-01428-9