Skip to main content
Log in

Articulatory-feature-based methods for performance improvement of Multilingual Phone Recognition Systems using Indian languages

  • Published:
Sādhanā Aims and scope Submit manuscript

Abstract

In this work, the performance of Multilingual Phone Recognition System (Multi-PRS) is improved using articulatory features (AFs). Four Indian languages – Kannada, Telugu, Bengali and Odia – are used for developing Multi-PRS. The transcription is derived using international phonetic alphabets (IPAs). Multi-PRS is trained using hidden Markov models and the state-of-the-art Deep Neural Networks (DNNs). AFs for five AF groups – place, manner, roundness, frontness and height – are predicted from Mel-frequency cepstral coefficients (MFCCs) using DNNs. The oracle AFs, which are derived from the ground truth IPA transcriptions, are used to set the best performance realizable by the predicted AFs. The performances of predicted and oracle AFs are compared. In addition to the AFs, the phone posteriors are explored to further boost the performance of Multi-PRS. Multi-task learning is explored to improve the prediction accuracy of AFs and thereby reduce the Phone Error Rates (PERs) of Multi-PRSs. Fusion of AFs is done using two approaches: i) lattice re-scoring approach and ii) AFs as tandem features. We show that oracle AFs by feature fusion with MFCCs offer a remarkably low target of PER of 10.4%, which is 24.7% absolute reduction compared with baseline Multi-PRS with MFCCs alone. The best performing system using predicted AFs has shown 3.2% reduction in absolute PER (9.1% reduction in relative PER) compared with baseline Multi-PRS. The best performance is obtained using the tandem approach for fusion of various AFs and phone posteriors.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
Figure 6
Figure 7

Similar content being viewed by others

References

  1. The International Phonetic Association 2007 Handbook of the International Phonetic Association. Cambridge University Press

  2. Stuker S, Metze F, Schultz T and Waibel A 2003 Integrating multilingual articulatory features into speech recognition. In: Proceedings of INTERSPEECH, pp. 1033–1036

  3. Manjunath K E and Sreenivasa Rao K 2017 Improvement of phone recognition accuracy using articulatory features. Circuits, Systems, and Signal Processing 37(2): 704–728

    Article  Google Scholar 

  4. Gerfen. 2011 Phonetics theory [online]. Available: http://www.unc.edu/\(\tilde{{\rm g}}\)erfen/Ling 30Sp2002/phonetics.html, pages 251–257

  5. Narayanan S et al 2011 A multimodal real-time MRI articulatory corpus for speech research. In: Proceedings of INTERSPEECH, pp. 837–840

  6. Narayanan S et al 2014 Real-time magnetic resonance imaging and electromagnetic articulography database for speech production research (TC). The Journal of the Acoustical Society of America 136(3): 1307–1311

    Article  Google Scholar 

  7. Lee S, Yildirim S, Kazemzadeh A and Narayanan S 2005 An articulatory study of emotional speech production. In: Proceedings of INTERSPEECH, pp. 497–500

  8. The Centre for Speech Technology Research, The University of Edinburgh. MOCHA-TIMIT: MOCHA MultiCHannel Articulatory database: English [online]. Available: http://www.cstr.ed.ac.uk/research/projects/artic/mocha.html

  9. Afshan A and Ghosh P K 2016 Better acoustic normalization in subject independent acoustic-to-articulatory inversion: benefit to recognition. In: Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5395–5399

  10. Mitra V, Sivaraman G, Nam H, Espy-Wilson C and Saltzman E 2014 Articulatory features from Deep Neural Networks and their role in speech recognition. In: Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3017–3021

  11. Kirchhoff K, Fink G A and Sagerer G 2002 Combining acoustic and articulatory feature information for robust speech recognition. Speech Communication 37: 303–319

    Article  Google Scholar 

  12. Frankel J, Magimai-Doss M, King S, Livescu K, Cetin O 2007 Articulatory feature classifiers trained on 2000 hours of telephone speech. In: Proceedings of INTERSPEECH

  13. Cetin O, Kantor A, King S, Bartels C, Magimai-Doss, Frankel J and Livescu K 2007 An articulatory feature-based tandem approach and factored observation modeling. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), vol. 4, p. IV-645

  14. Rajamanohar M and Fosler-Lussier E 2005 An evaluation of hierarchical articulatory feature detectors. In: Proceedings of the IEEE Workshop on Automatic Speech Recognition and Understanding, pp. 59–64

  15. Dusan S and Deng L 1998 Estimation of articulatory parameters from speech acoustics by Kalman filtering. In: Proceedings of the CITO Researcher Retreat, pp. 47–48

  16. Wakita H 1973 Direct estimation of the vocal tract shape by inverse filtering of acoustic speech waveforms. IEEE Transactions on Audio, Speech, and Language Processing 21(5): 417–427

    Google Scholar 

  17. Dhananjaya N, Yegnanarayana B and Suryakanth V G 2011 Acoustic-phonetic information from excitation source for refining manner hypotheses of a phone recognizer. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

  18. Corredor-Ardoy C, Lamel L, Adda-Decker M and Gauvain JL 1998 Multilingual phone recognition of spontaneous telephone speech. In: Proceedings of ICASSP, pp. 413–416

  19. Schultz T and Waibel A 2001 Language independent and language adaptive acoustic modeling for speech recognition. Speech Communication 35: 31–51

    Article  Google Scholar 

  20. Schultz T and Waibel A 1998 Multilingual and crosslingual speech recognition. In: Proceedings of the DARPA Workshop on Broadcast News Transcription and Understanding, pp. 259–262

  21. Schultz T and Kirchhoff K 2006 Multilingual speech processing. Academic Press

  22. Heigold G, Vanhoucke V, Senior A, Nguyen P, Ranzato M, Devin M and Dean J 2013 Multilingual acoustic models using distributed deep neural networks. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

  23. Vu N T et al 2014 Multilingual deep neural network based acoustic modeling for rapid language adaptation. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP)

  24. Kumar C S, Mohandas V P and Haizhou L 2005 Multilingual speech recognition: a unified approach. In: Proceedings of INTERSPEECH, pp. 3357–3360

  25. Gangashetty S V, Sekhar C C and Yegnanarayana B 2005 Spotting multilingual consonant–vowel units of speech using neural network models. In: Proceedings of the International Conference on Non-Linear Speech Processing (NOLISP), pp. 303–317

  26. Mohan A, Rose R, Ghalehjegh S H and Umesh S 2014 Acoustic modelling for speech recognition in Indian languages in an agricultural commodities task domain. Speech Communication 56: 167–180

    Article  Google Scholar 

  27. Deng L 1997 Integrated-multilingual speech recognition using universal phonological features in a functional speech production model. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP)

  28. Metze F 2005 Articulatory features for conversational speech recognition. PhD Thesis, Carnegie Mellon University

  29. Zhao Y, Zhao R, Wang X and Ji Q 2016 Multilingual articulatory features augmentation learning. In: Proceedings of the 23rd International Conference on Pattern Recognition (ICPR), pp. 2895–2899

  30. Livescu K et al 2007 Articulatory feature-based methods for acoustic and audio-visual speech recognition: summary from the 2006 JHU summer workshop. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. IV-621–IV-624

  31. Black A W et al 2012 Articulatory features for expressive speech synthesis. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4005–4008

  32. Sahraeian R and Compernolle D V 2017 Crosslingual and multilingual speech recognition based on the speech manifold. IEEE/ACM Transactions on Audio, Speech, and Language Processing 25(12): 2301–2312

    Article  Google Scholar 

  33. King S, Frankel J, Livescu K, McDermott E, Richmond K and Wester M 2007 Speech production knowledge in automatic speech recognition. The Journal of the Acoustical Society of America 121(2): 723–742

    Article  Google Scholar 

  34. Mermelstein P 1969 Computer simulation of articulatory activity in speech production. In: Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI), pp. 447–454

  35. Mermelstein P 1973 Articulatory model for the study of speech production. The Journal of the Acoustical Society of America 53(4): 1070–1082

    Article  Google Scholar 

  36. Rubin P, Baer T and Mermelstein P 1981 An articulatory synthesizer for perceptual research. The Journal of the Acoustical Society of America 70(2): 321–328

    Article  Google Scholar 

  37. Mitra V, Wang W, Stolcke A, Nam H, Richey C, Yuan J and Liberman M 2013 Articulatory trajectories for large-vocabulary speech recognition. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

  38. Frankel J and King S 2007 Speech recognition using linear dynamic models. IEEE Transactions on Audio, Speech, and Language Processing 15(1): 246–256

    Article  Google Scholar 

  39. Zlokarnik I 1995 Adding articulatory features to acoustic features for automatic speech recognition. The Journal of the Acoustical Society of America 97(5): 3246–3246

    Article  Google Scholar 

  40. Mitra V et al 2014 Articulatory features from deep neural networks and their role in speech recognition. In: Proceedings of ICASSP, pp. 3017–3021

  41. Rasipuram R and Magimai.-Doss M 2016 Articulatory feature based continuous speech recognition using probabilistic lexical modeling. Computer Speech and Language 36: 233–259

    Article  Google Scholar 

  42. Stuker S, Schultz T, Metze F and Waibel A 2003 Multilingual articulatory features. In: Proceedings of ICASSP, vol. 1, pp. 144–147

    Google Scholar 

  43. Schultz T 2002 GlobalPhone: a multilingual speech and text database developed at Karlsruhe university. In: Proceedings of ICSLP, Denver, CO, USA

  44. Ore B M 2007 Multilingual articulatory features for speech recognition. Master’s Thesis, Wright State University

  45. Rasipuram R and Magimai-Doss M 2011 Improving articulatory feature and phoneme recognition using multitask learning. In: Proceedings of Artificial Neural Networks and Machine Learning (ICANN), vol. 6791, pp. 299–306

    Google Scholar 

  46. Muller M, Stuker S and Waibel A 2016 Towards improving low-resource speech recognition using articulatory and language features. In: Proceedings of the International Workshop on Spoken Language Translation (IWSLT), pp. 1–7

  47. Muller M and Waibel A 2015 Using language adaptive deep neural networks for improved multilingual speech recognition. In: Proceedings of the International Workshop on Spoken Language Translation (IWSLT)

  48. Sahraeian R 2017 Acoustic modeling of under-resourced languages. PhD Thesis, Katholieke Universiteit Leuven (KU Leuven)

  49. Sahraeian R, Compernolle D V and de Wet F 2014 On using intrinsic spectral analysis for low-resource languages. In: Proceedings of the 4th International Workshop on Spoken Language Technologies for Under-resourced Languages (SLTU)

  50. Dash D, Kim M, Teplansky K and Wang J 2018 Automatic speech recognition with articulatory information and a unified dictionary for Hindi, Marathi, Bengali, and Oriya. In: Proceedings of INTERSPEECH

  51. Manjunath K E, Rao K S, Jayagopi D B and Ramasubramanian V 2018 Indian languages ASR: a multilingual phone recognition framework with IPA based common phone-set, predicted articulatory features and feature fusion. In: Proceedings of INTERSPEECH

  52. Development of prosodically guided phonetic engine for searching speech databases in Indian languages [online]. http://speech.iiit.ac.in/svldownloads/pro_po_en_report/

  53. Kumar S B S, Rao K S and Pati D 2013 Phonetic and prosodically rich transcribed speech corpus in Indian languages: Bengali and Odia. In: Proceedings of O-COCOSDA, pp. 1–5

  54. Shridhara MV, Banahatti BK, Narthan L, Karjigi V and Kumaraswamy R 2013 Development of Kannada speech corpus for prosodically guided phonetic search engine. In: Proceedings of O-COCOSDA, pp. 1–6

  55. Madhavi M C, Sharma S and Patil H A 2014 Development of language resources for speech application in Gujarati and Marathi. In: Proceedings of the IEEE International Conference on Asian Language Processing (IALP), vol. 1, pp. 115–118

    Google Scholar 

  56. Sarma B D, Sarma M, Sarma M and Prasanna S R M 2013 Development of Assamese phonetic engine: some issues. In: Proceedings of IEEE INDICON, pp. 1–6

  57. Manjunath K E and Sreenivasa Rao K 2014 Automatic phonetic transcription for read, extempore and conversation speech for an Indian language: Bengali. In: Proceedings of the IEEE National Conference on Communications (NCC)

  58. Riedhammer K T, Bocklet T, Ghoshal A and Povey D 2012 Revisiting semi-continuous hidden Markov models. In: Proceedings of ICASSP, pp. 4721–4724

  59. Zhang X, Trmal J, Povey D and Khudanpur S 2014 Improving deep neural network acoustic models using generalized maxout networks. In: Proceedings of ICASSP, pp. 215–219

  60. Povey D et al 2011 The Kaldi Speech Recognition Toolkit. In: Proceedings of the IEEE Workshop on ASRU

  61. Manjunath K E, Jayagopi D B, Rao K S and Ramasubramanian V 2019 Development and analysis of multilingual phone recognition systems using Indian languages. International Journal of Speech Technology

  62. Sclite Tool [online]. http://www1.icsi.berkeley.edu/Speech/docs/sctk-1.2/sclite.htm

  63. Manjunath K E, Sreenivasa Rao K and Jayagopi D B 2017 Development of multilingual phone recognition system for Indian languages. In: Proceedings of the IEEE International Conference on Signal Processing, Informatics, Communication and Energy Systems (SPICES)

  64. Erler K and Freeman G H 1996 An HMM-based speech recognizer using overlapping articulatory features. Journal of Acoustic Society of America 100(4): 2500–2513

    Article  Google Scholar 

  65. Ohman S E G 1965 Coarticulation in VCV utterances: spectrographic measurements. Journal of Acoustic Society of America 39(1): 151–168

    Article  Google Scholar 

  66. Ramachandran VR Coarticulation knowledge for a text-to-speech system for an Indian language. MS Thesis, Speech and Vision Laboratory, Indian Institute of Technology Madras, India

  67. Hermansky H, Ellis D P and Sharma S 2000 Tandem connectionist feature extraction for conventional HMM systems. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), vol. 3, pp. 1635–1638

  68. Lal P and King S 2013 Cross-lingual automatic speech recognition using tandem features. IEEE Transactions on Audio, Speech, and Language Processing 21(12): 2506–2515

    Article  Google Scholar 

  69. Siniscalchi S M, Li J and Lee C 2006 A study on lattice rescoring with knowledge scores for automatic speech recognition. In: Proceedings of INTERSPEECH, pp. 517–520

  70. Rasipuram R and Magimai-Doss M 2011 Integrating articulatory features using Kullback–Leibler divergence based acoustic model for phoneme recognition. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp. 5192–5195

  71. Ketabdar H and Bourlard H 2008 Hierarchical integration of phonetic and lexical knowledge in phone posterior estimation. In: Proceedings of ICASSP, pp. 4065–4068

  72. Ketabdar H and Bourlard H 2010 Enhanced phone posteriors for improving speech recognition systems. IEEE Transactions on Audio, Speech, and Language Processing 18(6): 1094–1106

    Article  Google Scholar 

  73. Caruana R 1998 Multitask learning. In: Learning to learn. Boston, MA: Springer, pp. 95–133

Download references

Acknowledgements

We thank Prof. B Yegnanarayana, Prof. K Sri Rama Murthy and Prof. R Kumaraswamy for providing Telugu and Kannada datasets.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to K E Manjunath.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Manjunath, K.E., Jayagopi, D.B., Rao, K.S. et al. Articulatory-feature-based methods for performance improvement of Multilingual Phone Recognition Systems using Indian languages. Sādhanā 45, 190 (2020). https://doi.org/10.1007/s12046-020-01428-9

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s12046-020-01428-9

Keywords

Navigation