Toward Robust Mispronunciation Detection via Audio-Visual Speech Recognition

Karbasi, Mahdie; Zeiler, Steffen; Freiwald, Jan; Kolossa, Dorothea

doi:10.1007/978-3-030-20518-8_54

Toward Robust Mispronunciation Detection via Audio-Visual Speech Recognition

Mahdie Karbasi¹⁷,
Steffen Zeiler¹⁷,
Jan Freiwald¹⁷ &
…
Dorothea Kolossa¹⁷

Conference paper
First Online: 16 May 2019

2179 Accesses
1 Citations

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 11507))

Abstract

A recent trend in language learning is gamification, i.e. the application of game-design elements and game principles in non-game contexts. A key component therein is the detection of mispronunciations by means of automatic speech recognition. Constraints like quiet environments and the use of close-talking microphones hinder the applicability for language learning games.

In this work, we propose to use multi-modal—specifically audio-visual—speech recognition as an alternative for detecting mispronunciations in acoustically noisy or otherwise challenging environments. We examine a hybrid speech recognizer structure, using either feed-forward or bidirectional long-short term memory (BiLSTM) networks. There are several options to integrate both modalities. Here, we compare early fusion, i.e. the use of one joint audio-visual network, with a turbo-decoding approach that combines contributions from acoustic and visual models. We evaluate the performance of these topologies in detecting some common phoneme mispronunciations, namely the errors in manner (MoA) and in place of articulation (PoA). It is shown that our novel architecture, using deep neural network acoustic and visual submodels in conjunction with turbo-decoding, is very well suited for the task of mispronunciation detection, and that the visual modality contributes strongly to achieving noise-robust performance.

This project has received funding from the European Regional Development Fund (ERDF).

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

Abdelaziz, A.H.: Comparing fusion models for DNN-based audiovisual continuous speech recognition. IEEE/ACM Trans. Audio Speech Lang. Process. (TASLP) 26(3), 475–484 (2018)
Article Google Scholar
Abdelaziz, A.H., Zeiler, S., Kolossa, D.: Learning dynamic stream weights for coupled-HMM-based audio-visual speech recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 23(5), 863–876 (2015)
Google Scholar
Cooke, M., Barker, J., Cunningham, S., Shao, X.: An audio-visual corpus for speech perception and automatic speech recognition. J. Acoust. Soc. Am. 120(5), 2421–2424 (2006). https://doi.org/10.1121/1.2229005
Article Google Scholar
Freiwald, J., et al.: Utilizing slow feature analysis for lipreading. In: Proceedings of ITG, November 2018
Google Scholar
Gergen, S., Zeiler, S., Hussen Abdelaziz, A., Nickel, R., Kolossa, D.: Dynamic stream weighting for turbo-decoding-based audiovisual ASR. In: Proceedings of ITG, pp. 2135–2139, September 2016
Google Scholar
Graham, C.R., Lonsdale, D., Kennington, C., Johnson, A., McGhee, J.: Elicited imitation as an oral proficiency measure with ASR scoring. In: LREC (2008)
Google Scholar
Hu, W., Qian, Y., Soong, F.K., Wang, Y.: Improved mispronunciation detection with deep neural network trained acoustic models and transfer learning based logistic regression classifiers. Speech Commun. 67, 154–166 (2015)
Article Google Scholar
Kjellström, H., Engwall, O., Abdou, S.M., Bälter, O.: Audio-visual phoneme classification for pronunciation training applications. In: Proceedings of the Eighth Annual Conference of the International Speech Communication Association (2007)
Google Scholar
Lee, A., Glass, J.: A comparison-based approach to mispronunciation detection. In: Proceedings of Spoken Language Technology Workshop (SLT), pp. 382–387 (2012)
Google Scholar
Lee, A., Zhang, Y., Glass, J.: Mispronunciation detection via dynamic time warping on deep belief network-based posteriorgrams. In: Proceedings of ICASSP, pp. 8227–8231 (2013)
Google Scholar
Li, K., Qian, X., Meng, H.: Mispronunciation detection and diagnosis in L2 English speech using multidistribution deep neural networks. IEEE/ACM Trans. Audio Speech Lang. Process. 25(1), 193–207 (2017)
Article Google Scholar
Li, W., Chen, N., Siniscalchi, M., Lee, C.H.: Improving mispronunciation detection for non-native learners with multisource information and LSTM-based deep models. In: Proceedings of Interspeech, pp. 2759–2763, September 2017
Google Scholar
Nefian, A.V., Liang, L., Pi, X., Liu, X., Murphy, K.: Dynamic Bayesian networks for audio-visual speech recognition. EURASIP J. Adv. Signal Process. 2002(11), 783042 (2002)
Article Google Scholar
Picard, S., Ananthakrishnan, G., Wik, P., Engwall, O., Abdou, S.: Detection of specific mispronunciations using audiovisual features. In: Proceedings of Auditory-Visual Speech Processing (2010)
Google Scholar
Receveur, S., Weiss, R., Fingscheidt, T.: Turbo automatic speech recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 24(5), 846–862 (2016)
Article Google Scholar
Richardson, M., Bilmes, J., Diorio, C.: Hidden-articulator Markov models for speech recognition. Speech Commun. 41(2–3), 511–529 (2003)
Article Google Scholar
Ronen, O., Neumeyer, L., Franco, H.: Automatic detection of mispronunciation for language instruction. In: Proceedings of the Fifth Eurospeech (1997)
Google Scholar
Tepperman, J., Narayanan, S.: Using articulatory representations to detect segmental errors in nonnative pronunciation. IEEE/ACM Trans. Audio Speech Lang. Process. 16(1), 8–22 (2008)
Article Google Scholar
Truong, K., Neri, A., Cucchiarini, C., Strik, H.: Automatic pronunciation error detection: an acoustic-phonetic approach. In: Proceedings of InSTIL/ICALL Symposium (2004)
Google Scholar
Wang, Y.B., Lee, L.S.: Improved approaches of modeling and detecting error patterns with empirical analysis for computer-aided pronunciation training. In: Proceedings of ICASSP, pp. 5049–5052 (2012)
Google Scholar
Wang, Y.B., Lee, L.S.: Supervised detection and unsupervised discovery of pronunciation error patterns for computer-assisted language learning. IEEE/ACM Trans. Audio Speech Lang. Process. (TASLP) 23(3), 564–579 (2015)
Article Google Scholar
Zeiler, S., Nickel, R., Ma, N., Brown, G., Kolossa, D.: Robust audiovisual speech recognition using noise-adaptive linear discriminant analysis. In: Proceedings of ICASSP (2016)
Google Scholar

Download references

Author information

Authors and Affiliations

Institute of Communication Acoustics, Faculty of Electrical Engineering and Information Technology, Ruhr University Bochum, Bochum, Germany
Mahdie Karbasi, Steffen Zeiler, Jan Freiwald & Dorothea Kolossa

Authors

Mahdie Karbasi
View author publications
You can also search for this author in PubMed Google Scholar
Steffen Zeiler
View author publications
You can also search for this author in PubMed Google Scholar
Jan Freiwald
View author publications
You can also search for this author in PubMed Google Scholar
Dorothea Kolossa
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mahdie Karbasi .

Editor information

Editors and Affiliations

University of Granada, Granada, Spain
Ignacio Rojas
University of Malaga, Malaga, Spain
Gonzalo Joya
Polytechnic University of Catalonia, Barcelona, Spain
Andreu Catala

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Karbasi, M., Zeiler, S., Freiwald, J., Kolossa, D. (2019). Toward Robust Mispronunciation Detection via Audio-Visual Speech Recognition. In: Rojas, I., Joya, G., Catala, A. (eds) Advances in Computational Intelligence. IWANN 2019. Lecture Notes in Computer Science(), vol 11507. Springer, Cham. https://doi.org/10.1007/978-3-030-20518-8_54

Download citation

DOI: https://doi.org/10.1007/978-3-030-20518-8_54
Published: 16 May 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-20517-1
Online ISBN: 978-3-030-20518-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics