Skip to main content

Toward Robust Mispronunciation Detection via Audio-Visual Speech Recognition

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 11507))

Abstract

A recent trend in language learning is gamification, i.e. the application of game-design elements and game principles in non-game contexts. A key component therein is the detection of mispronunciations by means of automatic speech recognition. Constraints like quiet environments and the use of close-talking microphones hinder the applicability for language learning games.

In this work, we propose to use multi-modal—specifically audio-visual—speech recognition as an alternative for detecting mispronunciations in acoustically noisy or otherwise challenging environments. We examine a hybrid speech recognizer structure, using either feed-forward or bidirectional long-short term memory (BiLSTM) networks. There are several options to integrate both modalities. Here, we compare early fusion, i.e. the use of one joint audio-visual network, with a turbo-decoding approach that combines contributions from acoustic and visual models. We evaluate the performance of these topologies in detecting some common phoneme mispronunciations, namely the errors in manner (MoA) and in place of articulation (PoA). It is shown that our novel architecture, using deep neural network acoustic and visual submodels in conjunction with turbo-decoding, is very well suited for the task of mispronunciation detection, and that the visual modality contributes strongly to achieving noise-robust performance.

This project has received funding from the European Regional Development Fund (ERDF).

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  1. Abdelaziz, A.H.: Comparing fusion models for DNN-based audiovisual continuous speech recognition. IEEE/ACM Trans. Audio Speech Lang. Process. (TASLP) 26(3), 475–484 (2018)

    Article  Google Scholar 

  2. Abdelaziz, A.H., Zeiler, S., Kolossa, D.: Learning dynamic stream weights for coupled-HMM-based audio-visual speech recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 23(5), 863–876 (2015)

    Google Scholar 

  3. Cooke, M., Barker, J., Cunningham, S., Shao, X.: An audio-visual corpus for speech perception and automatic speech recognition. J. Acoust. Soc. Am. 120(5), 2421–2424 (2006). https://doi.org/10.1121/1.2229005

    Article  Google Scholar 

  4. Freiwald, J., et al.: Utilizing slow feature analysis for lipreading. In: Proceedings of ITG, November 2018

    Google Scholar 

  5. Gergen, S., Zeiler, S., Hussen Abdelaziz, A., Nickel, R., Kolossa, D.: Dynamic stream weighting for turbo-decoding-based audiovisual ASR. In: Proceedings of ITG, pp. 2135–2139, September 2016

    Google Scholar 

  6. Graham, C.R., Lonsdale, D., Kennington, C., Johnson, A., McGhee, J.: Elicited imitation as an oral proficiency measure with ASR scoring. In: LREC (2008)

    Google Scholar 

  7. Hu, W., Qian, Y., Soong, F.K., Wang, Y.: Improved mispronunciation detection with deep neural network trained acoustic models and transfer learning based logistic regression classifiers. Speech Commun. 67, 154–166 (2015)

    Article  Google Scholar 

  8. Kjellström, H., Engwall, O., Abdou, S.M., Bälter, O.: Audio-visual phoneme classification for pronunciation training applications. In: Proceedings of the Eighth Annual Conference of the International Speech Communication Association (2007)

    Google Scholar 

  9. Lee, A., Glass, J.: A comparison-based approach to mispronunciation detection. In: Proceedings of Spoken Language Technology Workshop (SLT), pp. 382–387 (2012)

    Google Scholar 

  10. Lee, A., Zhang, Y., Glass, J.: Mispronunciation detection via dynamic time warping on deep belief network-based posteriorgrams. In: Proceedings of ICASSP, pp. 8227–8231 (2013)

    Google Scholar 

  11. Li, K., Qian, X., Meng, H.: Mispronunciation detection and diagnosis in L2 English speech using multidistribution deep neural networks. IEEE/ACM Trans. Audio Speech Lang. Process. 25(1), 193–207 (2017)

    Article  Google Scholar 

  12. Li, W., Chen, N., Siniscalchi, M., Lee, C.H.: Improving mispronunciation detection for non-native learners with multisource information and LSTM-based deep models. In: Proceedings of Interspeech, pp. 2759–2763, September 2017

    Google Scholar 

  13. Nefian, A.V., Liang, L., Pi, X., Liu, X., Murphy, K.: Dynamic Bayesian networks for audio-visual speech recognition. EURASIP J. Adv. Signal Process. 2002(11), 783042 (2002)

    Article  Google Scholar 

  14. Picard, S., Ananthakrishnan, G., Wik, P., Engwall, O., Abdou, S.: Detection of specific mispronunciations using audiovisual features. In: Proceedings of Auditory-Visual Speech Processing (2010)

    Google Scholar 

  15. Receveur, S., Weiss, R., Fingscheidt, T.: Turbo automatic speech recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 24(5), 846–862 (2016)

    Article  Google Scholar 

  16. Richardson, M., Bilmes, J., Diorio, C.: Hidden-articulator Markov models for speech recognition. Speech Commun. 41(2–3), 511–529 (2003)

    Article  Google Scholar 

  17. Ronen, O., Neumeyer, L., Franco, H.: Automatic detection of mispronunciation for language instruction. In: Proceedings of the Fifth Eurospeech (1997)

    Google Scholar 

  18. Tepperman, J., Narayanan, S.: Using articulatory representations to detect segmental errors in nonnative pronunciation. IEEE/ACM Trans. Audio Speech Lang. Process. 16(1), 8–22 (2008)

    Article  Google Scholar 

  19. Truong, K., Neri, A., Cucchiarini, C., Strik, H.: Automatic pronunciation error detection: an acoustic-phonetic approach. In: Proceedings of InSTIL/ICALL Symposium (2004)

    Google Scholar 

  20. Wang, Y.B., Lee, L.S.: Improved approaches of modeling and detecting error patterns with empirical analysis for computer-aided pronunciation training. In: Proceedings of ICASSP, pp. 5049–5052 (2012)

    Google Scholar 

  21. Wang, Y.B., Lee, L.S.: Supervised detection and unsupervised discovery of pronunciation error patterns for computer-assisted language learning. IEEE/ACM Trans. Audio Speech Lang. Process. (TASLP) 23(3), 564–579 (2015)

    Article  Google Scholar 

  22. Zeiler, S., Nickel, R., Ma, N., Brown, G., Kolossa, D.: Robust audiovisual speech recognition using noise-adaptive linear discriminant analysis. In: Proceedings of ICASSP (2016)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mahdie Karbasi .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Karbasi, M., Zeiler, S., Freiwald, J., Kolossa, D. (2019). Toward Robust Mispronunciation Detection via Audio-Visual Speech Recognition. In: Rojas, I., Joya, G., Catala, A. (eds) Advances in Computational Intelligence. IWANN 2019. Lecture Notes in Computer Science(), vol 11507. Springer, Cham. https://doi.org/10.1007/978-3-030-20518-8_54

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-20518-8_54

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-20517-1

  • Online ISBN: 978-3-030-20518-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics