Combining Articulatory Features with End-to-End Learning in Speech Recognition

Qu, Leyuan; Weber, Cornelius; Lakomkin, Egor; Twiefel, Johannes; Wermter, Stefan

doi:10.1007/978-3-030-01424-7_49

Leyuan Qu¹⁸,
Cornelius Weber¹⁸,
Egor Lakomkin¹⁸,
Johannes Twiefel¹⁸ &
…
Stefan Wermter¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 11141))

Included in the following conference series:

International Conference on Artificial Neural Networks

8533 Accesses
1 Citations

Abstract

End-to-end neural networks have shown promising results on large vocabulary continuous speech recognition (LVCSR) systems. However, it is challenging to integrate domain knowledge into such systems. Specifically, articulatory features (AFs) which are inspired by the human speech production mechanism can help in speech recognition. This paper presents two approaches to incorporate domain knowledge into end-to-end training: (a) fine-tuning networks which reuse hidden layer representations of AF extractors as input for ASR tasks; (b) progressive networks which combine articulatory knowledge by lateral connections from AF extractors. We evaluate the proposed approaches on the speech Wall Street Journal corpus and test on the eval92 standard evaluation dataset. Results show that both fine-tuning and progressive networks can integrate articulatory information into end-to-end learning and outperform previous systems.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
http://www.speech.cs.cmu.edu/cgi-bin/cmudict.

References

LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE 86(11), 2278–2324 (1998)
Article Google Scholar
Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. In: Proceedings of the ICLR (2015)
Google Scholar
Wang, K., Babenko, B., Belongie, S.: End-to-end scene text recognition. In: Proceedings of ICCV-2011, pp. 1457–1464 (2011)
Google Scholar
Miao, Y., Metze, F.: End-to-End Architectures for Speech Recognition. In: Watanabe, S., Delcroix, M., Metze, F., Hershey, J. (eds.) New Era for Robust Speech Recognition, pp. 299–323. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-64680-0_13
Chapter Google Scholar
Graves, A., Fernández, S., Gomez, F., et al.: Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In: Proceedings of ICML-2006, pp. 369–376 (2006)
Google Scholar
Zweig, G., Yu, C., Droppo, J., et al.: Advances in all-neural speech recognition. In: Proceedings of ICASSP-2017, pp. 4805–4809 (2017)
Google Scholar
King, S., Taylor, P.: Detection of phonological features in continuous speech using neural networks. Comput. Speech Lang. 14(4), 333–353 (2000)
Article Google Scholar
Kirchhoff, K.: Robust speech recognition using articulatory information. Ph.D. thesis, University of Bielefeld (1999)
Google Scholar
Yu, D., Siniscalchi, S.M., Deng, L., et al.: Boosting attribute and phone estimation accuracies with deep neural networks for detection-based speech recognition. In: Proceedings of ICASSP-2012, pp. 4169–4172 (2012)
Google Scholar
Sak, H., Senior, A., Rao, K., et al.: Learning acoustic frame labelling for speech recognition with recurrent neural networks. In: Proceedings of ICASSP-2015, pp. 4280–4284 (2015)
Google Scholar
Chorowski, J.K., Bahdanau, D., Serdyuk, D., et al.: Attention-based models for speech recognition. In: Advances in Neural Information Processing Systems, pp. 577–585 (2015)
Google Scholar
Bahdanau, D., Chorowski, J., Serdyuk, D., et al.: End-to-end attention-based large vocabulary speech recognition. In: Proceedings of ICASSP-2016, pp. 4945–4949 (2016)
Google Scholar
Chan, W., Jaitly, N., Le, Q., et al.: Listen, attend and spell: a neural network for large vocabulary conversational speech recognition. In: Proceedings of ICASSP-2016, pp. 4960–4964 (2016)
Google Scholar
Lee, C.-H., et al.: An overview on automatic speech attribute transcription (ASAT). In: Proceedings of INTERSPEECH-2007, pp. 1825–1828 (2007)
Google Scholar
Siniscalchi, S.M., Lee, C.-H.: A study on integrating acoustic-phonetic information into lattice rescoring for automatic speech recognition. Speech Commun. 51, 1139–1153 (2009)
Article Google Scholar
Siniscalchi, S.M., Lyu, D.C., Svendsen, T., et al.: Experiments on cross-language attribute detection and phone recognition with minimal target-specific training data. IEEE Trans. Audio Speech Lang. Process. 20(3), 875–887 (2012)
Article Google Scholar
Ananthakrishnan, S., Narayanan, S.: Improved speech recognition using acoustic and lexical correlates of pitch accent in a n-best rescoring framework. In: Proceedings of ICASSP-2007, vol. 4, pp. IV-873–IV-876 (2007)
Google Scholar
Rusu, A.A., Rabinowitz, N.C., Desjardins, G., et al.: Progressive neural networks. arXiv preprint arXiv:1606.04671 (2016)
Amodei, D., Ananthanarayanan, S., Anubhai, R., et al.: Deep speech 2: end-to-end speech recognition in English and Mandarin. In: Proceedings of ICML-2016, pp. 173–182 (2016)
Google Scholar
Sainath, T.N., Vinyals,. O., Senior, A., et al.: Convolutional, long short-term memory, fully connected deep neural networks. In: Proceedings of ICASSP-2015, pp. 4580–4584 (2015)
Google Scholar
Jozefowicz, R., Zaremba, W., Sutskever, I.: An empirical exploration of recurrent network architectures. In: Proceedings of ICML-2015, pp. 2342–2350 (2015)
Google Scholar
Paul, D.B., Baker, J.M.: The design for the wall street journal-based CSR corpus. In: Proceedings of the Workshop on Speech and Natural Language, pp. 357–362 (1992)
Google Scholar
Abdel-Hamid, O., Mohamed, A., Jiang, H., et al.: Applying convolutional neural networks concepts to hybrid NN-HMM model for speech recognition. In: Proceedings of ICASSP-2012, pp. 4277–4280 (2012)
Google Scholar
Hannun, A.Y., Maas, A.L., Jurafsky, D., et al.: First-pass large vocabulary continuous speech recognition using bi-directional recurrent DNNs. arXiv preprint arXiv:1408.2873 (2014)
Davis, S., Mermelstein, P.: Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. In: Proceedings of ICASSP-2015, pp. 357–366 (1980)
Article Google Scholar
Lee, L., Rose, R.: A frequency warping approach to speaker normalization. IEEE Trans. Speech Audio Process. 6(1), 49–60 (1998)
Article Google Scholar
Veselý, K., Ghoshal, A., Burget, L., et al.: Sequence-discriminative training of deep neural networks. In: Proceedings of INTERSPEECH-2013, pp. 2345–2349 (2013)
Google Scholar

Download references

Acknowledgements

The authors gratefully acknowledge partial support from the China Scholarship Council (CSC), the German Research Foundation DFG under project CML (TRR 169), and the European Union under project SECURE (No. 642667).

Author information

Authors and Affiliations

Department of Informatics, University of Hamburg, Vogt-Koelln-Str. 30, 22527, Hamburg, Germany
Leyuan Qu, Cornelius Weber, Egor Lakomkin, Johannes Twiefel & Stefan Wermter

Authors

Leyuan Qu
View author publications
You can also search for this author in PubMed Google Scholar
Cornelius Weber
View author publications
You can also search for this author in PubMed Google Scholar
Egor Lakomkin
View author publications
You can also search for this author in PubMed Google Scholar
Johannes Twiefel
View author publications
You can also search for this author in PubMed Google Scholar
Stefan Wermter
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Leyuan Qu .

Editor information

Editors and Affiliations

Czech Academy of Sciences, Prague 8, Czech Republic
Věra Kůrková
Open University of Cyprus, Latsia, Cyprus
Yannis Manolopoulos
CITEC Bielefeld University, Bielefeld, Germany
Barbara Hammer
Democritus University of Thrace, Xanthi, Greece
Lazaros Iliadis
University of Piraeus, Piraeus, Greece
Ilias Maglogiannis

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Qu, L., Weber, C., Lakomkin, E., Twiefel, J., Wermter, S. (2018). Combining Articulatory Features with End-to-End Learning in Speech Recognition. In: Kůrková, V., Manolopoulos, Y., Hammer, B., Iliadis, L., Maglogiannis, I. (eds) Artificial Neural Networks and Machine Learning – ICANN 2018. ICANN 2018. Lecture Notes in Computer Science(), vol 11141. Springer, Cham. https://doi.org/10.1007/978-3-030-01424-7_49

Download citation

DOI: https://doi.org/10.1007/978-3-030-01424-7_49
Published: 27 September 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-01423-0
Online ISBN: 978-3-030-01424-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics