Grapheme-to-Phoneme Transduction for Cross-Language ASR

Hasegawa-Johnson, Mark; Rolston, Leanne; Goudeseune, Camille; Levow, Gina-Anne; Kirchhoff, Katrin

doi:10.1007/978-3-030-59430-5_1

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 12379))

Included in the following conference series:

International Conference on Statistical Language and Speech Processing

582 Accesses
6 Citations

Abstract

Automatic speech recognition (ASR) can be deployed in a previously unknown language, in less than 24 h, given just three resources: an acoustic model trained on other languages, a set of language-model training data, and a grapheme-to-phoneme (G2P) transducer to connect them. The LanguageNet G2Ps were created with the goal of being small, fast, and easy to port to a previously unseen language. Data come from pronunciation lexicons if available, but if there are no pronunciation lexicons in the target language, then data are generated from minimal resources: from a Wikipedia description of the target language, or from a one-hour interview with a native speaker of the language. Using such methods, the LanguageNet G2Ps now include simple models in nearly 150 languages, with trained finite state transducers in 122 languages, 59 of which are sufficiently well-resourced to permit measurement of their phone error rates. This paper proposes a measure of the distance between the G2Ps in different languages, and demonstrates that agglomerative clustering of the LanguageNet languages bears some resemblance to a phylogeographic language family tree. The LanguageNet G2Ps proposed in this paper have already been applied in three cross-language ASRs, using both hybrid and end-to-end neural architectures, and further experiments are ongoing.

This research was supported by the DARPA LORELEI program. Conclusions and findings are those of the authors, and are not endorsed by DARPA.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
The complete tree is at github.com/uiuc-sst/g2ps/blob/master/g2ppy/cluster/agglomerative_cluster_output_2020-07-18.txt.

References

Adda, G., et al.: Breaking the unwritten language barrier: the BULB project. In: Proceedings of the SLTU-2016 5th Workshop on Spoken Language Technologies for Under-resourced Languages (2016)
Google Scholar
Aker, A., Paramita, M.L., Pinnis, M., Gaizauskas, R.J.: Bilingual dictionaries for all EU languages. In: Proceedings of the Conference on Language Resources and Evaluation (LREC), pp. 2839–2845 (2014)
Google Scholar
Allauzen, C., Mohri, M., Roark, B.: Generalized algorithms for constructing statistical language models. In: Proceedings of the Annual Meeting of the Association for Computational Linguistics, pp. 40–47 (2003)
Google Scholar
Baayen, R., Piepenbrock, R., Gulikers, L.: CELEX2. Technical report, LDC96L14, Linguistic Data Consortium (1996)
Google Scholar
Bahl, L.R., Brown, P.F., de Souza, P.V., Picheny, M.A.: Acoustic Markov models used in the Tangora speech recognition system. In: Proceedings ICASSP, pp. 497–500 (1988)
Google Scholar
Bisani, M., Ney, H.: Joint-sequence models for grapheme-to-phoneme conversion. Speech Commun. 50(5), 434–451 (2008)
Article Google Scholar
Blench, R., Nebel, A.: Dinka-English and English-Dinka dictionary (2005)
Google Scholar
Bond, F., Paik, K.: A survey of wordnets and their licenses. Small 8(4), 5 (2012)
Google Scholar
Bouckaert, R., et al.: Mapping the origins and expansion of the Indo-European language family. Science 337(6097), 957–960 (2012)
Article Google Scholar
Chan, W., Jaitly, N., Le, Q., Vinyals, O.: Listen, attend and spell: a neural network for large vocabulary conversational speech recognition. In: Proceedings ICASSP, pp. 4960–4964 (2016). https://doi.org/10.1109/ICASSP.2016.7472621
Dâna, A.: Sözlük (2006). www.denizyuret.com/2006/11/turkish-resources.html. Accessed 20 July 2020
Davis, K., Biddulph, R., Balashek, S.: Automatic recognition of spoken digits. J. Acoust. Soc. Am. 24(6), 637–642 (1952)
Article Google Scholar
Deng, L.: Integrated-multilingual speech recognition using universal phonological features in a functional speech production model. In: Proceedings ICASSP (1997). https://doi.org/10.1109/ICASSP.1997.596110
Deri, A., Knight, K.: Grapheme-to-phoneme models for (almost) any language. In: Proceedings 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 399–408 (2016). https://doi.org/10.18653/v1/P16-1038
Duda, R.O., Hart, P.E., Stork, D.G.: Pattern Classification. Wiley, New York (2001)
MATH Google Scholar
Dudley, H., Balashek, S.: Automatic recognition of phonetic patterns in speech. J. Acoust. Soc. Am. 30, 721–732 (1958)
Article Google Scholar
Eberhard, D.M., Simons, G.F., Fennig, C.D. (eds.): Ethnologue: Languages of the World. 23rd edn. SIL International, Dallas (2020). www.ethnologue.com
Elmahdy, M., Hasegawa-Johnson, M., Mustafawi, E.: Development of a TV broadcasts speech recognition system for Qatari Arabic. In: Proceedings of the Conference on Language Resources and Evaluation (LREC), pp. 3057–3061 (2014)
Google Scholar
Garrett, J., Lastowka, G., et al.: Turkmen-English dictionary: a SPA project of Peace Corps Turkmenistan (1996)
Google Scholar
Gilloux, M.: Automatic learning of word transducers from examples. In: Proceedings EUROSPEECH, pp. 107–112 (1991)
Google Scholar
Grézl, F., Karafiaát, M., Veselý, K.: Adaptation of multilingual stacked bottle-neck neural network structure for new language. In: Proceedings ICASSP, pp. 7704–7708 (2014)
Google Scholar
Hasegawa-Johnson, M., Goudeseune, C., Levow, G.A.: Fast transcription of speech in low-resource languages (2019). https://arxiv.org/abs/1909.07285
Hock, H.H.: Principles of Historical Linguistics. Mouton de Gruyter, Berlin (1991)
Book Google Scholar
Howard, D.A.: The History of Turkey. Greenwood, Santa Barbara (2016)
Google Scholar
Hughes, G.W.: The Recognition of Speech by Machine. Ph.D. Thesis, MIT (1961)
Google Scholar
Hwang, M.Y., Huang, X.: Subphonetic modeling for speech recognition. In: Human Language Technology (HLT), pp. 174–179 (1992)
Google Scholar
IATE: Interactive terminology for Europe (2020). https://iate.europa.eu. Accessed 26 July 2020
International Phonetic Association: Handbook of the International Phonetic Association, Cambridge (1999)
Google Scholar
Kamholz, D., Pool, J., Colowick, S.M.: PanLex: building a resource for panlingual lexical translation. In: Proceedings of the Conference on Language Resources and Evaluation (LREC), pp. 3145–3150 (2014)
Google Scholar
Kneser, R., Ney, H.: Improved backing-off for M-gram language modeling. In: Proc. ICASSP, pp. 181–184 (1995)
Google Scholar
Köhler, J.: Comparing three methods to create multilingual phone models for vocabulary independent speech recognition tasks. In: Multi-Lingual Interoperability in Speech Technology (1999)
Google Scholar
Kroeber, P.D.: The Salish Language Family: Reconstructing Syntax. University of Nebraska Press (1999)
Google Scholar
Kučera, H.: Mechanical phonemic transcription and phoneme frequency count in Czech. Int. J. Slavic Linguist. Poetics 6, 36–50 (1963)
Google Scholar
Ladefoged, P.: The revised international phonetic alphabet. Language 66(3), 550–552 (1990)
Article Google Scholar
Lee, F.F.: Automatic grapheme-to-phoneme translation of English. J. Acoust. Soc. Am. 41(6), 1594 (1969). https://doi.org/10.1121/1.2143635
Article Google Scholar
Li, J., Hasegawa-Johnson, M.: Autosegmental neural nets: should phones and tones be synchronous or asynchronous? In: Proceedings Interspeech (2020)
Google Scholar
Marcantonio, A.: The Uralic language family: facts, myths and statistics. Sapienza Università di Roma (2002)
Google Scholar
Millward, J.: Eurasian Crossroads: A History of Xinjiang. Columbia University Press (1982)
Google Scholar
Moran, S., McCloy, D. (eds.): PHOIBLE 2.0. Jena: Max Planck Institute for the Science of Human History (2019)
Google Scholar
Mortensen, D.R., Dalmia, S., Littell, P.: Epitran: precision G2P for many languages. In: Proceedings of the Conference on Language Resources and Evaluation (LREC), pp. 2710–2714 (2018)
Google Scholar
Neubig, G., et al.: DyNet: the dynamic neural network toolkit (2017). https://arxiv.org/pdf/1701.03980.pdf. Accessed 14 Sept 2017
Novak, J.R., Minematsu, N., Hirose, K.: Phonetisaurus: Exploring grapheme-to-phoneme conversion with joint n-gram models in the WFST framework. Natural Lang. Eng. 22(6), 907–938 (2015)
Article Google Scholar
Omar, A.H.: The Malay spelling reform. J. Simplified Spelling Soc. 1989(2), 9–13 (1989)
Google Scholar
Peters, B., Dehdari, J., van Genabith, J.: Massively multilingual neural grapheme-to-phoneme conversion. In: EMNLP 2017 Workshop on Building Linguisically Generalizable NLP Systems (2017)
Google Scholar
Peterson, G.E.: Automatic speech recognition procedures. Lang. Speech 4(4), 200–219 (1961). https://doi.org/10.1177/002383096100400403
Article Google Scholar
Povey, D., et al.: The Kaldi speech recognition toolkit. In: IEEE 2011 Workshop on Automatic Speech Recognition and Understanding. IEEE Signal Processing Society, December 2011. IEEE Catalog No.: CFP11SRW-USB
Google Scholar
Rentzepopoulos, P.A., Kokkinakis, G.K.: Efficient multilingual phoneme-to-grapheme conversion based on HMM. Comput. Linguist. 22(3), 351–376 (1996)
Google Scholar
Ritchie, M., Comrie, B. (eds.): The Intercontinental Dictionary Series. Max Planck Institute for Evolutionary Anthropology, Leipzig (2015). http://ids.clld.org. Accessed 26 July 2020
Rolston, L., Kirchhoff, K.: Collection of bilingual data for lexicon transfer learning. Technical report, UWEETR-2016-0000, University of Washington Department of Electrical Engineering (2016)
Google Scholar
Schultz, T.: GlobalPhone: a multilingual speech and text database developed at Karlsruhe University. In: Seventh International Conference on Spoken Language Processing (2002)
Google Scholar
Schultz, T., Waibel, A.: Multilingual and crosslingual speech recognition. In: Proceedings International Conference Spoken Language Processing (ICSLP), pp. 0577:1–4 (1998)
Google Scholar
Uzman, M.: Romanisation in Uzbekistan past and present. J. Roy. Asiatic Soc. 20(1), 49–60 (2010)
Google Scholar
van Rijnsoever, P.: A multilingual text-to-speech system. In: IPO Annual Progress Report, pp. 34–41. Institute for Perception Research, Eindhoven (1988)
Google Scholar
Varga, K.: Kaldi ASR: Extending the ASpIRE model (2017). chrisearch.wordpress.com/2017/03/11/speech-recognition-using-kaldi-extending-and-using-the-aspire-model
Vasu, S.C.: The Ashtádhyáyí of Páini. Translated into English, Sindhu Charan Bose (1897
Google Scholar
Vaswani, A., et al.: Attention is all you need. In: Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 30, pp. 5998–6008. Curran Associates, Inc. (2017). http://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf
Watanabe, S., et al.: ESPnet: end-to-end speech processing toolkit. In: Proceedings Interspeech, pp. 2207–2211 (2018). https://doi.org/10.21437/Interspeech.2018-1456
Żelasko, P., Moro-Velázquez, L., Hasegawa-Johnson, M., Scharenborg, O., Dehak, N.: That sounds familiar: an analysis of phonetic representations transfer across languages. In: Proceedings Interspeech (2020)
Google Scholar

Download references

Author information

Authors and Affiliations

University of Illinois, Champaign, USA
Mark Hasegawa-Johnson & Camille Goudeseune
University of Washington, Seattle, USA
Leanne Rolston & Gina-Anne Levow
Amazon Alexa, Seattle, USA
Katrin Kirchhoff

Authors

Mark Hasegawa-Johnson
View author publications
You can also search for this author in PubMed Google Scholar
Leanne Rolston
View author publications
You can also search for this author in PubMed Google Scholar
Camille Goudeseune
View author publications
You can also search for this author in PubMed Google Scholar
Gina-Anne Levow
View author publications
You can also search for this author in PubMed Google Scholar
Katrin Kirchhoff
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mark Hasegawa-Johnson .

Editor information

Editors and Affiliations

Cardiff University, Cardiff, UK
Luis Espinosa-Anke
Rovira i Virgili University, Tarragona, Tarragona, Spain
Carlos Martín-Vide
Computer Science, Cardiff University, Cardiff, UK
Irena Spasić

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Hasegawa-Johnson, M., Rolston, L., Goudeseune, C., Levow, GA., Kirchhoff, K. (2020). Grapheme-to-Phoneme Transduction for Cross-Language ASR. In: Espinosa-Anke, L., Martín-Vide, C., Spasić, I. (eds) Statistical Language and Speech Processing. SLSP 2020. Lecture Notes in Computer Science(), vol 12379. Springer, Cham. https://doi.org/10.1007/978-3-030-59430-5_1

Download citation

DOI: https://doi.org/10.1007/978-3-030-59430-5_1
Published: 26 September 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-59429-9
Online ISBN: 978-3-030-59430-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics