Abstract
Automatic speech recognition (ASR) can be deployed in a previously unknown language, in less than 24 h, given just three resources: an acoustic model trained on other languages, a set of language-model training data, and a grapheme-to-phoneme (G2P) transducer to connect them. The LanguageNet G2Ps were created with the goal of being small, fast, and easy to port to a previously unseen language. Data come from pronunciation lexicons if available, but if there are no pronunciation lexicons in the target language, then data are generated from minimal resources: from a Wikipedia description of the target language, or from a one-hour interview with a native speaker of the language. Using such methods, the LanguageNet G2Ps now include simple models in nearly 150 languages, with trained finite state transducers in 122 languages, 59 of which are sufficiently well-resourced to permit measurement of their phone error rates. This paper proposes a measure of the distance between the G2Ps in different languages, and demonstrates that agglomerative clustering of the LanguageNet languages bears some resemblance to a phylogeographic language family tree. The LanguageNet G2Ps proposed in this paper have already been applied in three cross-language ASRs, using both hybrid and end-to-end neural architectures, and further experiments are ongoing.
This research was supported by the DARPA LORELEI program. Conclusions and findings are those of the authors, and are not endorsed by DARPA.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
The complete tree is at github.com/uiuc-sst/g2ps/blob/master/g2ppy/cluster/agglomerative_cluster_output_2020-07-18.txt.
References
Adda, G., et al.: Breaking the unwritten language barrier: the BULB project. In: Proceedings of the SLTU-2016 5th Workshop on Spoken Language Technologies for Under-resourced Languages (2016)
Aker, A., Paramita, M.L., Pinnis, M., Gaizauskas, R.J.: Bilingual dictionaries for all EU languages. In: Proceedings of the Conference on Language Resources and Evaluation (LREC), pp. 2839–2845 (2014)
Allauzen, C., Mohri, M., Roark, B.: Generalized algorithms for constructing statistical language models. In: Proceedings of the Annual Meeting of the Association for Computational Linguistics, pp. 40–47 (2003)
Baayen, R., Piepenbrock, R., Gulikers, L.: CELEX2. Technical report, LDC96L14, Linguistic Data Consortium (1996)
Bahl, L.R., Brown, P.F., de Souza, P.V., Picheny, M.A.: Acoustic Markov models used in the Tangora speech recognition system. In: Proceedings ICASSP, pp. 497–500 (1988)
Bisani, M., Ney, H.: Joint-sequence models for grapheme-to-phoneme conversion. Speech Commun. 50(5), 434–451 (2008)
Blench, R., Nebel, A.: Dinka-English and English-Dinka dictionary (2005)
Bond, F., Paik, K.: A survey of wordnets and their licenses. Small 8(4), 5 (2012)
Bouckaert, R., et al.: Mapping the origins and expansion of the Indo-European language family. Science 337(6097), 957–960 (2012)
Chan, W., Jaitly, N., Le, Q., Vinyals, O.: Listen, attend and spell: a neural network for large vocabulary conversational speech recognition. In: Proceedings ICASSP, pp. 4960–4964 (2016). https://doi.org/10.1109/ICASSP.2016.7472621
Dâna, A.: Sözlük (2006). www.denizyuret.com/2006/11/turkish-resources.html. Accessed 20 July 2020
Davis, K., Biddulph, R., Balashek, S.: Automatic recognition of spoken digits. J. Acoust. Soc. Am. 24(6), 637–642 (1952)
Deng, L.: Integrated-multilingual speech recognition using universal phonological features in a functional speech production model. In: Proceedings ICASSP (1997). https://doi.org/10.1109/ICASSP.1997.596110
Deri, A., Knight, K.: Grapheme-to-phoneme models for (almost) any language. In: Proceedings 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 399–408 (2016). https://doi.org/10.18653/v1/P16-1038
Duda, R.O., Hart, P.E., Stork, D.G.: Pattern Classification. Wiley, New York (2001)
Dudley, H., Balashek, S.: Automatic recognition of phonetic patterns in speech. J. Acoust. Soc. Am. 30, 721–732 (1958)
Eberhard, D.M., Simons, G.F., Fennig, C.D. (eds.): Ethnologue: Languages of the World. 23rd edn. SIL International, Dallas (2020). www.ethnologue.com
Elmahdy, M., Hasegawa-Johnson, M., Mustafawi, E.: Development of a TV broadcasts speech recognition system for Qatari Arabic. In: Proceedings of the Conference on Language Resources and Evaluation (LREC), pp. 3057–3061 (2014)
Garrett, J., Lastowka, G., et al.: Turkmen-English dictionary: a SPA project of Peace Corps Turkmenistan (1996)
Gilloux, M.: Automatic learning of word transducers from examples. In: Proceedings EUROSPEECH, pp. 107–112 (1991)
Grézl, F., Karafiaát, M., Veselý, K.: Adaptation of multilingual stacked bottle-neck neural network structure for new language. In: Proceedings ICASSP, pp. 7704–7708 (2014)
Hasegawa-Johnson, M., Goudeseune, C., Levow, G.A.: Fast transcription of speech in low-resource languages (2019). https://arxiv.org/abs/1909.07285
Hock, H.H.: Principles of Historical Linguistics. Mouton de Gruyter, Berlin (1991)
Howard, D.A.: The History of Turkey. Greenwood, Santa Barbara (2016)
Hughes, G.W.: The Recognition of Speech by Machine. Ph.D. Thesis, MIT (1961)
Hwang, M.Y., Huang, X.: Subphonetic modeling for speech recognition. In: Human Language Technology (HLT), pp. 174–179 (1992)
IATE: Interactive terminology for Europe (2020). https://iate.europa.eu. Accessed 26 July 2020
International Phonetic Association: Handbook of the International Phonetic Association, Cambridge (1999)
Kamholz, D., Pool, J., Colowick, S.M.: PanLex: building a resource for panlingual lexical translation. In: Proceedings of the Conference on Language Resources and Evaluation (LREC), pp. 3145–3150 (2014)
Kneser, R., Ney, H.: Improved backing-off for M-gram language modeling. In: Proc. ICASSP, pp. 181–184 (1995)
Köhler, J.: Comparing three methods to create multilingual phone models for vocabulary independent speech recognition tasks. In: Multi-Lingual Interoperability in Speech Technology (1999)
Kroeber, P.D.: The Salish Language Family: Reconstructing Syntax. University of Nebraska Press (1999)
Kučera, H.: Mechanical phonemic transcription and phoneme frequency count in Czech. Int. J. Slavic Linguist. Poetics 6, 36–50 (1963)
Ladefoged, P.: The revised international phonetic alphabet. Language 66(3), 550–552 (1990)
Lee, F.F.: Automatic grapheme-to-phoneme translation of English. J. Acoust. Soc. Am. 41(6), 1594 (1969). https://doi.org/10.1121/1.2143635
Li, J., Hasegawa-Johnson, M.: Autosegmental neural nets: should phones and tones be synchronous or asynchronous? In: Proceedings Interspeech (2020)
Marcantonio, A.: The Uralic language family: facts, myths and statistics. Sapienza Università di Roma (2002)
Millward, J.: Eurasian Crossroads: A History of Xinjiang. Columbia University Press (1982)
Moran, S., McCloy, D. (eds.): PHOIBLE 2.0. Jena: Max Planck Institute for the Science of Human History (2019)
Mortensen, D.R., Dalmia, S., Littell, P.: Epitran: precision G2P for many languages. In: Proceedings of the Conference on Language Resources and Evaluation (LREC), pp. 2710–2714 (2018)
Neubig, G., et al.: DyNet: the dynamic neural network toolkit (2017). https://arxiv.org/pdf/1701.03980.pdf. Accessed 14 Sept 2017
Novak, J.R., Minematsu, N., Hirose, K.: Phonetisaurus: Exploring grapheme-to-phoneme conversion with joint n-gram models in the WFST framework. Natural Lang. Eng. 22(6), 907–938 (2015)
Omar, A.H.: The Malay spelling reform. J. Simplified Spelling Soc. 1989(2), 9–13 (1989)
Peters, B., Dehdari, J., van Genabith, J.: Massively multilingual neural grapheme-to-phoneme conversion. In: EMNLP 2017 Workshop on Building Linguisically Generalizable NLP Systems (2017)
Peterson, G.E.: Automatic speech recognition procedures. Lang. Speech 4(4), 200–219 (1961). https://doi.org/10.1177/002383096100400403
Povey, D., et al.: The Kaldi speech recognition toolkit. In: IEEE 2011 Workshop on Automatic Speech Recognition and Understanding. IEEE Signal Processing Society, December 2011. IEEE Catalog No.: CFP11SRW-USB
Rentzepopoulos, P.A., Kokkinakis, G.K.: Efficient multilingual phoneme-to-grapheme conversion based on HMM. Comput. Linguist. 22(3), 351–376 (1996)
Ritchie, M., Comrie, B. (eds.): The Intercontinental Dictionary Series. Max Planck Institute for Evolutionary Anthropology, Leipzig (2015). http://ids.clld.org. Accessed 26 July 2020
Rolston, L., Kirchhoff, K.: Collection of bilingual data for lexicon transfer learning. Technical report, UWEETR-2016-0000, University of Washington Department of Electrical Engineering (2016)
Schultz, T.: GlobalPhone: a multilingual speech and text database developed at Karlsruhe University. In: Seventh International Conference on Spoken Language Processing (2002)
Schultz, T., Waibel, A.: Multilingual and crosslingual speech recognition. In: Proceedings International Conference Spoken Language Processing (ICSLP), pp. 0577:1–4 (1998)
Uzman, M.: Romanisation in Uzbekistan past and present. J. Roy. Asiatic Soc. 20(1), 49–60 (2010)
van Rijnsoever, P.: A multilingual text-to-speech system. In: IPO Annual Progress Report, pp. 34–41. Institute for Perception Research, Eindhoven (1988)
Varga, K.: Kaldi ASR: Extending the ASpIRE model (2017). chrisearch.wordpress.com/2017/03/11/speech-recognition-using-kaldi-extending-and-using-the-aspire-model
Vasu, S.C.: The Ashtádhyáyí of Páini. Translated into English, Sindhu Charan Bose (1897
Vaswani, A., et al.: Attention is all you need. In: Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 30, pp. 5998–6008. Curran Associates, Inc. (2017). http://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf
Watanabe, S., et al.: ESPnet: end-to-end speech processing toolkit. In: Proceedings Interspeech, pp. 2207–2211 (2018). https://doi.org/10.21437/Interspeech.2018-1456
Żelasko, P., Moro-Velázquez, L., Hasegawa-Johnson, M., Scharenborg, O., Dehak, N.: That sounds familiar: an analysis of phonetic representations transfer across languages. In: Proceedings Interspeech (2020)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Hasegawa-Johnson, M., Rolston, L., Goudeseune, C., Levow, GA., Kirchhoff, K. (2020). Grapheme-to-Phoneme Transduction for Cross-Language ASR. In: Espinosa-Anke, L., Martín-Vide, C., Spasić, I. (eds) Statistical Language and Speech Processing. SLSP 2020. Lecture Notes in Computer Science(), vol 12379. Springer, Cham. https://doi.org/10.1007/978-3-030-59430-5_1
Download citation
DOI: https://doi.org/10.1007/978-3-030-59430-5_1
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-59429-9
Online ISBN: 978-3-030-59430-5
eBook Packages: Computer ScienceComputer Science (R0)