Skip to main content

Using Auto-Encoder BiLSTM Neural Network for Czech Grapheme-to-Phoneme Conversion

  • Conference paper
  • First Online:
Text, Speech, and Dialogue (TSD 2019)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 11697))

Included in the following conference series:

  • 893 Accesses

Abstract

The crucial part of almost all current TTS systems is a grapheme-to-phoneme (G2P) conversion, i.e. the transcription of any input grapheme sequence into the correct sequence of phonemes in the given language. Unfortunately, the preparation of transcription rules and pronunciation dictionaries is not an easy process for new languages in TTS systems. For that reason, in the presented paper, we focus on the creation of an automatic G2P model, based on neural networks (NN). But, contrary to the majority of related works in G2P field, using only separate words as an input, we consider a whole phrase the input of our proposed NN model. That approach should, in our opinion, lead to more precise phonetic transcription output because the pronunciation of a word can depend on the surrounding words. The results of the trained G2P model are presented on the Czech language where the cross-word-boundary phenomena occur quite often, and they are compared to the rule-based approach.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    In SAMPA [25], the symbol J corresponds to the palatal nasal.

  2. 2.

    Note: This accuracy was counted on phoneme level and includes the padding symbols ’-’ and the \(\texttt {<break>}\) words, too.

  3. 3.

    This time, the accuracies were counted only for regular phonemes and words, without padding symbol “-” and without phrase-break words.

References

  1. Bičan, A.: Distribution and combinations of Czech consonants. Zeitschrift für Slawistik 56, 153–171 (2011)

    Article  Google Scholar 

  2. Bisani, M., Ney, H.: Joint-sequence models for grapheme-to-phoneme conversion. Speech Commun. 50(5), 434–451 (2008)

    Article  Google Scholar 

  3. Cho, K., et al.: Learning phrase representations using RNN encoder-decoder for statistical machine translation. In: Moschitti, A., Pang, B., Daelemans, W. (eds.) EMNLP, pp. 1724–1734. ACL (2014)

    Google Scholar 

  4. Hanzlíček, Z., Vít, J., Tihelka, D.: LSTM-based speech segmentation for TTS synthesis. In: Ekštein, K. (ed.) TSD 2019. LNAI, vol. 11697, pp. 361–372. Springer, Heidelberg (2019)

    Google Scholar 

  5. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)

    Article  Google Scholar 

  6. Jiampojamarn, S., Cherry, C., Kondrak, G.: Joint processing and discriminative training for letter-to-phoneme conversion. In: Proceedings of ACL-08: HLT, pp. 905–913. Association for Computational Linguistics, Columbus (2008)

    Google Scholar 

  7. Kučera, H.: The phonology of Czech, Slavistic printings and reprintings, vol. 30, ’s-Gravenhage, Mouton (1961)

    Google Scholar 

  8. Machač, P., Skarnitzl, R.: Principles of phonetic segmentation. Edition erudica, Epocha (2009)

    Google Scholar 

  9. Matoušek, J.: Building a New Czech text-to-speech system using triphone-based speech units. In: Sojka, P., Kopeček, I., Pala, K. (eds.) TSD 2000. LNCS (LNAI), vol. 1902, pp. 223–228. Springer, Heidelberg (2000). https://doi.org/10.1007/3-540-45323-7_38

    Chapter  Google Scholar 

  10. Matoušek, J., Tihelka, D.: Annotation errors detection in TTS corpora. In: Proceedings of INTERSPEECH 2013, Lyon, France, pp. 1511–1515 (2013)

    Google Scholar 

  11. Matoušek, J., Tihelka, D., Šmídl, L.: On the impact of annotation errors on unit-selection speech synthesis. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds.) TSD 2012. LNCS (LNAI), vol. 7499, pp. 456–463. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-32790-2_55

    Chapter  Google Scholar 

  12. Matoušek, J., Tihelka, D., Psutka, J.: Experiments with automatic segmentation for Czech speech synthesis. In: Matoušek, V., Mautner, P. (eds.) TSD 2003. LNCS (LNAI), vol. 2807, pp. 287–294. Springer, Heidelberg (2003). https://doi.org/10.1007/978-3-540-39398-6_41

    Chapter  Google Scholar 

  13. Matoušek, J., Tihelka, D., Romportl, J., Psutka, J.: Slovak unit-selection speech synthesis: creating a new Slovak voice within a Czech TTS system ARTIC. IAENG Int. J. Comput. Sci. 39, 147–154 (2012)

    Google Scholar 

  14. Matoušek, J., Kala, J.: On modelling glottal stop in Czech text-to-speech synthesis. In: Matoušek, V., Mautner, P., Pavelka, T. (eds.) TSD 2005. LNCS (LNAI), vol. 3658, pp. 257–264. Springer, Heidelberg (2005). https://doi.org/10.1007/11551874_33

    Chapter  Google Scholar 

  15. Matoušek, J., Psutka, J.: ARTIC: a new czech text-to-speech system using statistical approach to speech segment database construction. In: Interspeech 2000 - ICSLP, Beijing, China, vol. 4, pp. 612–615 (2000)

    Google Scholar 

  16. Matoušek, J., Tihelka, D.: Slovak text-to-speech synthesis in ARTIC system. In: Sojka, P., Kopeček, I., Pala, K. (eds.) TSD 2004. LNCS (LNAI), vol. 3206, pp. 155–162. Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-540-30120-2_20

    Chapter  Google Scholar 

  17. Novak, J.R., Minamatsu, N., Hirose, K.: Phonetisaurus: exploring grapheme-to-phoneme conversion with joint n-gram models in the WFST framework. Natural Lang. Eng. 22(6), 907–938 (2016)

    Article  Google Scholar 

  18. Palková, Z.: Fonetika a fonologie češtiny [Phonetics and phonology of Czech], 1st edn. Univerzita Karlova, Nakladatelství Karolinum, Praha (1994)

    Google Scholar 

  19. Psutka, J., Müller, L., Matoušek, J., Radová, V.: Mluvíme s počítačem česky [Talking with Computer in Czech]. Academia, Praha (2006)

    Google Scholar 

  20. Rao, K., Peng, F., Sak, H., Beaufays, F.: Grapheme-to-phoneme conversion using long short-term memory recurrent neural networks. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4225–4229 (2015)

    Google Scholar 

  21. Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. In: Proceedings NIPS, Montreal, Canada, pp. 3104–3112 (2014)

    Google Scholar 

  22. Tihelka, D., Hanzlíček, Z., Jůzová, M., Vít, J., Matoušek, J., Grůber, M.: Current state of text-to-speech system ARTIC: a decade of research on the field of speech technologies. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds.) TSD 2018. LNCS (LNAI), vol. 11107, pp. 369–378. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-00794-2_40

    Chapter  Google Scholar 

  23. Wang, D., King, S.: Letter-to-sound pronunciation prediction using conditional random fields. IEEE Signal Process. Lett. 18(2), 122–125 (2011)

    Article  Google Scholar 

  24. Wang, Y., et al.: Tacotron: towards end-to-end speech synthesis (2017). https://arxiv.org/abs/1703.10135

  25. Wells, J.C.: SAMPA computer readable phonetic alphabet. In: Gibbon, D., Moore, R., Winski, R. (eds.) Handbook of Standards and Resources for Spoken Language Systems. Mouton de Gruyter, Berlin (1997)

    Google Scholar 

  26. Wu, Y., et al.: Google’s neural machine translation system: bridging the gap between human and machine translation. CoRR abs/1609.08144 (2016). http://arxiv.org/abs/1609.08144

  27. Yao, K., Zweig, G.: Sequence-to-sequence neural net models for grapheme-to-phoneme conversion. CoRR abs/1506.00196 (2015)

    Google Scholar 

Download references

Acknowledgement

This research was supported by the Czech Science Foundation (GA CR), project No. GA19-19324S, and by the grant of the University of West Bohemia, project No. SGS-2019-027.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Markéta Jůzová .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Jůzová, M., Vít, J. (2019). Using Auto-Encoder BiLSTM Neural Network for Czech Grapheme-to-Phoneme Conversion. In: Ekštein, K. (eds) Text, Speech, and Dialogue. TSD 2019. Lecture Notes in Computer Science(), vol 11697. Springer, Cham. https://doi.org/10.1007/978-3-030-27947-9_8

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-27947-9_8

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-27946-2

  • Online ISBN: 978-3-030-27947-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics