CNN-TDNN-Based Architecture for Speech Recognition Using Grapheme Models in Bilingual Czech-Slovak Task

Psutka, Josef V.; Švec, Jan; Pražák, Aleš

doi:10.1007/978-3-030-83527-9_45

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 12848))

Included in the following conference series:

International Conference on Text, Speech, and Dialogue

1243 Accesses
1 Citations

Abstract

Czech and Slovak languages are very similar, not only in writing but also in phonetic form. This work aims to find a suitable combination of these two languages concerning better recognition results. We would like to show such a contribution on the Malach project. The Malach speech of Holocaust survivors is highly emotional, filled with many disfluencies, heavy accents, age-related coarticulation, and many non-speech events. Due to the nature of the corpus, it is very difficult to find other appropriate data for acoustic modeling, so such a combination can significantly improve the amount of training data. We will discuss the differences between the phoneme and grapheme way of combining Czech with Slovak. We will also compare different architectures of deep neural networks (TDNN, TDNNF, CNN-TDNNF) and tune the optimal topology. The proposed bilingual ASR approach provides a slight improvement over monolingual ASR systems, not only at the phoneme level but also at the grapheme.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Abdel-Hamid, O., Mohamed, A., Jiang, H., Deng, L., Penn, G., Yu, D.: Convolutional neural networks for speech recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 22(10), 1533–1545 (2014). https://doi.org/10.1109/TASLP.2014.2339736
Article Google Scholar
Bahl, L., Brown, P., de Souza, P., Mercer, R.: Maximum mutual information estimation of hidden Markov model parameters for speech recognition. In: ICASSP 1986, pp. 49–52 (1986). https://doi.org/10.1109/ICASSP.1986.1169179
Byrne, W., et al.: Automatic recognition of spontaneous speech for access to multilingual oral history archives. IEEE Trans. Speech Audio Process. 12(4), 420–435 (2004). https://doi.org/10.1109/TSA.2004.828702
Article Google Scholar
Czech SAMPA. https://www.phon.ucl.ac.uk/home/sampa/czech-uni.htm
Kanthak, S., Ney, H.: Multilingual acoustic modeling using graphemes. In: Eurospeech 2003, pp. 1145–1148 (2003)
Google Scholar
Killer, M., Stüker, S., Schultz, T.: Grapheme based speech recognition. In: Eurospeech 2003, pp. 3141–3144 (2003)
Google Scholar
Lihan, S., Juhár, J., Čižmár, A.: Comparison of Slovak and Czech speech recognition based on grapheme and phoneme acoustic models. In: Interspeech 2006, pp. 149–152 (2006)
Google Scholar
MALACH project (2006). https://malach.umiacs.umd.edu/
Mirilovič, M., Juhár, J., Čižmár, A.: Comparison of grapheme and phoneme based acoustic modeling in LVCSR task in Slovak. In: Esposito, A., Hussain, A., Marinaro, M., Martone, R. (eds.) Multimodal Signals: Cognitive and Algorithmic Issues. LNCS (LNAI), vol. 5398, pp. 242–247. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-00525-1_24
Chapter Google Scholar
Nouza, J., Silovský, J., Zdánský, J., Cerva, P., Kroul, M., Chaloupka, J.: Czech-to-Slovak adapted broadcast news transcription system. In: Interspeech 2008, pp. 2683–2686. ISCA (2008)
Google Scholar
Nouza, J., Zdansky, J., Cerva, P., Silovsky, J.: Challenges in speech processing of Slavic languages (case studies in speech recognition of Czech and Slovak). In: Esposito, A., Campbell, N., Vogel, C., Hussain, A., Nijholt, A. (eds.) Development of Multimodal Interfaces: Active Listening and Synchrony. LNCS, vol. 5967, pp. 225–241. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-12397-9_19
Chapter Google Scholar
Novak, J.R., Nobuaki, M., Keikichi, H.: Phonetisaurus: exploring grapheme-to-phoneme conversion with joint n-gram models in the WFST framework. Nat. Lang. Eng. 22(6), 907–938 (2016). https://doi.org/10.1017/S1351324915000315
Article Google Scholar
Peddinti, V., Povey, D., Khudanpur, S.: A time delay neural network architecture for efficient modeling of long temporal contexts. In: Interspeech 2015, pp. 3214–3218 (2015)
Google Scholar
Picheny, M., Tüske, Z., Kingsbury, B., Audhkhasi, K., Cui, X., Saon, G.: Challenging the boundaries of speech recognition: the MALACH corpus. In: Interspeech 2019, pp. 326–330 (2019). https://doi.org/10.21437/Interspeech.2019-1907
Povey, D., et al.: Semi-orthogonal low-rank matrix factorization for deep neural networks. In: Interspeech 2018, pp. 3743–3747 (2018). https://doi.org/10.21437/Interspeech.2018-1417
Povey, D., et al.: The Kaldi speech recognition toolkit. In: IEEE 2011 Workshop on Automatic Speech Recognition and Understanding (2011)
Google Scholar
Povey, D., et al.: Purely sequence-trained neural networks for ASR based on lattice-free MMI. In: Interspeech 2016, pp. 2751–2755 (2016). https://doi.org/10.21437/Interspeech.2016-595
Psutka, J., Hajič, J., Byrne, W.: The development of ASR for Slavic languages in the MALACH project. In: ICASSP 2004, pp. iii–749 (2004). https://doi.org/10.1109/ICASSP.2004.1326653
Psutka, J., Hoidekr, J., Ircing, P., Psutka, J.V.: Recognition of spontaneous speech - some problems and their solutions. In: CITSA 2006, pp. 169–172. IIIS (2006)
Google Scholar
Psutka, J., Ircing, P., Psutka, J.V., Hajič, J., Byrne, W., Mírovský, J.: Automatic transcription of Czech, Russian and Slovak spontaneous speech in the MALACH project. In: Eurospeech 2005, pp. 1349–1352. ISCA (2005)
Google Scholar
Psutka, J.V., Psutka, J., Radová, V., Ircing, P., Matoušek, J., Müller, L.: USC-SFI MALACH interviews and transcripts Czech (2014). https://catalog.ldc.upenn.edu/LDC2014S04
Slovak SAMPA. http://www.ui.sav.sk/pp/speech/sampa_sk.htm
Švec, J., Psutka, J., Trmal, J., Šmídl, L., Ircing, P., Sedmidubský, J.: On the use of grapheme models for searching in large spoken archives. In: ICASSP 2018, pp. 6259–6263 (2018). https://doi.org/10.1109/ICASSP.2018.8461774
Trmal, J., et al.: The Kaldi OpenKWS system: improving low resource keyword search. In: Interspeech 2017, pp. 3597–3601 (2017). https://doi.org/10.21437/Interspeech.2017-601
Vaněk, J., Trmal, J., Psutka, J.V., Psutka, J.: Optimized acoustic likelihoods computation for NVIDIA and ATI/AMD graphics processors. IEEE Trans. Audio Speech Lang. Process. 20(6), 1818–1828 (2012). https://doi.org/10.1109/TASL.2012.2190928
Article Google Scholar
Waibel, A., Hanazawa, T., Hinton, G., Shikano, K., Lang, K.J.: Phoneme recognition using time-delay neural networks. IEEE Trans. Acoust. Speech Sig. Process. 37(3), 328–339 (1989). https://doi.org/10.1109/29.21701
Article Google Scholar

Download references

Acknowledgments

This paper was supported by the Technology Agency of the Czech Republic, project No. TN01000024.

Author information

Authors and Affiliations

Department of Cybernetics, University of West Bohemia, Pilsen, Czech Republic
Josef V. Psutka & Jan Švec
NTIS - New Technologies for the Information Society, UWB, Pilsen, Czech Republic
Josef V. Psutka, Jan Švec & Aleš Pražák

Authors

Josef V. Psutka
View author publications
You can also search for this author in PubMed Google Scholar
Jan Švec
View author publications
You can also search for this author in PubMed Google Scholar
Aleš Pražák
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Josef V. Psutka .

Editor information

Editors and Affiliations

University of West Bohemia, Pilsen, Czech Republic
Kamil Ekštein
University of West Bohemia, Pilsen, Czech Republic
František Pártl
University of West Bohemia, Pilsen, Czech Republic
Miloslav Konopík

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Psutka, J.V., Švec, J., Pražák, A. (2021). CNN-TDNN-Based Architecture for Speech Recognition Using Grapheme Models in Bilingual Czech-Slovak Task. In: Ekštein, K., Pártl, F., Konopík, M. (eds) Text, Speech, and Dialogue. TSD 2021. Lecture Notes in Computer Science(), vol 12848. Springer, Cham. https://doi.org/10.1007/978-3-030-83527-9_45

Download citation

DOI: https://doi.org/10.1007/978-3-030-83527-9_45
Published: 30 August 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-83526-2
Online ISBN: 978-3-030-83527-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics