Adapting Lexical and Language Models for Transcription of Highly Spontaneous Spoken Czech
The paper deals with the problem of automatic transcription of spontaneous conversations in Czech. That type of speech is informal with many colloquial words. It is difficult to create an appropriate lexicon and language model when linguistic resources representing colloquial Czech are limited to several small corpora collected by the Institute of Czech National Corpus. To overcome this, we introduce transformations between the most frequent colloquial words and their counterparts in formal Czech. This allows us a) to combine the small spoken corpora with much larger corpora of more formal texts, b) to optimize the recognizer’s lexicon, and c) to solve the data sparsity problem when computing a probabilistic language model. We have applied this approach in the design of a system for transcription of spontaneous telephone conversations. Its recent version operates with accuracy about 48% and the proposed transformations together with corpora mixing contributed to 9% improvement compared to the baseline system.
KeywordsSpeech recognition colloquial speech language modeling
Unable to display preview. Download preview PDF.
- 1.Godfrey, J.J., Holliman, E.C., McDaniel, J.: Switchboard: Telephone Speech Corpus for Research and Development. In: Proc. of ICASSP, San Francisco, pp. 517–520 (1992)Google Scholar
- 2.CALLHOME and CALLFRIEND Corpora in Various Languages. Linguistic Data Consortium, http://www.ldc.upenn.edu/Catalog/
- 3.Cieri, C., Miller, D., Waller, K.: The Fisher Corpus: A Resource for the Next Generation of Speech-to-Text. In: Proc. of LREC, Lisbon, Portugal, pp. 69–71 (2004)Google Scholar
- 5.van Leeuwen, D.A., Kessens, J., Sanders, E., van den Heuvel, H.: Results of the N-Best 2008 Dutch Speech Recognition Evaluation. In: Proc. of Interspeech, Brigthon UK, pp. 2571–2574 (2009)Google Scholar
- 6.Corpus ORAL 2006 and ORAL 2008. Institute of Czech National Corpus. Charles University, Prague, http://www.korpus.cz
- 7.Corpus PMK. Institute of Czech National Corpus. Charles University, Prague (2001), http://www.korpus.cz
- 8.Nouza, J., Silovský, J.: Fast Keyword Spotting in Telephone Speech. Radioengineering 18(4), 665–670 (2009)Google Scholar
- 9.Nouza, J., Žd’ánský, J., Červa, P., Silovský, J.: Challenges in Speech Processing of Slavic Languages (Case Studies in Speech Recognition of Czech and Slovak). In: Esposito, A., Campbell, N., Vogel, C., Hussain, A., Nijholt, A. (eds.) Development of Multimodal Interfaces. LNCS, vol. 5967, pp. 225–241. Springer, Heidelberg (2010)Google Scholar
- 11.Schmiedtova, V.: Colloquial Czech in Corpus ORAL 2006 (in Czech). In: Proc. of Conference Czech in Spoken Corpus, Prague, pp. 199–221 (2008)Google Scholar
- 12.Nouza, J., Psutka, J., Uhlir̃, J.: Phonetic Alphabet for Speech Recognition of Czech. Radioengineering 6(4), 16–20 (1997)Google Scholar