Advertisement

Adapting Lexical and Language Models for Transcription of Highly Spontaneous Spoken Czech

  • Jan Nouza
  • Jan Silovský
Part of the Lecture Notes in Computer Science book series (LNCS, volume 6231)

Abstract

The paper deals with the problem of automatic transcription of spontaneous conversations in Czech. That type of speech is informal with many colloquial words. It is difficult to create an appropriate lexicon and language model when linguistic resources representing colloquial Czech are limited to several small corpora collected by the Institute of Czech National Corpus. To overcome this, we introduce transformations between the most frequent colloquial words and their counterparts in formal Czech. This allows us a) to combine the small spoken corpora with much larger corpora of more formal texts, b) to optimize the recognizer’s lexicon, and c) to solve the data sparsity problem when computing a probabilistic language model. We have applied this approach in the design of a system for transcription of spontaneous telephone conversations. Its recent version operates with accuracy about 48% and the proposed transformations together with corpora mixing contributed to 9% improvement compared to the baseline system.

Keywords

Speech recognition colloquial speech language modeling 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Godfrey, J.J., Holliman, E.C., McDaniel, J.: Switchboard: Telephone Speech Corpus for Research and Development. In: Proc. of ICASSP, San Francisco, pp. 517–520 (1992)Google Scholar
  2. 2.
    CALLHOME and CALLFRIEND Corpora in Various Languages. Linguistic Data Consortium, http://www.ldc.upenn.edu/Catalog/
  3. 3.
    Cieri, C., Miller, D., Waller, K.: The Fisher Corpus: A Resource for the Next Generation of Speech-to-Text. In: Proc. of LREC, Lisbon, Portugal, pp. 69–71 (2004)Google Scholar
  4. 4.
    Hain, T., et al.: Automatic Transcription of Conversational Telephone Speech. IEEE Trans. on Speech and Audio Processing 13(6), 1173–1185 (2005)CrossRefGoogle Scholar
  5. 5.
    van Leeuwen, D.A., Kessens, J., Sanders, E., van den Heuvel, H.: Results of the N-Best 2008 Dutch Speech Recognition Evaluation. In: Proc. of Interspeech, Brigthon UK, pp. 2571–2574 (2009)Google Scholar
  6. 6.
    Corpus ORAL 2006 and ORAL 2008. Institute of Czech National Corpus. Charles University, Prague, http://www.korpus.cz
  7. 7.
    Corpus PMK. Institute of Czech National Corpus. Charles University, Prague (2001), http://www.korpus.cz
  8. 8.
    Nouza, J., Silovský, J.: Fast Keyword Spotting in Telephone Speech. Radioengineering 18(4), 665–670 (2009)Google Scholar
  9. 9.
    Nouza, J., Žd’ánský, J., Červa, P., Silovský, J.: Challenges in Speech Processing of Slavic Languages (Case Studies in Speech Recognition of Czech and Slovak). In: Esposito, A., Campbell, N., Vogel, C., Hussain, A., Nijholt, A. (eds.) Development of Multimodal Interfaces. LNCS, vol. 5967, pp. 225–241. Springer, Heidelberg (2010)Google Scholar
  10. 10.
    Nouza, J., Žd’ánský, J., Červa, P., Kolorenc, J.: A System for Information Retrieval from Large Records of Czech Spoken Data. In: Sojka, P., Kopeček, I., Pala, K. (eds.) TSD 2006. LNCS (LNAI), vol. 4188, pp. 401–408. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  11. 11.
    Schmiedtova, V.: Colloquial Czech in Corpus ORAL 2006 (in Czech). In: Proc. of Conference Czech in Spoken Corpus, Prague, pp. 199–221 (2008)Google Scholar
  12. 12.
    Nouza, J., Psutka, J., Uhlir̃, J.: Phonetic Alphabet for Speech Recognition of Czech. Radioengineering 6(4), 16–20 (1997)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2010

Authors and Affiliations

  • Jan Nouza
    • 1
  • Jan Silovský
    • 1
  1. 1.Institute of Information Technology and Electronics, Faculty of MechatronicsTechnical University of LiberecLiberecCzech Republic

Personalised recommendations