Digitisation and Automatic Alignment of the DIALOG Corpus: A Prosodically Annotated Corpus of Czech Television Debates

  • Nino Peterek
  • Petr Kaderka
  • Zdeňka Svobodová
  • Eva Havlová
  • Martin Havlík
  • Jana Klímová
  • Patricie Kubáčková
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4629)

Abstract

This article describes the development and automatic processing of the audio-visual DIALOG corpus. The DIALOG corpus is a prosodically annotated corpus of Czech television debates that has been recorded and annotated at the Czech Language Institute of the Academy of Sciences of the Czech Republic. It has recently grown to more than 400 VHS 4-hour tapes and 375 transcribed TV debates. The described digitisation process and automatic alignment enable an easily accessible and user-friendly research environment, supporting the exploration of Czech prosody and its analysis and modelling. This project has been carried out in cooperation with the Institute of Formal and Applied Linguistics of Faculty of Mathematics and Physics, Charles University, Prague. Currently the first version of the DIALOG corpus is available to the public (version 0.1, http://ujc.dialogy.cz ). It includes 10 selected and revised hour-long talk shows.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Čmejrková, S., Jílková, L., Kaderka, P.: Mluvená čeština v televizních debatách: korpus DIALOG. Slovo a slovesnost 65, 243–269 (2004)Google Scholar
  2. 2.
    The DIALOG corpus (version 0.1) (2006), http://ujc.dialogy.cz
  3. 3.
    Boersma, P., Weenink, D.: Praat: doing phonetics by computer (Version 4.4.X) (2006), http://www.praat.org
  4. 4.
    ESPS/waves+, Entropic Signal Processing System. Entropic Research Laboratory Ltd. (1996)Google Scholar
  5. 5.
    Peterek, N.: Tools and Data for Analysis of Spoken Czech and its Prosody. Ph.D. Thesis, MFF Charles University, Prague, Czech Republic (2006)Google Scholar
  6. 6.
    Peterek, N.: Dialogy. Org System (2006), http://www.dialogy.org
  7. 7.
    Kaderka, P., Svobodová, Z.: Jak přepisovat audiovizuální záznam rozhovoru? Jazykovědné aktuality, 43 (2006)Google Scholar
  8. 8.
    MPlayer and MEncoder Audio-Visual Software (2006), http://www.mplayerhq.hu
  9. 9.
    Black, A., Taylor, P.: Festival speech synthesis system & Edinburgh Speech Tools. University of Edinburgh (1999), http://www.cstr.ed.ac.uk/projects/festival
  10. 10.
    van Os, A.: Antiword (Version 0.37) (2005), http://www.winfield.demon.nl
  11. 11.
    MySQL Database Server (2006), http://dev.mysql.com
  12. 12.
    Hajič, J.: Disambiguation of Rich Inflection (Computational Morphology of Czech). Karolinum, Charles University Press, Prague, Czech Republic (2004)Google Scholar
  13. 13.
    Hajič, J.: Morphological Tagging: Data vs. Dictionaries. In: 6th ANLP Conference / 1st NAACL Meeting. Proceedings, Seattle, Washington, pp. 94–101 (2000)Google Scholar
  14. 14.
    Byrne, W., Hajič, J., Ircing, P., Jelinek, F., Khudanpur, S., McDonough, J., Peterek, N., Psutka, J.: Large Vocabulary Speech Recognition for Read and Broadcast Czech. In: Matoušek, V., Mautner, P., Ocelíková, J., Sojka, P. (eds.) TSD 1999. LNCS (LNAI), vol. 1692, pp. 235–240. Springer, Heidelberg (1999)CrossRefGoogle Scholar
  15. 15.
    Young, S., Kershaw, D., Odell, J., Ollason, D., Valtchev, V., Woodland, P.: HTK Book. Entropic Research Laboratory Ltd. (1999), http://htk.eng.cam.ac.uk

Copyright information

© Springer-Verlag Berlin Heidelberg 2007

Authors and Affiliations

  • Nino Peterek
    • 1
  • Petr Kaderka
    • 2
  • Zdeňka Svobodová
    • 2
  • Eva Havlová
    • 2
  • Martin Havlík
    • 2
  • Jana Klímová
    • 2
  • Patricie Kubáčková
    • 2
  1. 1.Charles University, MFF, Prague, Institute of Formal and Applied Linguistics (ÚFAL) 
  2. 2.The Academy of Sciences of the Czech Republic, Czech Language Institute (ÚJČ) 

Personalised recommendations