Digitisation and Automatic Alignment of the DIALOG Corpus: A Prosodically Annotated Corpus of Czech Television Debates

Peterek, Nino; Kaderka, Petr; Svobodová, Zdeňka; Havlová, Eva; Havlík, Martin; Klímová, Jana; Kubáčková, Patricie

doi:10.1007/978-3-540-74628-7_78

Digitisation and Automatic Alignment of the DIALOG Corpus: A Prosodically Annotated Corpus of Czech Television Debates

Nino Peterek¹,
Petr Kaderka²,
Zdeňka Svobodová²,
Eva Havlová²,
Martin Havlík²,
Jana Klímová² &
…
Patricie Kubáčková²

Conference paper

1737 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 4629))

Abstract

This article describes the development and automatic processing of the audio-visual DIALOG corpus. The DIALOG corpus is a prosodically annotated corpus of Czech television debates that has been recorded and annotated at the Czech Language Institute of the Academy of Sciences of the Czech Republic. It has recently grown to more than 400 VHS 4-hour tapes and 375 transcribed TV debates. The described digitisation process and automatic alignment enable an easily accessible and user-friendly research environment, supporting the exploration of Czech prosody and its analysis and modelling. This project has been carried out in cooperation with the Institute of Formal and Applied Linguistics of Faculty of Mathematics and Physics, Charles University, Prague. Currently the first version of the DIALOG corpus is available to the public (version 0.1, http://ujc.dialogy.cz ). It includes 10 selected and revised hour-long talk shows.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Čmejrková, S., Jílková, L., Kaderka, P.: Mluvená čeština v televizních debatách: korpus DIALOG. Slovo a slovesnost 65, 243–269 (2004)
Google Scholar
The DIALOG corpus (version 0.1) (2006), http://ujc.dialogy.cz
Boersma, P., Weenink, D.: Praat: doing phonetics by computer (Version 4.4.X) (2006), http://www.praat.org
ESPS/waves+, Entropic Signal Processing System. Entropic Research Laboratory Ltd. (1996)
Google Scholar
Peterek, N.: Tools and Data for Analysis of Spoken Czech and its Prosody. Ph.D. Thesis, MFF Charles University, Prague, Czech Republic (2006)
Google Scholar
Peterek, N.: Dialogy. Org System (2006), http://www.dialogy.org
Kaderka, P., Svobodová, Z.: Jak přepisovat audiovizuální záznam rozhovoru? Jazykovědné aktuality, 43 (2006)
Google Scholar
MPlayer and MEncoder Audio-Visual Software (2006), http://www.mplayerhq.hu
Black, A., Taylor, P.: Festival speech synthesis system & Edinburgh Speech Tools. University of Edinburgh (1999), http://www.cstr.ed.ac.uk/projects/festival
van Os, A.: Antiword (Version 0.37) (2005), http://www.winfield.demon.nl
MySQL Database Server (2006), http://dev.mysql.com
Hajič, J.: Disambiguation of Rich Inflection (Computational Morphology of Czech). Karolinum, Charles University Press, Prague, Czech Republic (2004)
Google Scholar
Hajič, J.: Morphological Tagging: Data vs. Dictionaries. In: 6th ANLP Conference / 1st NAACL Meeting. Proceedings, Seattle, Washington, pp. 94–101 (2000)
Google Scholar
Byrne, W., Hajič, J., Ircing, P., Jelinek, F., Khudanpur, S., McDonough, J., Peterek, N., Psutka, J.: Large Vocabulary Speech Recognition for Read and Broadcast Czech. In: Matoušek, V., Mautner, P., Ocelíková, J., Sojka, P. (eds.) TSD 1999. LNCS (LNAI), vol. 1692, pp. 235–240. Springer, Heidelberg (1999)
Chapter Google Scholar
Young, S., Kershaw, D., Odell, J., Ollason, D., Valtchev, V., Woodland, P.: HTK Book. Entropic Research Laboratory Ltd. (1999), http://htk.eng.cam.ac.uk

Download references

Author information

Authors and Affiliations

Charles University, MFF, Prague, Institute of Formal and Applied Linguistics (ÚFAL),
Nino Peterek
The Academy of Sciences of the Czech Republic, Czech Language Institute (ÚJČ),
Petr Kaderka, Zdeňka Svobodová, Eva Havlová, Martin Havlík, Jana Klímová & Patricie Kubáčková

Authors

Nino Peterek
View author publications
You can also search for this author in PubMed Google Scholar
Petr Kaderka
View author publications
You can also search for this author in PubMed Google Scholar
Zdeňka Svobodová
View author publications
You can also search for this author in PubMed Google Scholar
Eva Havlová
View author publications
You can also search for this author in PubMed Google Scholar
Martin Havlík
View author publications
You can also search for this author in PubMed Google Scholar
Jana Klímová
View author publications
You can also search for this author in PubMed Google Scholar
Patricie Kubáčková
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Václav Matoušek Pavel Mautner

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Peterek, N. et al. (2007). Digitisation and Automatic Alignment of the DIALOG Corpus: A Prosodically Annotated Corpus of Czech Television Debates. In: Matoušek, V., Mautner, P. (eds) Text, Speech and Dialogue. TSD 2007. Lecture Notes in Computer Science(), vol 4629. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-74628-7_78

Download citation

DOI: https://doi.org/10.1007/978-3-540-74628-7_78
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-74627-0
Online ISBN: 978-3-540-74628-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics