TED Multilingual Discourse Bank (TED-MDB): a parallel corpus annotated in the PDTB style

Zeyrek, Deniz; Mendes, Amália; Grishina, Yulia; Kurfalı, Murathan; Gibbon, Samuel; Ogrodniczuk, Maciej

doi:10.1007/s10579-019-09445-9

TED Multilingual Discourse Bank (TED-MDB): a parallel corpus annotated in the PDTB style

Project Notes
Published: 06 April 2019

Volume 54, pages 587–613, (2020)
Cite this article

Language Resources and Evaluation Aims and scope Submit manuscript

1285 Accesses
16 Citations
Explore all metrics

Abstract

TED-Multilingual Discourse Bank, or TED-MDB, is a multilingual resource where TED-talks are annotated at the discourse level in 6 languages (English, Polish, German, Russian, European Portuguese, and Turkish) following the aims and principles of PDTB. We explain the corpus design criteria, which has three main features: the linguistic characteristics of the languages involved, the interactive nature of TED talks—which led us to annotate Hypophora, and the decision to avoid projection. We report our annotation consistency, and post-annotation alignment experiments, and provide a cross-lingual comparison based on corpus statistics.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Notes

The TED-MDB initiative is taken by a group of researchers involved in a consortium brought together by the ISCH COST Action (IS1312), Textlink: Structuring discourse in Multilingual Europe, http://textlink.ii.metu.edu.tr/.
https://wit3.fbk.eu/.
TED-MDB is freely available to researchers and can be accessed at: https://github.com/MurathanKurfali/Ted-MDB-Annotations. The corpus now includes annotations on the transcripts of the same TED talks in a new language—Lithuanian—introduced in Oleskeviciene et al. (2018).
Our annotation procedure for capturing co-occurring multiple connectives has been to annotate each connective separately as a different token, and assign a meaning to each respective token, following the annotation principles of the PDTB. Multiple connectives could also be selected as a single token, as it has been the procedure of Cuenca and Marín (2009) and Crible (2007), among others.
The German and Russian annotations were carried out and checked by a single, bilingual researcher.
For convenience, here we refer to the linear ordering of the selected text spans Mírovskỳ et al. 2010, cf. Sect. 3.3.
https://sourceforge.net/projects/aligner/.

References

Aleixo, P., & Pardo, T. A. (2008). CSTTool: um parser multidocumento automático para o Português do Brasil. In Proceedings of the IV workshop on M.Sc dissertation and Ph.D thesis in artificial intelligence (WTDIA) (pp. 140–145). Salvador, Bahia.
Artstein, R., & Poesio, M. (2008). Inter-coder agreement for computational linguistics. Computational Linguistics, 34(4), 555–596.
Article Google Scholar
Asher, N. (1993). Reference to abstract objects in discourse. Dordrecht: Kluwer.
Book Google Scholar
Baker, C. F., Fillmore, C. J., & Lowe, J. B. (1998). The Berkeley FrameNet project. In Proceedings of the 36th annual meeting of the association for computational linguistics and 17th international conference on computational linguistics (COLING-ACL ’98) (Vol. 1, pp. 86–90). Montreal: Association for Computational Linguistics.
Basile, V., Bos, J., Evang, K., & Venhuizen, N. (2012). Developing a large semantically annotated corpus. In Proceedings of the eighth international conference on language resources and evaluation (LREC 2012) (pp. 3196–3200). Istanbul: European Language Resources Association (ELRA).
Cettolo, M., Girardi, C., & Federico, M. (2012). WIT3: Web inventory of transcribed and translated talks. In Proceedings of the 16th conference of the European association for machine translation (EAMT) (Vol. 261, p. 268). Trento.
Crible, L. (2007). Discourse markers and (dis)fluency across registers: A contrastive usage-based study in English and French. Ph.D thesis, Louvain.
Cuenca, M. J., & Marín, M. J. (2009). Co-occurrence of discourse markers in Catalan and Spanish oral narrative. Journal of Pragmatics, 41, 899–914.
Article Google Scholar
Demirşahin, I., & Zeyrek, D. (2017). Pair annotation as a novel annotation procedure: The case of Turkish Discourse Bank. In N. Ide & J. Pustejovsky (Eds.), Handbook of linguistic annotation (pp. 1219–1240). Berlin: Springer.
Chapter Google Scholar
Hovy, E., & Lavid, J. (2010). Towards a science of corpus annotation: A new methodological challenge for corpus linguistics. International Journal of Translation, 22(1), 13–36.
Google Scholar
Ide, N., & Pustejovsky, J. (Eds.). (2017). Handbook of linguistic annotation. Berlin: Springer.
Google Scholar
Joshi, A. (2012). Rememberance of ACLs past. Keynote speech, ACL 50th anniversary lectures. Jeju Island: The Association for Computational Linguistics. https://www.aclweb.org/mirror/acl2012/program/sub01.asp.html. Accessed 25 Feb 2018.
Laali, M., & Kosseim, L. (2017). Improving discourse relation projection to build discourse annotated corpora. Recent advances in natural language processing meet deep learning (RANLP) (pp. 407–416). Varna.
Lanham, R. (1991). A handlist of rhetorical terms. Berkeley: University of California Press.
Google Scholar
Lausberg, H. (1998). Handbook of literary rhetoric: A foundation for literary study. Leiden: Brill.
Google Scholar
Lee, A., Prasad, R., Webber, B. L., & Joshi, A. K. (2016). Annotating discourse relations with the PDTB Annotator. In Proceedings of COLING 2016, the 26th international conference on computational linguistics: Demos (pp. 121–125). Osaka.
Lin, Z., Ng, H. T., & Kan, M.-Y. (2014). A PDTB-styled end-to-end discourse parser. Natural Language Engineering, 20(02), 151–184.
Article Google Scholar
Marcu, D. (2000). The theory and practice of discourse parsing and summarization. Cambridge: MIT Press.
Google Scholar
Mayoral, J. A. (1994). Figuras retóricas. Madrid: Editorial Sintesis.
Maziero, E. & Pardo, T. A. (2012). CSTParser: A multi-document discourse parser. In Proceedings of the international conference, PROPOR 2012: Demonstration. Coimbra. http://conteudo.icmc.usp.br/pessoas/taspardo/PROPOR2012Demo-MazieroPardo.pdf. Accessed 25 Feb 2018.
Mírovskỳ, J., Mladová, L., & Zikánová, Š. (2010). Connective-based measuring of the inter-annotator agreement in the annotation of discourse in PDT. In Proceedings of the 23rd international conference on computational linguistics: Posters Volume (pp. 775–781). Beijing: Association for Computational Linguistics.
Oleskeviciene, G. V., Zeyrek, D., Mazeikiene, V., & Kurfalı, M. (2018). Observations on the annotation of discourse relational devices in TED talk transcripts in Lithuanian. In S. Kübler & H. Zinsmeister (Eds.), Proceedings of the workshop on annotation in digital humanities co-located with ESSLLI 2018 (Vol. 2155, pp. 53–58). Sofia. CEUR-WS.org.
Padó, S., & Lapata, M. (2009). Cross-lingual annotation projection for semantic roles. Journal of Artificial Intelligence Research, 36, 307–340.
Article Google Scholar
Palmer, M., Gildea, D., & Kingsbury, P. (2005). The proposition bank: An annotated corpus of semantic roles. Computational Linguistics, 31(1), 71–106.
Article Google Scholar
Pitler, E. & Nenkova, A. (2009). Using syntax to disambiguate explicit discourse connectives in text. In Proceedings of the ACL-IJCNLP 2009 conference: Short papers (pp. 13–16). Singapore: Suntec, Association for Computational Linguistics.
Prasad, R., Joshi, A., & Webber, B. (2010). Realization of discourse relations by other means: Alternative lexicalizations. In Proceedings of the 23rd international conference on computational linguistics: Posters (pp. 1023–1031). Uppsala: Association for Computational Linguistics.
Prasad, R., Webber, B., & Joshi, A. (2014). Reflections on the Penn Discourse TreeBank, comparable corpora, and complementary annotation. Computational Linguistics, 40(4), 921–950.
Article Google Scholar
Rohde, H., Dickinson, A., Schneider, N., Clark, C. N., Louis, A., & Webber, B. (2016). Filling in the blanks in understanding discourse adverbials: Consistency, conflict, and context-dependence in a crowdsourced elicitation task. In Proceedings of the 10th linguistic annotation workshop held in conjunction with ACL 2016 (pp. 49–58). Berlin: Association for Computational Linguistics.
Spooren, W., & Degand, L. (2010). Coding coherence relations: Reliability and validity. Corpus Linguistics and Linguistic Theory, 6(2), 241–266.
Article Google Scholar
Webber, B., Knott, A., & Joshi, A. (2001). Multiple discourse connectives in a lexicalized grammar for discourse. In H. Bunt & R. E. Muskens Thijsse (Eds.), Computing meaning, Studies in Linguistics and Philosophy (Vol. 77, pp. 229–245). Berlin: Springer.
Google Scholar
Webber, B., Prasad, R., Lee, A., & Joshi, A. (2016). A discourse-annotated corpus of conjoined VPs. In Proceedings of the 10th Linguistics Annotation Workshop (pp. 22–31). Berlin: Association for Computational Linguistics.
Webber, B., Stone, M., Joshi, A., & Knott, A. (2003). Anaphora and discourse structure. Computational Linguistics, 29(4), 545–587.
Article Google Scholar
Zeyrek, D., Mendes, A., & Kurfalı, M. (2018). Multilingual extension of PDTB-style annotation: The case of TED Multilingual Discourse Bank. In Proceedings of the eleventh international conference on language resources and evaluation (LREC 2018) (pp. 1913–1919). Miyazaki: European Language Resources Association (ELRA).

Download references

Acknowledgements

We thank our annotators (Robin Goodfellow Malamud, Robin Schäfer, Olha Zolotarenko, Nuno Martins, Aida Cardoso, Celina Heliasz, Joanna Bilińska, Daniel Ziembicki, İpek Süsoy). The research has been partially supported by Textlink, by the Scientific and Technological Research Council of Turkey—BIDEB-2219 Postdoctoral Research program, by the Polish National Science Centre (Contract Number 2014/15/B/HS2/03435) and by the FCT—Fundação para a Ciência e a Tecnologia (project ID: PEst-OE/LIN/UI0214/2013). The support of Bonnie Webber and Manfred Stede is greatly acknowledged though all errors are our own.

Author information

Authors and Affiliations

Graduate School of Informatics, Middle East Technical University, Ankara, Turkey
Deniz Zeyrek & Murathan Kurfalı
Centre of Linguistics, University of Lisbon, Lisbon, Portugal
Amália Mendes
University of Potsdam, Potsdam, Germany
Yulia Grishina
Department of Linguistics, Stockholm University, Stockholm, Sweden
Murathan Kurfalı
Institute for Language, Cognition and Computation, School of Informatics, University of Edinburgh, Edinburgh, Scotland
Samuel Gibbon
Institute of Computer Science, Polish Academy of Sciences, Warsaw, Poland
Maciej Ogrodniczuk

Authors

Deniz Zeyrek
View author publications
You can also search for this author in PubMed Google Scholar
Amália Mendes
View author publications
You can also search for this author in PubMed Google Scholar
Yulia Grishina
View author publications
You can also search for this author in PubMed Google Scholar
Murathan Kurfalı
View author publications
You can also search for this author in PubMed Google Scholar
Samuel Gibbon
View author publications
You can also search for this author in PubMed Google Scholar
Maciej Ogrodniczuk
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Deniz Zeyrek.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

Here we present confusion matrices of the aligned relations in two talks. Rows show the English tokens aligned to language X, and columns show language X aligned to English. For example, in Table 11, the sum of the first row (47) is the sum of explicit relations (in English) aligned with a discourse relation in German. Of those relations, 31 are also conveyed explicitly in German, while 13 are realized as implicits and 3 as EntRels. The total number of explicit relations in the two English talks is 75 (also see Table 7 above), with 28 non-aligned explicit relations. Bold fonts indicates that the number of tokens in language X matches the number of tokens in English.

Table 11 German

Full size table

Table 12 Polish

Full size table

Table 13 Portuguese

Full size table

Table 14 Russian

Full size table

Table 15 Turkish

Full size table

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zeyrek, D., Mendes, A., Grishina, Y. et al. TED Multilingual Discourse Bank (TED-MDB): a parallel corpus annotated in the PDTB style. Lang Resources & Evaluation 54, 587–613 (2020). https://doi.org/10.1007/s10579-019-09445-9

Download citation

Published: 06 April 2019
Issue Date: June 2020
DOI: https://doi.org/10.1007/s10579-019-09445-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

TED Multilingual Discourse Bank (TED-MDB): a parallel corpus annotated in the PDTB style

Abstract

Access this article

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendix

Appendix

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation