Advertisement

TED Multilingual Discourse Bank (TED-MDB): a parallel corpus annotated in the PDTB style

  • Deniz ZeyrekEmail author
  • Amália Mendes
  • Yulia Grishina
  • Murathan Kurfalı
  • Samuel Gibbon
  • Maciej Ogrodniczuk
Project Notes

Abstract

TED-Multilingual Discourse Bank, or TED-MDB, is a multilingual resource where TED-talks are annotated at the discourse level in 6 languages (English, Polish, German, Russian, European Portuguese, and Turkish) following the aims and principles of PDTB. We explain the corpus design criteria, which has three main features: the linguistic characteristics of the languages involved, the interactive nature of TED talks—which led us to annotate Hypophora, and the decision to avoid projection. We report our annotation consistency, and post-annotation alignment experiments, and provide a cross-lingual comparison based on corpus statistics.

Keywords

Discourse Discourse relations Corpus creation Annotation Multilingual corpus 

Notes

Acknowledgements

We thank our annotators (Robin Goodfellow Malamud, Robin Schäfer, Olha Zolotarenko, Nuno Martins, Aida Cardoso, Celina Heliasz, Joanna Bilińska, Daniel Ziembicki, İpek Süsoy). The research has been partially supported by Textlink, by the Scientific and Technological Research Council of Turkey—BIDEB-2219 Postdoctoral Research program, by the Polish National Science Centre (Contract Number 2014/15/B/HS2/03435) and by the FCT—Fundação para a Ciência e a Tecnologia (project ID: PEst-OE/LIN/UI0214/2013). The support of Bonnie Webber and Manfred Stede is greatly acknowledged though all errors are our own.

References

  1. Aleixo, P., & Pardo, T. A. (2008). CSTTool: um parser multidocumento automático para o Português do Brasil. In Proceedings of the IV workshop on M.Sc dissertation and Ph.D thesis in artificial intelligence (WTDIA) (pp. 140–145). Salvador, Bahia.Google Scholar
  2. Artstein, R., & Poesio, M. (2008). Inter-coder agreement for computational linguistics. Computational Linguistics, 34(4), 555–596.CrossRefGoogle Scholar
  3. Asher, N. (1993). Reference to abstract objects in discourse. Dordrecht: Kluwer.CrossRefGoogle Scholar
  4. Baker, C. F., Fillmore, C. J., & Lowe, J. B. (1998). The Berkeley FrameNet project. In Proceedings of the 36th annual meeting of the association for computational linguistics and 17th international conference on computational linguistics (COLING-ACL ’98) (Vol. 1, pp. 86–90). Montreal: Association for Computational Linguistics.Google Scholar
  5. Basile, V., Bos, J., Evang, K., & Venhuizen, N. (2012). Developing a large semantically annotated corpus. In Proceedings of the eighth international conference on language resources and evaluation (LREC 2012) (pp. 3196–3200). Istanbul: European Language Resources Association (ELRA).Google Scholar
  6. Cettolo, M., Girardi, C., & Federico, M. (2012). WIT3: Web inventory of transcribed and translated talks. In Proceedings of the 16th conference of the European association for machine translation (EAMT) (Vol. 261, p. 268). Trento.Google Scholar
  7. Crible, L. (2007). Discourse markers and (dis)fluency across registers: A contrastive usage-based study in English and French. Ph.D thesis, Louvain.Google Scholar
  8. Cuenca, M. J., & Marín, M. J. (2009). Co-occurrence of discourse markers in Catalan and Spanish oral narrative. Journal of Pragmatics, 41, 899–914.CrossRefGoogle Scholar
  9. Demirşahin, I., & Zeyrek, D. (2017). Pair annotation as a novel annotation procedure: The case of Turkish Discourse Bank. In N. Ide & J. Pustejovsky (Eds.), Handbook of linguistic annotation (pp. 1219–1240). Berlin: Springer.CrossRefGoogle Scholar
  10. Hovy, E., & Lavid, J. (2010). Towards a science of corpus annotation: A new methodological challenge for corpus linguistics. International Journal of Translation, 22(1), 13–36.Google Scholar
  11. Ide, N., & Pustejovsky, J. (Eds.). (2017). Handbook of linguistic annotation. Berlin: Springer.Google Scholar
  12. Joshi, A. (2012). Rememberance of ACLs past. Keynote speech, ACL 50th anniversary lectures. Jeju Island: The Association for Computational Linguistics. https://www.aclweb.org/mirror/acl2012/program/sub01.asp.html. Accessed 25 Feb 2018.
  13. Laali, M., & Kosseim, L. (2017). Improving discourse relation projection to build discourse annotated corpora. Recent advances in natural language processing meet deep learning (RANLP) (pp. 407–416). Varna.Google Scholar
  14. Lanham, R. (1991). A handlist of rhetorical terms. Berkeley: University of California Press.Google Scholar
  15. Lausberg, H. (1998). Handbook of literary rhetoric: A foundation for literary study. Leiden: Brill.Google Scholar
  16. Lee, A., Prasad, R., Webber, B. L., & Joshi, A. K. (2016). Annotating discourse relations with the PDTB Annotator. In Proceedings of COLING 2016, the 26th international conference on computational linguistics: Demos (pp. 121–125). Osaka.Google Scholar
  17. Lin, Z., Ng, H. T., & Kan, M.-Y. (2014). A PDTB-styled end-to-end discourse parser. Natural Language Engineering, 20(02), 151–184.CrossRefGoogle Scholar
  18. Marcu, D. (2000). The theory and practice of discourse parsing and summarization. Cambridge: MIT Press.Google Scholar
  19. Mayoral, J. A. (1994). Figuras retóricas. Madrid: Editorial Sintesis.Google Scholar
  20. Maziero, E. & Pardo, T. A. (2012). CSTParser: A multi-document discourse parser. In Proceedings of the international conference, PROPOR 2012: Demonstration. Coimbra. http://conteudo.icmc.usp.br/pessoas/taspardo/PROPOR2012Demo-MazieroPardo.pdf. Accessed 25 Feb 2018.
  21. Mírovskỳ, J., Mladová, L., & Zikánová, Š. (2010). Connective-based measuring of the inter-annotator agreement in the annotation of discourse in PDT. In Proceedings of the 23rd international conference on computational linguistics: Posters Volume (pp. 775–781). Beijing: Association for Computational Linguistics.Google Scholar
  22. Oleskeviciene, G. V., Zeyrek, D., Mazeikiene, V., & Kurfalı, M. (2018). Observations on the annotation of discourse relational devices in TED talk transcripts in Lithuanian. In S. Kübler & H. Zinsmeister (Eds.), Proceedings of the workshop on annotation in digital humanities co-located with ESSLLI 2018 (Vol. 2155, pp. 53–58). Sofia. CEUR-WS.org.Google Scholar
  23. Padó, S., & Lapata, M. (2009). Cross-lingual annotation projection for semantic roles. Journal of Artificial Intelligence Research, 36, 307–340.CrossRefGoogle Scholar
  24. Palmer, M., Gildea, D., & Kingsbury, P. (2005). The proposition bank: An annotated corpus of semantic roles. Computational Linguistics, 31(1), 71–106.CrossRefGoogle Scholar
  25. Pitler, E. & Nenkova, A. (2009). Using syntax to disambiguate explicit discourse connectives in text. In Proceedings of the ACL-IJCNLP 2009 conference: Short papers (pp. 13–16). Singapore: Suntec, Association for Computational Linguistics.Google Scholar
  26. Prasad, R., Joshi, A., & Webber, B. (2010). Realization of discourse relations by other means: Alternative lexicalizations. In Proceedings of the 23rd international conference on computational linguistics: Posters (pp. 1023–1031). Uppsala: Association for Computational Linguistics.Google Scholar
  27. Prasad, R., Webber, B., & Joshi, A. (2014). Reflections on the Penn Discourse TreeBank, comparable corpora, and complementary annotation. Computational Linguistics, 40(4), 921–950.CrossRefGoogle Scholar
  28. Rohde, H., Dickinson, A., Schneider, N., Clark, C. N., Louis, A., & Webber, B. (2016). Filling in the blanks in understanding discourse adverbials: Consistency, conflict, and context-dependence in a crowdsourced elicitation task. In Proceedings of the 10th linguistic annotation workshop held in conjunction with ACL 2016 (pp. 49–58). Berlin: Association for Computational Linguistics.Google Scholar
  29. Spooren, W., & Degand, L. (2010). Coding coherence relations: Reliability and validity. Corpus Linguistics and Linguistic Theory, 6(2), 241–266.CrossRefGoogle Scholar
  30. Webber, B., Knott, A., & Joshi, A. (2001). Multiple discourse connectives in a lexicalized grammar for discourse. In H. Bunt & R. E. Muskens Thijsse (Eds.), Computing meaning, Studies in Linguistics and Philosophy (Vol. 77, pp. 229–245). Berlin: Springer.Google Scholar
  31. Webber, B., Prasad, R., Lee, A., & Joshi, A. (2016). A discourse-annotated corpus of conjoined VPs. In Proceedings of the 10th Linguistics Annotation Workshop (pp. 22–31). Berlin: Association for Computational Linguistics.Google Scholar
  32. Webber, B., Stone, M., Joshi, A., & Knott, A. (2003). Anaphora and discourse structure. Computational Linguistics, 29(4), 545–587.CrossRefGoogle Scholar
  33. Zeyrek, D., Mendes, A., & Kurfalı, M. (2018). Multilingual extension of PDTB-style annotation: The case of TED Multilingual Discourse Bank. In Proceedings of the eleventh international conference on language resources and evaluation (LREC 2018) (pp. 1913–1919). Miyazaki: European Language Resources Association (ELRA).Google Scholar

Copyright information

© Springer Nature B.V. 2019

Authors and Affiliations

  1. 1.Graduate School of InformaticsMiddle East Technical UniversityAnkaraTurkey
  2. 2.Centre of LinguisticsUniversity of LisbonLisbonPortugal
  3. 3.University of PotsdamPotsdamGermany
  4. 4.Department of LinguisticsStockholm UniversityStockholmSweden
  5. 5.Institute for Language, Cognition and Computation, School of InformaticsUniversity of EdinburghEdinburghScotland
  6. 6.Institute of Computer SciencePolish Academy of SciencesWarsawPoland

Personalised recommendations