TED Multilingual Discourse Bank (TED-MDB): a parallel corpus annotated in the PDTB style


TED-Multilingual Discourse Bank, or TED-MDB, is a multilingual resource where TED-talks are annotated at the discourse level in 6 languages (English, Polish, German, Russian, European Portuguese, and Turkish) following the aims and principles of PDTB. We explain the corpus design criteria, which has three main features: the linguistic characteristics of the languages involved, the interactive nature of TED talks—which led us to annotate Hypophora, and the decision to avoid projection. We report our annotation consistency, and post-annotation alignment experiments, and provide a cross-lingual comparison based on corpus statistics.

  1. 1.

    The TED-MDB initiative is taken by a group of researchers involved in a consortium brought together by the ISCH COST Action (IS1312), Textlink: Structuring discourse in Multilingual Europe, http://textlink.ii.metu.edu.tr/.

  2. 2.


  3. 3.

    TED-MDB is freely available to researchers and can be accessed at: https://github.com/MurathanKurfali/Ted-MDB-Annotations. The corpus now includes annotations on the transcripts of the same TED talks in a new language—Lithuanian—introduced in Oleskeviciene et al. (2018).

  4. 4.

    Our annotation procedure for capturing co-occurring multiple connectives has been to annotate each connective separately as a different token, and assign a meaning to each respective token, following the annotation principles of the PDTB. Multiple connectives could also be selected as a single token, as it has been the procedure of Cuenca and Marín (2009) and Crible (2007), among others.

  5. 5.

    The German and Russian annotations were carried out and checked by a single, bilingual researcher.

  6. 6.

    For convenience, here we refer to the linear ordering of the selected text spans Mírovskỳ et al. 2010, cf. Sect. 3.3.

  7. 7.



We thank our annotators (Robin Goodfellow Malamud, Robin Schäfer, Olha Zolotarenko, Nuno Martins, Aida Cardoso, Celina Heliasz, Joanna Bilińska, Daniel Ziembicki, İpek Süsoy). The research has been partially supported by Textlink, by the Scientific and Technological Research Council of Turkey—BIDEB-2219 Postdoctoral Research program, by the Polish National Science Centre (Contract Number 2014/15/B/HS2/03435) and by the FCT—Fundação para a Ciência e a Tecnologia (project ID: PEst-OE/LIN/UI0214/2013). The support of Bonnie Webber and Manfred Stede is greatly acknowledged though all errors are our own.

Here we present confusion matrices of the aligned relations in two talks. Rows show the English tokens aligned to language X, and columns show language X aligned to English. For example, in Table 11, the sum of the first row (47) is the sum of explicit relations (in English) aligned with a discourse relation in German. Of those relations, 31 are also conveyed explicitly in German, while 13 are realized as implicits and 3 as EntRels. The total number of explicit relations in the two English talks is 75 (also see Table 7 above), with 28 non-aligned explicit relations. Bold fonts indicates that the number of tokens in language X matches the number of tokens in English.

Table 11 German
Table 12 Polish
Table 13 Portuguese
Table 14 Russian
Table 15 Turkish

  • Discourse
  • Discourse relations
  • Corpus creation
  • Annotation
  • Multilingual corpus