Skip to main content

CRPC-DB a Discourse Bank for Portuguese

  • Conference paper
  • First Online:
Computational Processing of the Portuguese Language (PROPOR 2022)

Abstract

We present a new resource for discourse studies in Portuguese, the CRPC Discourse Bank (CRPC-DB). CRPC-DB follows the Penn Discourse Treebank style of annotation. The annotation is performed on the PAROLE corpus, a free subset of the Reference Corpus of Contemporary Portuguese (CRPC) that includes news, fiction and didactic/scientific texts. The discourse bank covers explicit and implicit relations at intra and inter-sentential levels, and includes for now a total of 14,436 discourse relations. We present the main guidelines of our annotation and discuss specific cases. An experiment in inter-annotator agreement was performed and holds results of 0.88 F1-score for discourse relation identification, 0.71 Cohen’s K for the classification of discourse relation types, and 0,75 for top-level sense classification. The CRPC-DB will be distributed free of charge through the PORTULAN CLARIN infrastructure.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 69.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 89.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    http://catalogue.elra.info/en-us/repository/browse/ELRA-W0024_01.

  2. 2.

    https://portulanclarin.net.

References

  1. Al-Saif, A., Markert, K.: The leeds arabic discourse treebank: annotating discourse connectives for Arabic. In: Proceedings of LREC 2010, pp. 2046–2053 (2010)

    Google Scholar 

  2. Aleixo, P., Pardo, T.A.: CSTTool: um parser multidocumento automático para o português do Brasil. In: Proceedings of the IV Workshop on MSc Dissertation and PhD Thesis in Artificial Intelligence – WTDIA, pp. 140–145 (2008)

    Google Scholar 

  3. Aleixo, P., Pardo, T.A.: CSTNews: Um córpus de textos jornalísticos anotados segundo a teoria discursiva multidocumento CST (cross-document structure theory). Technical report NILC-TR-08-05, Núcleo Interinstitucional de Lingüística Computacional NILC, Universidade de São Paulo (2008)

    Google Scholar 

  4. Asher, N., Hunter, J., Morey, M., Farah, B., Afantenos, S.: Discourse structure and dialogue acts in multiparty dialogue: the STAC corpus. In: The Tenth International Conference on Language Resources and Evaluation (LREC 2016) (2016)

    Google Scholar 

  5. Asher, N., Lascarides, A.: The semantics and pragmatics of presupposition. J. Semant. 15(2), 239–299 (1988)

    Google Scholar 

  6. Asher, N.: Reference to Abstract Objects in Discourse. Kluwer, Dordrecht (1993)

    Book  Google Scholar 

  7. Asher, N., et al.: ANNODIS and related projects: case studies on the annotation of discourse structure. In: Ide, N., Pustejovsky, J. (eds.) Handbook of Linguistic Annotation, pp. 1241–1264. Springer, Dordrecht (2017). https://doi.org/10.1007/978-94-024-0881-2_47

    Chapter  Google Scholar 

  8. Branco, A., et al.: The Portuguese Language in the Digital Age / A Língua Portuguesa na Era Digital. Springer, Heidelberg (2012)

    Google Scholar 

  9. Branco, A., Mendes, A., Quaresma, P., Gomes, L., Silva, J., Teixeira, A.: Infrastructure for the science and technology of language PORTULAN CLARIN. In: LREC 2020 Worskhop IWLTP 2020–1st International Workshop on Language Technology Platforms, pp. 1–7. ELRA (2020)

    Google Scholar 

  10. Branco, A.H., Silva, J.R.: Contractions: breaking the tokenization-tagging circularity. In: Mamede, N.J., Trancoso, I., Baptista, J., das Graças Volpe Nunes, M. (eds.) PROPOR 2003. LNCS (LNAI), vol. 2721, pp. 167–170. Springer, Heidelberg (2003). https://doi.org/10.1007/3-540-45011-4_24

    Chapter  Google Scholar 

  11. Carbonel, T., Fuchs, J.T., Rino, L.: Anotação parcial de estruturas retóricas (RST) do Corpus Summ-it. Technical report, NILC-TR-04-07, Núcleo Interinstitucional de Lingüística Computacional NILC, Universidade de São Paulo (2007)

    Google Scholar 

  12. Carlson, L., Marcu, D.: Discourse tagging reference manual. Technical report ISI-TR-545 (2001)

    Google Scholar 

  13. Généreux, M., Hendrickx, I., Mendes, A.: Introducing the reference corpus of contemporary Portuguese on-line. In: Calzolari, N., et al. (eds.) LREC’2012 - Eighth International Conference on Language Resources and Evaluation, pp. 2237–2244. European Language Resources Association (ELRA), Istanbul, Turkey (2012)

    Google Scholar 

  14. Grésillon, A., Lebrave, J.L.: Qui interroge qui et pourquoi? In: La langue au ras du texte, pp. 57–132. Presses Universitaires de Lille (1984)

    Google Scholar 

  15. Lanham, R.: A Handlist of Rhetorical Terms. University of California Press, Berkeley (1991)

    Book  Google Scholar 

  16. Mayoral, J.A.: Figuras Retóricas. Editorial Sintesis, Madrid (1994)

    Google Scholar 

  17. Maziero, E., Pardo, T.A.: CSTParser - a multi-document discourse parser. In: Proceedings of the PROPOR 2012 Demonstration, pp. 1–3 (2012)

    Google Scholar 

  18. Mírovský, J., Mladová, L., Zikánová, Š.: Connective-based measuring of the inter-annotator agreement in the annotation of discourse in PDT. In: COLING 2010: Posters, pp. 775–781. Coling 2010 Organizing Committee, Beijing, China, August 2010. https://www.aclweb.org/anthology/C10-2089

  19. Nunes, M.V., Pardo, T.A.: A construção de um corpus de textos científicos em português do Brasil e sua marcação retórica. Technical report, NILC-TR-03-08, Núcleo Interinstitucional de Lingüística Computacional NILC, Universidade de São Paulo (2003)

    Google Scholar 

  20. Oza, U., Prasad, R., Kolachina, S., Sharma, D.M., Joshi, A.: The Hindi discourse relation bank. In: Proceedings of the 3rd Linguistic Annotation Workshop, pp. 158–161. Association for Computational Linguistics (2009)

    Google Scholar 

  21. Pardo, T., Seno, E.: Rhetalho: Um corpus de referência anotado retoricamente. In: Anais do V Encontro de Corpora (2005)

    Google Scholar 

  22. Prasad, R., et al.: The Penn discourse treebank 2.0. In: Proceedings of LREC 2008, pp. 2961–2968 (2008)

    Google Scholar 

  23. Prasad, R., Webber, B., Joshi, A.: Reflections on the Penn discourse treebank, comparable corpora, and complementary annotation. Comput. Linguist. 40(4), 921–950 (2014)

    Article  Google Scholar 

  24. Rachakonda, R.T., Sharma, D.M.: Creating an annotated Tamil corpus as a discourse resource. In: Proceedings of the 5th Linguistic Annotation Workshop, pp. 119–123. Association for Computational Linguistics, Portland, Oregon, USA, June 2011. https://www.aclweb.org/anthology/W11-0414

  25. Sanders, T., Spooren, W., Noordman, L.: Toward a taxonomy of coherence relations. Disc. Process. 15, 1–35 (1992)

    Article  Google Scholar 

  26. Sharma, H., Dakwale, P., Sharma, D.M., Prasad, R., Joshi, A.: Assessment of different workflow strategies for annotating discourse relations: a case study with HDRB. In: Gelbukh, A. (ed.) CICLing 2013. LNCS, vol. 7816, pp. 523–532. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-37247-6_42

    Chapter  Google Scholar 

  27. Spooren, W., Degand, L.: Coding coherence relations: reliability and validity. Corpus Linguist. Linguist. Theory 6(2), 241–266 (2010)

    Article  Google Scholar 

  28. Tonelli, S., Riccardi, G., Prasad, R., Joshi, A.: Annotation of discourse relations for conversational spoken dialogs. In: Calzolari, N., et al. (eds.) Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC 2010), European Language Resources Association (ELRA), Valletta, Malta, May 2010

    Google Scholar 

  29. Webber, B., Prasad, R., Lee, A., Joshi, A.: A discourse-annotated corpus of conjoined VPs. In: Proceedings of the 10th Linguistics Annotation Workshop, pp. 22–31 (2016)

    Google Scholar 

  30. Webber, B., Prasad, R., Lee, A., Joshi, A.: The Penn Discourse Treebank 3.0 annotation manual. Technical report, Institute for Research in Cognitive Science (2019)

    Google Scholar 

  31. Zeyrek, D., Mendes, A., Grishina, Y., Kurfalı, M., Gibbon, S., Ogrodniczuk, M.: TED multilingual discourse bank (TED-MDB) a parallel corpus annotated in the PDTB style. Lang. Resour. Eval. 54, 587–613 (2020)

    Article  Google Scholar 

  32. Zeyrek, D., Webber, B.L.: A discourse resource for Turkish: annotating discourse connectives in the METU corpus. In: IJCNLP, pp. 65–72 (2008)

    Google Scholar 

  33. Zhou, Y., Xue, N.: PDTB-style discourse annotation of Chinese text. In: Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers, vol. 1, pp. 69–77. Association for Computational Linguistics (2012)

    Google Scholar 

Download references

Acknowledgements

This work was partially supported by PORTULAN CLARIN-Research Infrastructure for the Science and Technology of Language, funded by Lisboa2020, Alentejo2020 and FCT-Fundação para a Ciência e Tecnologia under the grant PINFRA/22117/2016, and by FCT under the project UIDP/00214/2020. Some of its developments were implemented in the scope of the COST Action TextLink - Structuring Discourse in Multilingual Europe. We wish to thank the anonymous reviewers for their helpful comments.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Amália Mendes .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Mendes, A., Lejeune, P. (2022). CRPC-DB a Discourse Bank for Portuguese. In: Pinheiro, V., et al. Computational Processing of the Portuguese Language. PROPOR 2022. Lecture Notes in Computer Science(), vol 13208. Springer, Cham. https://doi.org/10.1007/978-3-030-98305-5_8

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-98305-5_8

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-98304-8

  • Online ISBN: 978-3-030-98305-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics