Skip to main content

Iarg-AnCora: Spanish corpus annotated with implicit arguments

Abstract

This article presents the Spanish Iarg-AnCora corpus (400 k-words, 13,883 sentences) annotated with the implicit arguments of deverbal nominalizations (18,397 occurrences). We describe the methodology used to create it, focusing on the annotation scheme and criteria adopted. The corpus was manually annotated and an interannotator agreement test was conducted (81 % observed agreement) in order to ensure the reliability of the final resource. The annotation of implicit arguments results in an important gain in argument and thematic role coverage (128 % on average). It is the first corpus annotated with implicit arguments for the Spanish language with a wide coverage that is freely available. This corpus can subsequently be used by machine learning-based semantic role labeling systems, and for the linguistic analysis of implicit arguments grounded on real data. Semantic analyzers are essential components of current language technology applications, which need to obtain a deeper understanding of the text in order to make inferences at the highest level to obtain qualitative improvements in the results.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Notes

  1. 1.

    Iarg-AnCora is freely available at: http://clic.ub.edu/corpus/en/ancora-descarregues.

  2. 2.

    Since implicit arguments are not annotated in AnCora-Es, the percentage of realization cannot be computed. The corresponding figure (0.19 implicit arguments per verb) has been estimated from the corpus assuming that for a given predicate the number of arguments (explicit or not) is the same, on average, when realized as a verb or as deverbal nominalization.

  3. 3.

    For sake of clarity we underline the discourse entities acting as antecedents of the implicit arguments.

  4. 4.

    In Sect. 4.1, the annotation scheme is presented in more detail.

  5. 5.

    In AnCora corpus, the tag 'S' stands for clause.

  6. 6.

    http://nlp.cs.nyu.edu/meyers/NomBank.html.

  7. 7.

    https://framenet.icsi.berkeley.edu/.

  8. 8.

    http://clic.ub.edu/corpus/ancora.

  9. 9.

    http://stl.recherche.univ-lille3.fr/programmesetcontrats/NOMAGE/NOMAGEenglish.html.

  10. 10.

    https://code.google.com/p/copenhagen-dependency-treebank/.

  11. 11.

    http://www.coli.uni-saarland.de/projects/semeval2010_FG/.

  12. 12.

    The authors also provided a version of the corpus based on PropBank/NomBank annotations.

  13. 13.

    Predicates annotated: ‘bid’, ‘sale’, ‘loan’, ‘cost’, ‘plan’, ‘investor’, ‘price’, ‘loss’, ‘investment’ and ‘fund’.

  14. 14.

    Henceforth, we will refer to the Moor, Roth and Frank (2013) corpus as the MRF corpus.

  15. 15.

    Predicates annotated: ‘give’, ‘put’, ‘leave’, ‘bring’ and ‘pay’.

  16. 16.

    Note that a coreference chain may consist of only one mention, that is, a singleton.

  17. 17.

    Possessive pronouns and determiners can also be discourse entities, but they do not tend to be implicit arguments of deverbal nouns since they usually appear explicitly inside of the NP headed by the nominalization. For instance, Esto permitirá al banco sanear sus cuentas, que es condición básica para continuar con su privatización, ‘This will enable the bank to consolidate its accounts, which is a basic condition for its privatization'. In this example, the possessive determiner su (‘its') is the explicit argument, with the thematic role theme, of the deverbal noun privatización (‘privatization').

  18. 18.

    Not all the combinations of argument position and thematic roles are valid semantic tags.

  19. 19.

    200,000 words were extracted from the Spanish El Periódico newspaper (http://www.elperiodico.com/es/) and the other 200,000 words from the EFE newswire agency (http://www.efe.es), spanning from January to December 2000.

  20. 20.

    We used Spanish WordNet in the Multilingual Central Repository (MCR), which is linked to Princeton WordNet (Gonzalez-Agirre, Laparra and Rigau 2012), http://adimen.si.ehu.es/web/MCR.

  21. 21.

    Spanish is a pro-drop language, therefore, pronominal subjects can be omitted. The object personal pronouns often appear as clitic forms and can be adjoined to the verb.

  22. 22.

    AnCora-Verb-Es lexicon is available at: http://clic.ub.edu/corpus/ancoraverb_es.

  23. 23.

    AnCora-Verb contains 3934 different senses and 5117 syntactic-semantic frames in total. .

  24. 24.

    http://clic.ub.edu/corpus/webfm_send/50.

  25. 25.

    http://clic.ub.edu/corpus/ancoranom_es.

  26. 26.

    AnCoraPipe is freely available, to access contact amarti@ub.edu.

  27. 27.

    http://www.eclipse.org/.

  28. 28.

    AnCoraPipe has been used for the treatment of corpora in the Amazighe, Latin and Cyrillic alphabets.

  29. 29.

    For reasons of space, Fig. 3 only shows the discourse entities starting from entity12.

  30. 30.

    We have split the panels in two figures in order to better visualize their content.

  31. 31.

    The mean of inter-annotator agreement for the annotation of explicit arguments reached 0.75 kappa, which translated to 79.2 % observed agreement.

  32. 32.

    Instances stand for the number of occurrences of argument types found in the corpus.

  33. 33.

    It is worth noting that the implicit arguments of verbs are not annotated in Iarg-AnCora, so the number of occurrences and percentages for verbs only includes explicit arguments.

  34. 34.

    The figures are slightly different from those reported in Sect. 4 because the comparison with G&C is performed with the subset of the 8 most frequent monosemous nominalizations.

  35. 35.

    A third explanation could be the use of different criteria in the annotation of both explicit and implicit arguments in the G&C dataset and in AnCora.

  36. 36.

    TCO aims to provide WordNet synsets with a neutral ontological assignment. The ontology contains 63 features organized as 1st order entities (physical things), 2nd order entities (situations) and 3rd order entities (unobservable things).

  37. 37.

    Since AnCora-Es mentions are annotated with correct synsets, no Word Sense Disambiguation was needed.

References

  1. Álvez, J., Atserias, J., Carrera, J., Climent, S., Oliver, A., & Rigau, G. (2008). Consistent annotation of EuroWordnet with the top concept ontology. In Proceedings of 4th international WordNet conference (GWC-08). Association for Computational Linguistics.

  2. Aparicio, J., Taulé, M., & Martí, M. A. (2008). AnCora-Verb: A lexical resource for the semantic annotation of corpora. In Proceedings of 6th international conference on of language, resources and evaluation. Marrakech, Morocco.

  3. Baker, C. F., Fillmore, C. J., & Lowe, J. B. (1998). The Berkeley FrameNet project. In Proceedings of the 36th annual meeting of the association for computational linguistics and 17th international conference on computational linguistics. ACL’98 (Vol. 1, pp. 86–90). Stroudsburg, PA, USA: Association for Computational Linguistics.

  4. Balvet, A., Condette, M. H., Haas, P., Huyghe, R., Marín R., & Merlo, A. (2011). Nomage: An electronic lexicon of French deverbal nouns based on a semantically annotated corpus. In Proceedings of the 1st international workshop on lexical recources (WoLeR 2011), pp. 8–15.

  5. Bertran, M., Borrega, O., Martí, M. A, & Taulé, M. (2011). AnCoraPipe: A new tool for corpora annotation. Working paper 1: TEXT-MESS 2.0 (Text-Knowledge 2.0). Universitat de Barcelona. http://clic.ub.edu/sites/default/files/pagines/AnCoraPipe.pdf

  6. Chen, D., Schneider, N., Das, D., & Smith, N. A. (2010). SEMAFOR: Frame argument resolution with log-linear models. In Proceedings of the 5th international workshop on semantic evaluation (SemEval’10) (pp. 264–267). Stroudsburg, PA, USA: Association for Computational Linguistics.

  7. Chinchor, N., & Sundheim, B. (2003). Message understanding conference (MUC) 6. Philadelphia: Linguistic Data Consortium.

    Google Scholar 

  8. Erk, K., & Padó, S. (2004). A powerful and versatile XML Format for representing role-semantic annotation. In Proceedings of 4th international conference on language resources and evaluation. Lisbon, Portugal.

  9. Fillmore, C. J. (1986). Pragmatically controlled zero anaphora. Technical report, Department of Linguistics. University of California.

  10. Fillmore, C. J., & Baker, C. F. (2001). Frame semantics for text understanding. In Proceedings of the workshop on WordNet and other lexical resources. NAACL, Pittsburgh, Pennsylvania, Association for Computational Linguistics.

  11. Gerber, M. (2011). Semantic role labeling of implicit arguments for nominal predicates. Ph-Dissertation, Michigan State University, USA.

  12. Gerber, M., & Chai, J. Y. (2010). Beyond NomBank: A study of implicit arguments for nominal predicates. In Proceedings of the 48th annual meeting of the association for computational linguistics. ACL’10 (pp. 1583–1592). Stroudsburg, PA, USA: Association for Computational Linguistics.

  13. Gerber, M., & Chai, J. Y. (2012). Semantic role labeling of implicit arguments for nominal predicates. Computational Linguistics, 38, 755–798.

    Article  Google Scholar 

  14. Hovy, E., Marcus, M., Palmer, M., Ramshaw, L., & Weischedel, R. (2006). OntoNotes: The 90% solution. In Proceedings of the human language technology conference of the North American chapter of the association for computational linguistics, HLT-NAACL’06, 57–60, New York.

  15. Kipper, K., Korhonen, A., Ryant, N., & Palmer, M. (2006). Extending VerbNet with novel verb classes. In Proceedings of the 5th international conference on language resources and evaluation (LREC’06), pp. 1027–1032. Genova, Italy.

  16. Laparra, E., & Rigau, G. (2012). Exploiting explicit annotations and semantic types for implicit argument resolution. ICSC, pp. 75–78.

  17. Laparra, E., & Rigau, G. (2013). ImpAr: A deterministic algorithm for implicit semantic role labelling. The 51st annual meeting of the association for computational linguistics (ACL 2013). Sofia, Bulgaria. Aquest crec que es unsupervised.

  18. Levin, B. (1993). English verb classes and alternations: A preliminary investigation. Chicago: University of Chicago Press.

    Google Scholar 

  19. Marcus, M., Santorini, B., & Marcinkiewicz, M. (1993). Building a large annotated corpus of English: The Penn treebank. Computational Linguistics, 19, 313–330.

    Google Scholar 

  20. Meyers, A. (2007). Anotation guidelines for NomBank-noun argument structure for PropBank. Technical report, University of New York.

  21. Meyers, A., Reeves, R., & Macleod, C. (2004). NP-external arguments, a study of argument sharing in English. In Proceedings of the workshop on multiword expressions: Integrating processing (MWE’04) (pp. 96–103). Stroudsburg, PA, USA: Association for Computational Linguistics.

  22. Mitchell, A., Strassel, S., Przybocki, M., Davis, J. K., Doddington, G., Grishman, R., et al. (2003). ACE-2 version 1.0. Linguistic Data Consortium, Philadelphia.

  23. Moor, T., Roth, M., & Frank, A. (2013). Predicate-specific annotations for implicit role binding: Corpus annotation, data analysis and evaluation experiments. In Proceedings of the 10th international conference on computational semantics (IWCS)Short papers. Potsdam, Germany, pp. 369–375.

  24. Müller, H. (2011). The copenhagen dependency treebank (CDT). Extending syntactic annotation to morphology and semantics. In K. Gerdes, E. Hajičová, & L. Wanner (Eds.), Depling 2011 proceedings. International conference on dependency linguistics: Exploring dependency grammar, semantics, and the lexicon (pp. 125–134). Barcelona: Depling.

  25. Palmer, M., Kingsbury, P., & Gildea, D. (2005). The proposition bank: An annotated corpus of semantic roles. Computational Linguistics, 31(1), 71–106.

    Article  Google Scholar 

  26. Parker, R., Graff, D., Kong, J., Chen, K., & Maeda, K. (2011). English Gigaword (5th ed.). Philadelphia: Linguistic Data Consortium.

    Google Scholar 

  27. Peris, A., & Taulé, M. (2011). AnCora-Nom: A Spanish Lexicon of deverbal nominalizations. Procesamiento del Lenguaje Natural, 46, 11–19.

    Google Scholar 

  28. Peris, A., & Taulé, M. (2012). Annotating the argument structure of deverbal nominalizations in Spanish. Language Resources and Evaluation, 46(4), 667–699, Springer.

  29. Peris, A., & Taulé, M. (2013). Argumentos implícitos de los sustantivos deverbales. Guía de anotación v. 0.2. Working paper: 1 Diana-Construcciones. Universitat de Barcelona.

  30. Peris, A., Taulé, M., Rodríguez, H., & Bertran, M. (2013). LIARc: Labeling implicit ARguments in Spanish deverbal nominalizations. In Computational linguistics and intelligent text processing14th international conference, CICLing 2013, Samos, Greece. Proceedings, Part I. Springer, Lecture Notes in Computer Science, 7816, pp. 423–434, Berlin, Germany.

  31. Poesio, M. (2004). The MATE/GNOME proposals for anaphoric annotation, revisited. In Proceedings of the 5th SIGdial workshop at HLT-NAACL 2004, pp. 154–162. Boston.

  32. Poesio, M., & Artstein, R. (2005). The reliability of anaphoric annotation, reconsidered: Taking ambiguity into account. In Proceedings of the workshop on frontiers in corpus annotation II: Pie in the sky, pp. 76–83, Ann Arbor, MI.

  33. Recasens, M., & Martí, M. A. (2010). AnCora-CO: Coreferentially annotated corpora for Spanish and Catalan. Language Resources and Evaluation, 44(4):315–345, Springer.

  34. Recasens, M., & Vila, M. (2010). On paraphrase and coreference. Computational Linguistics, 36(4), 639–647.

    Article  Google Scholar 

  35. Roth, M., & Frank, A. (2012). Aligning predicate argument structures in monolingual comparable texts: A new corpus for a new task. In Proceedings of the 1st joint conference on lexical and computational semantics (*SEM) (pp. 218–227). Montreal, Canada: Association for Computational Linguistics.

  36. Roth, M., & Frank, A. (2013). Automatically identifying implicit arguments to improve argument linking and coherence modeling. In Proceedings of the 2nd joint conference on lexical and computational semantics (*SEM) (pp. 306–316). Atlanta, Georgia, USA: Association for Computational Linguistics.

  37. Ruppenhofer, J., Ellsworth, M., Petruck, M., Johnson, C. R., & Scheffczyk, J. (2006). FrameNet II: Extended theory and practice. Berkeley, California: International Computer Science Institute.

    Google Scholar 

  38. Ruppenhofer, J., Gorinski, P., & Sporleder, C. (2011). In search of missing arguments: A linguistic approach. In Proceedings of the international conference recent advances in natural language processing (RANLP 2011), pp. 331–338, Hissar, Bulgaria.

  39. Ruppenhofer, J., Lee-Goldman, R., Sporleder, C., & Morante, R. (2012). Beyond sentence-level semantic role labeling: Linking argument structures in discourse. Language Resources and Evaluation, 47(3), 695–721, Springer.

  40. Ruppenhofer, J., Sporleder, C., Morante, R., Baker, C., & Palmer, M. (2010). Semeval-2010 task 10: Linking events and their participants in discourse. In Proceedings of the 5th workshop on semantic evaluations (ACL 2010), pp. 45–50, Uppsala, Sweden.

  41. Silberer, C., & Frank, A. (2012). Casting implicit role linking as an anaphora resolution task. *SEM 2012: The 1st joint conference on lexical and computational semantics—Vol. 1: Proceedings of the main conference and the shared task, and Vol. 2: Proceedings of the 6th international workshop on semantic evaluation (SemEval 2012) (pp. 1–10). Montréal, Canada, Association for Computational Linguistics.

  42. Taulé, M., Martí, M. A., & Borrega, O. (2011). AnCora 2.0: Argument structure guidelines for Catalan and Spanish, Working paper 4: TEXT-MESS 2.0 (Text-Knowledge 2.0).

  43. Taulé, M., Martí, M. A., & Recasens, M. (2008). AnCora: Multilevel annotated corpora for Catalan and Spanish. In Proceedings of 6th international conference on language resources and evaluation, pp. 96–101. Marrakesh, Morocco.

  44. Tonelli, S., & Delmonte, R. (2010). VENSES ++: Adapting a deep semantic processing system to the identification of null instantiations. In Proceedings of the 5th international workshop on semantic evaluation (SemEval’10) (pp. 296–299). Stroudsburg, PA, USA: Association for Computational Linguistics.

  45. Tonelli, S., & Delmonte, R. (2011). Desperately seeking implicit arguments in text. In Proceedings of the ACL 2011 workshop on relational models of semantics (pp. 54–62). Stroudsburg, PA, USA: Association for Computational Linguistics.

  46. Wang, N., Li, R., Lei, Z., Wang, Z., & Jin, J. (2013). Document oriented gap filling of definite null instantiation in FrameNet. In M. Sun, et al. (Eds.), Chinese computational linguistics and natural language processing based on naturally annotated big data 2013 (pp. 85–96)., Lecture notes in computer science Berlin Heidelberg: Springer.

    Chapter  Google Scholar 

  47. Weischedel, R., Hovy, E., Marcus, M., Palmer, M., Belvin, R., Pradhan, S., et al. (2011). OntoNotes: A large training corpus for enhanced processing. In J. Olive, C. Christianson, & J. McCary (Eds.), Handbook of natural language processing and machine translation: DARPA global autonomous language exploitation. New York: Springer.

    Google Scholar 

Download references

Acknowledgments

We are grateful to David Bridgewater for the proofreading of English. We would also like to express our gratitude to the three anonymous reviewers for their comments and suggestions to improve this article. This work was partly supported by the DIANA (TIN2012-38,603-C02-02) and SKATER (TIN2012-38,584-C06-01) projects from the Spanish Ministry of Economy and Competitiveness.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Mariona Taulé.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Taulé, M., Peris, A. & Rodríguez, H. Iarg-AnCora: Spanish corpus annotated with implicit arguments. Lang Resources & Evaluation 50, 549–584 (2016). https://doi.org/10.1007/s10579-015-9334-3

Download citation

Keywords

  • Implicit argument
  • Deverbal nominalizations
  • Argument structure
  • Thematic roles
  • Semantic corpus annotation
  • Linguistic resource