How the corpus-based Basque Verb Index lexicon was built

Abstract

This article describes the method used to build the Basque Verb Index (BVI), a corpus-based lexicon. The BVI is the result of semiautomatic annotation of the EPEC corpus with verb predicate information, following the PropBank-VerbNet model. The method presented is the product of a deep study of the syntactic–semantic behaviour of verbs in EPEC-RolSem (the EPEC corpus tagged with verb predicate information). During the process of annotating EPEC-RolSem, we have identified and stored in the BVI lexicon the different role-patterns associated with all verbs appearing in the corpus. In addition, each entry in the BVI is linked to the corresponding verb entry in well-known resources such as PropBank, VerbNet, WordNet and FrameNet. We have also implemented a tool called e-ROLda to facilitate the process of looking up verb patterns in the BVI and examples in EPEC-RolSem as a basis for future studies.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2

Notes

  1. 1.

    http://ixa.si.ehu.es/Ixa.

  2. 2.

    Only in the case of the 100 verbs analysed in this study.

  3. 3.

    PropBank has recently added an Arg6 to tag nominal natural disaster Rolesets (Bonial et al. 2017).

  4. 4.

    In the 3.3 version of VerbNet many changes have been implemented: path_rel semantics, initial lexical features, an many updates to verb classes, frames, and members, but full documentation about these changes is not jet available (http://verbs.colorado.edu/verbnet/). We use in this paper the data as it existed before all these changes.

  5. 5.

    1. Linguistic InfRastructure for Interoperable resourCes and Systems (http://lirics.loria.fr).

  6. 6.

    See Bonial et al. (2011) for details on the comparison between VN and LIRICS lists of roles and the decisions taken in the 3.2 version of VN.

  7. 7.

    1. See Sect. 2.2.1 for an explanation of the Arg0/Arg1 choice.

  8. 8.

    2. The examples under (4) could suggest that this verb is mainly a light verb. Light verbs are not the focus of this research, but it has to be said that at the moment we are working on this issue to see how the light verbs and the multiword expressions created with light verbs must be included in the BVI lexicon.

  9. 9.

    The EPEC-DEP corpus is the EPEC corpus syntactically tagged using a dependency grammar.

  10. 10.

    The 3 selected verbs were adierazi (‘to state’), izan (‘to be’) and etorri (‘to come’). We chose very different verbs to be able to draw interesting conclusions. The verb adierazi has a single sense and is very frequent in the corpus. The verb izan is the most frequent verb in the corpus (15.22%). Finally, the verb etorri is a priori a difficult verb, because it has 4 senses (not always easily distinguishable) and it is used extensively in complex expressions.

  11. 11.

    We do not include the time and personnel involved in earlier phases such as setting up the annotation criteria, creating the guidelines, or preparing the tool for the annotation task.

  12. 12.

    1.https://hiztegiak.elhuyar.eus/eu_en.

  13. 13.

    ARG_INFO tag is the semantic label we have created to annotate verb predicate information. For more details about this label see Estarrona et al. (2016).

  14. 14.

    1. We do not take into consideration adjuncts (ArgM) when building lexicon entries.

  15. 15.

    This figure is taken from our e-ROLda tool that we will present in detail in Sect. 4.

  16. 16.

    1. “These verbs are arrayed in a classic Zipfian distribution, with a few verbs occurring very often (say, for example, is the most common verb, with over 10,000 instances in its various inflectional forms), and most verbs occurring two or fewer times” (Palmer et al. 2005a: 13).

  17. 17.

    1. At the time of writing, we are working on a Basque NOMLEX and including the information of this new resource in the e-ROLda tool. Given the fact that the work is ongoing, the data is still tentative and incomplete at this stage.

  18. 18.

    1.http://adesse.uvigo.es/.

  19. 19.

    2.http://grial.uab.es/projectes/SenSem.php.

  20. 20.

    3.http://clic.ub.edu/corpus/en/ancora.

References

  1. Aduriz, I., Aranzabe, M. J., Arriola, J. M., Atutxa, A., Díaz de Ilarraza, A., Ezeiza, N., Gojenola, K., Oronoz, M., Soroa, A., & Urizar, R. (2006). Methodology and steps towards the construction of EPEC, a corpus of written Basque tagged at morphological and syntactic levels for automatic processing. In A. Wilson, P. Rayson, D. Archer (Eds.), Corpus Linguistics Around the World. Book series: Language and Computers (Vol. 56, pp. 1–15). Rodopi (Netherlands). ISBN: 90-420-1836-4.

  2. Agirre, E., Aldezabal, I., Etxeberria, J., & Pociello, E. (2006). A preliminary study for building the Basque PropBank. In Proceedings of the 5th international conference on language resources and evaluations (LREC’06) (pp. 981–986). Genoa, Italy. ISBN: 2-9517408-2-4.

  3. Aldabe, I., Gonzáles-Dios, I., López-Gazpio, I., Madrazo, J., & Maritxalar, M. (2013). Two approaches to generate questions in Basque. In Procesamiento del Lenguaje Natural (Vol. 51, pp. 101–108). Print ISSN: 1135-5948. Online ISSN: 1989-7553.

  4. Aldezabal, I. (2004). Aditz-azpikategorizazioaren azterketa. 100 aditzen azterketa zehatza, Levin (1993) oinarri harturik eta metodo automatikoak baliatuz. Ph.D. Thesis, Leioa (Bilbao), University of Basque Country.

  5. Aldezabal, I. (2010). Basis for the annotation of EPEC-RolSem. Interdisciplinary Workshop on Verbs. The Identification and Representation of Verb Features. Scuola Normale Superiore—Laboratori di Linguistica (pp. 92–97). Universitá di Pisa, Dipartamente di Linguistica. Pisa (Italy).

  6. Aldezabal, I., Aranzabe, M. J., Arriola, J. M., & Díaz de Ilarraza, A. (2009). Syntactic annotation in the Reference Corpus for the Processing of Basque (EPEC): Theoretical and practical issues. Corpus Linguistics and Linguistic Theory,5(2), 241–269.

    Article  Google Scholar 

  7. Aldezabal, I., Aranzabe, M. J., Díaz de Ilarraza, A., & Estarrona, A. (2010b). Building the Basque PropBank. In N. Calzolari, K. Choukri, B. Maegaard, J. Mariani, J. Odjik, S. Piperidis, M. Rosner & D. Tapias (Eds.), Proceedings of the seventh conference on international language resources and evaluation (LREC 2010) (pp. 1414–1417). European Language Resources Association (ELRA). LREC 2010, Valletta (Malta), May 19–21, 2010. ISBN: 2-9517408-6-7.

  8. Aldezabal, I., Aranzabe, M. J., Díaz de Ilarraza, A., & Estarrona, A. (2011). Preliminary evaluation of EPEC-RolSem, a Basque corpus labelled at predicate level. In XXVII Congreso de la Sociedad Española para el Procesamiento del Lenguaje Natural (SEPLN 2011). Universidad de Huelva.

  9. Aldezabal, I., Aranzabe, M. J., Díaz de Ilarraza, A., Estarrona, A., & Uria, L. (2010a). EusPropBank: Integrating semantic information in the Basque dependency treebank. In A. Gelbukh (Ed.), Lecture notes in computer science (LNCS) no 6008, computational linguistics and intelligent text processing (pp. 60–73). Berlin, Heidelberg, New York: Springer. ISSN: 0302-9743, ISBN-10: 3-642-12115-2.

  10. Aparicio, J. (2007). Clasificación semánticade los predicados del español. Masters Dissertation, Universitatde Barcelona.

  11. Aparicio, J., Taulé, M., & Martí, M.A. (2008). AnCora-Verb: A lexical resource for the semantic annotation of corpora. In N. Calzolari, K. Choukri, B. Maegaard, J. Mariani, J. Odijk, S. Piperidis & D. Tapias (Eds.), Proc. of 6th international conference on language resources and evaluation (LREC’08) (pp. 797–802). ELRA. ISBN: 2-9517408-4-0.

  12. Aranzabe, M. J., Atutxa, A., Bengoetxea, K., Díaz de Ilarraza, A., Goenaga, I., Gojenola, K., & Uria, L. (2015). Automatic conversion of the Basque dependency treebank to universal dependencies. In M. Dickinsons, E. Hinrichs, A. Patejuk, A. Przepiórkowski (Eds.), Proceedings of the fourteenth international workshop on treebanks an linguistic theories (TLT14) (pp. 233–241). Institute of Computer Science of the Polish Academy of Sciences, Warszawa, Poland. ISBN: 978-83-63159-18-4.

  13. Babko-Malaya, O., Bies, A., Taylor, A., Yi, S., Palmer, M., Marcus, M., Kulick, S., & Shen, L. (2006). Issues in synchronizing the English Treebank and PropBank. In Proc. of the workshop on frontiers in linguistically annotated corpora, a merged workshop with 7th int. workshop on linguistically interpreted corpora (LINC-2006) and frontier in corpus annotation III (Coling/ACL 2006) (pp. 70–77). Association for Computational Linguistics (ACL). Sydney, Australia. ISBN: 1-932432-78-7.

  14. Baker, C. F., Fillmore, C. J., & Lowe, J. B. (1998). The Berkeley FrameNet Project. In Proceedings of COLING-ACL’98. 36th annual meeting of the association for computational linguistics and 17th international conference on computational linguistics (COLING-ACL’98) (pp. 86–90). Montréal, Quebec: Morgan Kaufmann Publishers/ACL.

  15. Bhatt, R., Narasimhan, B., Palmer, M., Rambow, O., Sharma, D., & Xia, F. (2009). A multi-representational and multi-layered treebank for Hindi/Urdu. In Proceedings of the third linguistic annotation workshop, ACL-IJCNLP 2009 (pp. 186–189). Association for Computational Linguistics (ACL). Suntec, Singapore.

  16. Bonial, C., Bonn J., Conger K., Hwang, J., Palmer M., & Reese N. (2015). English PropBank annotation guidelines. http://propbank.github.io/. Accessed 4 Sept 2016.

  17. Bonial, C., Conger, K., Hwang, J. D., Mansouri, A., Aseri, Y., Bonn, J., et al. (2017). Current directions in English and Arabic PropBank. In N. Ide & J. Pustejovsky (Eds.), Handbook of linguistic annotation (pp. 737–769). Berlin: Springer.

    Google Scholar 

  18. Bonial, C., Corvey, W., Palmer, M., Petukhova, V., & Bunt, H. C. (2011). A hierarchical unification of LIRICS and VerbNet semantic roles. In Proc. of the workshop on semantic annotation for computational ling. Resources (SACL-ICSC 2011) (pp. 483–489). IEEE. Palo Alto, Californa, USA. ISBN: 978-1-4577-1648-5.

  19. Bunt, H. C., Petukhova, V., & Schiffrin, A. (2007). LIRICS Deliverable D4.4. Multilingual test suites for semantically annotated data. http://lirics.loria.fr. Accessed 23 June 2014.

  20. Castellón, I., Fernández, A., Vázquez, G., Alonso, L., & Capilla, J. A. (2006). The Sensem corpus: A corpus annotated at the syntactic and semantic level. In Proceedings of the fifth international conference on language resources and evaluation (LREC’06) (pp. 355–358). European Language Resources Association (ELRA). Genoa, Italy. ISBN: 2-9517408-2-4.

  21. Civit, M., Aldezabal, I., Pociello, E., Taulé, M., Aparicio, J., & Márquez, L. (2005). 3LBLEX: Léxico verbal con frames sintáctico-semánticos. In Procesamiento del Lenguaje Natural (Vol. 35, pp. 367–373). Print ISSN: 1135-5948. Online ISSN: 1989-7553.

  22. de Rijk, R. (1969). Is Basque an SOV language? Fontes Linguae Vasconum,1, 319–351.

    Google Scholar 

  23. Estarrona, A. (2014). EPEC corpusa predikatu-mailan etiketatzeko oinarriak: EPEC-RolSem, BVI eta e-ROLda. Ph.D. Thesis, Basque Language and Communication, Basque Country University (UPV-EHU), Donostia.

  24. Estarrona, A., Aldezabal, I., Díaz de Ilarraza, A., & Aranzabe, M. J. (2016). Methodology for the semiautomatic annotation of EPEC-RolSem, a Basque corpus labelled at predicate level following the PropBank/VerbNet model. In E. Vanhoutte (Ed.), Digital scholarship in the humanities (Vol. 31, No. 3, pp. 470–492). http://dx.doi.org/10.1093/llc/fqv010. First published online: 17 June 2015 (23 pages). Published by Oxford University Press on behalf of EADH: The European Association for Digital Humanities (Online ISSN 2055-768X - Print ISSN 2055-7671). https://academic.oup.com/dsh/article/31/3/470/1745349.

  25. Fellbaum, C. (1998). WordNet, an electronic lexical database. Cambridge: MIT Press. ISBN 0-262-06197-X.

    Google Scholar 

  26. García-Miguel, J., & Albertuz, F. J. (2005). Verbs, semantic classes and semantic roles in the ADESSE project. In K. Erk, A. Melinger & S. Schulte im Walde (Eds.), Proc. of workshop on the identification and representation of verb features and verb classes (pp. 50–55). Saarbrücken, Germany.

  27. Gardent, C., & Cerisara, C. (2010). Semi-automatic Propbanking for French. In Proceedings of the ninth international workshop on treebanks and linguistic theories (pp. 67–78). Northern European Association for Language Technology (NEALT). Tartu, Estonia. Print ISSN: 1736-8197. Online ISSN: 1736-6305.

  28. Hajic, J., Panevová, J., Urešová, Z., Bémová, A., Kolárová, V, & Pajas, P. (2003). PDT-VALLEX: Creating a largecoverage valency lexicon for treebank annotation. In Nivre, J. & Hinrichs, E. (Eds.), Proc. of the second workshop on treebanks and linguistic theories (pp. 57–68). ISBN: 9176363945 9789176363942.

  29. Hanks, P. (2012). The corpus revolution in lexicography. International Journal of Lexicography,25(4), 398–436.

    Article  Google Scholar 

  30. Kingsbury, P., & Palmer, M. (2003). PropBank: the next level of treebank. In Proceedings of the second workshop on treebanks and linguistic theories (TLT 2003) (Vol. 3). ISBN: 9176363945 9789176363942.

  31. Kipper, K. (2005). VerbNet: A broad-coverage, comprehensive verb lexicon. Ph.D. Thesis, U. of Pennsylvania.

  32. Kipper, K., Palmer, M., & Rambow, O. (2002). Extending propbank with verbnet semantic predicates. In: Workshop on applied interlinguas. AMTA-2002. Tiburon, CA, USA.

  33. Laka, I. (1996). A brief grammar of Euskara, the Basque language. University of the Basque Country. ISBN: 84-8373-850-3. http://www.ehu.es/grammar. Accessed 21 Feb 2017.

  34. Laparra, E., & Rigau, G. (2010). eXtended WordFrameNet. In N. Calzolari (Conference Chair), K. Choukri, B. Maegaard, J. Mariani, J. Odijk, S. Piperidis, M. Rosner & D. Tapias (Eds.), Proceedings of the 7th international conference on language resources and evaluation (LREC’10) (pp. 1214–1219). European Language Resources Association (ELRA). ISBN: 2-9517408-6-7.

  35. Levin, B. (1993). English verb classes and alternations. A preliminary investigation. Chicago, London: The University of Chicago Press. ISBN 0-226-47533-6.

    Google Scholar 

  36. Marcus, M., Santorini, B., & Marcinkiewicz, M. A. (1993). Building a large annotated corpus of English: The Penn Treebank. Computational Linguistics Journal,19(2), 313–330.

    Google Scholar 

  37. Merlo, P., & Van der Plas, L. (2009). Abstraction and generalisation in semantic role labels: PropBank, VerbNet or both?. In Proceedings of the 47th annual meeting of the ACL and the 4th IJCNLP of the AFNLP (pp. 288–296). Association for Computational Linguistics (ACL). Suntec, Singapore.

  38. Monachesi, P., Stevens, G., & Trapman, J. (2007). Adding semantic role annotation to a corpus of written Dutch. In Proceedings of the linguistic annotation workshop (LAW’07) (pp. 77–84). Association for Computational Linguistics (ACL). Prague, Czech Republic.

  39. Palmer, M., Babko-Malaya, O., Bies, A., Diab, M., Maamouri, M., Mansouri, A., & Zaghouani, W. (2008). A pilot arabic propbank. In N. Calzolari (Conference Chair), K. Choukri, B. Maegaard, J. Mariani, J. Odijk, S. Piperidis & D. Tapias (Eds.), Proceedings of the sixth conference on international language resources and evaluation (LREC’08) (pp. 3467–3471). European Language Resources Association (ELRA). Marrakech, Morocco. ISBN: 2-9517408-4-0.

  40. Palmer, M., Gildea, D., & Kingsbury, P. (2005a). The proposition bank: A corpus annotated with semantic roles. Computational Linguistics Journal,31(1), 71–106.

    Article  Google Scholar 

  41. Palmer, M., Nianwen, X., Babko-Malaya, O., Chen, J., & Snyder, B. (2005b). A parallel proposition bank II for Chinese and English. In Proceedings of the workshop on frontiers in corpus annotations II: Pie in the Sky (pp. 61–67). Association for Computational Linguistics (ACL). Ann Arbor, Michigan, USA.

  42. Palmer, M., Ryu, S., Choi, J., Yoon, S., & Jeon, Y. (2006). Korean PropBank. LDC2006T03. Philadelphia: Linguistic Data Consortium. ISBN 1-58563-374-7.

    Google Scholar 

  43. Pociello, E., Agirre, E., & Aldezabal, I. (2010). Methodology and construction of the Basque WordNet. Language Resources and Evaluation Journal,45(2), 121–142.

    Article  Google Scholar 

  44. Pradhan, S., Hovy, E., Marcus, M. P., Palmer, M., Ramshaw, L. A., & Weischedel, R. M. (2007). OntoNotes: A unified relational semantic representation. International Journal of Semantic Computing,1(4), 405–419.

    Article  Google Scholar 

  45. Salaberri, H., Arregi, O., & Zapirain, B. (2014). First approach toward semantic role labeling for Basque. In N. Calzolari (Conference Chair), K. Choukri, T. Declerck, H. Loftsson, B. Maegaard, J. Mariani, A. Moreno, J. Odijk & S. Piperidis (Eds.), Proceedings of the 9th language resources and evaluation conference (LREC’14) (pp. 1387–1393). European Language Resources Association (ELRA). Reykjavik, Iceland. ISBN: 978-2-9517408-8-4.

  46. Schiffrin, A., & Bunt, H. C. (2007). LIRICS deliverable D4.3. Document compilation of semantic data categories. http://lirics.loria.fr. Accessed 23 June 2014.

  47. Talmy, L. (1985). Lexicalization patterns: Semantic structure in lexical forms. In T. Shopen (Ed.), Language typology and syntactic description: Grammatical categories and the lexicon (Vol. 3, pp. 36–149). Cambridge: Cambridge University Press.

    Google Scholar 

  48. Taulé, M., Castellví, J., Martí, M. A., & Aparicio, J. (2006). Fundamentos teóricos y metodológicos para el etiquetado semántico de CESS-CAT y CESS-ESP. Procesamiento del Lenguaje Natural,37, 75–82.

    Google Scholar 

  49. Van Der Plas, L., Samardžić, T., & Merlo, P. (2010). Cross-lingual validity of PropBank in the manual annotation of French. In Proceedings of the 4th linguistic annotation workshop (LAW IV ‘10) (pp. 113–117). Association for Computational Linguistics (ACL). ISBN 978-1-932432-72-5 / 1-932432-72-8.

  50. Vázquez, G., Fernández, A., & Martí, M. A. (2000). Clasificación Verbal. Alternancias de Diátesis. Quaderns de Sintagma 3. Edicions de la Universitat de Lleida. Lleida. ISBN: 84-8409-067-1.

  51. Xue, N. (2008). Labeling Chinese predicates with semantic roles. Computational Linguistics, 34(2), 225–255.

    Article  Google Scholar 

  52. Xue, N., & Palmer, M. (2009). Adding semantic roles to the Chinese Trrebank. Natural Language Engineering,15(1), 143–172.

    Article  Google Scholar 

  53. Zipf, G. (1949). Human behavior and the principle of least effort. Addison-Wesley Press. ISBN-13: 978-1614273127. ISBN-10: 161427312X.

Download references

Acknowledgements

This research has been supported by the Basque Government: (IXA group (IT344-10), the Ministry of Science and Innovation of the Spanish Government (PROSA-MED (TIN2016-77820-C3-1-R)) and MINECO: TUNER (TIN 2015-65308-C5-1-R).

Author information

Affiliations

Authors

Corresponding author

Correspondence to Ainara Estarrona.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Estarrona, A., Aldezabal, I. & Díaz de Ilarraza, A. How the corpus-based Basque Verb Index lexicon was built. Lang Resources & Evaluation 54, 73–95 (2020). https://doi.org/10.1007/s10579-018-9440-0

Download citation

Keywords

  • Lexicon
  • PropBank/VerbNet
  • Semantic roles
  • Predicate labelling
  • Valence