How the corpus-based Basque Verb Index lexicon was built

  • Ainara EstarronaEmail author
  • Izaskun Aldezabal
  • Arantza Díaz de Ilarraza
Original Paper


This article describes the method used to build the Basque Verb Index (BVI), a corpus-based lexicon. The BVI is the result of semiautomatic annotation of the EPEC corpus with verb predicate information, following the PropBank-VerbNet model. The method presented is the product of a deep study of the syntactic–semantic behaviour of verbs in EPEC-RolSem (the EPEC corpus tagged with verb predicate information). During the process of annotating EPEC-RolSem, we have identified and stored in the BVI lexicon the different role-patterns associated with all verbs appearing in the corpus. In addition, each entry in the BVI is linked to the corresponding verb entry in well-known resources such as PropBank, VerbNet, WordNet and FrameNet. We have also implemented a tool called e-ROLda to facilitate the process of looking up verb patterns in the BVI and examples in EPEC-RolSem as a basis for future studies.


Lexicon PropBank/VerbNet Semantic roles Predicate labelling Valence 



This research has been supported by the Basque Government: (IXA group (IT344-10), the Ministry of Science and Innovation of the Spanish Government (PROSA-MED (TIN2016-77820-C3-1-R)) and MINECO: TUNER (TIN 2015-65308-C5-1-R).


  1. Aduriz, I., Aranzabe, M. J., Arriola, J. M., Atutxa, A., Díaz de Ilarraza, A., Ezeiza, N., Gojenola, K., Oronoz, M., Soroa, A., & Urizar, R. (2006). Methodology and steps towards the construction of EPEC, a corpus of written Basque tagged at morphological and syntactic levels for automatic processing. In A. Wilson, P. Rayson, D. Archer (Eds.), Corpus Linguistics Around the World. Book series: Language and Computers (Vol. 56, pp. 1–15). Rodopi (Netherlands). ISBN: 90-420-1836-4.Google Scholar
  2. Agirre, E., Aldezabal, I., Etxeberria, J., & Pociello, E. (2006). A preliminary study for building the Basque PropBank. In Proceedings of the 5th international conference on language resources and evaluations (LREC’06) (pp. 981–986). Genoa, Italy. ISBN: 2-9517408-2-4.Google Scholar
  3. Aldabe, I., Gonzáles-Dios, I., López-Gazpio, I., Madrazo, J., & Maritxalar, M. (2013). Two approaches to generate questions in Basque. In Procesamiento del Lenguaje Natural (Vol. 51, pp. 101–108). Print ISSN: 1135-5948. Online ISSN: 1989-7553.Google Scholar
  4. Aldezabal, I. (2004). Aditz-azpikategorizazioaren azterketa. 100 aditzen azterketa zehatza, Levin (1993) oinarri harturik eta metodo automatikoak baliatuz. Ph.D. Thesis, Leioa (Bilbao), University of Basque Country.Google Scholar
  5. Aldezabal, I. (2010). Basis for the annotation of EPEC-RolSem. Interdisciplinary Workshop on Verbs. The Identification and Representation of Verb Features. Scuola Normale Superiore—Laboratori di Linguistica (pp. 92–97). Universitá di Pisa, Dipartamente di Linguistica. Pisa (Italy).Google Scholar
  6. Aldezabal, I., Aranzabe, M. J., Arriola, J. M., & Díaz de Ilarraza, A. (2009). Syntactic annotation in the Reference Corpus for the Processing of Basque (EPEC): Theoretical and practical issues. Corpus Linguistics and Linguistic Theory, 5(2), 241–269.CrossRefGoogle Scholar
  7. Aldezabal, I., Aranzabe, M. J., Díaz de Ilarraza, A., & Estarrona, A. (2010b). Building the Basque PropBank. In N. Calzolari, K. Choukri, B. Maegaard, J. Mariani, J. Odjik, S. Piperidis, M. Rosner & D. Tapias (Eds.), Proceedings of the seventh conference on international language resources and evaluation (LREC 2010) (pp. 1414–1417). European Language Resources Association (ELRA). LREC 2010, Valletta (Malta), May 19–21, 2010. ISBN: 2-9517408-6-7.Google Scholar
  8. Aldezabal, I., Aranzabe, M. J., Díaz de Ilarraza, A., & Estarrona, A. (2011). Preliminary evaluation of EPEC-RolSem, a Basque corpus labelled at predicate level. In XXVII Congreso de la Sociedad Española para el Procesamiento del Lenguaje Natural (SEPLN 2011). Universidad de Huelva.Google Scholar
  9. Aldezabal, I., Aranzabe, M. J., Díaz de Ilarraza, A., Estarrona, A., & Uria, L. (2010a). EusPropBank: Integrating semantic information in the Basque dependency treebank. In A. Gelbukh (Ed.), Lecture notes in computer science (LNCS) no 6008, computational linguistics and intelligent text processing (pp. 60–73). Berlin, Heidelberg, New York: Springer. ISSN: 0302-9743, ISBN-10: 3-642-12115-2.CrossRefGoogle Scholar
  10. Aparicio, J. (2007). Clasificación semánticade los predicados del español. Masters Dissertation, Universitatde Barcelona.Google Scholar
  11. Aparicio, J., Taulé, M., & Martí, M.A. (2008). AnCora-Verb: A lexical resource for the semantic annotation of corpora. In N. Calzolari, K. Choukri, B. Maegaard, J. Mariani, J. Odijk, S. Piperidis & D. Tapias (Eds.), Proc. of 6th international conference on language resources and evaluation (LREC’08) (pp. 797–802). ELRA. ISBN: 2-9517408-4-0.Google Scholar
  12. Aranzabe, M. J., Atutxa, A., Bengoetxea, K., Díaz de Ilarraza, A., Goenaga, I., Gojenola, K., & Uria, L. (2015). Automatic conversion of the Basque dependency treebank to universal dependencies. In M. Dickinsons, E. Hinrichs, A. Patejuk, A. Przepiórkowski (Eds.), Proceedings of the fourteenth international workshop on treebanks an linguistic theories (TLT14) (pp. 233–241). Institute of Computer Science of the Polish Academy of Sciences, Warszawa, Poland. ISBN: 978-83-63159-18-4.Google Scholar
  13. Babko-Malaya, O., Bies, A., Taylor, A., Yi, S., Palmer, M., Marcus, M., Kulick, S., & Shen, L. (2006). Issues in synchronizing the English Treebank and PropBank. In Proc. of the workshop on frontiers in linguistically annotated corpora, a merged workshop with 7th int. workshop on linguistically interpreted corpora (LINC-2006) and frontier in corpus annotation III (Coling/ACL 2006) (pp. 70–77). Association for Computational Linguistics (ACL). Sydney, Australia. ISBN: 1-932432-78-7.Google Scholar
  14. Baker, C. F., Fillmore, C. J., & Lowe, J. B. (1998). The Berkeley FrameNet Project. In Proceedings of COLING-ACL’98. 36th annual meeting of the association for computational linguistics and 17th international conference on computational linguistics (COLING-ACL’98) (pp. 86–90). Montréal, Quebec: Morgan Kaufmann Publishers/ACL.Google Scholar
  15. Bhatt, R., Narasimhan, B., Palmer, M., Rambow, O., Sharma, D., & Xia, F. (2009). A multi-representational and multi-layered treebank for Hindi/Urdu. In Proceedings of the third linguistic annotation workshop, ACL-IJCNLP 2009 (pp. 186–189). Association for Computational Linguistics (ACL). Suntec, Singapore.Google Scholar
  16. Bonial, C., Bonn J., Conger K., Hwang, J., Palmer M., & Reese N. (2015). English PropBank annotation guidelines. Accessed 4 Sept 2016.
  17. Bonial, C., Conger, K., Hwang, J. D., Mansouri, A., Aseri, Y., Bonn, J., et al. (2017). Current directions in English and Arabic PropBank. In N. Ide & J. Pustejovsky (Eds.), Handbook of linguistic annotation (pp. 737–769). Berlin: Springer.CrossRefGoogle Scholar
  18. Bonial, C., Corvey, W., Palmer, M., Petukhova, V., & Bunt, H. C. (2011). A hierarchical unification of LIRICS and VerbNet semantic roles. In Proc. of the workshop on semantic annotation for computational ling. Resources (SACL-ICSC 2011) (pp. 483–489). IEEE. Palo Alto, Californa, USA. ISBN: 978-1-4577-1648-5.Google Scholar
  19. Bunt, H. C., Petukhova, V., & Schiffrin, A. (2007). LIRICS Deliverable D4.4. Multilingual test suites for semantically annotated data. Accessed 23 June 2014.
  20. Castellón, I., Fernández, A., Vázquez, G., Alonso, L., & Capilla, J. A. (2006). The Sensem corpus: A corpus annotated at the syntactic and semantic level. In Proceedings of the fifth international conference on language resources and evaluation (LREC’06) (pp. 355–358). European Language Resources Association (ELRA). Genoa, Italy. ISBN: 2-9517408-2-4.Google Scholar
  21. Civit, M., Aldezabal, I., Pociello, E., Taulé, M., Aparicio, J., & Márquez, L. (2005). 3LBLEX: Léxico verbal con frames sintáctico-semánticos. In Procesamiento del Lenguaje Natural (Vol. 35, pp. 367–373). Print ISSN: 1135-5948. Online ISSN: 1989-7553.Google Scholar
  22. de Rijk, R. (1969). Is Basque an SOV language? Fontes Linguae Vasconum, 1, 319–351.Google Scholar
  23. Estarrona, A. (2014). EPEC corpusa predikatu-mailan etiketatzeko oinarriak: EPEC-RolSem, BVI eta e-ROLda. Ph.D. Thesis, Basque Language and Communication, Basque Country University (UPV-EHU), Donostia.Google Scholar
  24. Estarrona, A., Aldezabal, I., Díaz de Ilarraza, A., & Aranzabe, M. J. (2016). Methodology for the semiautomatic annotation of EPEC-RolSem, a Basque corpus labelled at predicate level following the PropBank/VerbNet model. In E. Vanhoutte (Ed.), Digital scholarship in the humanities (Vol. 31, No. 3, pp. 470–492). First published online: 17 June 2015 (23 pages). Published by Oxford University Press on behalf of EADH: The European Association for Digital Humanities (Online ISSN 2055-768X - Print ISSN 2055-7671).
  25. Fellbaum, C. (1998). WordNet, an electronic lexical database. Cambridge: MIT Press. ISBN 0-262-06197-X.Google Scholar
  26. García-Miguel, J., & Albertuz, F. J. (2005). Verbs, semantic classes and semantic roles in the ADESSE project. In K. Erk, A. Melinger & S. Schulte im Walde (Eds.), Proc. of workshop on the identification and representation of verb features and verb classes (pp. 50–55). Saarbrücken, Germany.Google Scholar
  27. Gardent, C., & Cerisara, C. (2010). Semi-automatic Propbanking for French. In Proceedings of the ninth international workshop on treebanks and linguistic theories (pp. 67–78). Northern European Association for Language Technology (NEALT). Tartu, Estonia. Print ISSN: 1736-8197. Online ISSN: 1736-6305.Google Scholar
  28. Hajic, J., Panevová, J., Urešová, Z., Bémová, A., Kolárová, V, & Pajas, P. (2003). PDT-VALLEX: Creating a largecoverage valency lexicon for treebank annotation. In Nivre, J. & Hinrichs, E. (Eds.), Proc. of the second workshop on treebanks and linguistic theories (pp. 57–68). ISBN: 9176363945 9789176363942.Google Scholar
  29. Hanks, P. (2012). The corpus revolution in lexicography. International Journal of Lexicography, 25(4), 398–436.CrossRefGoogle Scholar
  30. Kingsbury, P., & Palmer, M. (2003). PropBank: the next level of treebank. In Proceedings of the second workshop on treebanks and linguistic theories (TLT 2003) (Vol. 3). ISBN: 9176363945 9789176363942.Google Scholar
  31. Kipper, K. (2005). VerbNet: A broad-coverage, comprehensive verb lexicon. Ph.D. Thesis, U. of Pennsylvania.Google Scholar
  32. Kipper, K., Palmer, M., & Rambow, O. (2002). Extending propbank with verbnet semantic predicates. In: Workshop on applied interlinguas. AMTA-2002. Tiburon, CA, USA.Google Scholar
  33. Laka, I. (1996). A brief grammar of Euskara, the Basque language. University of the Basque Country. ISBN: 84-8373-850-3. Accessed 21 Feb 2017.
  34. Laparra, E., & Rigau, G. (2010). eXtended WordFrameNet. In N. Calzolari (Conference Chair), K. Choukri, B. Maegaard, J. Mariani, J. Odijk, S. Piperidis, M. Rosner & D. Tapias (Eds.), Proceedings of the 7th international conference on language resources and evaluation (LREC’10) (pp. 1214–1219). European Language Resources Association (ELRA). ISBN: 2-9517408-6-7.Google Scholar
  35. Levin, B. (1993). English verb classes and alternations. A preliminary investigation. Chicago, London: The University of Chicago Press. ISBN 0-226-47533-6.Google Scholar
  36. Marcus, M., Santorini, B., & Marcinkiewicz, M. A. (1993). Building a large annotated corpus of English: The Penn Treebank. Computational Linguistics Journal, 19(2), 313–330.Google Scholar
  37. Merlo, P., & Van der Plas, L. (2009). Abstraction and generalisation in semantic role labels: PropBank, VerbNet or both?. In Proceedings of the 47th annual meeting of the ACL and the 4th IJCNLP of the AFNLP (pp. 288–296). Association for Computational Linguistics (ACL). Suntec, Singapore.Google Scholar
  38. Monachesi, P., Stevens, G., & Trapman, J. (2007). Adding semantic role annotation to a corpus of written Dutch. In Proceedings of the linguistic annotation workshop (LAW’07) (pp. 77–84). Association for Computational Linguistics (ACL). Prague, Czech Republic.Google Scholar
  39. Palmer, M., Babko-Malaya, O., Bies, A., Diab, M., Maamouri, M., Mansouri, A., & Zaghouani, W. (2008). A pilot arabic propbank. In N. Calzolari (Conference Chair), K. Choukri, B. Maegaard, J. Mariani, J. Odijk, S. Piperidis & D. Tapias (Eds.), Proceedings of the sixth conference on international language resources and evaluation (LREC’08) (pp. 3467–3471). European Language Resources Association (ELRA). Marrakech, Morocco. ISBN: 2-9517408-4-0.Google Scholar
  40. Palmer, M., Gildea, D., & Kingsbury, P. (2005a). The proposition bank: A corpus annotated with semantic roles. Computational Linguistics Journal, 31(1), 71–106.CrossRefGoogle Scholar
  41. Palmer, M., Nianwen, X., Babko-Malaya, O., Chen, J., & Snyder, B. (2005b). A parallel proposition bank II for Chinese and English. In Proceedings of the workshop on frontiers in corpus annotations II: Pie in the Sky (pp. 61–67). Association for Computational Linguistics (ACL). Ann Arbor, Michigan, USA.Google Scholar
  42. Palmer, M., Ryu, S., Choi, J., Yoon, S., & Jeon, Y. (2006). Korean PropBank. LDC2006T03. Philadelphia: Linguistic Data Consortium. ISBN 1-58563-374-7.Google Scholar
  43. Pociello, E., Agirre, E., & Aldezabal, I. (2010). Methodology and construction of the Basque WordNet. Language Resources and Evaluation Journal, 45(2), 121–142.CrossRefGoogle Scholar
  44. Pradhan, S., Hovy, E., Marcus, M. P., Palmer, M., Ramshaw, L. A., & Weischedel, R. M. (2007). OntoNotes: A unified relational semantic representation. International Journal of Semantic Computing, 1(4), 405–419.CrossRefGoogle Scholar
  45. Salaberri, H., Arregi, O., & Zapirain, B. (2014). First approach toward semantic role labeling for Basque. In N. Calzolari (Conference Chair), K. Choukri, T. Declerck, H. Loftsson, B. Maegaard, J. Mariani, A. Moreno, J. Odijk & S. Piperidis (Eds.), Proceedings of the 9th language resources and evaluation conference (LREC’14) (pp. 1387–1393). European Language Resources Association (ELRA). Reykjavik, Iceland. ISBN: 978-2-9517408-8-4.Google Scholar
  46. Schiffrin, A., & Bunt, H. C. (2007). LIRICS deliverable D4.3. Document compilation of semantic data categories. Accessed 23 June 2014.
  47. Talmy, L. (1985). Lexicalization patterns: Semantic structure in lexical forms. In T. Shopen (Ed.), Language typology and syntactic description: Grammatical categories and the lexicon (Vol. 3, pp. 36–149). Cambridge: Cambridge University Press.Google Scholar
  48. Taulé, M., Castellví, J., Martí, M. A., & Aparicio, J. (2006). Fundamentos teóricos y metodológicos para el etiquetado semántico de CESS-CAT y CESS-ESP. Procesamiento del Lenguaje Natural, 37, 75–82.Google Scholar
  49. Van Der Plas, L., Samardžić, T., & Merlo, P. (2010). Cross-lingual validity of PropBank in the manual annotation of French. In Proceedings of the 4th linguistic annotation workshop (LAW IV ‘10) (pp. 113–117). Association for Computational Linguistics (ACL). ISBN 978-1-932432-72-5 / 1-932432-72-8.Google Scholar
  50. Vázquez, G., Fernández, A., & Martí, M. A. (2000). Clasificación Verbal. Alternancias de Diátesis. Quaderns de Sintagma 3. Edicions de la Universitat de Lleida. Lleida. ISBN: 84-8409-067-1.Google Scholar
  51. Xue, N. (2008). Labeling Chinese predicates with semantic roles. Computational Linguistics, 34(2), 225–255.CrossRefGoogle Scholar
  52. Xue, N., & Palmer, M. (2009). Adding semantic roles to the Chinese Trrebank. Natural Language Engineering, 15(1), 143–172.CrossRefGoogle Scholar
  53. Zipf, G. (1949). Human behavior and the principle of least effort. Addison-Wesley Press. ISBN-13: 978-1614273127. ISBN-10: 161427312X.Google Scholar

Copyright information

© Springer Nature B.V. 2018

Authors and Affiliations

  • Ainara Estarrona
    • 1
    Email author
  • Izaskun Aldezabal
    • 1
  • Arantza Díaz de Ilarraza
    • 1
  1. 1.IXA NLP Group, Basque Language and Communication DepartmentUniversity of the Basque CountrySan SebastiánSpain

Personalised recommendations