How the corpus-based Basque Verb Index lexicon was built


This article describes the method used to build the Basque Verb Index (BVI), a corpus-based lexicon. The BVI is the result of semiautomatic annotation of the EPEC corpus with verb predicate information, following the PropBank-VerbNet model. The method presented is the product of a deep study of the syntactic–semantic behaviour of verbs in EPEC-RolSem (the EPEC corpus tagged with verb predicate information). During the process of annotating EPEC-RolSem, we have identified and stored in the BVI lexicon the different role-patterns associated with all verbs appearing in the corpus. In addition, each entry in the BVI is linked to the corresponding verb entry in well-known resources such as PropBank, VerbNet, WordNet and FrameNet. We have also implemented a tool called e-ROLda to facilitate the process of looking up verb patterns in the BVI and examples in EPEC-RolSem as a basis for future studies.

  1. 1.

  2. 2.

    Only in the case of the 100 verbs analysed in this study.

  3. 3.

    PropBank has recently added an Arg6 to tag nominal natural disaster Rolesets (Bonial et al. 2017).

  4. 4.

    In the 3.3 version of VerbNet many changes have been implemented: path_rel semantics, initial lexical features, an many updates to verb classes, frames, and members, but full documentation about these changes is not jet available ( We use in this paper the data as it existed before all these changes.

  5. 5.

    1. Linguistic InfRastructure for Interoperable resourCes and Systems (

  6. 6.

    See Bonial et al. (2011) for details on the comparison between VN and LIRICS lists of roles and the decisions taken in the 3.2 version of VN.

  7. 7.

    1. See Sect. 2.2.1 for an explanation of the Arg0/Arg1 choice.

  8. 8.

    2. The examples under (4) could suggest that this verb is mainly a light verb. Light verbs are not the focus of this research, but it has to be said that at the moment we are working on this issue to see how the light verbs and the multiword expressions created with light verbs must be included in the BVI lexicon.

  9. 9.

    The EPEC-DEP corpus is the EPEC corpus syntactically tagged using a dependency grammar.

  10. 10.

    The 3 selected verbs were adierazi (‘to state’), izan (‘to be’) and etorri (‘to come’). We chose very different verbs to be able to draw interesting conclusions. The verb adierazi has a single sense and is very frequent in the corpus. The verb izan is the most frequent verb in the corpus (15.22%). Finally, the verb etorri is a priori a difficult verb, because it has 4 senses (not always easily distinguishable) and it is used extensively in complex expressions.

  11. 11.

    We do not include the time and personnel involved in earlier phases such as setting up the annotation criteria, creating the guidelines, or preparing the tool for the annotation task.

  12. 12.


  13. 13.

    ARG_INFO tag is the semantic label we have created to annotate verb predicate information. For more details about this label see Estarrona et al. (2016).

  14. 14.

    1. We do not take into consideration adjuncts (ArgM) when building lexicon entries.

  15. 15.

    This figure is taken from our e-ROLda tool that we will present in detail in Sect. 4.

  16. 16.

    1. “These verbs are arrayed in a classic Zipfian distribution, with a few verbs occurring very often (say, for example, is the most common verb, with over 10,000 instances in its various inflectional forms), and most verbs occurring two or fewer times” (Palmer et al. 2005a: 13).

  17. 17.

    1. At the time of writing, we are working on a Basque NOMLEX and including the information of this new resource in the e-ROLda tool. Given the fact that the work is ongoing, the data is still tentative and incomplete at this stage.

  18. 18.


  19. 19.


  20. 20.



This research has been supported by the Basque Government: (IXA group (IT344-10), the Ministry of Science and Innovation of the Spanish Government (PROSA-MED (TIN2016-77820-C3-1-R)) and MINECO: TUNER (TIN 2015-65308-C5-1-R).

  • Lexicon
  • PropBank/VerbNet
  • Semantic roles
  • Predicate labelling
  • Valence