Investigating interoperable event corpora: limitations of reusability of resources and portability of models

Studies on the applicability of heterogeneous semantically interoperable corpora are rare. We investigate to what extent reusability (both of systems and of annotations) is entailed by corpora whose interoperability is based on compliance to standards. In particular, we look at event detection in English texts, supported by the ISO-TimeML annotation scheme. We run two sets of experiments using a common neural network architecture and extensively evaluate our results on both in-distribution and out-of-distribution settings. In all experimental settings, systems obtain state-of-the-art results on the in-distribution data and underperform out-of-distribution ones, setting limits to the benefits of semantically interoperable corpora. By means of a detailed error analysis, we show that while being compliant to a standard guarantees semantic interoperability, this becomes only a necessary condition for reusability, with factors such as differences in the quality of the annotations having a much stronger impact.


Introduction
namely that of semantic interoperability. In recent years, most efforts have focused on implementing semantic interoperability for language resource infrastructures with dedicated initiatives and EU funded projects (de Jong et al., 2020;Rehm et al., 2020, among others). When it comes to (annotated) data, the debate on what constitutes semantic interoperability and how to best implement it is still open (Witt et al., 2009;Ide & Pustejovsky, 2010;Chiarcos, 2012;Hajičová 2014). Besides specific differences, one of the promises of semantic interoperability is to make annotated data for a specific language phenomenon reusable. Reusability of annotated data has different facets: on the one hand, it can be interpreted as possibility of searching across different corpora for the same language phenomenon. On the other hand, it can be interpreted as combination of multiple corpora to increase the diversity of training material for stochastic NLP systems. The aim of this article is to investigate this latter interpretation of semantic interoperability.
Interoperability of (annotated) data is closely related to portability of supervised learning systems. Following Ettinger et al. (2017), good portability of systems would indicate the development of more robust NLP technologies. This means that a system trained for a specific task is expected to perform well (or with very limited losses) across datasets. Moreover, portability requires that predictions made by the system are consistent, even in presence of small perturbations to the input (Wang et al., 2020). To achieve portability one can apply transfer learning and domain adaptation (Blitzer et al., 2006;Daumé III, 2007;Ma et al., 2014;Ruder & Plank, 2017;Wu & Huang, 2016).
Another strategy for portability, one that is the focus of this article, is training systems with larger and more varied training data (Tu et al., 2020). Previous work (Alex et al., 2006;van Erp et al., 2016) has shown that the lack of interoperability for the categories used to annotate the data is the perfect recipe for failure in reusability of the annotated resources and the portability of systems. On the other hand, the use of interoperable annotated resources has proven to be successful. A case in point are the interoperable treebanks for syntactic parsing (Niu et al., 2009;Stymne et al., 2018), whose experimental data supports the connection between interoperability of annotated resources and portability of systems. This suggests that interoperability can be a faster and more reliable solution to access diversified supervised training material rather than simply concatenating datasets regardless of their annotation specifications (Witt et al., 2009;Poch & Bel Rafecas, 2011).
But one swallow doesn't make a summer. In this contribution we conduct a thorough empirical investigation on the benefits of semantically annotated interoperable corpora. The specific case that will be subject to our scrutiny are corpora annotated with information about events. Events have been at the heart of Linguistics, Philosophy, and Artificial Intelligence. Much work has been conducted in this area reaching a level of maturity and consensus that has boosted the development of dedicated semantically interoperable corpora. We will empirically assess, for the first time as far as we are aware, the extent to which the relationship between interoperability and reusability (of systems and data) holds for event-annotated corpora. In particular, we address a non-trivial dimension of interoperability, namely the use of a shared vocabulary and markables across annotation schemes. Our main contributions are: -an in-depth investigation of the interoperability and compatibility of eventannotated corpora for English compliant to the ISO-TimeML standard; -a detailed analysis of the factors that impact the interoperability of language resources; -state-of-the-art models for event trigger detection. 1 This article is organised as follows. First, in Sect. 2, we give a thorough background on how interoperability is conceived in the NLP community and the presumed advantages of having semantically interoperable language resources. We then provide an overview of the selected task, i.e., event detection (Sect. 3) and discuss current state-ofthe-art methods that have been used to address it. Section 2 discusses the ISO-TimeML annotation scheme , a standard for event and temporal annotation, and introduces three ISO-TimeML compliant corpora. The corpora will be used to investigate semantic interoperability, with particular attention to two aspects: portability of systems (Sect. 5) and reusability of annotated data (Sect. 6). Section 7 presents a detailed analysis to identify differences in the application of the annotation guidelines that may have had an impact on the interoperability experiments. We conclude with directions for future work in Sect. 8.

Syntactic and semantic interoperability
The increasing popularity of data-driven methods in NLP results in the development of a varied and large number of linguistic corpora with an even larger variation of annotations. For instance, the LRE Map 2 documents more than 2,608 written corpora for different languages. Similarly, there has been a parallel development of tools to represent, annotate, and visualise such varied data. This proliferation of tools is accompanied by a bottleneck: the representation formalisms could not be shared or combined, thus limiting the possibilities of investigating different annotation layers to the same piece of text or integrating tools in more complex text processing pipelines. As Chiarcos (2012) points out, the desire to address this bottleneck was the driving motivation for investigating interoperability.
Interoperability is a composite notion that takes into account different dimensions and levels of analysis including metadata, data, and tools. Interoperability is now a key goal of standardisation efforts (e.g., ISO TC/37 SC4) and of language resource infrastructures (e.g., META-SHARE, 3 CLARIN-ERIC, 4 European Language Grid 5 ). Two macro areas of application of interoperability must be distinguished: interoperability of NLP systems, and interoperability of corpora. These two areas of applications are strictly connected, although in this contribution we will investigate only the latter: the interoperability of annotated corpora.
There is a consensus in distinguishing two types of interoperability when it comes to corpora: syntactic (sometimes referred to as structural) and semantic interoperability. Syntactic interoperability is defined as convergence towards a common, or pivot, formalism of annotations of different origins to allow uniform processing of different resources (Chiarcos, 2012). Syntactic interoperability aims at representing all annotations of a corpus in a way that allows their storage and querying regardless of their original annotation layer. Essentially, syntactic interoperability is seen as "the ability of systems to process exchanged data either directly or via conversion." (Calzolari et al., 2011, p. 45). A generic example of a syntactic interoperable format is XML or OWL/RDF. Standardisation efforts such as TMF (ISO 16642), SynAF (Declerck, 2006), LAF/GrAF (Ide & Suderman, 2007), and the NLP Interchange Format (NIF) (Hellmann et al., 2013) all qualify as examples of syntactic interoperability that propose and define data models that allow uniform representations of different annotation layers.
Semantic interoperability, in contrast, is more challenging and addresses the heterogeneity of linguistic annotations. While representing richness of analyses, heterogeneity of annotations is a major hurdle for reusability and, consequently, for the portability of systems. Following Ide and Pustejovsky (2010), semantic interoperability of corpora is "the ability to automatically interpret exchanged information meaningfully and accurately in order to produce useful results". The key idea behind semantic interoperability is that sharing a common vocabulary, or a repository of linguistic annotation terminology, across language resources is a way to enrich knowledge and exchange of information. The use of a common annotation terminology enables the resolution of the linguistic information from one corpus against the information from another corpus: "[r]eference definitions [i.e., the common terminology repository (au.)] provide an interlingua that allows mapping linguistic annotations from annotation scheme A to annotations in accordance with scheme B" (Chiarcos, 2012, p. 163).
This sounds great in theory, but what does it mean in practice? The creation of shared linguistic repositories, such as ISOCat (Kemps-Snijders et al., 2009;Windhouwer & Wright, 2012) among others, is not inherently free from problems since different communities develop and maintain them independently. In some cases, linguistic repositories may fail to be compatible with the definitions they provide. The rise of the Semantic Web and the adoption of the Linked Open Data (LOD) principles have contributed to establishing innovative practices and solutions for attaining semantic interoperability. This has led to the development of new community standards, recommendations, and shared vocabularies, e.g., Ontologies of Linguistic Annotation (Chiarcos & Sukhareva, 2015) for linguistic data categories, or OntoLex-Lemon (McCrae et al., 2017) for lexical resources. Using LOD principles to make resources available allows them to be always uniquely identifiable, linked to one another in a uniform way, and immediately retrieved and processed through standard Web protocols. One of the most innovative aspect of this approach is that the same formalism (OWL/RDF) is used to address simultaneously syntactic and semantic interoperability (Chiarcos, 2012;Chiarcos et al., 2013).
A complementary approach to LOD to achieve semantic interoperability is the development and use of common annotation schemes. Hajičová (2014) distinguishes two ways to instantiate this approach: (i) the application of a common scheme to different languages; or (ii) the convergence towards a single representation format, in other words, an interlingua (Witt et al., 2009), for a common phenomenon encoded by different annotation schemes. Clear examples of the first method are initiatives such as the Universal Dependency Treebank (Nivre et al., 2016(Nivre et al., , 2020, SemEval 2010 Task 13: TempEval-2 (Verhagen et al., 2010), SemEval 2015 Task 5: Aspectbased Sentiment Analysis (Pontiki et al., 2016), or SemEval 2020 Task 12: OffensEval (Zampieri et al., 2020). The second method, i.e., the convergence towards a common scheme, is a more specific definition of semantic interoperability when compared to those by Ide and Pustejovsky (2010) and Chiarcos (2012), stressing concrete ways in which the "meaningful and accurate" interpretation of information can be achieved. Examples of this latter method are usually exemplified by conversion between different representation formats (de Marneffe et al., 2006;Niu et al., 2009;O'Gorman et al., 2018). Both methods call for a vision of interoperability that directly promotes the reuse of annotated data rather than creating them from scratch. At the same time, the adoption of a common representation scheme goes a step further by directly promoting also the reuse of systems.
Directly reusing corpora and their annotations has important consequences for the development of NLP systems. Stochastic NLP approaches are data-hungry, thus having access to multiple corpora for the same phenomenon annotated in an interoperable format is a way to address this issue (Comeau et al., 2013). The promoted vision here is that more (varied) data will lead to more robust systems, and consequently, systems that are more portable across heterogeneous datasets (Tu et al., 2020). This does not necessarily guarantee that these systems have also learned to "generalise well" or, in other words, that they freed themselves from the chains of their datasets and "learned" a language phenomenon. Nevertheless, increasing the robustness of models is a necessary step to attain generalisability (Ettinger et al., 2017).
This contribution investigates interoperability of event-annotated corpora. In particular, we consider the case of event annotation promoted by ISO-TimeML  for English texts. ISO-TimeML represents a peculiar case for the study of semantic interoperability. First, ISO-TimeML is a common representation format for annotating events (an interlingua). Second, available annotation schemes and corpora are all described as ISO-TimeML compliant, meaning that ISO-TimeML is used as a common reference format to develop new annotation schemes and guidelines. This is the opposite of what is considered interoperability by Hajičová (2014). Third, ISO-TimeML compliant corpora match the definition of semantic interoperablity following (Chiarcos, 2012;Ide & Pustejovsky, 2010): corpora share a common vocabulary for the definition of the events allowing to "automatically interpret exchanged information meaningfully and accurately" (Ide & Pustejovsky, 2010). At the same time, they instantiate a case of inverted interlingua approach. In particular, the annotation schemes are derived from the interlingua (ISO-TimeML), introducing further issues concerning the compatibility and complementarity of the specific annotations.

Event detection: a short overview
In the previous section we presented background information on the notion of semantic interoperability and its advantages for the development of NLP systems. In this section we introduce the specific NLP task we selected to study interoperability: event detection. In particular, we explain what constitutes event detection in the context of NLP, give an overview of the most important annotated corpora, and present the state-of-the-art approaches developed for this task.

Task definition
Event detection is a complex task with a strong tradition in NLP (Ahn, 2006;Bethard, 2013;Ji & Grishman, 2008;Nguyen & Grishman, 2015;Ritter et al., 2012). The problem has been framed as follows: given a document D, identify all pairs of linguistic expressions (w i , w j ) ∈ D, where w i is an instance of an event trigger and w j is an instance of an event participant. Event triggers correspond to those linguistic expressions in a document that denote the happening of something, or a state being valid (i.e., what) (Bach, 1986). Participants refer to the tokens (or phrases) that expresses the actors (i.e., who and whom), the location (i.e., where), and time of occurrence (i.e., when). Given this formulation, it is easy to notice that events are hubs of information that explicate complex relationships between people, places, objects, actions and states.
The identification of event triggers is a particularly challenging task: the "eventiveness" of a linguistic expression w i is highly dependent on its context because there exists a continuum between eventive and non-eventive interpretation in the space of event semantics (Araki et al., 2018). Event participants, on the other hand, are defined as "participant roles that can be filled" (Linguistic Data Consortium, 2005) and their identification is not guided by the syntactic structure of the predicates but by semantic schemes that represent event-related scenarios. 6

Computational approaches
Previous work on event detection has mainly adopted supervised approaches. 7 . Two major waves of systems can be identified, namely: (i) feature-based ones; and (ii) neural network architectures. Feature-based models (e.g., Support Vector Machines-SVM, Conditional Random Fields-CRF) make use of hand-crafted symbolic features combining linguistic and domain specific knowledge (Ji & Grishman, 2008;Jung & Stent, 2013;Chen & Ji, 2009;Caselli & Morante, 2018;Venugopal et al., 2014). On the other hand, neural network architectures, either based on Convolutional Neural Networks (CNN) or Recurrent Neural Network (RNN) and its extensions (e.g., Gated Recurrent Unit-GRU, or Long Short-Term Memory-LSTM), have shown their effectiveness in reducing the dependencies of the systems on the use of toolkits and external resources for feature extraction by automatically learning features from the data 8 (Araki, 2018;Huang et al., 2018;Nguyen & Grishman, 2015;Nguyen et al., 2016).
A difference that cuts across the specific algorithm is how the task is modeled. Early systems followed a two-stage approach: first, event triggers are identified (and classified), and subsequently this information is used to predict the participants in the event. More recent approaches propose predicting event triggers and participants jointly (Li et al., 2013). The advantages of joint modeling are mainly in a reduction of error propagation across the NLP pipeline. Furthermore, since event triggers of the same type tend to co-occur with the same set of participants, joint approaches benefit from this additional information and obtain better results. For instance, on the ACE corpus, the joint model by Nguyen et al. (2016) obtains an F1 score of .693 for event trigger detection and classification and .554 for argument roles on ACE, improving against the pipeline model by Nguyen and Grishman (2015).
More recently, pre-trained language models have been successfully applied to this task (Caselli & Üstün, 2019;Yang et al., 2019) reaching new state-of-the-art results. Next to this, a new wave of systems has been proposed based on the development of transferable neural network learning techniques with a common semantic space of shared embedding representations (Huang et al., 2018). The approach, also labeled as share-and-transfer, first learns the extraction models over this common space, and subsequently applies it to the target data. The advantage of the method is that the learned event knowledge can be transferred, i.e., becomes available, for recognizing unseen content in low-resource settings.

Corpora
Most previous work on event detection has focused on contemporary texts covering different domains, including news articles (Pustejovsky et al., 2003;Linguistic Data Consortium, 2005;Song et al., 2015;Minard et al., 2016), (bio-)medical documents (Bethard et al., 2016(Bethard et al., , 2017, and social media (Ritter et al., 2012). Recently, event extraction has been applied also to historical texts (Sprugnoli & Tonelli, 2019). Evaluation campaigns and dedicated workshops 9 have played a big role in boosting research by promoting the availability of numerous benchmark corpora and opening discussions for annotation proposals and refinements of the definition of events. For instance, in the Message Understanding Conference (MUC)'s tasks (Sundheim, 1992;Chinchor, 1998), event detection is restricted to predetermined event instances (e.g., joint venture announcements or rocket launching) based on a scenario filling task of template elements, with specific fields roughly corresponding to the event participants. More fine-grained annotations have been proposed in the Automatic Content Extraction (ACE) campaign, TempEval (Verhagen et al., 2007(Verhagen et al., , 2010UzZaman et al., 2013) and Clinical TempEval (Bethard et al., 2016(Bethard et al., , 2017, the Knowledge Base Population track at the Text Analysis Conference Know (TAC KBP), 10 and i2b2 challenge. 11 Variations in the annotation of events affect different dimensions, the most relevant being: (i) the definition of what constitutes an event and how to annotate its linguistic realization(s); and (ii) the assignment of events to specific classes. Such differences make most of the event-labeled corpora incompatible with each other preventing the direct reuse of their data and their automatic conversion across formats. For instance, ACE restricts the annotation of events to the occurrence of eight semantic classes (e.g., Life, Movement, Conflict, Business, Contact, Personnel, Justice, Transaction) in news articles. The TAC KBP campaigns adopt an approach similar to ACE by limiting the annotation to specific semantic classes. But they introduce a new annotation scheme, Entities Relations Events (Mitamura et al., 2015;Song et al., 2016), where events are annotated on the basis event nuggets, i.e., "a semantically meaningful unit that expresses an event [being] either a single word (verb, noun, or adjective) or a phrase which is continuous or discontinuous" (Araki, 2018, p. 116).
The TempEval campaigns have promoted the use of TimeML (Pustejovsky et al., 2003a) which rejects any restriction on semantic classes and linguistic realizations of events. The TimeML annotation philosophy is surface-oriented. The event classes do not correspond to any semantic category but rather are based on both the lexical aspect (Vendler, 1967) and their contextual syntactic structure. The consolidation of TimeML into an ISO standard had a beneficial effect in promoting the development of annotation initiatives that share some (minimal) common elements, such as (i) a common definition of the target phenomenon (i.e., what an event is); (ii) a similar annotation philosophy (i.e., how to annotate events); and (iii) neither semantic nor morpho-syntactic restrictions" i.e., what can realize an event (Caselli & Sprugnoli, 2017). This makes ISO-TimeML compliant corpora suitable for investigating interoperability. In particular, in this work we investigate the interoperability of three such corpora in English, namely TempEval-3 (UzZaman et al., 2013), RED (O'Gorman et al., 2016), and MEANTIME (Minard et al., 2016).
In Table 1 we report an overview of the major English corpora available for event detection illustrating their domain, definition of event, and event classes. In Table 2, we illustrate the annotation layers available in each corpus.
In the following section, we will describe in details the properties and the annotation characteristics of ISO-TimeML, and of the three compliant corpora we have selected for our study on interoperability.
detection task. We will first give an overview of the main characteristics for event annotation following ISO-TimeML standard (Sect. 4.1). Then we show how each of the three corpora implement this standard and provide a detailed analysis of these corpora (Sect. 4.2). This serves to illustrate potential and unexpected mismatches between the annotation schemes and the actual guidelines that may affect interoperability. Finally, we provide an empirical analysis of the similarities and differences of the data distribution composing these corpora to better assess their compatibility and impact for portability and reusability (Sect. 4.3).

Event definition and annotation in ISO-TimeML
As Pustejovsky et al. (2010) highlight, ISO-TimeML distinguishes between abstract syntax and concrete syntax. The abstract syntax "specifies the elements making up the information in annotations, and how these elements may be combined to form complex annotation structures" (Pustejovsky et al., 2010, p. 394). The combinations of annotation structures are independent of any specific representation format. On the other hand, the specification of how to represent the annotation structures is delegated to the concrete syntax. While XML is used to represent concrete ISO-TimeML annotations, any representation format that is faithful to the ISO-TimeML abstract syntax can be readily converted into a corresponding ISO-TimeML concrete syntax representation, i.e., XML. The abstract syntax is the key to the interoperability of annotations across different concrete specification formats: "[t]he fact that this semantics is associated with the abstract syntax, rather than with a particular concrete syntax, explains why all concrete representations of ISO-TimeML annotations are semantically equivalent" (Pustejovsky et al., 2010, p. 394). ISO-TimeML defines an event as anything that happens or occurs, or a state in which something obtains or holds true. This definition describes what is commonly referred to as eventuality (Bach, 1986). Event mentions are annotated using the tag EVENT. From a morpho-syntactic perspective, ISO-TimeML considers every possible realisation of an eventuality as valid, including verbs, nouns, adjectives, and (some) prepositional phrases. Every annotation of an event is also enriched by various attributes specifying the class, the tense, the grammatical aspect, the polarity (negative or positive), the presence of modal operators, and the cardinality.
ISO-TimeML also presents core normative instructions on how to annotate events. In particular, the identification of the textual extent of an event is mainly syntax driven and based on the notion of minimal chunk, i.e., the most meaningful component of a phrase mentioning an event. Higher constituents (e.g., a verb phrase) are discarded to avoid nesting of multiple events. In practical terms, the minimal chunk approach means that only a single token of an event mention is annotated, as the following examples demonstrate 12 :
In this section we review the schemes and the associated guidelines for event annotation in these three corpora. We will first have a close look at their definition of event. Then we discuss how events are annotated, and conclude by illustrating the composition of the corpora.
Event definition As already stated, ISO-TimeML adopts a broad definition of event corresponding to the notion of eventuality. As illustrated by Table 1, the corpora follow the same definition, whereby all eventualities are eligible for annotation. Furthermore, the corpora do not filter events using a predefined ontology or set of classes, like in ACE. From this perspective, the corpora are perfectly interoperable, given they have a shared vocabulary for defining an event. Sharing of vocabulary goes even further: TE3 and RED adopt the same tag (EVENT) to mark events mentions, while MNT uses a slightly different naming (EVENT_MENTION). The different naming in MNT is needed to distinguish between event mentions and co-referential event instances (see Tonelli et al. (2014) for details).
Event annotation The annotation guidelines of the corpora follow the same philosophy of adherence to the surface structure of a document.
Unexpected differences emerge on restrictions to what should not be marked as an event. TE3 excludes generic events from the annotation (see example 4). 14 RED, on the other hand, introduces a restriction concerning the "place upon a timeline" 15 of events. As a result, verbs and nouns expressing grammatical encoding to relationships (see example 5) or epistemic status (see example 6) are not annotated as events. 16 RED 13 We used only the gold annotated portion.
14 The example is taken from Saurı et al. (2006). and MNT also allow the annotation of pronouns as events when they are co-referential with an event antecedent (see example 7 17 ).
When it comes to the textual extent of the event tags, all guidelines apply the notion of minimal chunk. However, additional differences emerge at this level. MNT has a more flexible application of this rule allowing the identification of multi-token events. This is restricted to very limited cases corresponding to phrasal verbs, idioms, and prepositional phrases if attested as a single entry in a reference dictionary. On the other hand, RED allows multi-token events only when they correspond to named events (e.g., World War II).
These differences introduce an issue concerning the compatibility of the corpora. While interoperability as shared vocabulary and meaningful and accurate exchange of information is preserved, the specific implementations of the annotation schemes (i.e., the guidelines) have introduced changes that may prevent full portability and reusability. Table 3 illustrates the distribution of annotated event tokens across the three corpora. Here we show the number of annotated events in the train, development and test splits for each corpus, and give the distribution for the parts of speech that occur in the corpora: verb, noun, adjective, pronoun, and multi-token types. We have kept the official splits into train, dev(elopment), and test for RED only. For TE3, we have created a development section by using all test documents from the TempEval-2 evaluation campaign. No changes have been done to the TE3 official test data. The MNT corpus does not have an official split and we created one for this study. Development splits are used to optimize the training of the models. TE3 and RED annotated full documents, while MNT only considered the first five sentences of each document.
Unsurprisingly, most of the annotated events correspond to verbs. This is in line with studies in lexical semantics that identify verbs as the prototypical part of speech for the realization of events (Lyons, 1977). Nouns form the second most frequent part of speech for annotated events. The number of event nouns is higher in RED than in TE3 and MNT. In particular, around 16% of all nouns in this corpus are marked as events, while this corresponds to approximately 8% in TE3 and 12% in MNT. A similar difference affects adjectives. These differences are neither expected nor foreseen, considering the definition of event, the annotation scheme, and the annotation guidelines. They appear to be idiosyncrasies of each corpus which may result from a combination of factors such as the topics and the impact of different annotators. Although they do not affect the interoperability of the corpora, these variations in the distribution of the annotations may impact the portability of the systems and the reusability of the data. In particular, it could be the case that the use of TE3 data results in the weakest portable models on RED and MNT because of the annotation 17 Example 7 has been taken from Tonelli et al. (2014)  differences and event POS distributions (see Table 3), regardless of the fact that TE3 is the largest annotated corpus. Table 4 presents a summary of the raw data of the three corpora. TE3 is the largest manually annotated corpus for event detection for its size both at document and at token levels. It is also the corpus that covers the longest time period (i.e., 24 years). This contrasts with RED and MNT whose sizes are smaller, corresponding, respectively, to around 74% and around 12% of the tokens of TE3. They also span over a shorter time period of about five years, largely overlapping. A further notable difference concerns event density with respect to the total amount of tokens. MNT is the most densely annotated corpus, with almost 30% event tokens. On the other hand, TE3 and RED have similar proportions around 11%. Although MNT limits the annotation to the first five sentences of each document (including the title), the higher density of events appears to be a peculiarity of the documents composing the corpora.

Composition of the corpora
Looking at the sources of the documents, TE3 contains only news articles from established news outlets (e.g., Wall Street Journal, BBC, New York Times, Associated Press, among others). MNT has news articles written by volunteers on an online platform, Wikinews. Finally, RED is the only corpus that offers different text types by combining news articles and posts from online forums. Overall, there is a large overlap across the corpora in terms of text types: 93% of all documents are news articles. Nevertheless, the varied distribution of documents in time raises additional warnings concerning the compatibility of the corpora and their impact on the reusability of the data.

Compatibility of data distributions
The surface level analysis presented in Sect. 4.2 indicates that we have three corpora that broadly speaking belong to the same general domain. However, additional differences due to a variety of factors may affect the data distributions and consequently could impact the portability of systems. For instance, it may be the case that one of the corpora is skewed to a series of topics limiting the portability of a trained model on another corpus without violating interoperability. To better assess the impact of such factors we analyse the corpora for their similarity and diversity (Plank & Van Noord, 2011;Liakata et al., 2012;Ruder & Plank, 2017). Similarity and diversity Following previous work (Ruder & Plank, 2017), we investigate these aspects by means of two metrics: the Jensen-Shannon (JS) divergence and Out-of-Vocabulary rate (OOV).
The JS divergence assesses the similarity between two probability distributions, q and r. We followed the JS implementation in Ruder and Plank (2017): where first Kullback-Leibler (KL) divergence is computed and then averaged between the two probability distributions. The JS divergence has been computed using the count of each token normalized over the entire vocabulary. 18 On the other hand, the OOV rate helps in assessing the differences between the corpora as it highlights the percentage of unknown tokens. Table 5 illustrates the JS divergence between train and test splits, while Table 6 shows the normalized distributions of OOV rates.
When looking at the JS divergence, MNT test is the least similar to TE3 train and RED train ( Table 5). The figures for the OOV rates complement the analysis. In this case, MNT is the corpus that is more homogeneous between train and test splits (only 2.16% of OOV tokens) and a potential challenging test set for TE3 and RED. Besides MNT test limited size, the OOV rates with TE3 train and RED train distributions are 26.15% and 35.90%, respectively. At the same time, we observe that more than 60% of the tokens in TE3 test and RED test are not present in MNT train .
These figures suggest that, regardless of the differences in size between TE3 train and RED train with respect to MNT, both TE3 train and RED train may struggle to achieve comparable performances on the MNT test against a system trained on MNT train . Furthermore, the OOV rate and JS score suggest that system trained on TE3 train would obtain better results than a system that uses RED train when tested on MNT test .
The comparison between TE3 and RED, on the other hand, tells a different story. In general, both corpora seem to occupy a relatively homogeneous space given their JS scores (see Table 5). In particular, the differences between the in-distribution splits and the out-of-distribution ones are not very large. The OOV rate, on the other hand, suggests that a system trained on TE3 train should be more portable than one trained on RED train .
The main takeaways from this overview can be summarized as follows: -semantic interoperability can be preserved at an abstract level but disregarded in the actual realization of a corpus; -corpora may be interoperable but not necessarily compatible-even if covering the same broad domain; -TE3, RED, and MNT present differences in the distribution of the annotations for events that were unexpected on the basis of formal checks of their respective annotation schemes and guidelines.
The fact that the TE3, RED, and MNT adopt the same definition of event and types of texts offers an optimal setting to investigate the portability of systems (Sect. 5) and the reusability of their annotations to create more robust systems (Sect. 6). The differences we have observed will help us to formulate expectations on the performance of systems, their portability, and the reuse of the annotated data.

Portability of systems
Previous work has investigated portability and robustness in terms of system performance under a distribution shift (Novielli et al., 2018;Hendrycks et al., 2020;Wang et al., 2022). Under this perspective, portability is closer to the notion of domain abstraction or generalization to unforeseen distribution shifts (Muandet et al., 2013;Hendrycks et al., 2020). One of the assumptions is that performance should not degrade due to (minor) differences in the data (e.g., new domain, grammatical errors, speakers' dialect) (Wang et al., 2022). The focus is mainly on the portability of a system's architecture, backgrounding the differences in the data. However, assessing the compatibility of the data composing the corpora is a necessary requirement to study portability (Alex et al., 2006;van Erp et al., 2016;van Son et al., 2018). In Sect. 4.2, we have identified critical issues concerning the compatibility of the three corpora mainly in terms of their annotation. In Sect. 4.3 we have highlighted similarities and differences in the data distributions and formulated expectations on the behavior of systems. As a matter of fact, data from the same domain (e.g., news) may differ due to a variety of factors, some of which are openly identifiable (e.g., time of publication, topics, writing styles) and others yet to be enumerated (Plank, 2016;Ramponi & Plank, 2020).
To properly investigate the portability of systems, it is important to first assess the loss due to data distributions i.e., the expected loss. To do so, we apply methods developed to predict the performance drop of NLP systems in presence of domain-shift (Elsahar & Gallé, 2019) to the raw data of our corpora (Sect. 5.1). This will return an expected loss of the system performances across the corpora due to intrinsic differences of the data rather than due to human annotations. Only after this threshold has been identified, the portability of the systems can be assessed (Sect. 5.2). In the light of the previous analysis, we may expect that the differences in annotation and event partof-speech distributions may negatively affect portability, besides the corpora being semantically interoperable. This means that we may register losses in performance higher than the expected loss.
For the purpose of our experiments on event trigger detection, portability, and reusability, we deemed it sufficient to use a state-of-the-art method based on a Bi-LSTM network with a CRF classifier as the last layer (Reimers & Gurevych, 2017b). 19 rather than more complex architectures such as pre-trained Language Models (PLMs). We do not fine-tune the hyper-parameters, but follow the suggestions in Reimers and Gurevych (2017a), Reimers and Gurevych (2017b) for sequence labeling tasks. The network is composed by two LSTM layers of 100 units each, trained with the Nadam optimizer. Variational dropout is applied, 20 with gradient normalization (τ = 1), and batch size of 8. We train the models for a maximum of 30 epochs, with early stopping after 5 consecutive epochs with no improvements. Komninos and Manandhar (2016) pre-trained word-embeddings are used to initialize the network 21 and concatenated with character-level features Ma and Hovy (2016). Notice that in all experiments the development sets used during the training to optimize the losses are always from the same distributions as the train. Using an out-of-distribution development set is already a form of domain adaptation, an aspect which is out of the scope of this contribution.

Assessing the impact of data distributions
Differences in data distributions are known causes of performance drops. To assess their impact and identify an expected performance losses imputable to differences in the data rather than differences in the annotations, we run a series of experiments inspired by the idea of reverse classification accuracy (RCA) (Fan & Davidson, 2006;Zhong et al., 2010). RCA provides us with a quantitative measure of the expected loss directly comparable against the performance of a systems. RCA uses a classifier trained on a source dataset, or domain, C s , to label data of a different dataset, or domain, B t . The newly annotated dataset, B t , is then used to train a new model, C t , and evaluated against the same held-out portion of the source dataset, A s . Originally, RCA has been proposed as a strategy to select the best performing model or dataset for a given target domain.
Following Elsahar and Gallé (2019), we frame RCA as follows: we train one classifier using a source corpus, C s (e.g., TE3 train ). We then use this classifier to annotate the train splits of the other two target corpora, B t and D t (e.g., RED train and MNT train ). We further train the same system's architecture with the newly annotated train portions, obtaining two classifiers, C b and C d . Finally, we test the performance of systems against the test split of corpus A s (e.g., TE3 test ). The difference in performance that we observe reflects the potential performance loss due to the differences in their intrinsic language varieties. The results of these experiments are illustrated in Table 7.
In all settings, we observe large drops between the in-distribution and the out-ofdistribution splits-as it appears by comparing the results on the same test set along every row of Table 7. M N T train is the split that suffers most when used to make predictions on T E3 test (F1.805, −.015 points) and RED test (F1.750, -.137 points). In line with the JS score, M N T train is quite dissimilar from TE3 test and RED test and it presents the largest OOV rates. We observe that using TE3 train as a re-annotated corpus for training new systems produces the lower losses. We confirm that RED has issues when applied to MNT test , while being more homogeneous to TE3 on the basis of the JS score and OOV rate. Interestingly, the expected loss has a strong negative correlation (Spearman's ρ − .885, p < .05) with the OOV rates. On the other hand, the correlation with JS scores is moderate negative (Spearman's ρ -.552) but not significant ( p > .05). Although these results cannot be considered definitive, the correlation outcome indicates that OOV rate appears as a potential stronger prediction of expected loss than JS divergence.
While the corpora are compatible for the task (i.e., event detection) and text types, we have identified an expected loss due to different linguistic properties of the corpora and mainly due to differences in their vocabulary, as supported by the correlation between OOV rates and cross-corpora results. These differences in performance can be used as lower thresholds, or expected minimal losses, when testing for the portability of systems.

Experiments and results
Our initial working hypothesis was that systems trained on interoperable resources (on the same domain) should minimise their loss when applied across interoperable (test) distributions. We have identified intrinsic differences across the corpora due to their data distribution. In addition to this, we have observed that while preserving interoperability, there are compatibility issues related to the actual annotations of the data. This finding has a non-negligible impact, since it may be the cause of extra loss in performance. The combination of all these elements forces us to reformulate our expectations in more homogeneous and uniform way, prioritizing the annotation compatibility issues. This means that for portability, we may expect the following types of behavior: Columns represent the raw text corpus used to (re)train the systems using the three different annotation schemes. Each row reports the results on the same test split (e.g.,

TE3
test ) of three different systems, each trained with the same annotation scheme (e.g., TE3) but different raw text data (columns TE3 RED, and MNT). All scores are averaged over 5 different runs, standard deviations are reported in subscripts -annotation differences will trigger higher losses than the expected ones; -systems trained with TE3 train will be the weakest portable ones because their annotations are the least compatible with RED and MNT; -besides the differences in JS score and OVV, systems trained using RED train will be more portable on MNT test because their distribution of event part-of-speech is closer than that of TE3 train .
The figures in Table 8 are in line with our expectations, offering new insights on interoperability and portability of event corpora: all losses across the test distributions are bigger than the expected ones. While the losses in Table 7 range between 0.7% (RED train − TE3 test ) up to a maximum of 15.44% (MNT train − RED test ), here we face losses ranging between 27.84% (MNT train − RED test ) and 14.02% (RED train − TE3 test ). A more detailed analysis shows that RED test is challenging for systems trained on TE3 and MNT, while TE3 test seems to be the "easiest". A further peculiar behavior concerns the differences across out-of-distribution data: systems trained using RED train and MNT train consistently tend to maximize recall while those trained with TE3 train maximize precision.
These results appears to be in line with our revised expectations: differences in annotations have a stronger impact on portability than data distribution. TE3 systems suffer, as expected, when applied both to RED test and MNT test . The lower recall scores clearly point out issues in the ability of the systems to correctly identify events. In spite of having a larger amount of annotated data, the systems perform poorly. Applying such systems on new data distributions would result in a weak event annotation (many positive cases will be missed). RED systems appear to overgeneralize the identification of events on out-of-distribution data, as indicated by the high recall. On closer inspection, we observe that on MNT test these systems behave as expected, being more portable than those trained on TE3 train . As for MNT, systems are expected to suffer when applied on out-of-distribution data because of the limited size of the training and OOV rates. At the same time, the distribution of the annotated data indicates a behavior in line with RED systems, as shown by the results on TE3 test and RED test .
The outcome of this first round of experiments clearly points out that interoperability does not necessarily entail portability of systems. The raw data indicates that each corpus is a specific variety of English and that differences in data distribution appears to have a minimal impact on system performances when the annotation scheme is not changing (see Table 7). On the other hand, while interoperability can be preserved,  Table 7 unexpected differences in the distribution of the annotations plays a major role on the portability of current neural models.

Reusability of annotations
This section will investigate to what extent semantically interoperable corpora of the kind we are dealing with can be reused. The focus here is on direct reuse of the annotated data. By using an increased amount of training data, systems would gain access to a larger and diversified set of examples of contexts where different lexical items can realise events. This should result in better and more portable systems (Wang et al., 2022). Notice that, in comparison to previous work, interoperability does not require domain adaptation (Daumé III, 2007;Kim et al., 2016;Ruder & Plank, 2017). As shown in Sect. 4, the corpora share common characteristics that qualify them as belonging to the same domain. Additionally, we are not merging data originally annotated for different phenomena into a unified representation (Karan & Šnajder, 2018). This setting directly tests the assumption of reusability of annotated data promoted by semantic interoperability. The experiments in this section also aim at providing a different analysis for portability of models. Following solutions proposed in previous work (Daumé III, 2007;Niu et al., 2009) the simplest and most immediate method to check the reusability of annotations is to concatenate the data and train a system. Table 9 reports the results of such an experimental setting where different concatenations of the train sets of each of the three corpora has been used and the resulting systems tested on every test distribution. For all experiments we used the same system architecture, Bi-LSTM with a CRF, as described in Sect. 5.1. To facilitate an easy comparison of the results, we repeat the scores reported in Table 8 at the bottom of Table 9.
The outcome of this set of experiments is different from our expectations on the use of semantically interoperable corpora. At the same time, these results are in line with respect to our insights on the data distributions and differences in annotation. Focusing on reusability of data only, the picture is better than that coming from Table 8. In general, the combination of data annotated for the same language phenomenon has a positive impact when applied to out-of-distribution tests. Not surprisingly, the largest improvements concern the combination of training materials that are subsequently applied to an in-distribution test. For instance, when TE3+RED is applied to TE3 test , the results are better than systems trained solely on RED train or MNT train .
This second battery of experiments, however, confirms some of the limits of interoperability as realized by these corpora. While concatenating data has a positive effect (also for the portability of systems), the differences in the annotations prevent systems from improving when compared to the in-distribution test splits. Hence, the results of our experiments question the reusability of these corpora as a strategy to improve on in-distribution results.
From a more general perspective, our experiments consistently indicate that the promise of interoperability has limits and requires a more careful formulation, at least according to the dimensions we are investigating with these corpora. As already stated, the three corpora are semantically interoperable: they share a common vocabulary and All scores are averaged over 5 different runs, standard deviations are reported in subscripts. Best scores per test set are in bold. Second best scores are in italics definitions of events and they have a similar annotation philosophy. Nevertheless, it appears that interoperability, in its dimension of independent schemes compliant to a standard, does not entail reusability. As far as the combination of these corpora can enrich each other, as suggested in Table 9, and can provide a large collection of data for the linguistic investigation of events in English, they marginally contribute to the creation of more robust models.

Exploring the impact of annotation compatibility
Our analyses have shown that corpora can be semantically interoperable and yet they may fail to fulfill expectations concerning portability and reusability. We have identified that one important factor is represented by differences in the actual annotations of the data. In this section we will perform an error analysis on the different systems as a proxy to shed light on the actual annotations of the data and their impact for portability and reusability.
For each corpus, we use the best system and apply it to the development splits for both in-and out-of-distribution settings. We exclude the concatenated models, e.g., TE3+RED, because they would prevent the understanding of the specific features of each composing corpus (TE3, RED, and MNT) and their impact on reusability.
We focus the analysis on the errors per part of speech, as they are more relevant to understand the behaviors of the different systems in light of the data presented in Table 3 in Sect. 4.2. In the case of the false positives (FPs), the TE3 system tends to label more often verbs rather than nouns as events. This results in 62.22% of all FPs on TE3 dev to be verbs, followed by nouns (32.59%), adjectives (2.96%), and other parts of speech (2.22%). The contrary can be observed for the system trained on RED, where 61.90% of the FPs are nouns, while verbs are only 32.53%. Adjectives and other POS, on the other hand, are comparable with TE3. The MNT system, finally, distributes its FPs on adjectives (66.66%) and other parts of speech (33.33%) only. However, the limited amount of training data and the high similarity between the train and test distribution makes this quite a special corpus for the analysis of in-distribution errors. The analysis of the false negatives (FNs) is more in line with previous error analyses of systems for event detection (Mirza & Tonelli, 2014;Caselli & Morante, 2018). All systems struggle to correctly identify event nouns although with different proportions: 59.52% for TE3, 46% for RED, and 50% for MNT.
The analysis of the errors in the out-of-distribution data is complementary and helps to better understand the differences in precision and recall. Regardless of the target corpus, the RED system tends to predict nouns as events: 66.66% of FPs on TE3 dev and 78.12% of FPs on MNT dev . The same system fails more often to detect verbal events: 73.33% of FNs on TE3 dev and 44.89% MNT dev . When systems trained on TE3 and MNT are applied on RED dev , the distributions of the errors is swapped: the majority of FPs are verbs (73.48% for TE3 and 56.15% for MNT, respectively) while the majority of FNs are nouns (63.48% for TE3 and 62.63 for MNT).
The behavior of the systems is a direct consequence of the distribution of the annotations in the respective train corpora. Although smaller in size, the differences in the amount of event nouns and verbs in RED train are statistically significant (chi-square test, p < .05) when compared to TE3 train and MNT train . We thus explore the performance of the best trained models per corpus only on events realised by verbs and nouns. The results of this extra evaluation setting will help to better understand whether errors can be restricted to differences in the annotations of events. Figures are reported in Table 10.
In absolute terms, identifying nominal events appears as a more challenging task than identifying verbal events. This difference is best understood when framed within the notion of continuum of eventiveness of lexical items (Araki, 2018) and adopting recent theoretical frameworks of word classes as continua (Simone, 2000;Sasse, 2001;Simone, 2003). Nevertheless, performance issues are still at play in the out-ofdistribution setting, with RED test being the most challenging. When looking at the performance drops of the systems, none of them can actually be considered minimal nor in line with expected loss due to language variety. However, it appears that portability of systems (and potentially of annotations) is subject to specific parts of speech.
On the basis of the annotation scheme and guidelines, the impact of event part of speech on the systems' performance is not expected. In no part of the annotation guidelines of each corpus is stated that event nouns are to be treated differently. This opens questions about the annotation process and the quality of the corpora. Previous work (Derczynski & Gaizauskas, 2010) identified inconsistencies in the annotation of the TimeBank corpus (Pustejovsky et al., 2003) that were supposed to be addressed for the release of the TE3 corpus (UzZaman et al., 2013). Another contribution (Inel & Aroyo, 2019) investigated the consistency and completeness of TE3 for event annotation and time expressions. Their analysis show that the TE3 corpus still contains annotation inconsistencies such as not annotated sentences and missed event tokens.
We assess the differences in event annotations using the average of the ratio between the frequency of each type annotated as event an event ( f req(ev wi ), and its overall frequency (regardless of the event annotation) ( f req(w i )) in the corpus distribution (C). We call this measure average event ambiguity (AEA) and it can be expressed as follows: AEA can have multiple functions. First, it can be used to assess the diversity of "eventiveness" of different tokens. Consider cases of deverbal nouns like "building" or complex types (Pustejovsky, 1995) like "eulogy": for these nouns not all of their occurrences give rise to eventive readings. 22 Second, it gives an estimate of the density of events, i.e., how many times a type is marked as an event with respect to all of its occurrences. Third, it can facilitate the identification of potential inconsistencies in the manual annotation. In general, the nearer to 1.0 AEA is, the less the variety of eventiveness, the higher the density, and the less the inconsistencies are. We report AEA scores in Table 11 for the train splits of the corpora and for verbs and nouns.
The outcome of AEA further confirms that there are remarkable differences in the annotations of event nouns across the three corpora. This helps to explain the unexpected results for precision and recall for the trained systems when applied outof-distribution. The boost in recall we observe for the RED and MNT systems is definitely due to a more consistent annotation of nominal events. Furthermore, we can safely claim that the lower performance of MNT are mainly due to size rather than the annotation of the data.
As a final check, we manually annotated a random portion of 20% of the FP nouns predicted by RED on TE3 test using the TimeML v1.2 annotation guidelines. It turned out that 54.40% of them could reliably be annotated as instances of events.
Besides their being semantically interoperable, the differences in the actual annotations of the data in the three corpora limit the potentially positive effects of using interoperable corpora.

Conclusion and future directions
In this paper we have tested for the first time to what extent the promise of reusability promoted by interoperability of language resources holds when applied to semantically interoperable corpora for events. In particular, we have investigated a dimension of interoperability where resources share a common vocabulary based on a standard but implement independent schemes and guidelines.
We controlled for factors that may negatively influence the outcome of such analysis, namely differences in text types and language variety of the corpora involved. We have shown that the three corpora are composed by homogeneous text types (Sect. 4) and that differences in language variety can be used as expected, lower bound thresholds that negatively impacts the performance of systems (Sect. 5.1).
We have conducted an extensive set of analyses and experiments showing the limits of interoperability based on a shared common vocabulary for portability and reusability. Differences in annotations (Sect. 7) and the quality of the annotations-see Derczynski and Gaizauskas (2010) and Inel and Aroyo (2019)-appear as the major factors that impacts the reuse of interoperable resources. Annotation quality cannot be estimated only on the basis of annotation guidelines or reported IAA scores. The TimeBank Corpus, on which TE3 is derived, reports high IAA scores for event annotation. And so do RED and MNT. We propose to use measures like AEA to explore at a deeper level of analysis potential differences in annotations that do not emerge by an analysis of the annotation schemes and guidelines.
Even complex system architectures such as neural networks are very sensitive to the composition of the training data, as illustrated by unsatisfying performances when reusing systems trained on TE3 in out-of-distribution splits. Interoperability, and consequently reusability, seems best achieved by applying the same annotation scheme (e.g., Universal Dependencies) rather than by creating standard compliant schemes, as in our case.
A further aspect that cannot be ignored concerns the peculiarity of the event detection task. Events are complex entities that are at the heart of the syntax-semantic interface. The semantics of an event in some cases requires access to knowledge that may not be explicitly expressed in texts. This is a further factor that may have played a role in the outcome of our experiments. A natural follow-up question for future work would require the creation of new test sets on multiple and different data distributions against which all systems that have been developed in this work can be further tested. This may help to better understand the potential advantages of semantic interoperable resources.
In our closing remarks we want to clarify our stance on resource interoperability. We believe that standardisation initiatives such ISO-TimeML and UD represent valuable contributions for computational linguistics and natural language processing. They promote discussion and advancements for the analysis and documentation of language phenomena, because they create an environment that can be used by multiple communities potentially triggering new contributions and cross-fertilization across disciplines. Being able to query, find, and observe the same information in different datasets (and potentially different languages) is useful to expand knowledge of language phenomena, refine theoretical frameworks, and help develop more robust systems. In the end, to make substantial progress in the field we need more semantically interoperable resources of higher-quality.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.