1 Introduction

Event extraction [1] aims at extracting structured event records from unstructured text. For example, as shown in Figure 1, the goal of event extraction is to map the document “Two homemade pressure-cooker bombs are detonated remotely by the Tsarnaevs near the finish line of the Boston Marathon, killing three and injuring some 260 others. Seventeen people lost limbs.” to four predefined event types (highlighted with celeste), such as <event type: Attack, trigger word: detonated, role:Attacker: Tsarnaevs, \(\dots \), role:ExplosiveDevice: bombs, role:Place: Boston Marathon>, as well as other events that are triggered by words killing and injuring.

Event extraction is challenging due to the diversity of natural language expressions and the complexity of event structures. These challenges are amplified in document-level event extraction where the text is a full document and typically contains more events. Currently, most event extraction methods employ a decomposition-based approach [2], which involves breaking down the structured prediction problem of a complex event into classifications of substructures like trigger detection, entity recognition, and argument classification. Many of these methods tackle the subproblems separately, which necessitates additional annotations for each stage [3].

Fig. 1
figure 1

The event extraction task. In each event schema, we delineate the event type along with its associated roles. For instance, within the "Attack" event schema, roles such as "Attacker," "ExplosiveDevice," and "Place" are encompassed

Natural language generation techniques have been successfully applied to a number of NLP tasks [4,5,6]. These techniques have inspired the use of controlled event generation to tackle event extraction. These approaches use manually designed templates to wrap input sentences and train a model for cloze-style filling. The study by [7] proposes generating linearised event records via a pretrained encoder-decoder architecture combined with a constrained decoding mechanism that alleviates the complexity associated with template combination when extracting multiple events. The advantage of the extraction-as-generation approach is the removal of the need for fine-grained token-level annotations, which are typically used in previous event extraction approaches [8], thus enjoying greater feasibility.

Although good generalizability has been achieved for other tasks, we have observed a significant decrease in performance when it comes to generation-based event extraction over documents or unseen event types. Structured prediction tasks, such as event extraction, often rely on an external schema to format the output, whereas natural language generation tasks do not. To bridge this gap, we introduce a novel technique called knowledge-based conditioning. This approach involves injecting event type information as prefixes on different layers of the underlying pretrained language model. By incorporating this information, we aim to improve the performance of event extraction tasks. Additionally, to address the challenge of adapting to new scenarios, we consider event extraction from the perspective of zero-shot learning [9, 10]. Our model, KC-GEE, is capable of document-level event extraction and is generalizable to the zero-shot setting.

Our main contributions are as follows.

  • We propose a novel knowledge-based conditioning technique that injects event type information into the model, enabling zero-shot learning capability.

  • We carefully design a prefix-based injection mechanism that incorporates cross-attention to improve document-level event extraction.

  • We conducted extensive experiments on two benchmark datasets, in both fully supervised and zero-shot settings. Our evaluation consistently shows strong performance across all settings. In particular, our model achieves substantial superiority in the challenging settings of document-level event extraction and zero-shot transfer, outperforming state-of-the-art models by up to 5.4 absolute F1 points.

2 Related work

Document-level event extraction

Event extraction is a task that extracts structured event records from unstructured text [5]. Many approaches have been proposed for sentence-level event extraction [11, 12], ranging from hand-designed features [13] and neural-learned features [14, 15]. Yet, many real-world applications require document-level event extraction [14,15,16,17,18], in which the information of an event may be mentioned in multiple sentences [19, 20]. Moreover, most work adopt decomposition strategies in event extraction [2], which employ trigger detection [13], entity recognition [21, 22], and argument classification [23]. These decomposition strategies have shown high performance while introducing more detailed annotation to model training [5, 7].

Zero-shot event extraction

Several previous supervised event extraction methods have relied on features derived from manual annotations, limiting their applicability to new event types without additional annotation effort [9, 24, 25]. These methods often struggle to effectively generalize to new label taxonomies and domains. In contrast, [26] proposes a zero-shot event extraction approach. They first utilize existing tools, such as SRL (Semantic Role Labeling), to identify events and subsequently map them to a predefined taxonomy of event types without the need for specific training data. Lyu et al. [27] explores the possibility of zero-shot event extraction by formulating it as a series of Textual Entailment (TE) and/or Question Answering (QA) queries. For instance, they utilize pretrained TE/QA models to establish a direct transfer of knowledge, using statements like "A city was attacked" which entails "There is an attack." In this paper, we propose a novel approach for zero-shot event extraction by jointly training a prefix generator for event schemas. Our method is designed to be parameter-efficient and lightweight, allowing for effective event extraction even in scenarios with limited or no training data.

Fig. 2
figure 2

The comparison among KC-GEE and the other prompt-based generative methods, e.g., BART-Gen [5] and Gegree [6]

Generative event extraction

. Generative event extraction has emerged as a promising approach for automatically extracting event information from text by employing generative models. Motivated by the achievements of pretrained language models and the associated natural language generation-based approach in diverse NLP tasks [4, 28,29,30,31,32], some researchers have approached event extraction as controlled event generation. As shown in Figure 2, [5, 6] are end-to-end conditional generation methods with manually designed discrete prompts for each event type, requiring more human effort to find the optimal prompt. To remove the complexity of template combination in extracting multiple events, [7] proposed a method for generating the event records directly using a pretrained encoder-decoder architecture and a constrained decoding mechanism. This extraction-as-generation approach does not require fine-grained token-level annotations, which are typically needed by previous event extraction methods. Liu et al. [33] proposes a generative template-based event extraction method that uses dynamic prefixes and integrates context information with type-specific prefixes to learn a context-specific prefix for each context. However, this method does not consider zero-shot extraction or document-level extraction, which we consider in our paper.

3 Generation-based event extraction

Problem definition

We denote \(\mathcal {E}\) and \(\mathcal {R}\) as the set of predefined event types and role categories, respectively. An input sequence \(\varvec{x} := \{x_1,\ldots ,x_{\mid {\varvec{x}}\mid }\}\) comprises tokens \(x_i\), where \(\mid {\varvec{x}}\mid \) denotes the sequence length. Given an input document, an event extraction model aims to extract one or more structured events, where each event is specified by (i) the event type \(e \in \mathcal {E}\) along with the trigger word t from the document, and (ii) the roles \(\mathcal {R}_e \subseteq \mathcal {R}\) along with their corresponding arguments from the document.

Event extraction as generation

Given \(\mathcal {E}\) and \(\mathcal {R}\) in the predefined event schema, generation-based event extraction models generate a structured sequence based on an input document that is constrained by the schema [7].

The generated sequence is a linearised representation of events mentioned in the document. Specifically, given a document with token sequence \({\varvec{x}}\) as input, a generation-based extraction model, such as KC-GEE, outputs the linearised events representations \(\varvec{y} = \langle y_1, y_2, \dots , y_{\mid \varvec{y} \mid }\rangle \), where each event \(y_i\) is denoted by \(\langle e_i, t_i, \langle r_{i,1}, a_{i,1}\rangle , \dots , \langle r_{i,\mid r \mid }, a_{i,\mid r \mid }\rangle \rangle \). The angled brackets \(\langle \cdot \rangle \) are special tokens indicating the sequence structure. The \(e \in \mathcal {E}\) and t are the event type and the trigger words (a subspan of the document \({\varvec{x}}\)); furthermore, \(r_i \in \mathcal {R}\) and \(a_i\) denote roles and arguments (subspans of the document \({\varvec{x}}\)).

Architecture

Our KC-GEE model adopts a Transformer-based encoder-decoder architecture for event structure generation. KC-GEE outputs the sequentialized event representation \(\varvec{y}\) for an input document \({\varvec{x}}\). First, it computes the hidden representation \(\textbf{H}_{\varvec{x}} = ({\varvec{h}}_1, {\varvec{h}}_2, \dots , {\varvec{h}}_{\mid {\varvec{x}} \mid }) \in \mathbb {R}^{\mid {\varvec{x}} \mid \times d}\) for each token in the document via a multi-layer Transformer encoder:

$$\begin{aligned} \textbf{H}_{\varvec{x}} = \text {Encoder}({\varvec{x}}), \end{aligned}$$
(1)

where each layer of \(\text {Encoder}(\cdot )\) is a Transformer block [34] with the multi-head self-attention mechanism.

Given the encoding \(\textbf{H}_{\varvec{x}}\), the decoder generates each token sequentially to produce the sequence of events. At step t, the Transformer-based decoder generates the token \(y_t\) and hidden state \({{\varvec{h}}}_t\) as:

$$\begin{aligned} y_t, {\varvec{h}}_t = \text {Decoder}(y_{t-1}; \textbf{H}_{\varvec{y}_{<t}}, \textbf{H}_{{\varvec{x}}}), \end{aligned}$$
(2)

where each layer of \(\text {Decoder}(\cdot )\) is a Transformer block, with both the self-attention to past hidden states \(\textbf{H}_{\varvec{y}_{<t}}\in \mathbb {R}^{(t-1)\times d}\) during decoding and the cross-attention to the encoding \(\textbf{H}_{{\varvec{x}}}\). The conditional probability of the output sequence \(p(\varvec{y}\mid {\varvec{x}})\) is then,

$$\begin{aligned} p_{\theta }(\varvec{y}\mid {\varvec{x}}) = \prod _{t=1}^{\mid \varvec{y} \mid }p_{\theta }(y_{t}\mid \varvec{y}_{<t},{\varvec{x}}), \end{aligned}$$
(3)

where \(\theta \) denotes the parameters of the Transformer-based encoder-decoder model.

Fig. 3
figure 3

A high-level illustration of three candidates knowledge-based conditioning injection paradigm for encoder-decoder models: fine-tuning, adapter-tuning, and prefix-tuning. For each tuning type, the gray blocks indicate the frozen parameters in a pretrained model, and the blue blocks indicate the trainable parameters in a pretrained model

4 Knowledge-based conditioning in event generation

This paper investigates the best way to leverage pretrained language models (PLMs) as the backbone encoder-decoder model for event extraction.Footnote 1 Using PLMs is now standard practice in NLP, as they lead to strong performance and generalisation.

Given a labeled training dataset \(\mathcal {D}\), we investigate the best way to specialise the PLM for the event extraction task via prefix-tuning [35]. In this section, we show how to effectively condition the generation process on the event extraction task as well as the given document.

One may specialise the underlying PLM to the event extraction task through other methods, such as fine-tuning the PLM parameters or injecting adapters to the encoder and/or decoder of the PLM (see Figure 3). Our experiments show that prefix-tuning is more effective than those methods.

Fig. 4
figure 4

An illustration of our end-to-end framework KC-GEE, where the main architecture is a transformer-based encoder-decoder in the center. The lower blocks represent the conditioning construction modules for encoder and decoder, respectively. The upper blocks represent the conditioning injection modules for encoder and decoder, respectively

Our desiderata for prefix-conditioning of a PLM for event extraction are as follows: It should enable the model to be aware of (i) the candidate event schemas in the task, (ii) the specific input document, and (iii) flexible schema modifications that may occur after the model is trained in real-world settings. In what follows, we explain how we achieve these desiderata by producing prefixes for the encoder and the decoder based on the events of the task and the input document. Please refer to Figure 4 for an overview of the framework.

4.1 Encoder conditioning

We condition the encoder on the event types of the underlying event extraction task. Given the event types \(\varvec{e} = \{e_1, e_2, \dots , e_{\mid \varvec{e} \mid }\} \subseteq \mathcal {E}\) for a task, we use the encoder to get the encoding representation for the event types \(\textbf{H}_{\varvec{e}} \in \mathbb {R}^{\mid \varvec{e}\mid \times d}\). We then combine these events representations through a function \(f_{enc}:\mathbb {R}^{\mid \varvec{k} \mid \times d}\mapsto \mathbb {R}^{d'}\) to create the events conditioning context, i.e.

$$\begin{aligned} \begin{aligned} \textbf{H}_{\varvec{e}} = \textrm{Encoder}(\varvec{e});\ \ {\varvec{h}}_{\varvec{e}, enc} = f_{enc}(\textbf{H}_{\varvec{e}}) \end{aligned} \end{aligned}$$
(4)

Since we assume each event type is equally probable a priori, we use the pooling average operator as \(f_{enc}\). The vector \({\varvec{h}}_{\varvec{e}, enc}\) is used by a prefix generation network \(g_{enc}\) to produce the prefix. As shown in Figure 4, by ± in \(f_enc(.)\), we suggest that adding or removing an event type representation from knowledge-based conditioning is flexible.

4.2 Decoder conditioning

It is expected that the representation of instances could help the downstream generation in the decoder. Hence we use the representation of both the task and the input document to create a prefix for the decoder.

Specifically, let \(\textbf{H}_{{\varvec{x}}}\) denote the representation of the tokens of the input document \({\varvec{x}}\). We combine the document representation \(\textbf{H}_{\varvec{x}}\) and the task representation \(\textbf{H}_{\varvec{e}}\) through the function \(f_{dec}:\mathbb {R}^{\mid \varvec{e}\mid \times d} \times \mathbb {R}^{\mid \varvec{e}\mid \times d} \mapsto \mathbb {R}^{d'}\times \mathbb {R}^{d'}\) as follows,

$$\begin{aligned} {\varvec{h}}_{\varvec{e}, dec},{\varvec{h}}_{{\varvec{x}}, dec} = f_{dec}(\textbf{H}_{\varvec{e}}, \textbf{H}_{{\varvec{x}}}) \end{aligned}$$
(5)

where \(f_{dec}\) is based on dot product-based cross-attention, and \({\varvec{h}}_{\varvec{e}, dec} \in \mathbb {R}^{d'}\), \({\varvec{h}}_{{\varvec{x}}, dec} \in \mathbb {R}^{d'}\) are the resulting fixed-dimensional vector summaries for decoder conditioning.

4.3 Prefix generation

We create the encoder prefix \(\textbf{Z}_{enc}\) and decoder prefix \(\textbf{Z}_{dec}\) as follows,

$$\begin{aligned} \begin{aligned} \textbf{Z}_{enc}&= g_{enc}({\varvec{h}}_{\varvec{e}, enc}) \\ \textbf{Z}_{dec}&= g_{dec}([{\varvec{h}}_{{\varvec{x}}, dec};{\varvec{h}}_{{\varvec{x}}, dec}]) \end{aligned} \end{aligned}$$
(6)

where \(g_{enc}\) and \(g_{dec}\) are both mapping function \(g:\mathbb {R}^{2\times d'}\mapsto \mathbb {R}^{k\times \mid \textbf{H}_i\mid }\), where k is the length of injected prefix and \(\mid \textbf{H}_i \mid \) is the number of parameters of the ith injected prefix maintained in the Transformer architecture. With the injection of \(\textbf{Z}_{enc}\) and \(\textbf{Z}_{dec}\), the encoder and the decoder in (1) and (2) are modified as follows:

$$\begin{aligned} \textbf{H}_{{\varvec{x}}}&= {\text {E}ncoder}({\varvec{x}};\textbf{Z}_{enc}) \end{aligned}$$
(7)
$$\begin{aligned} y_t, {\varvec{h}}_t&= \text {Decoder}(y_{t-1}; \textbf{H}_{\varvec{y}_{<t}},\textbf{Z}_{dec},\textbf{H}_{{\varvec{x}}}), \end{aligned}$$
(8)

where \(\textbf{Z}_{enc}\) and \(\textbf{Z}_{dec}\) can be thought as pseudo-prefix tokens impacting the generation process  [35].

4.4 Training and inference

We train the model by minimising the negative log-likelihood loss:

$$\begin{aligned} \theta ^* = \arg \min _{\theta } \ \sum _{({\varvec{x}},\varvec{y}) \in \mathcal {D}} -\log p_{\theta }(\varvec{y}\mid {\varvec{x}}, \varvec{e}) \end{aligned}$$
(9)

where \(\mathcal {D}\) is the training set, \(\theta ^*\) denotes the optimal parameters, \(\varvec{e} = \{e_1, e_2, \dots , e_{\mid \varvec{e} \mid }\} \subseteq \mathcal {E}\) denotes the event type for a task, \(\varvec{x}\) is the input document and \(\varvec{y}\) is the predicted event structure. As we formulate the event extraction problem as a sequence generation problem, the overall likelihood \(p_{\theta }(\varvec{y}\mid {\varvec{x}}, \varvec{e})\) is formulated as follows:

$$\begin{aligned} p_{\theta }(\varvec{y}\mid {\varvec{x}}, \varvec{e}) = \prod _{t=1}^{\mid \varvec{y} \mid }p_{\theta }(y_{t}\mid \varvec{y}_{<t}, {\varvec{x}}, \varvec{e}). \end{aligned}$$
(10)

where \(p_\theta (\varvec{y}\mid {\varvec{x}}, \varvec{e})\) is defined as the cumulative product of \(p_{\theta }(y_{t}\mid \varvec{y}_{<t}, {\varvec{x}}, \varvec{e})\), in which \(y_t\) is the t-th token in the output sequence \(\varvec{y}\).

For inference, we use constrained decoding [7].

5 Experiments

We compare our KC-GEE model with several recent strong models, evaluating in both supervised learning and zero-shot learning settings, as well as for the document-level extraction task. Our aim is to demonstrate the greater generalizability and effectiveness of our model in these challenging scenarios.

5.1 Evaluation setup

5.1.1 Datasets

We carry out experiments on two event extraction datasets: the sentence-level dataset Automatic Content Extraction 2005 (ACE05-EN) [11] and the document-level dataset: WikiEvents [20]. The statistics for both are provided in Table 1. Note that we use the official dataset splits of the two datasets to ensure better reproducibility. It is worth noting that WikiEvents presents significant challenges due to three factors. (1) Context length: each instance in ACE05-EN contains only one sentence, whereas instances in WikiEvents are documents. (2) Event density: almost every instance in ACE05-EN contains only one event, whereas multiple events could be present in one instance in WikiEvents. (3) Data scarcity: the amount of training data in ACE05-EN is more than 77 times greater than that in WikiEvents.

5.1.2 Evaluation metrics

We employ the same evaluation metrics used in previous work [7, 36] for both trigger extraction (Trig-C) and arguments extraction (Arg-C). These metrics include F1, precision, and recall.

As KC-GEE is a text generation model, we consider the input sequence one by one to find the matched utterance to reconstruct the offset of predicted trigger mentions. Additionally, in the case of argument mentions, we identify the trigger offset as the nearest matched utterance to the predicted trigger mention.

Table 1 Statistics of the event extraction datasets used in the paper, including the numbers of event types, argument types, the type of instances, events per instance, and the number of instances in different splits

5.1.3 Baselines

We evaluate KC-GEE against three groups of baselines which use different levels of annotations of decreasing granularity: Both token-level and entity-level annotation, Token-level annotation, and Parallel text-record annotation. Some methods utilise token annotations, in which each token in an instance is annotated with event labels and golden entity annotation to facilitate event extraction. Joint3EE [36] is a multi-task model that jointly performs entity, trigger, and argument extraction by shared Bi-GRU hidden representations. DYGIE++ [8] is a BERT-based extraction framework that models text spans and captures within-sentence and cross-sentence context. GAIL [37] is an ELMo-based model that proposes a joint entity and event extraction framework based on generative adversarial imitation learning, which is an inverse reinforcement learning method. OneIE [36] introduces a classification-based information extraction system that employs global features and beam search to extract event structures. Some other methods use token-level annotation. For instance, TANL [3] is a sequence generation-based method that tackles event extraction in a trigger-argument pipeline. Multi-task TANL is the extended version of TANL that transfers structure knowledge from other tasks. BERT-QA [38] and MQAEE [39] consider event extraction as a sequence of extractive question-answering problems. Similar to Text2Event [7], we use Parallel text-record annotation, which only requires (instance, event) pairs without expensive, fine-grained token-level or entity-level annotations. As shown in an instance of such an annotation, \(\langle \)“Evidence at a makeshift morgue points to mass executions by the Iraqi regime.”, {Type: Execute, Trigger: executions, ...}\(\rangle \), parallel text-record annotation is the least demanding and therefore more practical annotation level. We compare our method with Text2Event [7], which introduces a sequence-to-structure generation model that addresses the missing event structure issue via constrained decoding. Given the BART-Gen [5] and Degree [6] both using BART-large as the backbone model while we are using T5-base, and both of the methods require the detected event type as prior, we list their results as another group distinguished from our method and Text2Event. Furthermore, we evaluate KC-GEE against zero-shot approaches on ACE05-EN [9, 10, 27].

5.1.4 Implementation details

We develop our KC-GEE method based on the T5-base pretrained language model, and train it for 50 epochs with a learning rate of 1e-4 and batch size of 8 for the supervised setting. For the zero-shot setting, we use a learning rate of 5e-5 and batch size of 16. To optimize KC-GEE, we employ label smoothing [41] and AdamW [42]. The prefix length is set to 20 for all experiments in Section 5.2.

5.2 Main results

We compare our KC-GEE model in two evaluation settings: fully supervised and zero-shot. For each setting, we organise the model evaluation by the characteristics of the datasets including sentence-level (ACE05-EN) and document-level (WikiEvents).

5.2.1 Supervised setting

In this setting, each model is trained on the full training data of the respective dataset. Table 2 presents the sentence-level event extraction results on ACE05-EN. Note that except for the last block, performance numbers of all baselines are taken directly from Text2Event [7].

From the table, it can be observed that our KC-GEE model outperforms Text2Event in terms of F1 for both argument extraction and trigger extraction.

Sentence-level performance

As discussed above, among all compared models, our KC-GEE model, together with Text2Event [7], is trained on parallel text-record annotations which represent the weakest form of supervision. In contrast, the other baseline models require token-level annotations and entity annotations, which are more fine-grained and expensive to collect. It is expected that models trained on more extensive data would perform better. The last column of the table also shows that the better-performing models use larger pretrained language models (PLMs), such as BERT-large. The larger capacity of these PLMs also contributes to model’s performance.

Table 2 Experiment results for the fully supervised event extraction on ACE05-EN. PLM represents the pretrained language model used by each model. We use text-record annotation, which only provides (instance, event) pairs without expensive, fine-grained token-level or entity-level annotations
Table 3 Results for supervised learning on the document-level event extraction dataset WikiEvents

Document-level performance

Table 3 shows the performance of the baseline (Text2Event), our model KC-GEE, and its different variants for document-level event extraction on the WikiEvents dataset. Please note that BART-Gen [39] and Degree [6] rely on explicit annotations of event type and assume the event-specific templates are given for document-level argument and trigger extraction, while both Text2Event and our model implicitly perform event detection and subsequent extraction. Furthermore, the remaining models listed in Table 2 are designed for sentence-level tasks and do not support this specific task. To ensure a fair comparison with other methods, we rigorously report our results under the identical setting as indicated in Table 3.

The majority of document-level baselines focus only on event argument extraction from WikiEvents dataset, which did not handle event types and triggers [20, 43, 44]. Our model supports the joint extraction of both event triggers and arguments from the WikiEvents dataset.

We can observe from the table that our full model achieves the best F1 values for both argument extraction (Arg-C) and trigger detection (Trig-C) on WikiEvents. It is especially noteworthy that KC-GEE achieves significant performance advantages over Text2Event of +11.1 and +9.4 absolute F1 points for Arg-C and Trig-C, respectively.

The superiority of our model can be attributed to two design features. Firstly, our cross-attention mechanism filters event type tokens and argument tokens, allowing the model to handle long context better. Secondly, our knowledge-based conditioning mechanism injects event type information into the model, enabling it to learn more effectively with less data. A detailed analysis of the contributions of each model component are presented below.

Table 4 Experiment results for zero-shot learning on sentence-level (ACE05-EN) and document-level (WikiEvents) datasets

5.2.2 Zero-shot setting

We evaluate KC-GEE’s ability to generalize to unseen event types in the zero-shot setting for both sentence-level (ACE05-EN) and document-level (WikiEvents) event extraction tasks. Specifically, for both datasets, we randomly split the instances into two subsets Source and Target. Source contains the annotations of 23 event types, while Target only retains the annotations of 10 instances for each of the 10 unseen event types. In this experiment, we first pretrained each model on the Source dataset, which is evaluated on the 10 new event types in the Target dataset without fine-tuning.

The results for both datasets are shown in Table 4. Once again, our full model significantly outperforms baselines. On ACE05-EN, it obtains F1 gains of 27.7 and 9.2 absolute points for Arg-C and Trig-C, respectively. On WikiEvents, the F1 gains over Text2Event are 4.4 and 5.4 absolute points for Arg-C and Trig-C, respectively. We attribute the strong zero-shot generalizability of our model to knowledge-based conditioning. By casting the event extraction task as a generation problem, and injecting event type names, the model gains task-specific information that is especially valuable.

Table 5 The ablation study in the supervised learning setting on the ACE05-EN dataset based on T5-base

5.3 Ablation study

This section analyzes the effects of prefix encoder conditioning, prefix decoder conditioning, prefix cross-attention, and constrained decoding in KC-GEE. We designed five ablated variants based on T5-base:

  • w/o enc-cond indicates KC-GEE without prefix encoder conditioning.

  • w/o dec-cond indicates KC-GEE without prefix decoder conditioning.

  • w/o both-cond indicates KC-GEE without both prefix encoder and prefix decoder conditioning.

  • w/o const-dec discards the constrained decoding during inference and generates event structures as an unconstrained generation model.

  • w/o cross-att indicates KC-GEE without prefix cross-attention.

Table 5 shows the results of ACE05-EN on the test set for the supervised learning setting. We observe that:

  • constrained decoding helps, but not too much;

  • prefix encoder and decoder conditioning are the most effective module if we use both of them together.

Furthermore, as constraint decoding limits the argument and the trigger word generated by the model, our method does not suffer from hallucination problems.

5.4 Analysis

In this section, we conduct comprehensive studies to analyze the design of our method from different perspectives.

5.4.1 Prefix length

Longer prefixes provide more knowledge-based conditioning information to the model. Table 6 summarizes the result of the model performance with different prefix lengths on the WikiEvents dataset. As shown in the table, longer prefixes improve model performance on Arg-C, while performance on Trig-C improves with increases in prefix length until 20, after which the F1 value plateaus. However, longer prefixes requires more model parameters. Therefore, we set the prefix length to 20 as a trade-off between model performance and computational efficiency.

Table 6 Zero-shot learning on WikiEnents with different prefix lengths
Table 7 Zero-shot learning on WikiEvents with different knowledge-based conditioning

5.4.2 Knowledge-based conditioning

A key contribution to our method is the introduction of knowledge-based conditioning information. We analyze this component from two perspectives: (1) conditioning information and (2) injection mechanism.

Conditioning information

In Table 7, we analyze in detail the effect of different knowledge-based conditioning, in which we fix the prefix length at 20. As can be seen, having no knowledge-based conditioning (None) results in poor performance across the board. Injecting task-agnostic information (Pseudo token) provides noticeable gains on Trig-C. Furthermore, injecting event type information substantially improves performance on both Arg-C and Trig-C. Adding role information improves performance on Arg-C but decreases performance on Trig-C. Finally, having all three types of conditioning does not bring additional benefits.

This comparison highlights the effectiveness of knowledge-based conditioning on event type information. Additionally, incorporating role information enhances argument extraction performance, although it comes at the expense of trigger extraction.

Injection mechanism

The bottom four rows in Tables 3 and 4 display variants of our KC-GEE model, where knowledge-based conditioning information is injected in different ways, as depicted in Figure 3. Specifically, the “Adapter” variant injects knowledge-based conditioning information in an adapter layer over each Transformer layer while freezing the parameters of the underlying language model. The “Fine tuning+Adapter” variant employs adapter layers and updates the language model’s parameters. The “Prefix” variant prepends the knowledge-based conditioning vectors \({\varvec{h}}\) to each layer in the language model while keeping the language model’s parameters frozen. Finally, the “Fine tuning+Prefix” variant additionally updates the parameters of the language model. We can make the following observations from Tables 3 and 4.

As expected, updating the language model’s parameters (i.e. “Fine-tuning”) is much more effective than when the parameters are frozen, regardless of whether Knowledge-based conditioning information is injected as adapters or prefixes.

The “Adapter” style of injection performs especially poorly on WikiEvents in both the supervised and zero-shot settings. In comparison, on WikiEvents, “Prefix” inject is able to outperform Text2Event in the zero-shot setting and achieves competitive performance in the supervised setting.

6 Conclusion

In this paper, we formulate the problem of event extraction as a natural-language generation task. We propose KC-GEE, a generation-based document-level event extraction technique that leverages large pretrained language models. A key component in KC-GEE is a novel Knowledge-based conditioning technique that injects event type information into the model as prefixes to enable zero-shot learning capability. The cross-attention mechanism in the prefix generator also facilitates effective document handling. Extensive experiments on two benchmark datasets demonstrate the effectiveness of KC-GEE, which achieves state-of-the-art performance in document-level extraction in fully supervised and zero-shot settings. On the challenging WikiEvents dataset, KC-GEE substantially outperforms the current best model by up to 27.7 absolute points in F1. In future work, we will investigate incorporating attention mechanisms or graph-based techniques to integrate external knowledge, like event descriptions, improving zero-shot event extraction performance.

7 Limitations

In this paper, we explore a new method for solving zero-shot and document-level event extraction through Knowledge-based conditioning. The model has the ability of zero-shot transfer learning primarily from seen roles composition. This means that although the model has not seen any instances of zero-shot event types during training, the schema of these event types is available in the training stage. There is still a gap in the true zero-shot generalization.