1 Introduction

In the economic domain, information extraction from text is highly popular for making available fundamental knowledge present in economic text, such as business news (Day & Lee, 2016; Khedr et al., 2017), regulatory disclosures (Cavar & Josefy, 2018; Feuerriegel & Gordon, 2018), and social media (Gémar & Jiménez-Quintero, 2015; Oliveira et al., 2017). Extracting factual data (using named entity recognition, relation extraction, and event extraction) or subjectivity data (using sentiment analysis) from economic text has the ability to enhance the available numerical data on markets with fundamental information for financial applications. In business and finance, events encapsulate new information on the market and automatically collecting novel event data thus has applications in stock prediction (Bholat et al., 2015; Chen et al., 2019; Nardo et al., 2016; Nassirtoussi et al., 2014; Zhang et al., 2018), risk analysis (Hogenboom et al., 2015; Wei et al., 2019), policy assessment (Karami et al., 2018; Tobback et al., 2018), brand management (De Clercq et al., 2017; Geetha et al., 2017) and marketing (Rambocas & Pacheco, 2018) (Fig. 1).

Fig. 1
figure 1

Examples of annotated economic event schemata with argument roles and attributes. Boldface indicates event trigger spans which can be discontinuous

While learning-based methodologies are currently predominantly used in NLP research, the domain of economic event extraction is still dominated by pattern-based approaches. In order to allow for data-driven supervised event extraction for company specific news, we discuss in this work the construction of the SENTiVENT Economic Event corpus, a representative corpus enabling both NLP and market research. The SENTiVENT annotation scheme aims to be compatible with the benchmark ACE/ERE event datasets.

We believe this work delivers the following contributions to the field of economic event processing:

  • The first manually annotated SENTiVENT news corpus of fine-grained economic events: we annotated full ACE/ERE-like event schemata with attributes (type, subtype, negation, modality, realis), event co-reference, and arguments (participant/filler) enabling structured event extraction in the economic domain for English. There are no available datasets on fine-grained economic events in the field of economic information extraction. Compatibility with ACE/ERE-like event nuggets allows straight-forward porting of state-of-the-art models in Event Extraction for which ACE2005 remains a benchmark dataset.

  • Representative collection and typology creation methodology: we applied a dataset collection approach for obtaining a corpus representative of the business news genre minimizing topical specialization across sectors. Coverage and representativeness are guiding principles in the creation of the event typology. The choice of which events to include in event typologies are usually arbitrary and not transparent. We applied a typology creation method of iterative refinement resulting in annotator agreement on type and subtype identification that is substantial-(almost) perfect.

  • A sentence-level event detection pilot study producing satisfactory results proving this dataset is adequate for machine learning purposes.

The remainder of this paper is organized as follows. First, Sect. 2 discusses related work on event extraction and information extraction in the economic domain. Section 3 describes our article collection approach used to obtain a corpus representative of business news. Section 4 presents the annotated units and definitions in the corpus, the tools and procedure used for annotation. Section 5 covers the inter-annotator agreement study. Next, we present the properties and statistics of the annotated corpus in Sect. 6. Finally, we elaborate on the methodology and results of our event detection pilot study in Sect. 7. We conclude with some notes on the availability of the corpus and prospects for future research.

2 Related research

2.1 General domain fine-grained event extraction

Our work on economic event annotation and processing can be accommodated within the rich history of automatic event processing. Starting with a pilot study in 1999, the ACE (Automatic Content Extraction) annotation scheme and programme was highly influential in event processing. The objective of the ACE program was to develop technology to automatically infer entities being mentioned in text, the relations among these entities that are directly expressed, and the events in which these entities participate (Doddington et al., 2004). The ACE programme also periodically organized “Event Detection and Recognition” evaluation competitions in which event extraction corpora were released (Walker et al., 2006; Consortium, 2005), thus leading to different types of methodologies to tackle the problem of event extraction. Some years later, the ERE (Entities, Relations, Events) standard was conceived under the DARPA DEFT program as a continuation of ACE with the goal of making annotation easier and more consistent. It consolidated and simplified the ACE typology and removed many complex annotation features into the Light ERE standard (Aguilar et al., 2014; Bies et al., 2016). Both in the ACE and Light ERE annotations, an event is defined as an explicit occurrence involving participants, belonging to a pre-defined set of types/subtypes. Subsequently, expansions to this standard were added as Rich ERE (Song et al., 2015). The typology and attributes were expanded slightly and more attention was devoted to event coreference (“Event hoppers”). The DEFT ERE standards formed the basis of the Event Nugget annotation scheme proposed by Mitamura et al. (2015b). An event nugget was defined as a semantically meaningful unit referring to an event. The event trigger could be a single word as in ACE/Light ERE or a discontinuous multi-token expression. The Knowledge Base Population track of the Text Analysis Conference (TAC-KBP) organized several shared tasks based on this event nugget standard (Mitamura et al., 2015a, 2016a, 2017). Our annotation scheme is inspired by the Rich ERE Event annotation schemes (Linguistic Data Consortium, 2016, 2015a, 2015b) as to provide compatibility with on-going research in event extraction, where ACE and ERE datasets remain dominant benchmarks.

2.2 Events in the economic domain

Business and finance are domains in which new information is produced and consumed at a high rate due to the economic proposition of information acquisition. In practice, investors and analysts widely use text data such as news announcements and commentary on recent events as a major source of market information (Oberlechner & Hocking, 2004). In the financial domain, the way companies are perceived by investors is influenced by the news published about those companies (Engle and Ng, 1993; Mian and Sankaraguruswamy, 2012; Tetlock, 2007).

Aside the economic finality, there is also academic interest in automated event extraction: financial economists conduct so-called “event studies” to provide some insights into the way markets react to new information about companies. These event studies measure the impact of a specific event on the value of a firm (MacKinlay, 1997). Identifying news published about certain events in an automated way enables researchers in the field of event studies to process more data in less time. This can consequently lead to new insights into the correlation between events and stock market movements.

Most of the existing approaches to the detection of economic events, however, are knowledge-based and pattern-based (Arendarenko & Kakkonen, 2012; Chen et al., 2019; Du et al., 2016; Hogenboom et al., 2013). These use rule-sets or ontology knowledge-bases which are largely or fully created by hand and do not rely on fully manually annotated supervised datasets for machine learning. The Stock Sonar project (Feldman et al., 2011a) notably uses domain experts to formulate event rules for rule-based stock sentiment analysis. This technology has been successfully used in assessing the impact of events on the stock market (Boudoukh et al., 2016) and in formulating trading strategies (Ben Ami and Feldman, 2017). Along the same line, Hogenboom et al. (2013) relies on a hand-crafted financial event ontology for pattern-based event detection in the economic domain and incorporates lexicons, gazetteers, PoS-tagging and morphological analysis.

A rich variety of representations of events in economic news exists in experiments on stock market prediction. Here event information is used as an information source in machine learning-based price or movement prediction. On a continuum of coarse- to fine-grained event representations as features, we can distinguish approaches using raw text as embeddings (Del Corro & Hoffart, 2020; Hu et al., 2018; Xu & Cohen, 2018), to topic modeling (Nguyen and Shirai, 2015; Zhou et al., 2015)), to subject, predicate, object tuples (Ding et al., 2015, 2016; Zhang et al., 2018) or entities, time, location, key words tuples (Zhou et al., 2015) to fine-grained semantic frames (Xie et al., 2013) and ACE/ERE-like event structures (Chen et al., 2019; Yang et al., 2018). The latter rely on rule-based data generation methods (distant supervision) to produce ACE/ERE-like event data as no supervised economic datasets are available. Chen et al. (2019) highlights the improved performance in stock prediction of more fine-grained event structures viz. coarse-grained representations.

Several semi- or distantly supervision approaches exist in which seed sets are manually labeled or rule-sets are used to generate or enhance training data. However, these often lack the granularity and structure of ACE-like events and are annotated at the document level. Qian et al. (2019), for example, used a semi-supervised approach for event type detection using word vector clustering in Chinese economic news. Clusters were manually labeled for event type and unseen documents were classified by cluster similarity. The authors annotated 4 event types (“Financing”, “M&A”, “IPO”, “Delisting”) on a gold-standard evaluation corpus of 14,556 documents containing article headlines and leads from October 2011 to February 2017. Rönnqvist and Sarlin (2017) used weak supervision in the financial context of bank distress events. They relied on a knowledge base of known bank distress events and thus collected 386K sentences. A sentence describes a weakly supervised event when the article’s publication date is near the date at which the bank distress event occurred. Here, bank distress events are conceptualized as mentions of bank entities in a time-window and no typology classification is assigned. Ein-Dor et al. (2019) adopted a weakly-supervised training dataset for identifying company-related events at the sentence-level in English news articles. Their main interest was in the binary task of detecting sentences containing events. Sentences from a company’s Wikipedia article were selected if they appeared in an event-section and started with a date-pattern (such as ‘On/In/By/As of’ + month + year). The training set was re-balanced to enforce an equal number of positive and negative instances. The target evaluation dataset consisted of articles scraped through Seeking Alpha containing the name of the 10 most eventful S&P companies in Wikipedia for 2019. As test annotations, events were manually collected from 2019 from the Wikipedia pages of those companies. All sentences containing reference to these where marked positive (26 different events). Yang et al. (2018) also used a weak-supervision, hybrid approach in creating an event dataset for Chinese economic news. They relied on a financial event knowledge database consisting of labelled event trigger and argument samples of nine pre-determined events collected by financial professionals. The text data came from official announcements released by companies on the web and obtained from Sohu securities net. A weakly supervised dataset was created by dictionary matching the knowledge-base event triggers and arguments against the corpus. Han et al. (2018) describe a hybrid approach to ACE-like event extraction in Chinese business news. They labeled triggers, argument types, and event types for 1500 Chinese news articles from the Sina.com news aggregator with 8 economic event types. Unfortunately, detailed information on the annotation scheme was missing and it was not clear to what extent manual annotation has been applied. In their experimental work, the authors mainly relied on a knowledge-based methodology with ML-based expansion using trigger and argument dictionaries and a pattern-matching approach for argument extraction. They expanded a handmade event trigger dictionary by finding synonyms using a Word2Vec-based word vector model of a background corpus of Chinese business news.

Few strictly supervised approaches exist in the literature, precisely due to the lack of annotated training data.

Our previous work in the SentiFM project (Van de Kauter et al., 2015a, 2015b) mainly focused on annotating implicit and explicit sentiment in Dutch and English economic text. Besides sentiment annotations, event annotations were made, but were limited to annotating continuous trigger spans of 10 events types and 64 subtypes. Only one argument relation was tagged linking the company involved to the event span. No other event attributes were annotated. The English SentiFM corpus consists of 497 news articles containing 2522 annotated events. The English subcorpus was collected from The Financial Times using keyword search where the title had to contain one out of seven companies. The Dutch part has 126 articles on seven BEL20 companies containing 1152 events. Several supervised event detection experiments have been conducted on this dataset (Jacobs et al., 2018; Lefever & Hoste, 2016). Our current corpus has more fine-grained event annotations with additional attributes, the typology is more diverse, counts more annotated events and has a higher event density meaning more relevant events are tagged. It has been crawled randomly making it more representative of the business news genre.

We are not aware of any other published fine-grained ACE/ERE-like event extraction datasets for the economic domain. This work presents the first attempt at the creation of a gold-standard event dataset in the economic domain with incorporating an agreement/annotation quality study and detailed guidelines.

3 Dataset collection

First, we crawled English online business news articles on all companies in the S&P500 from various sources over a period of 14 months (June 2016–May 2017) using the business news aggregator Yahoo! FinanceFootnote 1 API. Next, we filtered all articles pertaining to a hand-selected set of 30 companies based on publication frequency and sector diversification.Footnote 2 Finally, we selected a subset of articles for each company based on temporal spread, topical diversity, and quality criteria for linguistic annotation. The goal of this selection method was to assure temporal spread and to avoid that a company, news event, sector or topic was over-represented in the data. Avoiding topical specialisation ensures the data is general enough to represent business news as a text genre.

The initial crawl was restricted to all companies that are historically and currently present in the Standard & Poor’s 500 (S&P 500) Index. The S&P 500 was chosen over other indices because it contains companies which are commonly reported and has more sector diversification than other indices such as the DJIA or the technology-heavy NASDAQ. Made up of 500 of the most widely traded stocks in the U.S., it represents about 80% of the total value of U.S. stock markets. In general, the S&P 500 index thus gives a good indication of movement in the U.S. marketplace as a whole. Additionally, technical data on stock performance and metrics is readily available for stocks listed in the S&P 500, which also enables stock performance and market research. We scraped the linked article pages and their metadata (such as tags, author, publisher source, etc.) daily in the period of June 2016 to May 2017.

The selection of 30 companies was made based on sectorial diversification and frequency in the crawled data. This selection procedure provides a balance between the need for sector diversification and representation of industries in the business news genre. All included companies are listed in Table 1 alongside their GICS-codified industries. For each company, a final random selection of articles was made based on temporal spread of all available articles. Using temporal binning of publication dates, we ensure that reported topics are different over time because articles that are published at the same time often report on the same topic.

To extract the textual content from the article web-page, three article content extraction tools were used: (1) a self-written rule-based parser, the (2) rule-based Goose (Plush, 2014) and (3) machine learning-based BoilerPipe (Kohlschütter et al., 2010). The best output for each article was manually chosen in an article selection phase and missing text was manually corrected.

Table 1 Overview of companies on which articles are annotated in the English SENTiVENT corpus

Articles that did not meet the following quality and relevancy criteria were removed from the selection: (a) company-specific focus, (b) topical diversity: articles focusing on a earlier reported event in the corpus are removed, (c) pertinence and novelty (only articles relevant to the present state of the market were selected), and (d) human language production, leading to the exclusion of robo-written and templated articles.

This method ensures a representative, topically diverse, and temporally spread corpus enabling both text-based information extraction as well as market research.

4 Annotation definitions and procedure

In this section, we define the event annotations as well as the annotation process. While an event can be defined as a specific real-world occurrence involving participants, our focus is on economic events, i.e. ‘textually reported real-world occurrences, actions, relations, and situations involving companies and firms’. Since we could not rely on annotation guidelines from previous work, nor on publicly available datasets, we decided to seek inspiration in the rich body of previous work on event annotation from the ACE/ERE (Song et al., 2015) competitions, for which detailed guidelines and annotations are available. For the annotation of economic events we thus draw on the operationalization of event structures from the DEFT Rich ERE annotation guidelines (Linguistic Data Consortium, 2016), and more specifically the technical reports on annotating events (Linguistic Data Consortium, 2015b) and arguments (Linguistic Data Consortium, 2015a).

There is one important difference with ACE/ERE datasets: We do not tag pre-defined entity types and relations. Unlike ACE/ERE, argument role definitions are hence not restricted by entity types, i.e. any text span that describes a relevant event argument within the event scope is tagged.

Events are limited to a type and subtype from our event typology, which we will further discuss in Sect. 4.1. A sentence can contain multiple events and event argument links can cross sentence boundaries. The full event annotation consists of the event trigger, argument span, and attributes (type, subtype, modality, negation) in running text (i.e. an event mention in ACE/ERE terminology). Figure 2 shows examples of annotated events with attributes.

The event trigger is the minimal span of text (a single word or a small phrase) that most succinctly expresses the occurrence of an event. Generally, we think of the trigger as the word that most strongly refers to an event. Event triggers are annotated at the token-level and discontinuous token-spans are allowed. In Fig. 2, “sell” is a single-word trigger while “sales \(\ldots \) stall” and “revenues \(\ldots \) rise” are discontinuous multi-token triggers. Annotators are instructed to keep the trigger as small as possible while maintaining the core lexical semantics of the event.

Event arguments are the participating entities that are involved in the event by filling a certain prototypical semantic role. Arguments fill a number of roles specific to an event type. Examples are the person or company performing an action, the amount of capital involved, or some other piece of information, like the time or place where the event happens. It is possible that none of the participants or only a select few are tagged. We distinguish two types of event arguments:

  1. 1.

    Participant arguments: these are the text spans describing a Participant that is involved in the event. For each event type/subtype there is a specific set of Participant roles to be filled. We only tag Participant arguments that occur within the event mention scope for an event trigger. The event mention scope is defined as the span of a document from the first trigger of an event to the next trigger you see for the same, coreferent event. This avoids double tagging of arguments for co-referent events.

  2. 2.

    Filler arguments: Filler arguments are non-central arguments in the event but provide descriptive and discriminative information. They denote the time, place, capital (amount of money), etc. attached to an event. Filler arguments differ from Participant arguments in two ways: (i) they are not bound by event mention scope and can be tagged anywhere in the document; (ii) they have to be full noun phrases (NPs), unlike Participants that can be anaphoric pronouns.

We also annotated two features pertaining to the factuality of the event along two axes: Modality with values Certain vs. Other and Polarity with values Affirmed vs. Negated. Polarity captures whether an event mention is affirmed (the event actually took place) or negated (the event did not happen). Modality captures the degree of certainty (i.e., epistemic modality) about the event being the case as expressed by the author or another source. Automatic processing of event factuality is necessary for downstream applications such as knowledge-base population or market modelling so that events that are speculative or did not take place in the real-world can be filtered out.Footnote 3

Fig. 2
figure 2

Examples of annotated economic event schemata with argument roles. Boldface indicates event trigger spans which can be discontinuous

Additionally, two types of relations are also annotated: Event coreference indicates whether an event intuitively refers to the same real-world event. It is required that type and subtype are identical and that the location and temporal scope match. For event coreference, we directly inherited Rich ERE’s concept of “event hoppers”. A hopper is a group of annotated events that semantically refer to the same event occurrence. This differs from strict event coreference in ACE and Light ERE in which arguments also have to corefer strictly by exact matching of all event features and argument semantics. We also annotate the Canonical referent of a pronominal or anaphoric noun phrase. When participant arguments are realized as a pronoun (“it”) or an anaphoric noun phrase (“the company”, “the stock”), annotators are instructed to draw a link to the nearest canonical nominal referent (containing the name of the company, person, or the stock). This link is useful for producing a dataset variant in which all arguments are nominal phrases (names).

4.1 Event typology creation

The event typology specifies which types of events we annotate and the semantic argument roles in which they are involved. The aim of defining an event typology for economic event detection is to obtain adequate coverage of the factual content of economic news articles. In designing such a typology, we have to balance relevance from an investor point-of-view, generality of event types across industries, and coverage. Our event type selection methodology was similar to FrameNet’s, where types start with an annotator’s intuition and are gradually defined and grouped based on corpus examples (cf. Ruppenhofer et al., 2016, p. 12 for details on frame construction and type selection).

We iteratively developed the event typology on a subcorpus in close collaboration with a financial domain expert and in accordance with the prior state-of-the-art. The Stock Sonar project (Feldman et al., 2011b) expert-created event typology identifies 8 event types, viz. “Legal”, “Analyst Recommendation”, “Financial”, “Stock Price Change”, “Deals”, “Mergers and Acquisitions”, “Partnerships”, “Product”, and “Employment”. Boudoukh et al. (2019) identified the following 18 event categories based on Capital-IQ typesFootnote 4 and a cross-section from academic event studies: ”business trends, capital returns, CSR/brand, deals, earnings factors, employment, facility, financial, financing, forecast, general, investment, legal, mergers & acquisitions, product, ratings, stocks, and stock holdings”. The event typology we arrived at by gradual refinement has a high overlap with this work. The Hogenboom et al. (2013) rule-based SPEED event detection system differentiates between ten different financial events (announcements regarding “CEOs”, “presidents”, “products”, “competitors”, “partners”, “subsidiaries”, “share values”, “revenues”, “profits”, and “losses”). The PULS business intelligence system of Du et al. (2016) detects 15 event types pre-categorized as "positive" or "negative". Positive business events include “acquisition”, “investment”, “order marketing”, “product launch”, merger”. Negative events are “bankruptcy”, “lawsuit”, “business closing”, “layoff”, “product recall”, “accident”. The SentiFM project for sentiment analysis and business event detection defined 10 events: “Buy rating”, “Debt”, “Dividend”, “Merger&Acquisition”, “Profit”, “Quarterly results”, “Sales volume”, “Share repurchase”, “Target proce”, and “Turnover”. The ACE/ERE event typology defines 8 types and 33 subtypes of which one event type, namely “Business” with 4 subtypes (“Start-org”, “Merge-org”, “Declare-bankruptcy” and “End-org”) is also relevant to the economic domain. These types however lack descriptiveness in capturing the information commonly conveyed in business-specific news and are too coarse to be of any use for business analysis (Fig. 3).

Fig. 3
figure 3

Diagram illustrating iterative refinement of the event typology

These existing event typologies were taken as a basis for our typology. We joined similar types into a seed list of event types and on a random subsample of 40 crawled articles, each belonging to different companies, we counted the presence of these events. Additionally, we labeled special interest events and news pegsFootnote 5 that were not in the seed typology. We counted the frequency of these types, subtypes and participants in the subcorpus and adjusted the typology by adding new event types or removing infrequent events. Relevant participant roles were also added in this process of iterative refinement. After this iterative process, we ended up with a typology consisting of 18 types and 42 subtypes of company-specific economic events as listed in Table 2. For a full description of the types and subtypes, we refer to the guidelines (Jacobs, 2020b).

Table 2 Event types and subtypes

The types “Profit/loss” (aka “Earnings”), “Revenue”, “Expense”, “Sales Volume” represent major metrics in financial reporting on the income statement. We found these types the most common generally reported financial metric in business news. Other commonly reported metrics such as debt ratios or cash flow statements as well as highly industry-specific indicators were collected under the “Financial Report” event type. Asset performance event types include “Rating” and “Security Value”. These involve analysts giving investor advice and discussions of a company’s stock value. “Macroeconomics” is a broad category pertaining to events that do not involve decisions or interactions of specific firms. While this category is not company-specific, it was devised to capture highly common discussions of sector trends, economy-wide phenomena, and governmental policy in news.

4.2 Annotation process

We used the online text annotation platform WebAnno v4.5 for all our annotations. Three annotators were trained under supervision. All annotators have professional linguistic training in English equivalent to C2-level and all annotators had taken an undergraduate-level course on economics. We also trained the annotators in the terminology and financial concepts related to the events. An independent annotator corrected first pass annotations on a subsample of 104 (40%) documents. Potentially problematic documents were prioritized for correction by searching for often made annotation mistakes: e.g., error-prone event types, unusual event-to-article-length ratio or a high amount of weakly-realized events.

5 Inter-annotator agreement and annotator performance

To determine the feasibility of the annotation task, we conducted an inter-annotator agreement study on a subsample of 30 documents pertaining to 10 different companies. Each document was fully annotated by the three annotators. Our task demands a large degree of interpretation as events are abstract semantic categories (the event type, subtype and arguments of an event have the potential to be ambiguous) and as such an agreement metric should highlight the strength of consensus among annotators. We applied two approaches to determine annotation agreement: (a) Chance-corrected span-aligned agreement of event token spans, (b) Event nugget scoring to capture performance of annotators on the full event structure compared to an adjudicated reference.

5.1 Span-aligned agreement

Event annotations are made at the token level and token spans can also be discontinuous. In line with Lee and Sun (2019), who examined agreement for span annotations in medical text, we computed agreement metrics by using overlapping span alignment. This allows us to determine agreement on the labels of events, while the exact span overlap is not of interest. Events can be considered semantic categories which are not bound to specific syntactic or lexical boundary rules, leading to quite some variation in boundaries of token spans as exemplified in Fig. 4. Figure 4 examples 1–3 show typical examples of annotations where boundaries do not match exactly, but the core event trigger remains valid. Applying exact span agreement in which spans only match at the boundaries would underestimate the agreement of span annotations.

Other work in computational linguistics with span annotations perform agreement scoring at the sentence-level (Pavlopoulos and Androutsopoulos, 2014; Pontiki et al., 2016) or clause level (Thet et al., 2010) and thus circumvent the issue of alignment. These approaches, however, produce larger units and lose granularity resulting in higher agreement scores due to not discounting unmatched spans that are in the same sentence/clause. Our span-alignment approach presents a more rigorous attempt at unitizing spans for computing agreement.

Fig. 4
figure 4

Examples from the corpus of non-exact span overlap. Double lines in example 4. indicate that the annotator has made an overlapping annotation

For calculating span-aligned agreement, we first unitized the text by matching overlapping spans of annotations into alignment groups and we subsequently computed agreement metrics of those groups. In other work where overlapping span alignment is used (Lee & Sun, 2019; Liu et al., 2015), this is done pair-wise between annotators, with the final agreement score being the mean of each annotator pair. However averaging of pair-wise scores does not allow calculating the mean expected agreement distribution of labels across all annotators necessary for Krippendorf’s \(\alpha \). Hence, we made alignment groups for all three annotators at once. When one annotator had multiple candidates that overlapped with other annotators, one annotation had to be selected. This selection was made based on the dice-coefficient (equivalent to \(F_1\)-score) of overlapping tokens. The other non-matched annotations were then placed in their own alignment group. For example, in Sentence 4 in Fig. 4, annotator 3 has made a self-overlapping annotation “increasing commodity prices" and “increasing commodity". The selection is made by taking the mean dice overlap score of all other annotations which results in alignment of “increasing commodity prices" with the spans of the other annotators. The span “commodity prices” was then placed in its own alignment group with no match. Example 5 is a similar case with a discontinuous annotation of annotator 2 showing that our alignment approach handles discontinuity correctly. Annotator 2 has made a discontinuous annotation “upgraded \(\ldots \) to outperform" and annotator 3 has made two different annotations “upgraded" and “outperform". “upgraded" by annotator 1 and 3 and the discontinuous “upgraded \(\ldots \) to outperform" will be placed in the same alignment group while “outperform" is unmatched as desired.

To measure the amount of agreement that is above chance across the three annotators, we used the chance-corrected agreement measures Fleiss’ kappa (\(\kappa \)), Krippendorf’s alpha (\(\alpha \)), and Gwet’s AC1 on the aligned span annotations. Fleiss’ Kappa (\(\kappa \)) (Fleiss, 1971) is a chance-corrected agreement measure that generalizes Scott’s Pi to any number of annotators. Krippendorff (2004) argues for the use of Alpha (\(\alpha \)) in favour of other measure like Kappa because of its independence to the number of annotators and its robustness against imperfect data. AC1 was introduced by Gwet (Gwet, 2001) to provide a more robust measure of inter-annotator reliability than Fleiss’ Kappa. AC1 can overcome Kappa’s sensitivity to trait prevalence and annotator’s classification probabilities (i.e., marginal probabilities) (Gwet, 2002). The interpretation of AC1 is similar to generalized kappa, which is used to assess inter-annotator reliability when there are multiple annotators. We used the R package irrCACCFootnote 6 to compute these metrics and their statistical properties.

Table 3 shows the agreement analysis with coefficients, their standard error, the 95% confidence interval (C.I.) and p-values. We benchmark these coefficients using the cumulative membership probabilities within the agreement ranges as set out by Landis and Koch (1977): \(\kappa < 0 = poor\,agreement\), \(0 \le \kappa \le 0.2 = slight\,agreement\), \(0.2 < \kappa \le 0.4 = fair\,agreement\), \(0.4 < \kappa \le 0.6 = moderate\,agreement\), \(0.6 < \kappa \le 0.8 = substantial\,agreement\), \(0.8 < \kappa \le 1 = almost\,perfect\,agreement\). As recommended by Gwet (2014), we set the benchmark membership probability cut-off point at \(95\%\).Footnote 7

Agreement on the event main type is “almost perfect” for all metrics with \(\kappa = 0.88\), \(\alpha =0.87\), and \(AC1=0.89\) indicating that annotators are highly consistent in assigning the 18 event type categories. For event subtype labeling, “substantial” agreement is obtained on all metrics with \(\kappa = 0.81\), \(\alpha =0.81\), and \(AC1=0.82\). The most common mistake we see in the data is when annotators do not assign a subtype. These high agreement scores for both the type and subtype labels showcase that the event typology is sufficiently discriminative. The negation label obtains a “substantial” agreement score using Fleiss’ \(\kappa \) (\(\kappa = 0.85\)) and Krippendorff’s \(\alpha \) (\(\alpha =0.78\)) and an “almost perfect” agreement according to AC1 (\(AC1=0.99\)). Annotators thus have little problems with determining whether events lie within a negated context. Event modality falls in the “moderate” agreement range for \(\kappa \) (\( = 0.48\)) and \(\alpha \) (\(= 0.46\)), but a “substantial” agreement is obtained for AC1 (\(=0.84\)). This indicates that determining whether an event is presented as being uncertain remains difficult for annotators. The different AC1 scores for the negation and modality annotations can be explained by the prevalence bias of the \(\kappa \) and \(\alpha \) metrics which AC1 attempts to mitigate since both categories are highly skewed. While AC1 provides “substantial" agreement on modality, we err on the side of caution and recommend refining the modality labels in a secondary revision pass, before using this dataset for factuality processing. Overall, we can state with high certainty that substantial agreement is obtained on events, with the exception of modality. This indicates that the guidelines and typology produce consistent event annotations of high quality.

Table 3 Span-aligned agreement scores on events for type, full type (type and subtype), negation, and modality

5.2 Event nugget performance analysis

While token-level metrics can provide information on the difficulty of identifying token span extents in text, they do not relay useful information on the combined performance of event span annotation and its attributes (i.e., event type, event subtype, polarity, modality).

The second approach is not strictly a chance-corrected agreement measure to capture the degree to which annotators agree on annotation units, but this also seeks to capture the quality of the annotations. The specific event nugget scoring method integrates token span-overlap and attribute accuracy. This method was included because it is also used in the ACE/ERE project to assess annotation quality. The Liu et al. (2015) event nugget scorer, developed on DEFT Rich ERE data for the TAC-KBP Event Nugget Extraction task (Mitamura et al., 2015b), produces a span similarity score for a pair of event mentions between 0 and 1. This span score is the Dice coefficient over token ids (equivalent to \(F_1\)-score for token overlap). Each system mention is mapped to the reference by selecting the maximum dice overlap score. This captures the performance of annotators on identifying the correct token spans. Subsequently, attribute label accuracy is computed for matched events. The attributes used by Mitamura et al. (2015b) in their assessment of event nugget annotation in DEFT ERE were limited to “Type” and “Realis”. We can obtain a corresponding value for Rich ERE “Realis” by combining values for “Negation” and “Modality” from our annotations.

The authors compared the nugget scores of the annotators on each other and scored each annotator separately on a reference set. In line with Mitamura et al. (2015b), we created an adjudicated set (ADJ) through correcting and combining all annotations by an independent judge. In their analysis, they also compared their Light ERE annotation scores with earlier ACE2005 annotations. Both of those datasets involved two annotators, while our SENTiVENT dataset involves three. For comparison, we listed the reported scores of the original work (Mitamura et al., 2015b) next to the nugget score performance on our data in Table 4.

Table 4 Nugget scorer performance of annotators vs. adjudicated reference (scores are averaged across annotators)

Our annotators seem to perform worse on the identification of span boundaries compared to work in ACE and ERE, whereas the accuracy scores of attributes, are on par or better. By having three annotators instead of two more potential disagreement on span boundaries was introduced.

For future annotations, we suggest mitigating span identification issues by reducing the complexity of the event annotation task by splitting up trigger span identification and attribute assignment into two completely separate phases: an event trigger phase in which all potentially relevant events of business news are tagged, followed by an attribute phase in which attributes and arguments are tagged. In this manner, the cognitive overload of annotators that need to memorize rules for span identification and attribute assignment is lessened. The disadvantage being that annotation time and cost will increase substantially.

Overall, similarly to the chance-corrected IAA study of aligned spans, we observe that the event attributes scores are highly satisfactory, although the identification of event spans shows room for improvement. We believe the consistency study shows that the dataset quality is adequate for data-driven machine learning techniques.

6 Corpus analysis and statistics

In total, our annotators labeled 258 training and 30 evaluation documents, which makes our corpus comparable in size to the largest of the monolingual TAC-KBP corpora. The standardized evaluation (holdout test) set of 30 documents stems from the adjudicated reference set in the inter-annotator agreement study. In the 2016 TAC-KBP event nugget track, which implements Rich ERE event guidelines, the annotated training corpus sizes were 200 documents for English, 200,000 words for Chinese and 120,000 for Spanish (Mitamura et al., 2016b) with an evaluation set size of 60 documents for each language.

We examined the distribution of annotation units as well as some statistical properties which might be relevant for classification (see Table 5). Considering the event type and subtype distribution, as visualised in Fig. 5, we can observe that the Macroeconomic class was the most frequently used event annotation. We introduced this class as a low-specificity category to capture all non-company specific events such as market trends, market-share, competition, regulatory issues and governmental policy, etc.

Fig. 5
figure 5

Frequency of event categories in SENTiVENT English corpus

Table 5 Basic corpus statistics for the suggested split of the development training set and the adjudicated test set

6.1 Lexical richness of events

A known difficulty in event processing and other semantically-oriented NLP tasks (a.k.a.“deep semantics") is the large lexical richness of categories (Van de Kauter et al., 2015a). Because events are abstract semantic categories, they can be expressed in a large variety of lexicalizations. To quantify the lexical variation within our event type categories, we computed several lexical richness metrics for each event type for the event trigger.

Lexical richness measures give an idea of which classes and event types could cause difficulty when building classifiers that rely on lexical features such as the transformer models we used in the pilot study. Lexical diversity is a measure of how many different words are used in a stretch of text. In its simplest form, lexical diversity is measured through the calculation of type-token ratio (TTR) where the counts of unique words are divided by their occurrences. TTR, however, has an inverse relationship with sample size, introducing a bias when comparing texts of different lengths. As shown in (McCarthy & Jarvis, 2010, 2007), Maas (Mass, 1972), Measure of Textual Lexical Diversity (MTLD) (McCarthy, 2005) and Hypergeometric distribution diversity (HDD) (McCarthy & Jarvis, 2007) correct for this bias and are more robust measures of lexical richness than TTR-based measures such as CTTR (Guiraud, 1959), RTTR (Carroll, 1964), MSSTR (Johnson, 1944), MATTR (Covington & McFall, 2010), Herdan (Herdan, 1964), Somers (Somers, 1966), and Dugast (Dugast, 1978).

We computed these robust non-TTR-based measures using the LexicalRichness Python module (Shun, 2019) as well as lexical entropy as Shannon entropy. We pre-processed the text by stemming and removing stopwords using the Porter2/SnowballStemmer algorithm in NLTK. Because the event types text lengths differ, we only considered HDD, Maas and MTLD in our discussion because they are substantially less sensitive to differences in text length as recommended in McCarthy and Jarvis (2010).

Figure 6 shows the mean-normalized measures of lexical richness per event type for event triggers. Note that the Maas measure inversely measures lexical richness. This shows that “CSR/Brand" and “Macroeconomics" exhibit the most lexically rich triggers. That “FinancialReport" and “Macroeconomics" show large amounts of lexical variety is due to them being conceived as event categories that encompass a large amount of distinct real-world possibilities. Event triggers of “Merger/Acquistion", “Investment", “Dividend", “Revenue", “Expense", and the financial metric type (“SalesVolume") are unsurprisingly not very lexically diverse. These events are typically narrowly defined and inspection of the corpus shows that they are not often the target of creative language use such as metonomy or metaphor. We expect that these types to perform well in trigger classification.

Fig. 6
figure 6

Lexical richness measures (mean-normalized) of triggers for all event types. Ordering is from least to most lexically diverse

7 Event detection pilot study

As validation for this dataset, we performed a set of event detection experiments for determining the type of economic event in text. Event detection is a simpler task than event extraction for which this dataset was designed: instead of extracting events, their attributes (type, subtype, factuality) and arguments, event detection consists in assigning an event type to a given piece of text. For this pilot study, we conceived event type detection as a multilabel sentence-level classification problem, and tested the fine-tuning of several neural transformer-based language models, which are currently state-of-the-art in text classification.

7.1 Method

The field of NLP has transitioned from developing task-specific models to fine-tuning approaches based on large general-purpose language models (Howard & Ruder, 2018; Peters et al., 2018). Pre-trained models of this type are based on the transformer architecture (Vaswani et al., 2017). Currently, the most widely-used model of this type is BERT (Devlin et al., 2019) and its many variants. Fine-tuning BERT-like models has resulted in state-of-the-art performance in many NLP tasks.

We experimented with fine-tuning pre-trained transformer models for English using the “huggingface/transformers” PyTorch library which provides models in their repository (Wolf et al., 2019). We used the provided default hyper-parameters, tokenizers, and configuration for all models. To add sentence classification capability to the language models, a sequence classification head was added to the original transformer architecture. The batch size was set to 8 instances and sequence length to 128 for all models. The only hyperparameter set in crossvalidation grid-search was the amount of training epochs (\(e=\{4,8,16\}\)). The best hyper-parametrization was chosen for each architecture by macro-averaged F\(_1\)-score over 10 folds in cross-validation. Macro-averaged F1-score was chosen because we consider each class equally important despite severe class imbalance. We also perform a held-out test in which the model is trained on the dev set and tested on the adjudicated test set.

We provide baselines in the form of “dummy” classifiers: (a) the “majority-class baseline” always predicts the most frequent label, (b) the “stratified baseline” generates random predictions by respecting the training set class distribution, and (c) the “uniform baseline” generates predictions uniformly at random.

7.2 Pre-trained models

We performed text classification using pre-trained BERT (Devlin et al., 2019), RoBERTa (Liu et al., 2019), XLM (Conneau and Lample 2019), and XLNet (Yang et al., 2019) models.

BERT is an attention-based auto-encoding sequence-to-sequence model using two unsupervised task objectives. The first task is word masking or Masked Language Model (MLM), where the model has to guess which word is masked in its position in the text. The second task is Next Sentence Prediction (NSP) performed by predicting if two sentences are subsequent in the corpus, or if they are randomly sampled from the corpus. The specific English BERT model used is the “bert-base-cased” in which capitalization is preserved as provided with the original paper.Footnote 8

The XLM model we used is based on the work of Conneau and Lample (2019). We only used the monolingual MLM English model. The main difference with BERT is that XLM uses Byte-Pair Encoding (BPE) tokens as input instead of WordPiece tokens and drops the NSP objective. The difference between BPE and WordPiece lies in the way the token symbol pairs are chosen for adding to the vocabulary. Instead of relying on the frequency of the pairs, WordPiece chooses the one which maximises the likelihood of the training data. These changes have been shown to improve performance on text-classification tasks. The specific model we used is “xlm-mlm-en-2048”.

The RoBERTa model (Liu et al., 2019) improved over BERT by dropping next sentence prediction and using only the MLM task on multiple sentences instead of single sentences. The authors argue that while NSP was intended to learn inter-sentence coherence, it actually learned topic similarity because of the random sampling of sentences in negative instances. The specific RoBERTa models used are the “roberta-base” and “roberta-large” cased model provided alongside the original paper.Footnote 9

XLNet is a permutation language model that combines strengths of auto-regressive and auto-encoding modeling approaches: Permutation language models are trained to predict tokens given preceding context like a traditional unidirectional language model, but instead of predicting the tokens in sequential order, it predicts tokens in a random order sampling from both the left and right context. XLNet incorporates two key ideas from the TransformerXL architecture (Dai et al., 2019): relative positional embeddings and the recurrence mechanism. In combination with the permutation objective these techniques effectively capture bidirectional context while avoiding the independence assumption and pretrain-finetune discrepancy caused by the use of masked tokens in BERT. The specific XLNet models used are the “xlnet-base” and “xlnet-large” cased models released alongside the original work.Footnote 10

7.3 Results

Table 6 shows the macro-averaged precision, recall, and F\(_1\)-scores for the highest-scoring hyper-parametrization of each model. Results show that RoBERTa (large 335M parameter variant, trained for 16 epochs) obtained the best results with a \(60.1\%\,(std \, 3\%)\) macro-averaged F\(_1\)-score on cross-validation and \(59.1\%\) on holdout. Not shown in Table 6, RoBERTa-Large obtained a micro-averaged F\(_1\)-score of \(62.1\%\,(std \, 2.7)\) in cross-validation and \(63.7\%\) on holdout, and a ROC-AUC of \(0.81\,(std \, 0.02)\) in cross-validation and 0.90 on holdout. RoBERTa’s smaller base variant (parameters) and XLNet (base variant) also perform similarly while the other models trail behind. The other models are not as stable in cross-validation with a relatively large standard deviation in scores across folds. The stable results between the cross-validated development set and test set indicates the adequacy of the pre-determined train-test split for future benchmarking purposes.

Table 6 10-Fold cross-validation and held-out precision (P), recall (R), \(F_1\)-score (macro-averaged) for sentence-level event detection on 18 classes

7.4 Discussion and error analysis

The results of the best performing model were further analysed to gain more insights into the event type classification, the results of which are displayed in Fig. 7. At the left-hand side of the figure, the event type “CSR/Brand" (n = 365, 5.9% of all events) yields the lowest scores at a 33% F1-score with markedly low recall. “Financing" (n = 141, 2.3%), “FinancialReport" (n = 417 6.7%), “Deal" (n = 230 3.7%), “Investment" (n = 139, 2.2%), “Macroeconomics” (n = 827, 13.3%) have fairly low F1-score of around 50%. This is likely due to a combination of class infrequency and lexical variation observed by the corresponding lexical diversity in the event triggers.

Fig. 7
figure 7

Macro-averaged precision, recall, F1-scores in cross-validation per type of the best system. (Only F1-score is labeled.)

We also performed a qualitative error analysis on miss-classifications in the output. We manually annotated a randomly selected sample of half the instances (169/335) where a classification error was made. This type of analysis allows to identify classification difficulties in context. The types of miss-classifications, examples and their frequency is given in Table 7.

Table 7 Qualitative error analysis of miss-classification types

Miss-classifications were most often caused by the absence of strong lexical cues (“weak cues”) which are lexical terms that strongly reference an event type e.g. “Hold" for “Rating”, “acquire" for “Acquisition”. This is due to anaphoric or less-specific references that are not captured within the sentence-level context and require document-level processing to resolve co-reference, which is lacking in our pilot classifier. Another type of error was caused by an ambiguous event trigger, e.g. “Buy” as a “Buy rating” or for “Merger/Acquisition”, “Spending” “SalesVolume” (customer spending ) vs. “Expense”. We believe that both the weak cues and ambiguous triggers could be mitigated by machine learning methods which take into account the larger document-level context. Furthermore, entity and event co-reference processing would likely need to be incorporated. Some errors also occurred in a highly lexically specialized context with a lot of industry-specific terms or idiomatic expressions such as proverbs/sayings. The uniqueness of these expressions make it unlikely that these words or similar contexts were seen in the training set. This is a difficult problem to resolve with machine learning methods that are lexically-based.

Finally, in some instances a label outside the true label set was predicted that can plausibly be assigned. This usually happened when the plausible event type was not annotated because it was not deemed relevant, but strong lexical cues for that event were present. In rare cases, our evaluation set should be revised with these plausible types.

8 Data availability and replication

The annotation scheme guidelines, pilot study source code, sentence-level replication data, and trained models are publicly available at Jacobs (2020a). The original dataset will be freely downloadable at the end of the SENTiVENT project through this repository. Up until then, the fully annotated corpus is available on request for academic research purposes.

9 Conclusion and future work

Event extraction is a productive field of research and a required step in many data-driven tasks in which factual information is needed to capture changes in the real-world. Various general domain fine-grained event extraction corpora are freely available but no economically focused corpus exists. Currently, this resource scarcity limits the development of information extraction for economic and financial applications. The aim of the research presented in this paper was to construct a detailed, manually annotated, high-quality gold standard dataset for event extraction in the economic domain.

We present an annotation scheme of company-specific events in English economic news articles. A representative corpus was crawled and annotated with an iteratively developed economic event typology with 18 types and 64 subtypes. This resulted in around 6200 annotated events in 288 documents with event triggers, participant arguments, event co-reference, and event attributes such as type, subtype, negation, and modality. Agreement was generally substantial and annotator performance was adequate, indicating that the annotation scheme produces consistent event annotations of high quality. In an event classification pilot study, satisfactory results were obtained with a macro-averaged \(F_1\)-score of \(59\%\) validating the dataset for machine learning purposes. This dataset provides a rich resource on events as training data for supervised machine learning for economic and financial applications.

In future work, we will apply recent methods for fine-grained event extraction based on neural attention-based methods. Extracting full event schemata remains a highly challenging task even on well-established datasets such as ACE2005. As in the current state-of-the-art event extraction methods, we will experiment with semi-supervised data generation methods.Footnote 11 This dataset provides a strong supervised seed set which should aid such approaches. We also plan on moving from purely data-driven to hybrid machine learning pipelines that integrate existing knowledge-based semantic resources such as DBpedia (Auer et al., 2007) and economic ontologies (Fani & Bagheri, 2015; Kakkonen & Mufti, 2011; Lösch & Nikitina, 2009).

In currently ongoing work, we are adding “investor sentiment” annotations to this corpus. We will add a “positive”, “neutral”, “negative” polarity attribute on top of events as well as separate sentiment expression annotations with their targets. These sentiment annotations allow the joint processing of “common-sense” sentiment of events and we will investigate how extracted event schemata can be used upstream from aspect-based sentiment analysis.