4.1 Introduction

Target concept analysis finds and resolves mentions of persons that can be subject to the bias forms analyzed by person-oriented framing analysis (PFA), in particular, word choice and labeling, source selection, and commission and omission of information (Sect. 3.3.2). This task is of particular importance in PFA and difficulty in slanted coverage, for example, due to the divergent word choice across differently slanted news articles. While one article might refer to “undocumented immigrants,” others may refer to “illegal aliens.” Also, within single articles, different terms may be used to refer to the same persons, such as referring to “Kim Jong-un” and later quoting a politician using the term “little rocket man” [74].

In PFA, target concept analysis is the first analysis component (Fig. 3.1). The input to the target concept analysis is a set of news articles reporting on the same event. The output should be the set of all persons mentioned in the event coverage and each person’s mentions resolved across all news articles.

We investigate two conceptually different approaches to tackle the task of target concept analysis. Section 4.2 describes our first approach, main event extraction. The approach extracts phrases describing a given article’s main event, e.g., who is the main actor and what action is performed by the main actor.

Section 4.3 describes our method for context-driven cross-document coreference resolution, which is a technique to find mentions of semantic concepts, such as persons, and resolve them across one or more text documents, i.e., news articles in the context of our thesis. Compared to event extraction, coreference resolution allows for directly finding and resolving all mentions of persons (and other concept types) across the given news articles.

Lastly, Sect. 4.4 contrasts both approaches and reasons why coreference resolution is used as the primary approach employed in target concept analysis.

4.2 Event Extraction

Methods in the field of event extraction aim to determine one or more events in a given document and extract specific properties of these events [388]. In the context of target concept analysis, event extraction can be useful as the first step of two. In the first step, we would use event extraction to extract phrases describing important properties of each article’s events, such as the actor, which action the actor performed, where, when, and to whom. In the second step, we would then analyze across the articles which events are identical. Using this way matched events, we could deduce that their individual properties refer to the same concepts. At the end of this process, mentions, even those apparently dissimilar, could be resolved across the set of articles. Table 4.1 shows a simple example of the previously outlined idea, where—for the sake of simplicity only two—event properties, i.e., the actor (“who”) and action (“what”) performed by the actor, are extracted from articles’ sentences and headlines (first column). Despite the textual and semantic difference of “illegal aliens” and “undocumented immigrants,” our approach could resolve them to the same semantic concept, here a group of persons, due to the action’s similarity, which is in both cases “cross [the] border.”

Table 4.1 Simplistic example showing two of the six 5W1H properties where the “what” properties are semantically identical

To tackle the first step of the previously outlined idea, we propose Giveme5W1H, a method to extract phrases answering the journalistic 5W1H questions.Footnote 1 Figure 4.1 depicts an example of 5W1H phrases, which describe an article’s main event, i.e., who does what, when, where, why, and how. We also introduce an annotated dataset for the evaluation of the approach.

Fig. 4.1
figure 1

News article consisting of title (bold), lead paragraph (italic), and first of remaining paragraphs. Highlighted phrases represent the 5W1H event properties ( did , , , , and ). Source [2]

Specifically, our objective is to devise Giveme5W1H as an automated method for extracting the main event being reported on by a given news article. For this purpose, we exclude non-event-reporting articles, such as commentaries or press reviews. First, we define the extracted main event descriptors to be concise (requirement R1). This means they must be as short as possible and contain only the information describing the event while also being as long as necessary to contain all information of the event. Second, the descriptors must be of high accuracy (R2). For this reason, we give higher priority to extraction accuracy than execution speed.

The remainder of this section is structured as follows. Section 4.2.1 discusses prior work in event extraction. Section 4.2.2 details our event extractor. Section 4.2.3 presents the results of our evaluation of the system. Section 4.2.4 discusses the system’s performance both concerning general event extraction and concerning PFA. Section 4.2.5 concludes this line of research and reasons why we focus on a second line of research for target concept analysis, which we then describe in Sect. 4.3.

Giveme5W1H and the datasets for training and evaluation are available at

https://github.com/fhamborg/Giveme5W1H.

4.2.1 Related Work

This section gives a brief overview of 5W and 5W1H extraction methods in the news domain. Most systems focus only on the extraction of 5W phrases without “how” phrases (cf. [67, 345, 392, 393]). The authors of prior work do not justify this, but we suspect two reasons. First, the “how” question is particularly difficult to extract due to its ambiguity, as we will explain later in this section. Second, “how” (and “why”) phrases are considered less important in many use cases when compared to the other phrases, particularly those answering the “who,” “what,” “when,” and “where” (4W) questions (cf. [156, 349, 397]). For the sake of readability in this section, we include approaches to extract both 5W and 5W1H when referring to 5W1H extraction. Aside from the “how” extraction, the analysis of approaches for 5W1H or 5W extraction is conceptually identical.

The task is closely related to closed-domain question answering, which is why some authors call their approaches 5W question answering (QA) systems.

Systems for 5W QA on news texts typically perform three tasks to determine the article’s main event [374, 393]: (1) preprocessing; (2) phrase extraction [88, 176, 321, 392, 397], where, for instance, linguistic rules are used to extract phrase candidates; and (3) candidate scoring, which selects the best answer for each question by employing heuristics, such as the position of a phrase within the document. The input data to QA systems is usually text, such as a full article including headline, lead paragraph, and main text [321], or a single sentence, e.g., in news ticker format [392]. Other systems use automatic speech recognition (ASR) to convert broad casts into text [393]. The outcomes of the process are six textual phrases, one for each of the 5W1H questions, which together describe the main event of a given news text, as highlighted in Fig. 4.1.

The preprocessing task (1) performs sentence splitting, tokenizes them, and often applies further NLP methods, including part-of-speech (POS) tagging, coreference resolution [321], NER [88], parsing [225], or semantic role labeling (SRL) [47].

For the phrase extraction task (2), various strategies are available. Most systems use manually created linguistic rules to extract phrase candidates from the preprocessed text [176, 321, 393]. Noun phrases (NP) yield candidates for “who,” while sibling verb phrases (VP) are candidates for “what” [321]. Other systems use NER to only retrieve phrases that contain named entities, e.g., a person or an organization [88]. Other approaches use SRL to identify the agent (“who”) performing the action (“what”) and location and temporal information (“where” and “when”) [392]. Determining the reason (“why”) can even be difficult for humans because often the reason is only described implicitly, if at all [108]. The applied methods range from simple approaches, e.g., looking for explicit markers of causal relations [176], such as “because,” to complex approaches, e.g., training machine learning (ML) methods on annotated corpora [11]. The clear majority of research has focused on explicit causal relations, while only few approaches address implicit causal relations, which also achieve lower precision than methods for explicit causes [27].

The candidate scoring task (3) estimates the best answer for each 5W question. The reviewed 5W QA systems provide only few details on their scoring. Typical heuristics include shortness of a candidate, as longer candidates may contain too many irrelevant details [321], “who” candidates that contain an NE, and active speech [393]. More complex methods are discussed in various linguistic publications and involve supervised ML [165, 392]. Yaman, Hakkani-Tur, and Tur [392] use three independent subsystems to extract 5W answers. A trained SVM then decides which subsystem is “correct” using features, such as the agreement among subsystems or the number of non-null answers per subsystem.

Both the extraction of phrases answering the “why” and “how” questions pose a particular challenge in comparison to the other questions. Determining the reason or cause (i.e.. “why”) can even be difficult for humans. Often the reason is unknown, or it is only described implicitly, if at all [108]. Extracting the “how” answer is also difficult, because this question can be answered in many ways. To find “how” candidates, the system by Sharma et al. [321] extracts the adverb or adverbial phrase within the “what” phrase. The tokens extracted with this simplistic approach detail the verb, e.g., “He drove quickly,” but do not answer the method how the action was performed, e.g., by ramming an explosive-laden car into the consulate (in the example in Fig. 4.1), which is a prepositional phrase. Other approaches employ ML [175], but have not been devised for the English language. In summary, few approaches exist that extract “how” phrases. The reviewed approaches provide no details on their extraction method and achieve poor results, e.g., they extract adverbs rather than the tool or the method by which an action was performed (cf. [157, 175, 321]).

While the evaluations of the reviewed papers generally indicate sufficient quality to be usable for news event extraction, e.g., the system by Yaman, Hakkani-Tur, and Tur [392] achieved macro F1 = 0.85 on the Darpa corpus from 2009, they lack comparability for two reasons: (1) There is no gold standard for journalistic 5W1H question answering on news articles. A few datasets exist for automated question answering, specifically for the purpose of disaster tracking [202, 350]. However, these datasets are so specialized to their own use cases that they cannot be applied to the use case of automated journalistic question answering. Another challenge to the evaluation of news event extraction is that the evaluation datasets of previous papers are no longer publicly available [279, 392, 393]. (2) Previous papers use different quality measures, such as precision and recall [67] or error rates [393].

Another weakness of the reviewed prior work is that none of them yield canonical or normalized data. Canonical output is more concise and also less ambiguous than its original textual form (cf. [378]), e.g., polysemes, such as crane (animal or machine), have multiple meanings. Hence, canonical data is often more useful in downstream analyses (see Sect. 4.2). Phrases containing temporal information or location information may be canonicalized, e.g., by converting the phrases to dates or timespans [48, 343] or to precise geographic positions [207]. Phrases answering the other questions could be canonicalized by employing NERD on the contained NEs and then linking the NEs to concepts defined in a knowledge graph, such as YAGO [150] or WordNet [239].

In sum, methods for extracting events from articles suffer from three main shortcomings. First, most approaches only detect events implicitly, e.g., by employing topic modeling [90, 355]. Second, they are specialized for the extraction of task-specific properties, e.g., extracting only the number of injured people in an attack [267, 355]. Lastly, some methods extract explicit descriptors, but are not publicly available, or are described in insufficient detail to allow researchers to reimplement the approaches [279, 374, 392, 393].

4.2.2 Method

Giveme5W1H is a method for main event retrieval from news articles that addresses the objectives we defined in Sect. 4.2. The system extracts 5W1H phrases that describe the most defining characteristics of a news event, i.e., who did what, when, where, why, and how. This section describes the analysis workflow of Giveme5W1H, as shown in Fig. 4.2. Due to the lack of a large-scale dataset for 5W1H extraction (see Sect. 4.2.1), we devise the system using traditional machine learning techniques and domain knowledge.

Fig. 4.2
figure 2

The three-phase analysis pipeline preprocesses a news text, finds candidate phrases for each of the 5W1H questions, and scores these. Giveme5W1H can easily be accessed via Python and via a RESTful API

Besides the intended use in PFA, Giveme5W1H can be accessed by other software as a Python library and via a RESTful API. Due to its modularity, researchers can efficiently adapt or replace components. For example, researchers can integrate a custom parser or adapt the scoring functions tailored to the characteristics of their data. The system builds on our earlier system, Giveme5W [133], but improves the extraction performance by addressing the planned future work directions: Giveme5W1H uses coreference resolution, question-specific semantic distance measures, combined scoring of candidates, and extracts phrases for the “how” question. The values of the parameters introduced in this section result from a semi-automated search for the optimal configuration of Giveme5W1H using an annotated learning dataset including a manual, qualitative revision (see Sect. 4.2.2.5).

4.2.2.1 Preprocessing

Giveme5W1H accepts as input the full text of a news article, including headline, lead paragraph, and body text. The user can specify these three components as one or separately. Optionally, the article’s publishing date can be provided, which helps Giveme5W1H parse relative dates, such as “yesterday at 1 pm.”

During preprocessing, we use Stanford CoreNLP for sentence splitting, tokenization, lemmatization, POS tagging, full parsing, NER (with Stanford NER’s seven-class model), and pronominal and nominal coreference resolution. Since our main goal is high 5W1H extraction accuracy (rather than fast execution speed), we use the best-performing model for each of the CoreNLP annotators, i.e., the “neural” model if available. We use the default settings for English in all libraries.

After the initial preprocessing, we bring all NEs in the text into their canonical form. Following from requirement R1, canonical information is the preferred output of Giveme5W1H, since it is the most concise form. Because Giveme5W1H uses the canonical information to extract and score “when” and “where” candidates, we implement the canonicalization task during preprocessing.

We parse dates written in natural language into canonical dates using SUTime [362]. SUTime looks for NEs of the type date or time and merges adjacent tokens to phrases. SUTime also handles heterogeneous phrases, such as “yesterday at 1 pm,” which consist not only of temporal NEs but also other tokens, such as function words. Subsequently, SUTime converts each temporal phrase into a standardized TIMEX3 instance [292]. TIMEX3 defines various types, also including repetitive periods. Since events according to our definition occur at a single point in time, we only retrieve datetimes indicating an exact time, e.g., “yesterday at 6pm,” or a duration, e.g., “yesterday,” which spans the whole day.

Geocoding is the process of parsing places and addresses written in natural language into canonical geocodes, i.e., one or more coordinates referring to a point or area on earth. We look for tokens classified as NEs of the type location (cf. [392]). We merge adjacent tokens of the same NE type within the same sentence constituent, e.g., within the same NP or VP. Similar to temporal phrases, locality phrases are often heterogeneous, i.e., they do not only contain temporal NEs but also function words. Hence, we introduce a locality phrase merge range r where = 1, to merge phrases where up to r where arbitrary NE tokens are allowed between two location NEs. Lastly, we geocode the merged phrases with Nominatim,Footnote 2 which uses free data from OpenStreetMap.

We canonicalize NEs of the remaining types, e.g., persons and organizations, by linking NEs to concepts in the YAGO graph [221] using AIDA [150]. The YAGO graph is a knowledge base, where nodes in the graph represent semantic concepts that are connected to other nodes through attributes and relations. The data is derived from other well-established knowledge bases, such as Wikipedia, WordNet, Wikidata, and GeoNames [345].

4.2.2.2 Phrase Extraction

Giveme5W1H performs four independent extraction chains to retrieve the article’s main event: (1) the action chain extracts phrases for the “who” and “what” questions, (2) environment for “when” and “where,” (3) cause for “why,” and (4) method for “how.”

The action extractor identifies who did what in the article’s main event. The main idea for retrieving “who” candidates is to collect the subject of each sentence in the news article. Therefore, we extract the first NP that is a direct child to the sentence in the parse tree and that has a VP as its next right sibling. We discard all NPs that contain a child VP, since such NPs yield lengthy “who” phrases. Take, for instance, this sentence: “((NP) Mr. Trump, ((VP) who stormed to a shock election victory on Wednesday)), ((VP) said it was […]),” where “who stormed […]” is the child VP of the NP. We then put the NPs into the list of “who” candidates. For each “who” candidate, we take the VP that is the next right sibling as the corresponding “what” candidate. To avoid long “what” phrases, we cut VPs after their first child NP, which long VPs usually contain. However, we do not cut the “what” candidate if the VP contains at most l what,min = 3 tokens, and the right sibling to the VP’s child NP is a prepositional phrase (PP). This way, we avoid short, undescriptive “what” phrases. For instance, in the simplified example, “((NP) The microchip) ((VP) is ((NP) part) ((PP) of a wider range of the company’s products)),” the truncated VP “is part” contains no descriptive information; hence, our presented rules prevent this truncation.

The environment extractor retrieves phrases describing the temporal and locality context of the event. To determine “when” candidates, we take TIMEX3 instances from preprocessing. Similarly, we take the geocodes as “where” candidates.

The cause extractor looks for linguistic features indicating a causal relation within a sentence’s constituents. We look for three types of cause-effect indicators (cf. [176, 177]): causal conjunctions, causative adverbs, and causative verbs. Causal conjunctions, e.g., “due to,” “result of,” and “effect of,” connect two clauses, whereas the second clause yields the “why” candidate. For causative adverbs, e.g., “therefore,” “hence,” and “thus,” the first clause yields the “why” candidate. If we find that one or more subsequent tokens of a sentence match with one of the tokens adapted from Khoo et al. [176], we take all tokens on the right (causal conjunction) or left side (causative adverb) as the “why” candidate.

Causative verbs, e.g., “activate” and “implicate,” are contained in the middle VP of the causative NP-VP-NP pattern, whereas the last NP yields the “why” candidate [108, 177]. For each NP-VP-NP pattern we find in the parse tree, we determine whether the VP is causative. To do this, we extract the VP’s verb, retrieve the verb’s synonyms from WordNet [239], and compare the verb and its synonyms with the list of causative verbs from Girju [108], which we also extended by their synonyms (cf. [108]). If there is at least one match, we take the last NP of the causative pattern as the “why” candidate. To reduce false positives, we check the NP and VP for the causal constraints for verbs proposed by Girju [108].

The method extractor retrieves “how” phrases, i.e., the method by which an action was performed. The combined method consists of two sub-tasks, one analyzing copulative conjunctions and the other looking for adjectives and adverbs. Often, sentences with a copulative conjunction contain a method phrase in the clause that follows the copulative conjunction, e.g., “after [the train came off the tracks].” Therefore, we look for copulative conjunctions compiled from the Oxford English Dictionary [268]. If a token matches, we take the right clause as the “how” candidate. To avoid long phrases, we cut off phrases longer than l how,max = 10 tokens. The second sub-task extracts phrases that consist purely of adjectives or adverbs (cf. [321]), since these often represent how an action was performed. We use this extraction method as a fallback, since we found the copulative conjunction-based extraction too restrictive in many cases.

4.2.2.3 Candidate Scoring

The last task is to determine the best candidate for each 5W1H question. The scoring consists of two sub-tasks. First, we score candidates independently for each of the 5W1H questions. Second, we perform a combined scoring where we adjust scores of candidates of one question dependent on properties, e.g., position, of candidates of other questions. For each question q, we use a scoring function that is composed as a weighted sum of n scoring factors:

$$\displaystyle \begin{aligned} s_q = \sum_{i=0}^{n - 1}{w_{\mathrm{q},i}s_{\mathrm{q},i}}, \end{aligned} $$
(4.1)

where w q,i is the weight of the scoring factor s q,i.

To score “who” candidates, we define three scoring factors: the candidate shall occur in the article (1) early and (2) often and (3) contain a named entity. The first scoring factor targets the concept of the inverse pyramid [52]: news mention the most important information, i.e., the main event, early in the article, e.g., in the headline and lead paragraph, while later paragraphs contain details. However, journalists often use so-called hooks to get the reader’s attention without revealing all content of the article [283]. Hence, for each candidate, we also consider the frequency of similar phrases in the article, since the primary actor involved in the main event is likely to be mentioned frequently in the article. Furthermore, if a candidate contains a NE, we will score it higher, since in news, the actors involved in events are often NEs, e.g., politicians. Table 4.2 shows the weights and scoring factors.

Table 4.2 Weights and scoring factors for “who” phrases

To calculate these factors, we define \(\operatorname {pos}(c)=1-\frac {n_{\mathrm {pos}}(c)}{d_{\mathrm {len}}}\), \(f(c)=\frac {n_f(c)}{\max _{c' \in C}(n_f(c'))}\), where n pos is the candidate c’s position measured in sentences within the document, n f(c) the frequency of phrases similar to c in the document, and \(\operatorname {NE}(c)=1\) if c contains an NE, else 0 (cf. [88]). To measure n f(c) of the actor in candidate c, we use the number of the actor’s coreferences, which we extracted during coreference resolution (see Sect. 4.2.2.1). This allows Giveme5W1H to recognize and count name variations, as well as pronouns. Due to the strong relation between agent and action, we rank VPs according to their NPs’ scores. Hence, the most likely VP is the sibling in the parse tree of the most likely NP: s what = s who.

We score temporal candidates according to four scoring factors: the candidate shall occur in the article (1) early and (2) often. It should also be (3) close to the publishing date of the article and (4) of a relatively short duration. The first two scoring factors have the same motivation as in the scoring of “who” candidates. The idea for the third scoring factor, close to the publishing date, is that events reported on by news articles often occurred on the same day or on the day before the article was published. For example, if a candidate represents a date one or more years in the past before the publishing date of the article, the candidate will achieve the lowest possible score in the third scoring factor. The fourth scoring factor prefers temporal candidates that have a short duration, since events according to our definition happen during a specific point in time with a short duration. We logarithmically normalize the duration factor between 1 minute and 1 month (cf. [397]). The resulting scoring formula for a temporal candidate c is the sum of the weighted scoring factors shown in Table 4.3.

Table 4.3 Weights and scoring factors for “when” phrases

To count \(n_{f}\left ( c \right )\), we determine two TIMEX3 instances as similar if their start and end dates are at most 24h apart. \(\Delta _{s}\left ( c,d_{\text{pub}} \right )\) is the difference in seconds of candidate c and the publication date of the news article d pub, \(s\left ( c \right )\) the duration in seconds of c, and the normalization constants \(e_{\max } \approx 2.5Ms\,\) (1 month in seconds), \(s_{\min } = 60s\,\), and \(s_{\max } \approx 31Ms\,\) (1 year).

The scoring of location candidates follows four scoring factors: the candidate shall occur (1) early and (2) often in the article. It should also be (3) often geographically contained in other location candidates and be (4) specific. The first two scoring factors have the same motivation as in the scoring of “who” and “when” candidates. The second and third scoring factors aim to (1) find locations that occur often, either by being similar to others or (2) by being contained in other location candidates. The fourth scoring factor favors specific locations, e.g., Berlin, over broader mentions of location, e.g., Germany or Europe. We logarithmically normalize the location specificity between \(a_{\min } = 225\,m\,^{2}\) (a small property’s size) and \(a_{\max } = 530{,}000\,km\,^{2}\) (approx. the mean area of all countries [43]). We discuss other scoring options in Sect. 4.2.4. The used weights and scoring factors are shown in Table 4.4. We measure \(n_{f}\left ( c \right )\), the number of similar mentions of candidate c, by counting how many other candidates have the same Nominatim place ID. We measure \(n_{e}\left ( c \right )\) by counting how many other candidates are geographically contained within the bounding box of c, where \(a\left ( c \right )\) is the area of the bounding box of c in square meters.

Table 4.4 Weights and scoring factors for “where” phrases

Scoring causal candidates was challenging, since it often requires semantic interpretation of the text and simple heuristics may fail [108]. We define two objectives: candidates shall (1) occur early in the document, and (2) their causal type shall be reliable [177]. The second scoring factor rewards causal types with low ambiguity (cf. [11, 108]), e.g., “because” has a very high likelihood that the subsequent phrase contains a cause [108]. The weighted scoring factors are shown in Table 4.5. The causal type \({\text{TC}\left ( c \right ) = 1}_{\ }\)if c is extracted due to a causal conjunction, 0.62 if it starts with a causative RB and 0.06 if it contains a causative VB (cf. [176, 177]).

Table 4.5 Weights and scoring factors for “why” phrases

The scoring of method candidates uses three simple scoring factors: the candidate shall occur (1) early and (2) often in the news article, and (3) their method type shall be reliable. The weighted scoring factors for method candidates are shown in Table 4.6.

Table 4.6 Weights and scoring factors for “how” phrases

The method type \(\text{TM}\left ( c \right ) = 1\) if c is extracted because of a copulative conjunction, else 0.41. We determine the number of mentions of a method phrase \(n_{f}\left ( c \right )\) by the term frequency (including inflected forms) of its most frequent token (cf. [374]).

The final sub-task in candidate scoring is combined scoring, which adjusts scores of candidates of a single 5W1H question depending on the candidates of other questions. To improve the scoring of method candidates, we devise a combined sentence-distance scorer. The assumption is that the method of performing an action should be close to the mention of the action. The resulting equation for a method candidate c given an action candidate a is:

$$\displaystyle \begin{aligned} s_{\mathrm{who},\mathrm{new}}(c,a) = s_{\mathrm{how}}(c)-w_0 \frac{|{n_{\mathrm{pos}}(c)-n_{\mathrm{pos}}(a)}|}{d_{\mathrm{len}}}, \end{aligned} $$
(4.2)

where w 0 = 1. Section 4.2.4 describes additional scoring approaches.

4.2.2.4 Output

The highlighted phrases in Fig. 4.1 are candidates extracted by Giveme5W1H for each of the 5W1H event properties of the shown article. Giveme5W1H enriches the returned phrases with additional information that the system extracted for its own analysis or during custom enrichment, with which users can integrate their own preprocessing. The additional information for each token is its POS tag, parse tree context, and NE type if applicable. Enriching the tokens with this information increases the efficiency of the overall analysis workflow in which Giveme5W1H may be embedded, since later analysis tasks can reuse the information.

For the temporal phrases and locality phrases, Giveme5W1H also provides their canonical forms, i.e., TIMEX3 instances and geocodes. For the news article shown in Fig. 4.1, the canonical form of the “when” phrase represents the entire day of November 10, 2016. The canonical geocode for the “where” phrase represents the coordinates of the center of the city Mazar-i-Sharif (3642’30.8”N 6707’09.7”E), where the bounding box represents the area of the city and further information from OSM, such as a canonical name and place ID, which uniquely identifies the place. Lastly, Giveme5W1H provides linked YAGO concepts [221] for other NEs.

4.2.2.5 Parameter Learning

Determining the best values for the parameters introduced in Sect. 4.2.2, e.g., weights of scoring factors, is a supervised ML problem [162]. Since there is no gold standard for journalistic 5W1H extraction on news (see Sect. 4.2.1), we created an annotated dataset.

The dataset is available in the open-source repository (see Sect. 4.2.5). To facilitate diversity in both content and writing style, we selected 13 major news outlets from the USA and the UK. We sampled 100 articles from the news categories politics, disaster, entertainment, business, and sports for November 6–14, 2016. We crawled the articles (see Sect. 3.5) and manually revised the extracted information to ensure that it was free of extraction errors.

We asked 3 assessors (graduate IT students, aged between 22 and 26, all male) to read each of the 100 news articles and to annotate the single most suitable phrase for each 5W1H question. Finally, for each article and question, we combined the annotations using a set of combination rules, e.g., if all phrases were semantically equal, we selected the most concise phrase, or if there was no agreement between the annotators, we selected each annotator’s first phrase, resulting in three semantically diverging but valid phrases. We also manually added a TIMEX3 instance to each “when” annotation, which was used by the error function for “when.” The inter-rater reliability was IRRann = 81.0, measured using average pairwise percentage agreement.

We divided the dataset into two subsets for training (80% randomly sampled articles) and testing (20%). To find the optimal parameter values for our extraction method, we used an exhaustive grid search over all possible parameter configurations.Footnote 3 For each parameter configuration, we then calculated the mean error (ME) on the training set. To measure the ME of a configuration, we devised three error functions measuring the semantic distance between candidate phrases and annotated phrases. For the textual candidates, i.e., who, what, why, and how, we used the Word Mover’s Distance (WMD) [192]. WMD is a generic measure for semantic similarity of two phrases. For “when” candidates, we computed the difference in seconds between candidate and annotation. For “where” candidates, we computed the distance in meters between both coordinates. We linearly normalized all measures.

We then validated the 5% best-performing configurations on the test set and discarded all configurations that yielded a significantly different ME. Finally, we selected the best-performing parameter configuration for each question.

4.2.3 Evaluation

We conducted a survey with 3 assessors (3 graduate IT students, aged between 22 and 26, all male) and a dataset of 120 news articles, which we sampled from the BBC dataset [115]. The dataset contains 24 news articles in each of the following categories: business (“Bus”), entertainment (“Ent”), politics (“Pol”), sport (“Spo”), and tech (“Tec”). We asked the assessors to read one article at a time. After reading each article, we showed the assessors the 5W1H phrases that had been extracted by the system and asked them to judge the relevance of each answer on a 3-point scale: non-relevant (if an answer contained no relevant information, score s = 0), partially relevant (if only part of the answer was relevant or if information was missing, s = 0.5), and relevant (if the answer was completely relevant without missing information, s = 1).

Table 4.7 shows the mean average generalized precision (MAgP), a score suitable for multi-graded relevance assessments [168]. MAgP was 73.0 over all categories and questions. If only considering the first 4Ws, which the literature considers as sufficient to represent an event (cf. [156, 349, 397]), overall MAgP was 82.0.

Table 4.7 IRR and MAgP-performance of Giveme5W1H. The last row displays the mean when evaluated on only the first four W questions

Of the few existing approaches capable of extracting phrases that answer all six 5W1H questions (see Sect. 4.2.1), only one publication reported the results of an evaluation: the approach developed by Khodra achieved a precision of 74.0 on Indonesian articles [175]. Others did not conduct any evaluation [321] or only evaluated the extracted “who” and “what” phrases of Japanese news articles [157].

We also investigated the performance of systems that are only capable of extracting 5W phrases. Our system achieves MAgP5W = 75.0, which is 5pp. higher than the MAgP of our earlier system Giveme5W [133]. Directly comparing our system to other systems was not possible (cf. [133]): other systems were tested on non-disclosed datasets [279, 392, 393], they were translated from other languages [279], they were devised for different languages [157, 175, 374], or they used different evaluation measures, such as error rates [393] or binary relevance assessments [392], which are both not optimal because of the non-binary relevance of 5W1H answers (cf. [168]). Finally, none of the related systems have been made publicly available or have been described in sufficient detail to enable a reimplementation.

Therefore, a direct comparison of the results and related work was not possible, but we compared the reported evaluation metrics. Compared to the fraction of correct 5W answers by the best system by Parton et al. [279], Giveme5W1H achieves a 12pp. higher MAgP5W. The best system by Yaman, Hakkani-Tur, and Tur [392] achieved a precision P 5W = 89.0, which is 14pp. higher than our MAgP5W and—as a rough approximation of the best achievable precision [152]—surprisingly almost identical to the inter-rater reliability (IRR) of our assessors.

We found that different forms of journalistic presentation in the five news categories of the dataset led to different extraction performance. Politics articles, which yielded the best performance, mostly reported on single events. The performance on sports articles was unexpectedly high, even though they not only report on single events but also are background reports or announcements, for which event detection is more difficult. Determining the “how” in sports articles was difficult (MAgPhow = 51.0), since often articles implicitly described the method of an event, e.g., how one team won a match, by reporting on multiple key events during the match. Some categories, such as entertainment and tech, achieved lower extraction performances, mainly because they often contained much background information on earlier events and the actors involved.

4.2.4 Future Work

We plan to improve the extraction quality of the “what” question, being one of the important 4W questions. We aim to achieve an extraction performance similar to the performance of the “who” extraction (MaGPwho = 91.0), since both are very important in event description. In our evaluation, we identified two main issues: (1) joint extraction of optimal “who” candidates with non-optimal “what” candidates and (2) cut-off “what” candidates. In some cases (1), the headline contained a concise “who” phrase, but the “what” phrase did not contain all information, e.g., because it only aimed to catch the reader’s interest, a journalistic hook (Sect. 4.2.1). We plan to devise separate extraction methods for both questions. Thereby, we need to ensure that the top candidates of both questions fit to each other, e.g., by verifying that the semantic concept of the answer of each question, e.g., represented by the nouns in the “who” phrase or verbs in the “what” phrase, co-occurs in at least one sentence of the article. In other cases (2), our strategy to avoid too detailed “what” candidates (Sect. 4.2.2.2) cut off the relevant information, e.g., “widespread corruption in the finance ministry has cost it $2m,” in which the underlined text was cut off. We will investigate dependency parsing and further syntax rules, e.g., to always include the direct object of a transitive verb.

For “when” and “where” questions, we found that in some cases, an article does not explicitly mention the main event’s date or location. The date of an event may be implicitly defined by the reported event, e.g., “in the final of the Canberra Classic.” The location may be implicitly defined by the main actor, e.g., “Apple Postpones Release of […],” which likely happened at the Apple headquarters in Cupertino. Similarly, the proper noun “Stanford University” also defines a location. We plan to investigate how we can use the YAGO concepts, which are linked to NEs, to gather further information regarding the date and location of the main event. If no date can be identified, the publishing date of the article or the day before it might sometimes be a suitable fallback date.

Using the TIMEX3 instances from SUTime is an improvement (MAgPwhen = 78.0) over a first version, where we used dates without a duration (MAgPwhen = 72.0).

The extraction of “why” and “how” phrases was most challenging, which manifests in lower extraction performances compared to the other questions. One reason is that articles often do not explicitly state a single cause or method of an event, but implicitly describe this throughout the article, particularly in sports articles (see Sect. 4.2.3). In such cases, NLP methods are currently not advanced enough to find and abstract or summarize the cause or method (see Sect. 4.2.2.3). However, we plan to improve the extraction accuracy by preventing the system from returning false positives. For instance, in cases where no cause or method could be determined, we plan to introduce a score threshold to prevent the system from outputting candidates with a low score, which are presumably wrong. Currently, the system always outputs a candidate if at least one cause or method was found.

To improve the performance of all textual questions, i.e., who, what, why, and how, we will investigate two approaches. First, we want to improve measuring a candidate’s frequency, an important scoring factor in multiple questions (see Sect. 4.2.2.3). We currently use the number of coreferences, which does not include synonymous mentions. We plan to count the number of YAGO concepts that are semantically related to the current candidate. Second, we found that a few top candidates of the four textual questions were semantically correct but only contained a pronoun referring to the more meaningful noun. We plan to add the coreference’s original mention to extracted answers.

Section 4.2 outlined a two-task approach within which Giveme5W1H could be used to tackle the goal of target concept analysis, i.e., identifying and resolving mentions of persons. In the first step, Giveme5W1H would extract the 5W1H event properties. In the second step, these could be resolved across all articles, e.g., by deducing that the actors (“who”) of two events refer to the same person if the events’ actions are identical. In a simplistic example of two sentences “illegal aliens cross the border” and “undocumented immigrants cross border,” this two-task approach could resolve both actors, i.e., “illegal aliens” and “undocumented immigrants,” to the same semantic concept, here group of persons, since their action is identical (see Table 4.1).

However, the difficulty of this task increases strongly when not only one but multiple event properties are dissimilar. For example, Table 4.8 shows an additional, third sentence, “The migrant caravan invades the country,” which has a different actor (“migrant caravan”) and a different action (“invades the country”). In a qualitative investigation of the extracted 5W1H phrases, we find that real-world news coverage often has divergent 5W1H phrases, especially in the presence of bias, making the previously mentioned idea to resolve the mentions infeasible. Moreover, since we want to find and resolve not only a single main actor for each of the event’s news articles, we would additionally need to extract fine-grained side events at the sentence level. Lastly, PFA focuses on individual persons, whereas the actors of main events extracted by Giveme5W1H can also be groups of persons, countries, and other concept types.Footnote 4 Given this issue of using event extraction, i.e., strongly increased complexity in real-world news articles, and the additionally required work to devise methods for extracting fine-grained side events as well as resolving the event descriptors afterward, we choose to focus our research on a different line of research, which we describe in Sect. 4.3.

Table 4.8 Simplistic example showcasing the difficulty of resolving phrases when an event property is only ambiguously similar to others (“what” in the third row)

4.2.5 Conclusion

In this section, we proposed Giveme5W1H, the first open-source system that extracts answers to the journalistic 5W1H questions, i.e., who did what, when, where, why, and how, to describe a news article’s main event. The system canonicalizes temporal mentions in the text to standardized TIMEX3 instances, locations to geocoordinates, and other NEs, e.g., persons and organizations, to unique concepts in a knowledge graph. The system uses syntactic and domain-specific rules to extract and score phrases for each 5W1H question. Giveme5W1H achieved a mean average generalized precision (MAgP) of 73.0 on all questions and an MAgP of 82.0 on the first four W questions (who, what, when, and where), which alone can represent an event. Extracting the answers to “why” and “how” performed more poorly, since articles often only imply causes and methods. Answering the 5W1H questions is at the core of understanding any article and thus an essential task in many research efforts that analyze articles. We hope that redundant implementations and non-reproducible evaluations can be avoided with Giveme5W1H as the first universally applicable, modular, and open-source 5W1H extraction system. In addition to benefiting developers and computer scientists, our system especially benefits researchers from the social sciences, for whom automated 5W1H extraction was previously not made accessible.

In the context of this thesis, the event extraction achieved by Giveme5W1H represents the first step of a two-step approach that could be used to tackle target concept analysis. This approach, outlined in more detail in Sect. 4.2, relies on the idea of first extracting events and matching these across articles in order to determine which event properties refer to the same semantic concepts. Due to conceptual issues as described in Sect. 4.2.4, such as high ambiguity when matching events, we focus our research for target concept analysis on cross-document coreference resolution. Taking this approach also has the advantage of directly extracting and resolving mentions in one method and thus decreases the conceptual complexity of target concept analysis.

Giveme5W1H and the datasets for training and evaluation are available at

https://github.com/fhamborg/Giveme5W1H.

4.3 Context-Driven Cross-Document Coreference Resolution

Methods in the field of coreference resolution aim to resolve mentions of semantic concepts, such as persons, in a given text document [298]. Context-driven cross-document coreference resolution (CDCDCR) is a special form of coreference resolution with two differences. First, mentions are identified and resolved across multiple documents. Second, mentions can be less strictly related to another but can still be considered coreferential. Such mentions include also those that are typically non-coreferential, such as “White House” and “US President,” or are even contradictory in other contexts, such as “activist” and “extremist.” This is an extension to regular (cross-document) coreference resolution, which resolves only mentions that have an identity relation, i.e., that are strictly identical, such as “US President” and “Biden” [298].

In the context of our analysis workflow and in particular the target concept analysis, we use context-driven cross-document coreference resolution to find and identify mentions of persons across the set of news articles reporting on the given event. While our overall bias analysis focuses on person-targeting biases only, we devise our method for context-driven cross-document coreference resolution in this section to resolve also other types of semantic concepts, such as countries, so that the method can be used outside the scope of our analysis.

The remainder of this section is structured as follows.Footnote 5 Section 4.3.1 describes related work, highlighting that most research focuses on coreference resolution but only few works exists for cross-document coreference resolution or that aim to resolve mentions with less strictly identical relations. We then describe in Sect. 4.3.2 how we create and annotate our test dataset named NewsWCL50. We annotate not only coreferential mentions but also so-called frame properties, e.g., how the persons we annotate are portrayed. We do this because we use the dataset not only for the evaluation of our method for context-driven cross-document coreference resolution but also for the evaluation of a second approach, which aims to identify how the persons are portrayed (see Sect. 5.2). Section 4.3.3 introduces and describes our approach for context-driven cross-document coreference resolution. We then evaluate the approach (Sect. 4.3.4) and derive future work ideas (Sect. 4.3.5). Lastly, we summarize our research and set the results in context of the overall approach (Sect. 4.3.6).

In this section, for improved readability, we use the term sentence-level bias forms to refer to the three bias forms that cover the broad spectrum of text means to slant coverage, i.e., word choice and labeling, source selection, and commission and omission of information. We do not use the term person-targeting bias forms to highlight that our method can resolve various concept types and not only individual persons.

NewsWCL50 and its codebook are available at

https://github.com/fhamborg/NewsWCL50.

4.3.1 Related Work

The task of coreference resolution entails techniques that aim to resolve mentions of entities, typically in a single text document. Coreference resolution is employed as an essential analysis component in a broad spectrum of use cases, such as identifying potential targets in sentiment analysis or as a part of discourse interpretation. While traditional coreference resolution focuses on single documents, cross-document coreference resolution (CDCR) resolves concept mentions across a set of documents. Compared to traditional coreference resolution, CDCR is a less-researched task. Moreover, CDCR can be considered more difficult than traditional coreference resolution since multiple documents yield a larger search space than only a single document. Adding to the difficulty, multiple documents are more likely to differ in their writing style (cf. “word choice” as described in Sect. 2.3.4). In this thesis, especially the varying word choice represents an important issue that current methods for coreference resolution and CDCR fail to tackle.

Only a few methods and datasets for CDCR have been proposed, especially compared to traditional single-document coreference resolution. Albeit evaluated on different datasets, the mildly decreased performance of CDCR can serve as an indicator for the increased difficulty and decreased research popularity (F1 = [71.2;79.5] [18]) compared to single-document coreference resolution (F1 = 80.2 [389]). Additionally, in initial experiments, we noticed strong performance losses when applying most techniques in a more realistic setup to reflect real-world use (cf. [403]). The key difference between established evaluation practices and real-world use is that no gold standard mentions are available in the latter. Instead, other techniques must first find and extract mentions before coreference resolution can resolve them. Naturally, such automated extraction is prone to errors, and imperfectly resolved concepts and mentions may degrade the performance of coreference resolution. To our knowledge, there is only one approach that jointly extracts and resolves mentions [200].

Missing the highly context-specific coreferences of varying word choice as they occur in biased news coverage is the fundamental shortcoming of prior CDCR. Identifying and resolving such mentions is especially important in person-oriented framing analysis (PFA). A fitting CDCR method would need to identify and resolve not only clearly defined concepts and identity coreferences. Additionally, the method would need to resolve near-identity mentions, such as in specific contexts “the White House” and “the US,” and highly event-dependent coreferences, such as “Kim Jong-un” and “little rocket man” [74]. However, prior CDCR focuses on clearly defined concepts that are either event-centric or entity-centric. This narrowly defined structural distinction leads to corresponding methods and dataset annotations missing the previously mentioned concept types and highly context-dependent coreferences.

As stated previously, established CDCR datasets are either event-centric or entity-centric. When comparing the relevant datasets, we find a broad spectrum of concept scopes, e.g., whether two mentions are considered coreferential and which phrases are to be annotated as mentions in the first place. Correspondingly, individual datasets “miss” concepts and mentions that would have been annotated if the other annotation scheme had been used. The EventCorefBank (ECB) dataset entails two types of concepts, i.e., action and entity [23]. ECB+ is an event-centric corpus that extends ECB to consist of 502 news articles. Compared to ECB, annotations in ECB+ are more detailed, e.g., the dataset distinguishes various sub-types of actions and entities [62]. ECB+ contains only those mentions that describe an event, i.e., location, time, and human or non-human participants. NP4E is a dataset for entity-only CDCR [143]. NiDENT is an explorative CDCR evaluation dataset based on NP4E. Compared to the previously mentioned datasets, NiDENT also contains more abstract and less obvious coreference relations coined near-identity [299]. Zhukova et al. [403] provide an in-depth discussion of these and further datasets.

To our knowledge, all CDCR methods focus on resolving only events and, if at all, resolve entities as subordinate attributes of the events [174, 217]. There are two common, supervised approaches for event-centric CDCR: easy-first and mention-pair [217]. Easy-first models are so-called sieve-based models, where sieves are executed sequentially. Thereby, each sieve merges, i.e., resolves, mentions concerning specific characteristics. Initial sieves address reliable and straightforward properties, such as heads of phrases. Later sieves address more complex or specialized cases using techniques such as pairwise scoring of pre-identified concepts with binary classifiers [158, 199, 216]. Recently, a mention-pair model was proposed, which uses a neural model trained to score the likelihood of a pair of events or entity mentions to be the same semantic concept. The model represents such mentions using spans of text, contexts, and semantic dependencies [18].

In sum, the reviewed CDCR methods suffer from at least one of three essential shortcomings. First, they only resolve clearly defined identity mentions. Second, they only focus on event-related mentions. Third, they suffer performance losses when evaluated in real-world use cases due to requiring gold standard mentions, which are not available in real-world use cases. These shortcomings hold correspondingly for the current CDCR datasets. Thus, to our knowledge, there is no CDCR method that resembles the annotation of persons and other concept types as established in framing analyses, including broadly defined concepts and generally concepts independent of fine-grained event occurrences. In the remainder of this section, we thus create a dataset and method that addresses these shortcomings.

4.3.2 NewsWCL50: Dataset Creation

To create NewsWCL50, the first dataset for the evaluation of methods for context-driven cross-document coreference resolution and methods to automatically identify sentence-level bias forms, we conducted a manual content analysis. Thereby, we follow the procedure established in the social sciences, e.g., we first use an inductive to explore the data and derive categories to be annotated as well as annotation instructions. Afterward, we conduct a deductive content analysis, following these instructions and using only these categories. NewsWCL50 consists of 50 news articles that cover 10 political news events, each reported on by 5 online US news outlets representing the ideological spectrum, e.g., including left-wing, center, and right-wing outlets. The dataset contains 8656 manual annotations, i.e., each news article has on average approximately 170 annotations.

4.3.2.1 Collection of News Articles

We selected ten political events that happened during April 2018 and manually collected for each event five articles. To ease the identification and annotation of sentence-level bias forms, we aimed to increase the diversity of both writing style and content. Therefore, we selected articles published by different news outlets and selected events associated with different topical categories. We selected five large, online US news outlets representing the political and ideological spectrum of the US media landscape [245, 364]:

  • HuffPost (formerly The Huffington Post, far left, abbreviated LL)

  • The New York Times (left, L)

  • USA Today (center or middle, M)

  • Fox News Channel (right, R)

  • Breitbart News Network (far right, RR)

News outlets with different slants likely use different terms when reporting on the same topic (see Sect. 2.3.4), e.g., the negatively slanted term “illegal aliens” is used by RR, whereas “undocumented immigrants” is rather used by L when referring to DACA recipients.

To increase the content diversity, we aimed to gather events for each of the following political categories (cf. [95]): economic policy (focusing on US economy), finance policy, foreign politics (events in which the USA is directly involved), other national politics, and global interventions (globally important events, which are part of the public, political discourse).

Table 4.9 shows the collected events of NewsWCL50. One frequent issue during data gathering was that even major events were not reported on by all five news outlets; especially the far-left or far-right outlets did not report on otherwise popular events (which may contribute to a different form of bias, named event selection; see Sect. 2.3.1). We could not find any finance policy event in April that all five outlets reported on; hence, we discarded this category.

Table 4.9 Overview of the events part of NewsWCL50. All dates are in 2018

4.3.2.2 Training Phase: Creation of the Codebook

We create and use NewsWCL50 to evaluate two methods. The first method is for context-driven cross-document coreference resolution as used in target concept analysis. Additionally, we use NewsWCL50 to evaluate a method we will propose in Sect. 5.2, which aims to identify how persons are portrayed in news articles. This second method is used in the frame analysis component. Integrating the annotation of both properties, i.e., coreferential relations of mentions and how the mentions, e.g., of persons, are portrayed, approximates the manual procedure of frame analysis. Thus, in the following, we describe the creation of NewsWCL50, including coreferential mentions of persons (and other semantic concept types) and the framing categories representing how the persons are portrayed.

The goal of the training phase was to get an understanding of news articles concerning their types of mentions as well as the mentions’ coreference relations and portrayal. The training phase was conducted on news articles not contained in NewsWCL50. We collected the articles as described in Sect. 4.3.2.1 but for different time frames. In a first, inductive content analysis, we asked three coders (students in computer science or political sciences aged between 20 and 29) to read five news articles and use MAXQDA, a content analysis software, to annotate any phrase that they felt was influencing their perception or judgment of a person and other semantic concept mentioned in the article. Specifically, coders were asked to (1) mark such phrases and state which (2) perception, judgment, or feeling the phrase caused in them, e.g., affection, and its (3) target concept, i.e., which concept the perception effect was ascribed to. We then used the initial codings to derive coding rules and a set of frame properties, representing how the annotators felt a target was portrayed.

We use the stated perception or judgment (see step 2 described previously) to derive so-called frame properties. Frame properties are pre-defined categories representing specific dimensions of political frames. A detailed definition and discussion is described in Sect. 5.2, where we introduce our method that determines the frame properties. Our desired characteristics of frame properties are on the one hand to be general so that they can be applied meaningfully to a variety of political news events, but on the other hand to be specific, allowing fine-grained categorization of persons’ portrayal. Thus, during training, we added, removed, refined, or merged frame properties, e.g., we found that “unfairness” was always accompanied by (not necessarily physical) aggression and hence was better, i.e., more fine-grained, represented by “aggressor” or “victim,” which convey additional information on the perception of the target. We created a codebook including definitions of frame properties, coding rules, and examples.

During training, we refined the codebook until we reached an acceptable inter-coder reliability (ICR) after six training iterations. The inter-coder reliability at the end of our training was 0.65 for frame properties and 0.86 for target mentions (calculated with mean pairwise average agreement). The comparably low inter-coder reliability of the annotations concerning frame properties is in line with results of other studies that aim to annotate topic-independent “frame types” [45]. This indicates the complexity and difficulty of the task.

In total, we derived 13 bi-polar frame properties, i.e., that have an antonym, and 3 without an antonym. Since the frame properties are not used in the target concept analysis, further details concerning the frame properties are described in Sect. 5.2.2, which also describes the method that uses the frame properties.

Target concepts can be “actors” (single individua), “actions,” “countries,” “events,” “groups” (of individua acting collectively, e.g., demonstrators), other (physical) “objects,” and also more abstract or broadly defined semantic concepts, such as “Immigration issues,” coined “misc” (see Table 4.12). To define these seven types, we used established named entity types [359] and refined them during our annotation training to better fit our use case, e.g., by removing types that were never subject to bias, such as “TIME,” and adding fine-grained sub-types, such as “countries” and “groups” instead of only “ORG.”

The codebook is available as part of NewsWCL50 (see Sect. 4.3.6).

4.3.2.3 Deductive Content Analysis

To create our NewsWCL50 dataset, we conducted a deductive content analysis. One coder read and annotated the news articles, and two researchers reviewed and revised the annotations to ensure adherence to the codebook (cf. [260, 316]). For the annotation process, we used the two coding concepts devised in Sect. 4.3.2.2: target concepts, which are semantic concepts including persons that can be the target of sentence-level bias forms, and frame properties, which are categorized framing effects.

To facilitate the use of our dataset and method for context-driven cross-document coreference resolution also outside the scope of this thesis, we do not restrict the annotation of only persons but include the previously mentioned in total seven types of semantic concepts. This way, the dataset and the context-driven coreference resolution method we evaluate on it can more realistically cover the broad spectrum of coreferential mentions as they occur in real-world news coverage.

Following the codebook, the coder was asked to code any relevant phrase that represents either a target concept or frame property. This is in contrast to the beginning of the training phase, where any annotation originated from a change in perception or judgment of a concept. Said differently, while in training annotation the mentions of persons and other semantic concepts were only annotated as such if the coder felt the sentence or generally an (adjacent) expression changed the perception of such semantic concept, in the deductive annotation, we annotated all mentions of any semantic concept. To improve annotation efficiency, we asked the coder, however, to first briefly read the given news article and determine which semantic concepts are mentioned at least three times. Then, mentions of only these and semantic concepts that were identified in previously annotated news articles had to be annotated.

For each frame property, additionally, the corresponding target concept had to be assigned. For example, in “Russia seizes Ukrainian naval ships,” “Russia” would be coded as a target concept (type “country”), and “seizes” as a frame property (type “Aggressor”) with “Russia” being its target. Each mention of a target concept in a text segment can be targeted by multiple frame property phrases. More details on the coding instructions can be found in NewsWCL50’s codebook. The dataset consists of 5926 target concept codings and 2730 frame property codings. NewsWCL50 is openly available in an online repository (see Sect. 4.3.6).

4.3.3 Method

Given a set of news articles reporting on the same event, our method finds and resolves mentions referring to the same semantic concepts. The method consists of the preprocessing, candidate extraction, and candidate merging as shown in Fig. 3.1. Our evaluation dataset (Sect. 4.3.2.3) is not sufficiently large to train a method that uses deep learning techniques, and a large-scale dataset would cause high cost, e.g., the OntoNotes dataset, commonly used to train methods for coreference resolution, consists of more than 2000 documents and more than 25000 coreference chains [7]. By devising a rule-based method as described in the following, we are able to reduce the otherwise high annotation cost (see also Sect. 4.3.2).

While the previous section reporting on the dataset creation (Sect. 4.3.2) described also information important for the frame analysis component, e.g., how we devised and annotated frame properties, this section describes only our method for context-driven cross-document coreference resolution. The method for the identification of frame properties as devised for the frame analysis components is described in Sect. 5.2.

In the target concept analysis, the goal of the first sub-task, candidate extraction, is to identify phrases that contain a semantic concept, i.e., phrases that could be the target of sentence-level bias forms (Sect. 4.3.3.1). We identify noun phrases (NPs) and verb phrases (VPs). We coin such phrases candidate phrases, and they compare to the mentions of target concepts annotated in the content analysis (Sect. 4.3.2.3). The goal of the second sub-task, candidate merging, is to merge candidates referring to the same semantic concept, i.e., groups of phrases that are coreferential (see Sect. 4.3.3.2). Candidate merging includes state-of-the-art coreference resolution, but also aims to find coreferences across documents and in a broader sense (see Sects. 4.3 and 4.3.1), e.g., “undocumented immigrants” and “illegal aliens.”

4.3.3.1 Preprocessing and Candidate Extraction

We perform natural language preprocessing, including part-of-speech (POS) tagging, dependency parsing, full parsing, named entity recognition (NER), and coreference resolution [56, 57], using Stanford CoreNLP with neural models where available, otherwise using the defaults for the English language [224].

As initial candidates, we extract coreference chains, noun phrases (NPs), and verb phrases (VPs). First, we extract each coreference chain including all its mentions found by coreference resolution as a single candidate. Conceptually, this can be seen as an initial merging of candidates, since we merge all mentions of the coreference chain into one candidate. Second, we extract each NP found by the parser as a single candidate. We avoid long phrases by discarding any NP consisting of 20 or more words. If phrase contains one or more child NPs, we extract only the parent, i.e., longest, phrase. We follow the same extraction procedure for VPs. In the following, when referring to NPs, we always refer to VPs as well, if not noted otherwise.

We set a representative phrase for each candidate, which represents the candidate’s meaning. For coreference chain candidates, we take the representative mention defined by CoreNLP’s coreference resolution [334]. For NP-based candidates, we take the whole NP as the representative phrase. We use the representative phrases as one property to determine the similarity of candidates.

We also determine a candidate’s type, which is one of the types shown in Table 4.10. For each phrase in a candidate, we determine whether its head is a “person,” “group,” or “country,” using the lexicographer dictionaries from WordNet [239] and NE types from NER [88], e.g., “crowd” or “hospital,” are of type “group.” In linguistics, the head is defined as the word that determines a phrase’s syntactic category [240], e.g., the noun “aliens” is the head of “illegal aliens,” determining that the phrase is an NP. In WordNet, individual words, i.e., here the head words, can have multiple senses, e.g., “hospital” could be a building and an organization. We use WordNet’s ranked list of senses for each head word, to determine the head word’s most likely type. Specifically, for h, a head’s sense s of rank n s = {1, 2, 3, ...}, we define m(s) = 1∕n s as a weighting factor. We then calculate the head’s type score for each type t individually as follows:

$$\displaystyle \begin{aligned} s(h,t) = \sum_{s \in W(h)}{m(s)T(s,t)}, \end{aligned} $$
(4.3)

where W(h) yields all senses of h in WordNet and T(s, t) returns 1 if the queried type t is identical to the sense s’s type defined in WordNet, else 0. For a candidate c consisting of h ∈ c head words, we then calculate c’s type score for each type t individually as follows:

$$\displaystyle \begin{aligned} S(c,t) = \sum_{h \in c}{s(h,t)}. \end{aligned} $$
(4.4)

Lastly, we assign the type t to candidate c where S(c, t) is maximized. This way, our fine-grained type determination well reflects the different senses a word can have and their likeliness.

Table 4.10 Candidate types identified during preprocessing

If the candidate contains at least one NE mention, we set the NE flag. For example, if most phrases of a candidate are NE mentions of a “person,” we set the candidate type “person-ne.” If the type is a person, we distinguish between singular and plural by counting the heads’ POS types: NN and NNP for singular and NNS and NNPS for plural. If a candidate is neither a “person,” “group,” nor “country,” e.g., because the candidate is an abstract concept, such as “program,” we set its type to “misc.” We use the candidate types to determine which candidates can be subject to merging and for type-to-type-specific merge thresholds.

We refer to the previously described preprocessing as our standard preprocessing. Since CoreNLP is prone to merge (large) coreferential chains incorrectly [57], we propose a second preprocessing variant. Our split preprocessing executes an additional task after all tasks of standard preprocessing. It takes CoreNLP’s coreference chains and split likely incorrectly merged chains and mentions into separate chains. To determine which mentions of a chain are likely not truly part of that chain, split preprocessing employs named entity linking. Specifically, it attempts to link each mention of a coreference chain to its Wikipedia page [112]. Given a coreference chain and its mentions, our preprocessing removes mentions having a different linked entity than the entity linked by the majority of the chain’s mentions. For the removed mentions, our preprocessing creates new chains [242].

4.3.3.2 Candidate Merging

The goal of the sub-task candidate merging is to find and merge candidates that refer to the same semantic concept. Current methods for coreference resolution (see Sect. 4.3.1) cannot resolve abstract and broadly defined coreferences as they occur in sentence-level bias forms, especially in bias by word choice and labeling (see Table 1.1). Thus, we propose a merging method consisting of six sieves for our rule-based merging system (see Fig. 3.1), where each sieve analyzes specific characteristics of two candidates to determine whether the candidates should be merged.Footnote 6 Merging sieves 1 and 2 determine the similarity of two candidates, particularly of the “actor” type, as to their (core) meaning. Sieves 3 and 4 focus on multi-word expressions. Sieves 5 and 6 focus on frequently occurring words common in two candidates.

Sieves and Examples

(1) Representative phrases’ heads: we merge two candidates by determining the similarity of their core meaning (as a simplified example, we would merge “Donald Trump” and “President Trump”). (2) Sets of phrases’ heads: we determine the similarity as to the meaning of all phrases of two candidates ({Trump, president} and {billionaire}). (3) Representative labeling phrases: similarity of adjectival labeling phrases. Labeling is an essential property in sentence-level bias forms, especially in bias by word choice and labeling (“illegal immigrants” and “undocumented workers”). (4) Compounds: similarity of nouns bearing additional meaning to the heads (“DACA recipients” and “DACA applicants”). (5) Representative wordsets: similarity of frequently occurring words common in two candidates (“United States” and “US”). (6) Representative frequent phrases: similarity of longer multi-word expressions where the order is important for the meaning (“Deferred Action of Childhood Arrival” and “Childhood Arrivals”).

For each merging sieve i, we define a 9 × 9 comparison matrix cmati spanned over the nine candidate types listed in Table 4.3. The normalized scalar in each cell cmati,u,v defines whether two candidates of types u and v are considered comparable (if cmati,u,v > 0). As described later, for some merging sieves, we also use cmati,u,v as a threshold, i.e., we merge two candidates with types u and v if the similarity of both candidates is larger than cmati,u,v. We found generally usable default values for the comparison matrices’ cells and other parameters described in the following through experimenting and domain knowledge (see Sect. 4.3.5). The specific values of all comparison matrices can be found in the source code (Appendix A.4).

We organize candidates in a list sorted by their number of phrases, i.e., mentions in the texts; thus, larger candidates are at the top of the list. In each merging sieve, we compare the first, thus largest, candidate with the second candidate, then third, etc. If two candidates at comparison meet a specific similarity criterion, we merge the current (smaller) candidate into the first candidate, thereby removing the smaller candidate from the list. Once the pairwise comparison reaches the end of the list for the first candidate, we repeat the procedure for each remaining candidate in the list, e.g., we compare the second (then third, etc.) candidate pairwise with the remainder of the list. Once all candidates have been compared with another, we proceed with the next merging sieve.

As stated previously, the first and second sieve aim to determine the similarity of two candidates as to their (core) meaning. In the first merging sieve, we merge two candidates if the head of each of their representative phrase (see Sect. 4.3.3.1) is identical by string comparison. By default, we apply the first merging sieve only to candidates of identical NE-based types, but one can configure the sieve’s comparison table cmat1 to be less restrictive, e.g., allow also other type comparison or inter-type comparisons.

In the second merging sieve, we merge two candidates if their sets of phrases’ heads are semantically similar. For each candidate, we create a set H consisting of the heads from all phrases belonging to the candidate. We then vectorize each head within H into the word embedding space of the enhanced word2vec model trained on the GoogleNews corpus (300M words, 300 dimensions) [238], using an implementation that also handles out-of-vocabulary words [280]. We then compute the mean vector \(\overrightarrow {m_{H}}\) for the whole set of head vectors.

Then, to determine whether two candidates c 0 and c 1 are semantically similar, we compute their similarity \(s\left ( c_{0},c_{1} \right ) = \operatorname {cossim}\left ( \overrightarrow {m_{H}},\overrightarrow {n_{H}} \right )\), where \(\overrightarrow {m_{H}}\) is the mean head vector of c 0, \(\overrightarrow {n_{H}}\) the mean head vector of c 1, and \(\operatorname {cossim}(\ldots )\) the cosine similarity function. We merge the candidates, if c 0 and c 1 are of the same type, e.g., each represents a person, and if their cosine similarity \(s\left ( c_{0},c_{1} \right ) \geq t_{2,\mathrm {low}} = 0.5\). We also merge candidates that are of different types if we consider them comparable (defined in cmat2), e.g., NEs such as “Trump” with proper nouns (NNP) such as “President,” and if \(s\left ( c_{0},c_{1} \right ) \geq t_{2,\text{high}} = 0.7\). We use a higher, i.e., more restrictive, threshold since the candidates are not of the same type.

The third and fourth sieves focus on resolving multi-word expressions, such as “illegal immigrants” and “undocumented workers.” In the third merging sieve, we merge two candidates if their representative labeling phrases are semantically similar. First, we extract all adjective NPs from a candidate containing a noun and one or more labels, i.e., adjectives attributing to the noun. If the NP contains multiple labels, we extract for each label one NP, e.g., “young illegal immigrant” is extracted as “young immigrant” and “illegal immigrant.” Then, we vectorize all NPs of a candidate and cluster them using affinity propagation [91]. To vectorize each NP, we concatenate its words, e.g., “illegal_worker,” and look it up in the embedding space produced by the enhanced word2vec model (see second merging sieve), where frequently occurring phrases were treated as separate words during training [197]. If the concatenated NP is not part of the model, we calculate a mean vector of the vectors of the NP’s words. Each resulting cluster consists of NPs that are similar in meaning. For each cluster within one candidate, we select the single adjective NP with the global most frequent label, i.e., the label that is most frequent among all candidates. This way, selected NPs are the representative labeling phrases of a candidate.

Then, to determine the similarity between two candidates c 0 and c 1 in the third merging sieve, we compute a similarity score matrix \(S\left ( V,W \right )\) spanned by the representative labeling phrases v i ∈ V  of c 0 and w j ∈ Wof c 1. We look up a type-to-type-specific threshold \(t_{3} = \text{cmat}_{3}\left \lbrack \text{type}\left ( c_{0} \right ) \right \rbrack \lbrack \text{type}(c_{1})\rbrack \), and \(\text{type}\left ( c \right )\) returns the type of candidate c (see Table 4.3). For each cell s i,j in \(S\left ( V,W \right )\), we define a three-class similarity score

$$\displaystyle \begin{aligned} s_{i,j} = \begin{cases} 2,& \text{if } \operatorname{cossim}( \overrightarrow{v_{i}},\overrightarrow{w_{j}}) \geq t_{3} + t_{3,r} \\ 1, & \text{if } \operatorname{cossim}( \overrightarrow{v_{i}},\overrightarrow{w_{j}}) \geq t_{3} \\ 0, & \text{otherwise}, \end{cases} \end{aligned} $$
(4.5)

where \(\text{cossim}\left ( \overrightarrow {v_{i}},\overrightarrow {w_{j}} \right )\) is the cosine similarity of both vectors and t 3,r = 0.2 to reward more similar vectors into the highest similarity class. We found the three-class score to yield better results than using the cosine similarity directly. We merge c 0 and c 1 if V ∼ W, i.e., \(\text{sim}\left ( V,W \right ) = \frac {\sum _{s \in S}^{}s_{i,j}}{\left | V \right |\left | W \right |} \geq t_{3,m} = 0.3\). When merging candidates, we transitively merge different candidates U, V, W if U ∼ W and V ∼ W, i.e., we say U ∼ W, V ∼ W → U ∼ V , and merge both candidates U and W into V .

In the fourth merging sieve, we merge two candidates if they contain compounds that are semantically similar. In linguistics, a compound is a word or multi-word expression that consists of more than one stem and that cannot be separated without changing its meaning [209]. We focus only on multi-word compounds, such as “DACA recipient.”

In this sieve, we first analyze the semantic similarity of the stems common in multiple candidates. Therefore, we find all words that are common in at least one compound of each candidate at comparison. In each candidate, we then select as its compound phrases all phrases that contain at least one of these words and vectorize the compound phrases into the word embedding space. Then, to determine the similarity of two candidates, we compute a similarity score matrix \(S\left ( V,W \right )\) spanned by all compound phrases v i ∈ V  of candidate c 0 and w j ∈ Wof c 1 using the same approach we used for the third merging sieve (including merging candidates that are transitively similar). If \(\operatorname {sim}\left ( V,W \right ) \geq t_{4,m}\), we merge both candidates. Else, we proceed with the second merge method.

In the second method, we check for the lexical identity of specific stems in multiple candidates. Specifically, we merge two candidates c 0 and c 1 if there is at least one phrase in c 0 that contains a head that is a dependent in at least one phrase in c 1 and if both candidates are comparable according to cmat4. For instance, two candidates are of type person-ne (see Table 4.3), and one phrase in c 0 has a headword “Donald,” and one phrase in c 1 is “Donald Trump,” where “Donald” is the dependent word.

The fifth and sixth sieves focus on the special cases of coreferences that are still unresolved, particularly candidates that have frequently occurring words in common, such as “United States” and “US.” In the fifth merging sieve, we merge two candidates if their representative wordsets are semantically similar. To create the representative wordset of a candidate, we perform the following steps. We create frequent itemsets of the words contained in the candidate’s phrases excluding stopwords (we currently use an absolute support supp = 4) and select all maximal frequent itemsets [3]. Note that this merging sieve thus ignores the order of the words within the phrases. To select the most representative wordsets from the maximal frequent itemsets, we introduce a representativeness score

$$\displaystyle \begin{aligned} r(w) = \log(1+l(w)) \log(f(w)), \end{aligned} $$
(4.6)

where w is the current itemset, l(w) the number of words in the itemset, and f(w) the frequency of the itemset in the current candidate. The representativeness score balances two factors: first, the descriptiveness of an itemset, i.e., the more words an itemset contains, the more comprehensively it describes its meaning, and second, the importance, i.e., the more often the itemset occurs in phrases of the candidate, the more relevant the itemset is. We then select as the representative wordsets the N itemsets with the highest representativeness score, where \(N = \min (6,f_{p}(c))\), where \(f_{p}\left ( c \right )\) is the number of phrases in a candidate. If a word appears in more than rs5 = 0.9 of all phrases in a candidate but is not present in the maximal frequent itemsets, we select only N − 1 representative wordsets and add an itemset consisting only of that word to the representative wordsets. Lastly, we compute the mean vector \(\overrightarrow {v}\) of each representative wordset v by vectorizing each word in the representative wordset using the word embedding model introduced in the second merging sieve.

Then, to determine the similarity of two candidates c 0 and c 1 in the fifth merging sieve, we compute a similarity score matrix \(S\left ( V,W \right )\) spanned by all representative wordsets v i ∈ V  of candidate c 0 and w j ∈ Wof c 1 analogously constructed as the matrix described in the third merging sieve. We merge c 0 and c 1, if \(\text{sim}\left ( V,W \right ) \geq t_{5} = 0.3\).

In the sixth merging sieve, we merge two candidates if they have similar representative frequent phrases. To determine the most representative wordlists of a candidate, we conceptually follow the procedure from the fifth merging sieve but apply the steps to phrases instead of wordsets. Specifically, the representativeness score of a phrase o is calculated using Eq. 4.6 with phrase o instead of itemset w. We then select as the representative frequent phrases the N phrases with the highest representative score, where \(N = \min {6,f_{p}(c)}\).

Then, to determine the similarity of two candidates c 0 and c 1 in the sixth merging sieve, we compute a similarity score matrix \(S\left ( V,W \right )\) spanned by all representative wordlists v i ∈ V  of candidate c 0 and w j ∈ Wof c 1. We look up a type-to-type-specific threshold \(t_{6} = \text{cmat}_{6}\left \lbrack \text{type}\left ( c_{0} \right ) \right \rbrack \lbrack \text{type}(c_{1})\rbrack \). We calculate the similarity score of each cell s i,j in \(S\left ( V,W \right )\):

$$\displaystyle \begin{aligned} s_{i,j} = \begin{cases} 2,& \text{if } \text{levend}\left( v_{i},\ w_{j} \right) \leq t_{6} - t_{6,r} \\ 1,& \text{if } \text{levend}\left( v_{i},\ w_{j} \right) \leq t_{6} \\ 0, & \text{otherwise,} \end{cases} \end{aligned} $$
(4.7)

where \(\text{levend}\left ( v_{i},\ w_{j} \right )\) is the normalized Levenshtein distance [204, 226] of both phrases and t 6,r = 0.2. Then, over all rows j, we find the maximum sum of similarity scores simhor and likewise simvert over all columns i:

(4.8)

and

(4.9)

We calculate a similarity score for the matrix:

$$\displaystyle \begin{aligned} \text{simval}(V,W) = \begin{cases} \text{sim}_{\text{hor}}, & \text{if } \text{sim}_{\text{hor}} \geq \ \text{sim}_{\text{vert}}\ \land \ \left| W \right| > 1 \\ \text{sim}_{\text{vert}}, & \text{else if } |V|>1 \\ 0, & \text{otherwise.} \end{cases} \end{aligned} $$
(4.10)

Finally, we merge candidates c 0 and c 1 if simval ≥ t 6,m = 0.5.

Using the previously outlined series of six sieves, our method merges those candidates that CoreNLP’s coreference resolution used in the preprocessing step did not identify as coreferential. In practical terms, our method relies on CoreNLP as an established method for single-document coreference resolution and uses the six sieves to enhance CoreNLP’s results in two ways. First, our method merges the coreferences found in single documents across multiple documents. Second, it additionally merges highly context-dependent coreferences as they occur frequently in the sentence-level bias forms that our PFA approach seeks to identify.

4.3.4 Evaluation

To measure the effectiveness of the context-driven cross-document coreference resolution in the context of our overall analysis for the automated identification and communication of media bias, we perform an in-depth evaluation of the approach in this section. After presenting the evaluation results in Sect. 4.3.4.3, we discuss the strengths and weaknesses of our approach, from which we derive future research directives in Sect. 4.3.5.

4.3.4.1 Setup and Metrics

We evaluate our method and all baselines on the events and articles contained in the NewsWCL50 dataset (Sect. 4.3.2.3). Similar to prior work, we evaluate exclusively the coreference resolution performance but not the extraction performance [18]. Thus, we do not automatically extract the mentions for coreference resolution but pass the set of all true mentions as annotated in NewsWCL50 to evaluate the method.

Our primary evaluation metric is weighted macro F1 (F1m), where we weight the F1 score of each candidate (as automatically resolved) by the number of samples from its true class (as annotated in our dataset). Secondary metrics are precision (P) and recall (R). We generally prefer recall over precision within the secondary metrics since CoreNLP is prone to yield many small or even singleton coreference chains on our dataset. CoreNLP thus achieves very high precision scores, while at the same time, the larger coreference chains miss many mentions, i.e., those mentions that are part of the singleton chains. To measure the metrics, we compare resolved candidates, i.e., coreference chains, with manually annotated target concepts, e.g., “USA/Donald Trump.” For each target concept annotated in NewsWCL50, we find the best matching candidate extracted automatically (cf. [226]), i.e., the candidate whose phrases yield the largest overlap to the mentions of the target concept. To account for the subjectivity of the coding task in the content analysis, particularly when coding abstract target concepts (Sect. 4.3.2), we allow in our evaluation multiple true labels to be assigned to the candidates, i.e., a predicted candidate can have multiple true annotated target concepts.

We report and discuss all the performances of the evaluated methods for all concept types. However, we use the “Actor” type as the primary concept type since the person-oriented framing approach focuses on the analysis of individual persons (Sect. 3.3.2).

4.3.4.2 Baselines

We compare our approach with three baselines, which we describe briefly in the following.

EECDCR represents the state of the art in cross-document coreference resolution. The authors reported the highest evaluation results [18] compared to Kenyon-Dean, Cheung, and Precup [170], Lee et al. [199], and NLP Architect [158]. EECDCR resolves event and entity mentions jointly. To reproduce EECDCR’s performance, we use the model’s full set of optional features: semantic role labeling (SRL), ELMo word embeddings [284], and dependency relations. Since we could not setup SwiRL, the originally used SRL parser, we used the most recent SRL method by AllenNLP [323]. To resolve intra-document mentions, we use CoreNLP [56]. We use default parameters for the model inference.

Two further baselines represent the state of the art in single-document coreference resolution; both employ CoreNLP [56, 57]. Since CoreNLP is designed for single-document coreference resolution, each baseline uses a different adaption to make CoreNLP suitable for the cross-document task. Baseline CoreNLP-merge creates a virtual document by appending all documents. It then performs CoreNLP’s coreference resolution on this virtual document. Baseline CoreNLP-cluster employs CoreNLP’s coreference resolution on each document individually. The baseline then clusters the mentions of all coreference chains in the word2vec space [197] using affinity propagation [91]. Each resulting cluster of phrases yields one candidate (cf. [43]), i.e., coreference chain.

4.3.4.3 Results

Table 4.11 shows the CDCR performance of the evaluated methods. Our method achieved the highest performance concerning our primary metric (F1m = 59.0). The CoreNLP baselines yield worse performance (at best: F1m = 53.2). Our method performed also slightly better than EECDCR (F1m = 57.8). We found that our split preprocessing tackled CoreNLP’s merging issue, where large chains are merged incorrectly as described in Sect. 4.3.3.1. Specifically, our split preprocessing improved the F1m performance from at best 57.0 to 59.0. The effect of our split preprocessing can be seen when comparing the recall scores after preprocessing (standard preprocessing, R = 15.8; split preprocessing, R = 34.2). Since our method performed better with split preprocessing, we refer in the remainder of this section to this variant if not noted otherwise.

Table 4.11 Performance of the context-driven cross-document coreference resolution method and baselines on NewsWCL50. Best-performing approaches are highlighted

Table 4.12 shows the performance achieved on the individual concept types. Importantly, our method achieved the highest performance on the “Actor” type (F1m = 88.7 compared to the best baseline: F1m = 81.9). The high “Actor” performance allows for using our method in target concept analysis, since person-oriented framing analysis focuses on individual persons, i.e., as part of the “Actor” type. Further, our method performed for most of the other concept types better than the baselines: “Action,” “Group,” “Misc,” and “Object.” On the “Event” type, EECDCR performed best, which is expected since the method was specifically designed for event-centric coreference resolution (F1m = 63.6 compared to us: F1m = 60.8).

Table 4.12 Performance of the context-driven cross-document coreference resolution method and baselines for each concept type. Macro F1 is shown for each of the three baselines: CoreNLP-merge, CoreNLP-cluster, and EECDCR

In general, we found that our method and the baselines performed best on concepts that consist mainly of NPs and that are narrowly defined. In contrast, our method performed worse on concepts that consist mainly of (1) VPs, are (2) broadly defined, or are (3) abstract. Our method achieved a low macro F1 on the “Action” type (F1m = 39.7), whose candidates consist mainly of VPs. The concept type “Actor-I” is very broadly defined as to our codebook and yields the lowest performance (F1m = 36.4). One reason for the low performance is that in the content analysis, different individua were subsumed under one Actor-I concept to increase annotation speed (Sect. 4.3.2). We propose means to address this issue in Sect. 4.3.5. The extraction of candidates of the type “Misc” is as expected challenging (F1m = 43.2), since by definition its concepts are mostly abstract or complex. For example, the concept “Reaction to IRN deal” (event #9) contains both actual and possible future (re)actions to the event (the “Iran deal”) and assessments and other statements by persons regarding the event.

Table 4.13 shows the performance of our method on the individual events of NewsWCL50. The approach performed best on events #1, #4, and #9 (F1m,1 = 68.2) and worse on events #6, #7, and #3 (F1m,6 = 46.2). When investigating the events’ compositions of concept types, we found that in general higher performances were achieved for events that consist mainly of NPs, e.g., 44.1% of all mentions in event #1 are of type “Actor.”Footnote 7 In contrast, events with lower performance contain typically a higher number of broadly defined concepts or “Action” concepts.

Table 4.13 Performance of the context-driven cross-document coreference resolution method for each event

We also found that our approach was able to extract and merge unknown concepts, i.e., concepts that are not contained in the word embeddings used during the candidate merging process [197, 280]. For example, when the GoogleNews corpus was published in 2013 [238], many concepts, such as “US President Trump” or “Denuclearization,” did not exist yet or had a different, typically more general, meaning than in 2018. Yet, the approach was able to correctly merge phrases with similar meanings, e.g., in event #2, the target concept “Peace” contains among others “a long-term detente,” “denuclearization,” and “peace.” In event #6, the approach was able to resolve, for example, “many immigrants,” “the caravan,” “the group marching toward the border,” “families,” “refugees,” “asylum seekers,” and “unauthorized immigrants.” In event #1, the approach resolved, among others, “allegations,” “the infamous Steele dossier,” “the salacious dossier,” and “unsubstantiated allegations.”

Table 4.11 shows that using only sieves 1 to 2 achieved over all concept types the highest performance (F1m = 59.0). However, subsequent sieves improved recall further (from R = 53.2 after sieve 2 to R = 56.1 after sieve 6) while only slightly reducing F1m to 58.5. We thus recommend to generally run all sieves. However, in the context of person-oriented framing analysis (PFA), we recommend to use only sieves 1 and 2. Table 4.14 shows that using only these two sieves suffices to achieve the best possible performance for the “Actor” type (F1m = 88.8). This is expected since sieves 1 and 2 focus specifically on resolving mentions of the “Actor” type.

Table 4.14 Performance of the context-driven cross-document coreference resolution method evaluated only on the “Actor” type

In sum, the results of the evaluation showed an improved performance of our method in resolving highly context-dependent coreferences compared to the state of the art, especially on the Actor concept type, which is most relevant for PFA.

4.3.5 Future Work

When devising our method, we focused on using it only as part of person-oriented framing analysis (PFA). In this use case, i.e., resolving individual persons (concept type: “Actor”) in event coverage, our method outperformed the state of the art in coreference resolution (F1m = 88.7 compared to 81.9). Moreover, the evaluation showed that our method achieved competitive or higher performance compared to current methods for cross-document coreference resolution when evaluated on all concept types or individual concept types. However, our evaluation cannot elucidate how effective our method is in other domains and applications. We seek to address this by creating a larger dataset, for which we plan to implement and validate minor improvements in the codebook, e.g., infrequent individua are currently coded jointly into a single “[Actor]-I” target concept. While such coding requires less coding effort, it also negatively skews the measured evaluation performance (see Sect. 4.3.4). An idea is to either not code infrequent target concepts or code them as single concepts.

To further strengthen the evaluation of the coreference resolution task, we plan to test our method on established datasets for coreference resolution (Sect. 4.3.1). Doing so will help to investigate how our method performs in other use cases and domains. Albeit standard in the evaluation practices of coreference resolution, the evaluation of methods only on true mentions means that the practical performance of such methods may differ strongly. We expect that the performance of a CDCR methods is lower when employed in real-world use cases where almost never true mentions, e.g., as annotated in a ground truth, are available. Instead, in real-world use cases, such mentions typically have to be extracted automatically. We thus propose to conduct a future evaluation that uses automatically extracted mentions, where we compare pure coreference resolution performance with its performance when employed in settings resembling real-world use.

A larger dataset would also enable the training of recent deep learning models, and we expect that deep learning models could achieve higher performance. Recent models for single-document coreference resolution achieve increased performance compared to earlier, traditional methods [390]. A key issue that prevented us from creating a sufficiently large dataset is the tremendous annotating cost required for its annotation (see Sect. 4.3).

4.3.6 Conclusion

The previous sections introduced our method for context-driven cross-document coreference resolution. Our method is the first to find and resolve coreferences as they are relevant for sentence-level bias forms. When evaluated on our dataset for the PFA use case, our method outperformed the state of the art (on the concept type “Actor,” our method achieved F1m = 88.7 compared to 81.9). When considering all concept types, our method performed slightly better than the state of the art (F1m = 59.0 compared to 57.8).

As noted in Sect. 4.3.5, our use case-specific evaluation can only serve as a first indicator for the general coreference resolution performance in other use cases and on other text domains. Moreover, coreference resolution is a broad and complex field of research, with a diverse spectrum of sub-tasks and use cases. Our thesis can only contribute a first step in the task of context-driven cross-document coreference resolution. Despite that, because of the high performance of our method on the “Actor” type as evaluated on the NewsWCL50 dataset, we conclude that the method is an effective means to be used within the PFA approach.

NewsWCL50 and its codebook are available at

https://github.com/fhamborg/NewsWCL50.

4.4 Summary of the Chapter

This chapter introduced target concept analysis as the first analysis component in person-oriented framing analysis (PFA). Target concept analysis aims to find and resolve the mentions of persons across the given news articles. This task is of particular importance and difficulty in slanted coverage, for example, due to the divergent word choice across differently slanted news articles. We explored two approaches to enable target concept analysis: event extraction and coreference resolution.

Our approach for event extraction (Sect. 4.2) achieved overall high performance in extracting answers for the journalistic 5W1H questions, which describe an article’s main event. However, additional research effort would be required to employ this approach in target concept analysis. First, we would need to extend the approach to extract not only the single main event of each article but also multiple side events. The latter is crucial for PFA, which requires multiple mentions and persons to identify the overall frame of a news article. In contrast, a lower reliability of the frame identification would be expected when relying on only a single characteristic of each article, e.g., the actor of the single main event. Second, for the event extraction to be used in PFA, we would need to devise a second method for resolving the events and their phrases across the event coverage. Besides these two shortcomings concerning the use in PFA, our event extraction approach represents a universally usable means for event extraction, achieves high extraction performance, and is publicly available (Sect. 4.2.5).

Due to the two shortcomings of the line of research employing event extraction, we decided to focus on a second line of research. The main contribution of this chapter is the first method to perform context-driven cross-document coreference resolution (Sect. 4.3). In the evaluation, our method resolved highly context-dependent mentions of persons and other concept types as they occur commonly in person-targeting forms of media bias. When considering mentions of individual persons as relevant for PFA, our method achieved a high performance (F1m = 88.7) and outperformed the state of the art (F1m = 81.9). We will thus use our method for coreference resolution for the target concept analysis component.