5.1 Introduction

This chapter describes the frame analysis component, which aims to identify how persons are portrayed in the given news articles, both at the article and sentence levels. This task is of difficulty for various reasons, such as that news articles rather implicitly or indirectly frame persons, for example, by describing actions performed by a person. Consequently, a high level of interpretation is necessary to identify how news articles portray persons. Because of this and other issues highlighted later in this chapter, prior approaches to analyze frames or derivatives yield inconclusive or superficial results or require high manual effort, e.g., to create large annotated datasets.

Frame analysis is the second and last analysis component in person-oriented framing analysis (PFA). The input to frame analysis is a set of news articles reporting on the same policy event, including the persons involved in the event and their mentions across all articles. The output of the frame analysis component should be information concerning how each person is portrayed and groups of articles similarly framing the persons involved in the event.

An approach in principle suitable for the frame analysis component is identifying political frames as defined by Entman [79]. Doing so would approximate the content analysis, particularly the frame analysis, as conducted in social science research on media bias most closely. However, as pointed out in Sect. 3.3.2, taking such an approach would result in infeasibly high annotation cost. Further, political frames are defined for a specific topic or analysis question [79], whereas PFA is meant to analyze bias in news coverage on any policy event. Thus, identifying political frames seems out of the thesis’s scope and also methodologically infeasible currently. We revisit this design decision later in the chapter and also in our discussion of the thesis’s limitations and future work ideas in Sect. 7.3.2.

In this chapter, we explore two conceptually different approaches to determine how individual persons are portrayed. The first approach more closely resembles how researchers in the social sciences analyze framing. It aims to identify categories representing topic-independent framing effects, which we call frame properties, such as whether a person is portrayed as being competent, wise, powerful, or trustworthy (or not). The second approach follows a more pragmatic route to the task of the frame analysis component, which we devise to address fundamental issues of the first approach, such as high annotation cost, high annotation difficulty, and low classification performance. Both approaches have in common that they do not analyze frames, which would be the standard procedure in the social sciences, but instead categorized effects of framing. We focus on framing effects since frames as analyzed in social science research on media bias are topic-specific. In contrast, our approach is meant to analyze news coverage on any policy issue (see also Sects. 1.3 and 3.3.2).

Note that the frame analysis component consists of an additional second task, i.e., frame clustering (Fig. 3.1). Once the frame analysis has identified how each person is portrayed at both the article and sentence levels, frame clustering aims to find groups of articles that frame the persons similarly. For frame clustering, we use a simple technique that we will describe when introducing our prototype system (Chap. 6).

5.2 Exploring Person-Targeting Framing(-Effects) in News Articles

The section presents the results of a research direction we pursued initially for the frame analysis component. We explore a simple approach that aims to identify so-called frame properties, which are fine-grained constituents of how a person is portrayed at the sentence level, e.g., whether a person is shown as competent or powerful. Frame properties resemble framing effects in social science research. We propose frame properties as pre-defined categorical characteristics that might be attributed to a target person due to one or more frames. For example, in a sentence that frames immigrants as intruders that might harm a country’s culture and economy (rather than victims that need protection, cf. [369]), respective frame properties of the mentioned immigrants could be “dangerous” and “aggressive.”

The remainder of this section is structured as follows. Section 5.2.1 briefly summarizes prior work on automated frame analysis. We then present our exploratory approach for frame property identification in Sect. 5.2.2. Afterward, we discuss the results of an exploratory, qualitative analysis in Sect. 5.2.3. Section 5.2.4 highlights the shortcomings and difficulties of this approach and discusses how to address or avoid them. Specifically, our approach achieved in the evaluation only mixed results, but we can use the identified issues to derive our main approach for the frame analysis component, which is then described in Sect. 5.3. Lastly, Sect. 5.2.5 provides a brief summary of our exploratory research on frame property identification.

The dataset used for the approach and its codebook are available at

https://github.com/fhamborg/NewsWCL50.

5.2.1 Related Work

This section briefly summarizes key findings of our literature review concerning the analysis of political framing. An in-depth discussion of the following and other related approaches can be found in Chap. 2.

To analyze how persons (or other semantic concepts) are portrayed, i.e., framed, researchers from the social sciences primarily employ manual content analyses and frame analyses (see Sect. 2.3.4). In content analyses focusing on the person-targeting bias forms (Sect. 3.3.2), social scientists typically analyze how news articles frame individual persons or groups of persons. For example, whether there is a systematic tendency in coverage to portray immigrants in a certain way, such as being aggressive or helpless [263]. Observing these tendencies may then yield specific frames, such as the “intruder” or “victim” frames mentioned in the beginning of Sect. 5.2. Other analyses not concerned with persons but, e.g., topics, may focus on whether news outlets use emotional or factual language when reporting on a specific topic or which topical aspects of an issue are highlighted in coverage [274].

In sum, in our literature review on identifying media bias, we find that no automated system focuses on the analysis of person-targeting bias forms at the sentence level (see Sects. 2.3.22.3.4). However, two prior works are of special interest. First, the research conducted for the creation of the media frame corpus (MFC) aims to directly represent political frames [80] as established in the social sciences [45]. In contrast to political frames, MFC’s “frame types” are topic independent and thus are in principle highly relevant for our task. However, from a conceptual perspective, the MFC’s frame types are independent of any target, i.e., they holistically describe the content or “frame” of a news article or a sentence within. Moreover, this approach suffers from high annotation cost and low inter-coder reliability (ICR) [45]. As a consequence, classifiers trained on the MFC yield low classification accuracy on the sentence level [46].

Second, a recent approach aims to automatically extract so-called microframes from a set of text documents, e.g., news articles [193]. Given a set of text documents, the respective microframes are defined as semantic axes that are over-represented in individual documents. Each such semantic axis is a bi-polar adjective pair as used in semantic differential scales established in psychology [145]. Since the microframes are extracted for a given set of documents, they are topic-dependent or, more specifically, dataset-dependent. For example, in a topic and corresponding documents on immigration, the adjective pair of one microframe could be “illegal”-“legal.” After the extraction of these microframes for a given dataset, users then review the microframes and select a subset of them to be used for further analysis. The approach yields qualitative microframes that resemble closely our frame properties but are, in contrast to them, dataset-specific.

Most of the other reviewed approaches that are only partially related to our task use quantitative methodology, and their results are mostly superficial, especially when compared to the results of manual content analyses. For example, one approach investigates the frequency of affective words close to user-defined words [116], e.g., names of politicians. Another approach aims to find bias words by employing IDF [211].

Another field that is relevant for the task of determining how persons are portrayed is sentiment analysis and more specifically target-dependent sentiment classification [212]. However, researchers have questioned whether the one-dimensional polarity scale of sentiment classification suffices to capture the actual fine-grained effects of framing (see Sect. 2.3.4). We will investigate sentiment classification and this question in Sect. 5.3.

5.2.2 Method

Given a person mention and its context, the objective of our method is to determine which frame properties the context as well as the mention yield on the person. Specifically, our simple method looks for words that express one or more frame properties on the person mention. Afterward, we aggregate for each target person frame properties from all its mentions in the given news article from sentence level to article level.Footnote 1

The idea of our approach is to extend the one-dimensional polarity scale, i.e., positive, neutral, and negative, established in traditional sentiment classification with further classes, i.e., fine-grained properties, such as competent, powerful, and antonyms thereof. We call these fine-grained properties frame properties. Conceptually, some frame properties can be subsumed using sentiment polarity, but they also extend the characteristics that can be represented using polarity. For example, while affection and refusal can be represented as more specific forms of positive and negative portrayal, respectively, other frame properties cannot be projected into the positive-negative scale. Such frame properties include, for example, (being portrayed as an) aggressor versus a victim, where the victim has neither positive nor negative sentiment polarity (the aggressor, though, has clearly negative polarity). This way, our approach seeks to overcome the shortcomings of the one-dimensional polarity scale used in sentiment classification (see Sect. 2.3.4).

Using a series of inductive and deductive small-scale content analyses, we devised in total 29 frame properties, of which 13 are bi-polar pairs and 3 have no antonym. Specifically, to derive the final set of frame properties, we initially asked coders to annotate any phrase that they felt was influencing their assessment of a person or other semantic concepts mentioned and also to state which perception, judgment, or feeling the phrase caused in them. We then used these initial open statements to derive a set of frame properties, which represented how the annotators felt a target was portrayed. Table 5.1 shows the final set of frame properties.

Table 5.1 Frame properties in NewsWCL50. Parentheses in the first column show the name of the respective antonym, if any

We refined these initial frame properties in an interactive process following best practices from the social sciences until three goals or factors were achieved or maximized. First, an acceptable inter-coder reliability was reached. Second, the set of frame properties covers a broad spectrum of person characteristics highlighted by “any” news coverage while, third, still being as specific as possible. The second and third goals aim to achieve a balance between being topic-independent and generic while preferring specific categories, such as “competent,” over general categories, such as “positive.” We achieved these goals after conducting six test iterations consisting of reviewing previous annotation results, refining the codebook and frame properties, and discussion of the changes with the annotators. We reached an acceptable inter-coder reliability of 0.65 (calculated with mean pairwise average agreement). The comparably low inter-coder reliability indicates the complexity and difficulty of the task (cf. [45]). Further information on the training prior to the main annotation is described in Sect. 4.3.2.2.

Finally, we conducted a deductive content analysis on 50 news articles reporting on policy issues using the codebook created during the training annotations. In the main annotation, 5926 mentions of persons and other target concepts were annotated. Further, 2730 phrases that each induce at least one frame property were annotated. For each frame property, additionally, the corresponding target concept had to be assigned. For example, in “Russia seizes Ukrainian naval ships,” “Russia” would be annotated as a target concept of type “country,” and “seizes” as a frame property with type “Aggressor” that targets “Russia.” Each mention of a target concept in a text segment can be targeted by multiple frame property phrases. Further information on the training prior to the main annotation is described in Sect. 4.3.2.3.

After the annotation and in a one-time process, we manually defined a set of seed words for each of the frame properties S k ∈ S. For each frame property S k, we gathered seed words by carefully selecting common synonyms from a dictionary [233], e.g., for the frame property “affection,” we selected the seed words: attachment, devotion, fondness, love, and passion.

For each news article passed to the frame analysis component, our method performs the following procedure. First, to identify frame property words, the method iterates all words in the given news article and determines for each word its semantic similarity to each of the frame properties. Specifically, we calculate the cosine similarity of the current word w and each seed word s ∈ S k of the current frame property S k in a word embedding space [330]. We define the semantic similarity

$$\displaystyle \begin{aligned} \mathrm{sim}\left( w,S_{k} \right) = \mathrm{cossim}{(\overrightarrow{w},\overrightarrow{s})}. \end{aligned} $$
(5.1)

We assign to a word w any frame property S k, where \(\mathrm {sim}\left ( w,S_{k} \right ) > t_{p} = 0.4\). At the end of this procedure, each word has a set of weighted frame properties. The weight of a frame property on a word is defined by \(\mathrm {sim}\left ( w,S_{k} \right )\).

Second, for each target person c i, we aggregate frame properties S k ∈ S from all its modifiers w j of c i found by dependency parsing [6]. We use manually devised rules to handle the different types of relations between head c i and modifier w j, e.g., to assign the frame properties of an attribute (modifier) to its noun (target person mention) or a predeterminer (modifier) to its head (target person mention).

Given a news article and a set of persons or other semantic concepts with one or more mentions, the output of the proposed method is as follows. For each mention, the method determines a set of weighted frame properties, yielded by the sentence of the mention. Further, for each semantic concept, the method returns a set of weighted frame properties by aggregating them from mention level to article level.

5.2.3 Exploratory Evaluation

We discuss the usability of this exploratory approach as to determining frame properties in a set of news articles reporting on the same event in two use cases. Due to the low inter-coder reliability of the frame property annotations (see Sect. 5.2.2), we expected low classification performance of our approach. Thus, we did not conduct a quantitative evaluation but instead qualitatively investigated the approach to derive future research ideas [46]. In contrast to the research objective of this thesis, i.e., identifying and communicating biases targeting persons, in this investigation, we also considered semantic concepts of the type groups of persons. This allows us to better demonstrate and discuss the results of the method.

In the first use case, we investigated the frame properties of persons and other semantic concepts in an event, where the DNC, a part of the Democratic Party in the USA, sued Russia and associates of Trump’s presidential campaign in 2018 (see event #3 in Table 4.9). Table 5.2 shows exemplary frame properties of the three main actors involved in the event, Donald Trump, the Democratic Party, and the Russian Federation, each being a different concept type (shown in parentheses in Table 5.2). The first column shows each candidate’s representative phrase (see Sect. 4.3.3.1). The linearly normalized scores s(c, a, f) in the three exemplary frame property columns represent how strongly each article a (row) portrays a frame property f regarding a candidate c: s = 1 or − 1 indicates the maximum presence of the property or its antonym, respectively. A value of 0 indicates the absence of the property or equal presence of the property and its antonym.

Table 5.2 Excerpt of exemplary frame properties as determined automatically in the first use case

Left-wing outlets (LL and L) more strongly ascribe the property “aggressor” to Trump, e.g., s(Trump, LL, aggressor) = 1, than right-wing outlets do, for example, s(Trump, R, aggressor) = 0.34. This is conformal with the findings of manual analyses of news coverage of left- versus right-wing outlets regarding Republicans [71, 116, 117]. The Democratic Party is portrayed in all outlets as rather aggressive (s = [0.91, 1]), which can be expected due to the nature of the event, since the DNC sued various political actors.

The difficulty of frame property classification is visible in other frame properties that yielded inconclusive trends, such as “reason.” We found that an increased level of abstractness is the main cause for lower frame identification performance (cf. [45, 211, 276]). For example, in the content analysis (see Sect. 4.3.2), we noticed that “reason” was often not induced by single words but rather more abstractly through actions that were assessed as reasonable by the human coders.

In the second use case, we investigated frame properties in an event where special counsel Mueller provided a list of questions to Trump in 2018 (see event #8 in Table 4.9). Table 5.3 shows selected frame properties of the two main actors involved in the event: Trump and Mueller. Since both are individual persons, their semantic concept type is “Actor.” The results of our method indicated that the reviewed left-wing outlets ascribe rather positive frame properties to Mueller, e.g., s(Mueller, LL, confidence) = 1, than right-wing outlets do, s(…, RR, …) = 0. For Trump, we identified the opposite, e.g., s(Trump, LL, trustworthiness) = −0.19 and s(…, RR, …) = 1. More strongly, left-wing news outlets even ascribe non-trustworthiness to Trump, e.g., s(Trump, LL, trustworthiness) = −0.93. Besides these expected patterns, other frame properties again showed inconclusive trends, such as power.

Table 5.3 Excerpt of exemplary frame properties as determined automatically in the second use case

Due to the difficulty of automatically estimating frames (see Sect. 2.5), the identification of frame properties ascribed to persons and other semantic concepts did not always yield clear or expected patterns. We found this is especially true for abstract or implicitly ascribed frame properties. For example, we could not find clear patterns for the frame properties “reason” in the first use case (Table 5.2) and “power” in the second use case (Table 5.3), which is mainly due to the abstractness used to portray a person as being powerful or reasonable.

5.2.4 Future Work

In our exploratory evaluation, we found that our basic approach yielded trends concerning frame properties that are in part as expected and in part inconclusive. We think there are two main causes for the inconclusive results. First, the simplicity of the approach is one key reason. The second potential cause is the general difficulty of annotating or determining frames and, respectively, in our case the frames’ effects [45].

To address the first issue, we propose to improve the approach using more sophisticated techniques, such as deep learning and recent language models. Fundamentally, our current approach is word-based and may often fail to catch the “meaning between the lines” (see Sect. 2.2.5). This is in contradiction to the substantial character of frames [80], which typically requires a higher degree of interpretation, being one key reason for the comparably “superficial” results of many automated approaches to date compared to manual content analyses (see Sect. 2.5). For example, determining implicitly ascribed frame properties, such as “reason,” requires a high degree of interpretation since typically a news articles would not state that a person acted reasonably but instead this conclusion would be made by news consumers after reading one or more sentences describing, for example, actions that portray the person in a specific way. One idea to improve the classification performance is to use deep language models, such as RoBERTa, pre-trained on large amounts of also news articles [214]. RoBERTa and other language models have significantly improved natural language understanding capabilities across many tasks [373]. Given these advancements, we expect that such language models can also reliably determine complex and implicitly ascribed frame properties. However, fine-tuning language models, especially for multi-label classification tasks with high degree of required interpretation as for frames and frame properties, require also the creation of very large datasets (cf. [46]).

The second cause is the difficulty of frame annotation as well as frame classification or in our cases more specifically the effects of framing, i.e., frame properties. As our comparably low inter-coder reliability (see Sect. 5.2.2) and prior work indicate [45], the annotation of frames and frame properties is highly complex, and some “degree of subjectivity in framing analysis [is] unavoidable” [45]. In our view, the most effective idea to address the high annotation difficulty is to reduce the number of frame properties that are to be annotated. Other, commonly used means to tackle the subjectivity are performing more training iterations, might be less effective since we already conducted as many iterations as we could to improve the inter-coder reliability. Another promising idea is to determine frame properties on a much larger set of news articles. While our exploratory evaluation showed in principle also expected framing patterns, we tested our method only on five articles for each use case. We think that the task’s ambiguity could—besides technical improvements as mentioned previously—be addressed by identifying framing patterns on more articles instead of attempting to pinpoint frame properties on individual articles.

5.2.5 Conclusion

This section presented the results of our exploratory research on imitating manual frame analysis. Albeit effective, such analysis entails the definition of topic-specific and analysis question-specific frames. Such dependencies are in contrast to the objectives of the person-oriented framing analysis approach, which is intended to be applied to any news coverage on policy issues. Instead of frames, we proposed to analyze frame properties, which represent the effects of person-oriented framing, such as whether a person is shown as being “aggressive.”

In our view, the approach represents a promising line of research but at the same time suffers from shortcomings that are common to prior approaches aiming to imitate frame analysis, especially high annotation cost for the required training dataset. Likewise, we noticed a degree of subjectivity that could not be reduced without lowering the “substance” of the frame properties (cf. [45]), which is required to interpret the “meaning between the lines” like it is done in frame analysis.

The dataset used for the approach and its codebook are available at

https://github.com/fhamborg/NewsWCL50.

5.3 Target-Dependent Sentiment Classification

This section describes the second approach for our frame analysis component. Specifically, we describe a dataset and method for target-dependent sentiment classification (TSC, also called aspect-term sentiment classification) in news articles. In the context of the overall person-oriented framing analysis (PFA) and in particular the frame analysis component, we use TSC to classify a fundamental effect of person-oriented framing, i.e., whether sentences and articles portray individual persons positively or negatively. As we show in this section and our prototype evaluation (Chap. 6), TSC represents a pragmatic and effective alternative to the fine-grained but expensive approach of classifying frame properties. The advantages of TSC over approaches aiming to capture frames or frame derivatives are the reduced annotation cost and high reliability.

We define our objective in this section as follows: we seek to detect polar judgments toward target persons [335]. Following the TSC literature, we include only in-text, specifically in-sentence, means to express sentiment. In news texts, such means are, for example, word choice or describing actions performed by the target, e.g., “John and Bert got in a fight” or “John attacked Bert.” Sentiment can also be expressed indirectly, e.g., through quoting another person, such as “According to John, an expert on the field, the idea ‘suffers from fundamental issues’ such as […]” [335]. Other means may also alter the perception of persons and topics in the news, but are not in the scope of the task [16], e.g., because they are not on sentence level, for example, story selection, source selection, article’s placement and size (Sect. 2.1), and epistemological bias [297]. Albeit excluding non-sentence-level means from our objective in this section, in the context of the overall thesis, the TSC method will still be able to catch the effects of source selection and commission and omission of information. For example, when journalists write articles and include mostly information of one perspective that is in favor of specific persons, the resulting article will mostly reflect that perspective and thus be in favor of these persons (Sect. 3.3.2).

The main contributions of this section are as follows: (1) We create a small-scale dataset and train state-of-the-art models on it to explore characteristics of sentiment in news articles. (2) We introduce NewsMTSC, a large, manually annotated dataset for TSC in political news articles. We analyze the quality and characteristics of the dataset using an on-site, expert annotation. Because of its fundamentally different characteristics compared to previous TSC datasets, e.g., as to how sentiment is expressed and text lengths, NewsMTSC represents a challenging novel dataset for the TSC task. (3) We propose a neural model that improves TSC performance on news articles compared to prior state-of-the-art models. Additionally, our model yields competitive performance on established TSC datasets. (4) We perform an extensive evaluation and ablation study of the proposed model. Among others, we investigate the recently claimed “degeneration” [161] of TSC to sequence-level classification, finding a performance drop in all models when comparing single- and multi-target sentences.

The remainder of this section is structured as follows. In Sect. 5.3.1, we provide an overview of related work and identify the research gap of sentiment classification in news articles. In Sect. 5.3.2, we explore the characteristics of how sentiment is expressed in news articles by creating and analyzing a small-scale TSC dataset. We then use and address the findings of this exploratory work, to create our main dataset (Sect. 5.3.3) and model (Sect. 5.3.4). Key differences and improvements of the main dataset compared to the small-scale dataset are as follows. We significantly increase the dataset’s size and the number of annotators per example and address class imbalance. Further, we devise annotation instructions specifically created to capture a broad spectrum of sentiment expressions specific to news articles. In contrast, the early dataset misses the more implicit sentiment expressions commonly used by news authors (see Sect. 5.3.2.5). Also, we test various consolidation strategies and conduct an expert annotation to validate the dataset.

We provide the dataset and code to reproduce our experiments at

https://github.com/fhamborg/NewsMTSC.

5.3.1 Related Work

Analogously to other NLP tasks, the TSC task has recently seen a significant performance leap due to the rise of language models [73]. Pre-BERT approaches yield up to F1m = 63.3 on the SemEval 2014 Twitter set [182]. They employ traditional machine learning combining hand-crafted sentiment dictionaries, such as SentiWordNet [13], and other linguistic features [29]. On the same dataset, vanilla BERT (also called BERT-SPC) yields 73.6 [73, 400]. Specialized downstream architectures improve performance further, e.g., LCF-BERT yields 75.8 [400].

The vast majority of recently proposed TSC approaches employ BERT and focus on devising specialized downstream architectures [329, 346, 400]. More recently, to improve performance further, additional measures have been proposed, for example, domain adaption of BERT, i.e., domain-specific language model fine-tuning prior to the TSC fine-tuning [76, 300]; use of external knowledge, such as sentiment or emotion dictionaries [151, 401], rule-based sentiment systems [151], and knowledge graphs [102]; use of all mentions of a target and/or related targets in a document [50]; and explicit encoding of syntactic information [286, 398].

To train and evaluate recent TSC approaches, three datasets are commonly used: Twitter [257, 258, 305], Laptop, and Restaurant [289, 290]. These and other TSC datasets [273] suffer from at least one of the following shortcomings. First, implicitly or indirectly expressed sentiment is rare in them. In their domains, e.g., social media and reviews, typically, authors explicitly express their sentiment regarding a target [402]. Second, they largely neglect that a text may contain coreferential mentions of the target or mentions of different concepts (with potentially different polarities), respectively [161].

Texts in news articles differ from reviews and social media in that news authors typically do not express sentiment toward a target explicitly (exceptions include opinion pieces and columns). Instead, journalists implicitly or indirectly express sentiment (Sect. 2.3.4) because language in news is typically expected to be neutral and journalists to be objective [16, 110].

Our objective as described in the beginning of Sect. 5.3 is largely identical to prior news TSC literature [16, 335] with key differences: we do not generally discard the “author level” and “reader level.” Doing so would neglect large parts of sentiment expressions. Thus, it would degrade real-world performance of the resulting dataset and models trained on it. For example, word choice (listed as “author level” and discarded from their problem statement) is in our view an in-text means that may in fact strongly influence how readers perceive a target, e.g., “compromise” or “consensus.” While we do not exclude the “reader level,” we do seek to exclude polarizing or contentious cases, where no uniform answer can be found in a set of randomly selected readers (Sects. 5.3.3.3 and 5.3.3.4). As a consequence, we generally do not distinguish between the three levels of sentiment (“author,” “reader,” and “text”).

Previous news TSC approaches mostly employ sentiment dictionaries, e.g., created manually [16, 335] or extended semi-automatically [110], but yield poor or even “useless” [335] performances. To our knowledge, there exist two datasets for the evaluation of news TSC methods. Steinberger et al. [335] proposed a news TSC dataset, which—perhaps due to its small size (N = 1274)—has not been used or tested in recent TSC literature. Another dataset contains quotes extracted from news articles, since quotes more likely contain explicit sentiment (N = 1592) [16].

In summary, no suitable datasets for news TSC exist nor have news TSC approaches been proposed that exploit recent advances in NLP.

5.3.2 Exploring Sentiment in News Articles

We describe how the procedure used to create our exploratory TSC dataset for the domain of news articles, including the collection of articles and the annotation procedure. Afterward, we discuss the characteristics of the dataset. Then, we report the results of our evaluation where we test state-of-the-art TSC models on the dataset. Lastly, we discuss the findings and shortcomings of our qualitative dataset investigation and quantitative evaluation to derive means to address these shortcomings in our main dataset.

5.3.2.1 Creating an Exploratory Dataset

Our procedure to create the exploratory dataset for sentiment classification on news articles entails the following steps. First, we create a base set of articles of high diversity in topics covered and writing styles, e.g., whether emotional or factual words are used (cf. [96]). Using our news extractor (Sect. 3.5), we collect news articles from the Common Crawl news crawl (CCNC, also known as CC-NEWS), consisting of over 250M articles until August 2019 [256]. To ensure diversity in writing styles, we select 14 US news outlets,Footnote 2 which are mostly major outlets that represent the political spectrum from left to right, based on selections by Budak et al. [38], Groseclose and Milyo [117], and Baum and Groeling[21]. We cannot simply select the whole corpus, because CCNC lacks articles for some outlets and time frames. By selecting articles published between August 2017 and July 2019, we minimize such gaps while covering a time frame of 2 years, which is sufficiently large to include many diverse news topics. To facilitate the balanced contribution of each outlet and time range, we perform binning: we create 336 bins, one for each outlet and month, and randomly draw 10 articles reporting on politics for each bin, resulting in 3360 articles in total.Footnote 3 During binning, we remove any article duplicates by text equivalence.

To create examples for annotation, we select all mentions of NEs recognized as PERSON, NROP, or ORG for each article [376].Footnote 4 We discard NE mentions in sentences shorter than 50 characters. For each NE mention, we create an example by using the mention as the target and its surrounding sentence as its context. We remove any example duplicates. Afterward, to ensure diversity in writing styles and topics, we use the outlet-month binning described previously and randomly draw examples from each bin.

Different means may be used to address expected class imbalance, e.g., for the Twitter set, only examples that contained at least one word from a sentiment dictionary were annotated [257, 258]. While doing so yields high frequencies of classes that are infrequent in real-world distribution, it also causes dataset shift and selection bias [293]. Thus, we instead investigate the effectiveness of different means to address class imbalance during training and evaluation (see Sect. 5.3.2.4).

5.3.2.2 Annotating the Exploratory Dataset

We set up an annotation process following best practices from TSC literature [258, 290, 305, 335]. For each example, we asked three coders to read the context, in which we visually highlighted the target, and assess the target’s sentiment. Examples were shown in random order to each coder. Coders could choose from positive, neutral, and negative polarity, whereby they were allowed to choose positive and negative polarity at the same time. Coders were asked to reject an example, e.g., if it was not political or a meaningless text fragment. Before, coders read a codebook that included instructions on how to code and examples. Five coders, students, aged between 24 and 32, participated in the process.

In total, 3288 examples were annotated, from which we discard 125 (3.8%) that were rejected by at least one coder, resulting in 3163 non-rejected examples. From these, we discard 3.3% that lacked a majority class, i.e., examples where each coder assigned a different sentiment class, and 1.8% that were annotated as positive and negative sentiment at the same time, to allow for better comparison with previous TSC datasets and methods (see Sect. 5.3.1). Lastly, we split the remaining 3002 examples into training and test sets; see Table 5.4.

Table 5.4 Class frequencies of the splits in the exploratory TSC dataset

We use the full set of 3163 non-rejected examples to illustrate the degree of agreement between coders: 3.3% lack a majority class; for 62.7%, two coders assigned the same sentiment; and for 33.9%, all coders agreed. On average, the accuracy of individual coders is A h = 72.9%. We calculate two inter-rater reliability (IRR) measures. For completeness, Cohen’s kappa is κ = 25.1, but it is unreliable in our case due to Kappa’s sensitivity to class imbalance [55]. The mean pairwise observed agreement over all coders is 72.5.

5.3.2.3 Exploring the Characteristics of Sentiment in News Articles

In a manual, qualitative analysis of our exploratory dataset, we found two key differences of news compared to established domains: First, we confirmed that news contains mostly implicit and indirect sentiment (see Sect. 5.3.1). Second, determining the sentiment in news articles typically requires a greater degree of interpretation (cf. [335]). The second difference is caused by multiple factors, particularly the implicitness of sentiment (mentioned as the first difference) and that sentiment in news articles is more often dependent on non-local, i.e., off-sentence, context. In the following, we discuss annotated examples (part of the dataset and discarded examples) to understand the characteristics of target-dependent sentiment in news texts.

In our analysis, we found that in news articles, a key means to express targeted sentiment is to describe actions performed by the target. This is in contrast, e.g., to product reviews where more often a target’s feature, e.g., “high resolution,” or the mention of the target itself, e.g., “the camera is awesome,” expresses sentiment. For example, in “The Trump administration has worked tirelessly to impede a transition to a green economy with actions ranging from opening the long-protected Arctic National Wildlife Refuge to drilling […],” the target (underlined) was assigned negative sentiment due to its actions.

We found sentiment in ≈3% of the examples to be strongly reader-dependent (cf. [16]).Footnote 5 In the previous example, the perceived sentiment may, in part, depend on the reader’s own ideological or political stance, e.g., readers focusing on economic growth could perceive the described action positively, whereas those concerned with environmental issues would perceive it negatively.

In some examples, targeted sentiment expressions can be interpreted differently due to ambiguity. As a consequence, we mostly found such examples in the discarded examples, and thus they are not contained in our exploratory dataset. While this can be true for any domain (cf. “polarity ambiguity” in [290]), we think it is especially characteristic for news articles, which are lengthier than tweets and reviews, giving authors more ways to refer to non-local statements and to embed their arguments in larger argumentative structures. For instance, in “And it is true that even when using similar tactics, President Trump and President Obama have expressed very different attitudes towards immigration and espoused different goals,” the target was assigned neutral sentiment. However, when considering this sentence in the context of its article [356], the target’s sentiment may be shifted (slightly) negatively.

From a practical perspective, considering more context than only the current sentence seems to be an effective means to determine otherwise ambiguous sentiment expressions. By considering a broader context, e.g., the current sentence and previous sentences, annotators can get a more comprehensive understanding of the author’s intention and the sentiment the author may have wanted to communicate. The greater degree of interpretation required to determine non-explicit sentiment expressions may naturally lead to a higher degree of subjectivity. Due to our majority-based consolidation method (see Sect. 5.3.2.2), examples with non-explicit or apparently ambiguous sentiment expressions are not contained in our exploratory dataset.

5.3.2.4 Experiments and Discussion

We evaluated three TSC methods that represent the state of the art on the established TSC datasets Laptop, Restaurant, and Twitter: AEN-BERT [329], BERT-SPC [73], and LCF-BERT [400]. Additionally, we tested the methods using a domain-adapted language model, which we created by fine-tuning BERT (base, uncased) for 3 epochs on 10M English sentences sampled from CCNC (cf. [300]). For all methods, we test hyperparameter ranges suggested by their respective authors.Footnote 6 Additionally, we investigated the effects of two common measures to address class imbalance: weighted cross-entropy loss (using inverse class frequencies as weights) and oversampling of the training set. Of the training set, we use 2001 examples for training and 300 for validation.

We used average recall (R a) as our primary measure, which was also chosen as the primary measure in the TSC task of the latest SemEval series, due to its robustness against class imbalance [305]. We also measured accuracy (A), macro F1 (F1m), and average F1 on positive and negative classes (F1pn) to allow comparison to previous works [258].

Table 5.5 shows that LCF-BERT performed best (R a = 67.3 using BERT and 69.8 using our news-adapted language model).Footnote 7 Class-weighted cross-entropy loss helped best to address class imbalance (R a = 69.8 compared to 67.2 using oversampling and 64.6 without any measure).

Table 5.5 TSC performance on the exploratory dataset. LM refers to the language model used, where base is BERT (base, uncased) and news is our fine-tuned BERT model

Performance in news articles was significantly lower than in established domains, where the top model (LCF-BERT) yielded in our experiments R a = 78.0 (Laptop), 82.2 (Restaurant), and 75.6 (Twitter). For Laptop and Restaurant, we used domain-adapted language models [300]. News TSC accuracy A = 66.0 was lower than single human level A h = 72.9 (see Sect. 5.3.2.3).

We carried out a manual error analysis (up to 30 randomly sampled examples for each true class). We found target misassociation as the most common error cause: In 40%, sentences express the predicted sentiment toward a different target. In 30%, we cannot find any apparent cause. The remaining cases contain various potential causes, including usage of euphemisms or sayings (12% of examples with negative sentiment). Infrequently, we found that sentiment is expressed by rare words or figurative speech or is reader-dependent (the latter in 2%, approximately matching the 3% of reader-dependent examples reported in Sect. 5.3.2.3).

Previous news TSC approaches, mostly dictionary-based, could not reliably classify implicit or indirect sentiment expressions (see Sect. 5.3.1). In contrast, our experiments indicate that BERT’s language understanding suffices to interpret implicitly expressed sentiment correctly (cf. [16, 73, 110]). Our exploratory dataset does not contain instances in which the broader context defines sentiment, since human coders could or did not classify them in our annotation procedure. Our experiments therefore cannot elucidate this particular characteristic discussed in Sect. 5.3.2.3.

5.3.2.5 Summary

We explored how target-dependent sentiment classification (TSC) can be applied to political news articles. After creating an exploratory dataset of 3000 manually annotated sentences sampled from news articles reporting on policy issues, we qualitatively analyzed its characteristics. We found notable differences concerning how authors express sentiment toward targets as compared to other, well-researched domains of TSC, such as product reviews or posts on social media. In these domains, authors tend to explicitly express their opinions. In contrast, in news articles, we found dominant use of implicit or indirect sentiment expressions, e.g., by describing actions, which were performed by a given target, and their consequences. Thus, sentiment expressions may be more ambiguous, and determining their polarity requires a greater degree of interpretation.

In our quantitative evaluation, we found that current TSC methods performed lower on the news domain (average recall R a = 69.8 using our news-adapted BERT model, R a = 67.3 without) than on popular TSC domains (\(R_a=\left [ 75.6, 82.2\right ]\)).

While our exploratory dataset contains clear sentiment expressions, it lacks other sentiment types that occur in real-world news coverage, for example, sentences that express sentiment more implicitly or ambiguously. To create a labeled TSC dataset that better reflects real-world news coverage, we suggest to adjust annotation instructions to raise annotators’ awareness of these sentiment types and clearly define how they should be labeled. Technically, apparently ambiguous sentiment expressions might be easier to label when considering a broader context, e.g., not only the current sentence but also previous sentences. Considering more context might also help to improve a classifier’s performance.

5.3.3 NewsMTSC: Dataset Creation

This section describes the procedure to create our main dataset for TSC in the news domain. When creating the dataset, we rely on best practices reported in literature on the creation of datasets for NLP [291], especially for the TSC task [305]. As our previous exploration has showed (Sect. 5.3.2.5), compared to previous TSC datasets though, the nature of sentiment in news articles requires key changes, especially in the annotation instructions and consolidation of answers [335].

5.3.3.1 Data Sources

We use two datasets as sources: our POLUSA dataset [96] and the Bias Flipper 2018 (BF18) dataset [49]. Both satisfy five criteria that are important to our problem. First, they contain news articles reporting on political topics. Second, they approximately match the online media landscape as perceived by an average US news consumer.Footnote 8 Third, they have a high diversity in topics due to the number of articles contained and time frames covered (POLUSA: 0.9M articles published between Jan. 2017 and Aug. 2019, BF18: 6447 articles associated with 2781 events). Fourth, they feature high diversity in writing styles because they contain articles from across the political spectrum, including left- and right-wing outlets. Fifth, we find that they contain only few minor content errors albeit being created through scraping or crawling.

In early tests when selecting data sources, we tested other datasets as well. While we found that other factors are more important for the resulting quality of annotated examples (filtering of candidate example, annotation instructions, and consolidation strategy), we also found that other datasets are slightly less suitable as to the five previously mentioned criteria because the datasets, e.g., contain only contentious news topics and articles [45] or hyperpartisan sentences [178], are of mixed content quality [264] or contain too few sentences [4, 5].

5.3.3.2 Creation of Examples

To create a batch of examples for annotation, we devise a three tasks process: First, we extract example candidates from randomly selected articles. Second, we discard non-optimal candidates. Only for the train set, third, we filter candidates to address class imbalance. We repeatedly execute these tasks so that each batch yields 500 examples for annotation, contributed equally by both sources.

First, we randomly select articles from the two sources. Since both are at least very approximately uniformly distributed over time [49, 96], randomly drawing articles will yield sufficiently high diversity in both writings styles and reported topics (Sect. 5.3.3.1). To extract from an article examples that contain meaningful target mentions, we employ coreference resolution (CR).Footnote 9 We iterate all resulting coreference clusters of the given article and create a single example for each mention and its enclosing sentence.

Extraction of mentions of named entities (NEs) is the commonly employed method to create examples in previous TSC datasets [257, 258, 305, 335]. We do not use it since we find it would miss \(\gtrapprox \)30% mentions of relevant target candidates, e.g., pronominal or near-identity mentions.

Second, we perform a two-level filtering to improve quality and “substance” of candidates. On coreference cluster level, we discard a cluster c in a document d if |M c|≤ 0.2|S d|, where |…| is the number of mentions of a cluster (M c) and sentences in a document (S d). Also, we discard non-persons clusters, i.e., if ∃m ∈ M c : t(m)∉{−, P}, where t(m) yields the NE typeFootnote 10 of m and − and P represent the unknown and person type, respectively. On example level, we discard short and similar examples e, i.e., if \(|s_{e}| < 50 \lor \exists \hat {e}: \mathrm {sim}(s_{e},s_{\hat {e}} )>0.6 \land m_e=m_{\hat {e}} \land t_{e} = t_{\hat {e}}\) where s e, m e, and t e are the sentence of e, its mention, and the target’s cluster, respectively, and sim(…) is the cosine similarity. Lastly, if a cluster has multiple mentions in a sentence, we try to select the most meaningful example. In short, we prefer the cluster’s representative mentionFootnote 11 over nominal mentions and those over all other instances.

Third, for only the train set, we filter candidates to address class imbalance. Specifically, we discard examples e that are likely the majority class (p(neutral|s e) > 0.95) as determined by a simple binary classifier [310]. Whenever annotated and consolidated examples are added to the train set of NewsMTSC, we retrain the classifier on them and all previous examples in the train set.

5.3.3.3 Annotation

Instructions used in popular TSC datasets plainly ask annotators to rate the sentiment of a text toward a target [290, 305]. For news texts, we find that doing so yields two issues [16]: low inter-rater reliability (IRR) and low suitability. Low suitability refers to examples where annotators’ answers can be consolidated but the resulting majority answer is incorrect as to the task. For example, instructions from prior TSC datasets often yield low suitability for polarizing targets, independent of the sentence they are mentioned in. Figure 5.1 depicts our final annotation instructions.

Fig. 5.1
figure 1

Final version of the annotation instructions as shown on Amazon Mechanical Turk

In an interactive process with multiple test annotations (six on-site and eight on Amazon Mechanical Turk, MTurk), we test various measures to address the two issues. We find that asking annotators to think from the perspective of the sentence’s author strongly facilitates that annotators overcome their personal attitude. Further, we find that we can effectively draw annotators’ attention not only at the event and other “facts” described in the sentence (the “what”) but also at word choice (“how” it is described) by exemplarily mentioning both factors and abstracting these factors as the author’s holistic “attitude.”Footnote 12 We further improve IRR and suitability, e.g., by explicitly instructing annotators to rate sentiment only regarding the target but not regarding other aspects, such as the reported event.

We also test other means to address low IRR and suitability in news TSC annotation but find our means to be more efficient while similarly effective. For example, Balahur et al. [16] ask annotators to only rate the target’s sentiment but not consider whether the news are “good” or “bad.” They also ask annotators to interpret only “what is said” and not use their own background knowledge. Additionally, we test a design where we ask annotators to select the more negative sentence of a pair of sentences sharing a target. We use semantic textual similarity (STS) datasets [4, 5] and extract all pairs with an STS score >2.5. While this design yields high IRR, suitability (especially political framing through word choice is found more effectively [166]), and efficiency, the STS datasets contain too few examples. On MTurk, we find consistently across all instruction variants that short instructions yield higher suitability and IRR than more comprehensive instructions. Surprisingly, the average duration of each crowdworkers’ first assignment is shorter for the latter. This is perhaps because crowdworkers have high incentive to minimize the duration per task to increase their salary and in case they deem instructions too long, the crowdworkers will not read them at all or only very briefly [302, 322].

While most TSC dataset creation procedures use 3- or 5-point Likert scales [16, 257, 258, 289, 290, 305, 335], we use a 7-point scale to encourage rating also only slightly positive or negative examples as such.

Technically, we closely follow previous literature on TSC datasets [290, 305]. We conduct the annotation of our examples on MTurk. Each example is shown to five randomly selected crowdworkers. To participate in our annotation, crowdworkers must have the “Master” qualification, i.e., have a record of successfully completed, high-quality work on MTurk. To ensure quality, we implement a set of objective measures and tests [180]. While we pay all crowdworkers always (USD 0.07 per assignment), we discard all of a crowdworker’s answers if at least one of the following conditions is met. (a) A crowdworker was not shown any test question or answered at least one incorrectly,Footnote 13 (b) a crowdworker provided answers to invisible fields in the HTML form (0.3% of crowdworkers did so, supposedly bots), or (c) the average duration of time spent on the assignments was extremely low (<4s).

The IRR is sufficiently high (κ C = 0.74) when considering only examples in NewsMTSC. The expected mixed quality of crowdsourced work becomes apparent when considering all examples, including those that could not be consolidated and answers of those crowdworkers who did not pass our quality checks (κ C = 0.50).

5.3.3.4 Consolidation

We consolidate the answers of each example to a majority answer by employing a restrictive strategy. Specifically, we consolidate the set of five answers A to the single-label three-class polarity p ∈{pos., neu., neg.} if ∃C ⊆ A : |C|≥ 4 ∧∀c ∈ C : s(c) = p, where s(c) yields the three-class polarity of an individual seven-class answer c, i.e., neutral ⇒ neutral, any positive (from slightly to strongly) ⇒ positive, and, respectively, for negative. If there is no such consolidation set C, A cannot be consolidated, and the example is discarded. Consolidating to three-class polarity allows for direct comparison to established TSC dataset.

While the strategy is restrictive (only 50.6% of all examples are consolidated this way), we find it yields the highest quality. We quantify the dataset’s quality by comparing the dataset to an expert annotation (Sect. 5.3.3.6) and by training and testing models on dataset variants with different consolidations. Compared to consolidations employed for previous TSC datasets, quality is improved significantly on our examples, e.g., our strategy yields F1m = 86.4 when compared to experts’ annotations and models trained on the resulting set yield up to F1m = 83.1, whereas the two-step majority strategy employed for the Twitter 2016 set [258] yields 50.6 and 53.4, respectively.

5.3.3.5 Splits and Multi-Target Examples

NewsMTSC consists of three sets as depicted in Tables 5.6 and 5.7. For the train set, we employ class balancing prior to annotation (Sect. 5.3.3.2). To minimize dataset shift, which might yield a skewed sentiment distribution in the dataset compared to the real world [293], we do not use class balancing for either of the two test sets. Sentences can have multiple targets (MT) with potentially different polarities. We call this MT property. To investigate the effect on TSC performance of considering or neglecting the MT property [161], we devise a test set named test-mt, which consists only of examples that have at least two semantically different targets, i.e., each belonging to a separate coreference cluster (Sect. 5.3.3.2). Since the additional filtering required for test-mt naturally yields dataset shift, we create a second test set named test-rw, which omits the MT filtering and is thus designed to be as close as possible to the real-world distribution of sentiment. We seek to provide a sentiment score for each person in each sentence in train and test-rw, but mentions may be missing, e.g., because of erroneous coreference resolution or crowdworkers’ answers could not be consolidated. Table 5.7 shows the frequencies of the targets and sentiment classes with added coreferential mentions.

Table 5.6 Class frequencies of NewsMTSC. Columns (f.l.t.r.): name; count of targets with any, positive, neutral, and negative sentiment, respectively; and count of examples with multiple targets of any and different polarity, respectively
Table 5.7 Statistics of coreference-related examples in NewsMTSC. Columns (f.l.t.r.): name and count of targets and their coreferential mentions with any, positive, neutral, and negative sentiment, respectively

5.3.3.6 Quality and Characteristics

We conducted an expert annotation of a random subset of 360 examples used during the creation of NewsMTSC with 5 international graduate students (studying Political or Communication Science at the University of Zurich, Switzerland, 3 female, 2 male, aged between 23 and 29). Key differences compared to the MTurk annotation are as follows: First is extensive training until high IRR is reached (considering all examples, κ C = 0.72; only consolidated, κ C = 0.93). We conducted five iterations, each consisting of individual annotations by the students, quantitative and qualitative review, adaption of instructions, and individual and group discussions. Second are comprehensive instructions (4 pages). Third is no time pressure, since the students were paid per hour (crowdworkers per assignment).

When comparing the expert annotation with our dataset, we found that NewsMTSC is of high quality (F1m = 86.4). The quality of unfiltered answers from MTurk is, as expected, much lower (50.1).

What is contained in NewsMTSC? In a random set of 50 consolidated examples from MTurk, we found that most frequent, non-mutually exclusive means to express a polar statement (62% of the 50) are usage of quotes (in total, direct, and indirect 42%, 28%, and 14%, respectively), target being subject to action (24%), evaluative expression by the author or an opinion holder mentioned outside of the sentence (18%), target performing an action (16%), and loaded language or connotated terms (14%). Direct quotes often contain evaluative expressions or connotated terms and indirect quotes less. Neutral examples (38% of the 50) contain mostly objective storytelling about neutral events (16%) or variants of “[target] said that […]” (8%). Yet, “said” variants cannot be used as a reliably indicator for neutral sentiment, e.g., if the target has multiple mentions in the sentences or if the target’s statement is considered positive or negative, e.g., “‘Not all of that is preventable, but a lot of it is preventable if we’ve got better cooperation […],’ Obama said.”

What is not contained in NewsMTSC? We qualitatively reviewed all examples where individual answers could not be consolidated to identify potential causes why annotators do not agree. The predominant reason is technical, i.e., the restrictiveness of the consolidation (MTurk compared to experts: 26% ≈ 30%). Other examples lack apparent causes (24% ≫ 8%). Further potential causes are (not mutually exclusive) as follows: ambiguous sentence (16% ≈ 18%), sentence contains positive and negative parts (8% ≈ 6%), and opinion holder is target (6% ≈ 8%), e.g., “[…] Bauman asked supporters to ‘push back’ against what he called a targeted campaign to spread false rumors about him online.” In a subset of such instances, more context could have helped to resolve ambiguity, e.g., by showing annotators also the sentence prior to the mention.

What are qualitative differences in the annotations by crowdworkers and experts? We reviewed all 63 cases (18%) where answers from MTurk could be consolidated but differ to experts’ answers. The major reason for disagreement is the restrictiveness of the consolidation (53 cases have no consolidation among the experts). In ten cases, the consolidated answers differ. We found that in few examples (2–3%), crowdsourced annotations are superficial and fail to interpret the full sentence correctly.

Texts in NewsMTSC are much longer than in prior TSC datasets (mean over all examples): 152 characters compared to 100, 96, and 90 in Twitter, Restaurant, and Laptops, respectively.

5.3.4 Method

The goal of TSC is to find a target’s polarity y ∈{pos., neu., neg.} in a sentence. Our model consists of four key components (Fig. 5.2): a pre-trained language model (LM), a representation of external knowledge sources (EKS), a target mention mask, and a bidirectional GRU (BiGRU) [51]. We adapt our model from Hosseinia, Dragut, and Mukherjee [151] and change the design as follows: we employ a target mask (which they did not) and use multiple EKS simultaneously (instead of one). Further, we use a different set of EKS (Sect. 5.3.5) and do not exclude the LM’s parameters from fine-tuning.

Fig. 5.2
figure 2

Architecture of the proposed model for target-dependent sentiment classification

Input Representation

We construct three model inputs. The first is a text input T constructed as suggested by Devlin et al. [73] for question answering (QA) tasks. Specifically, we concatenate the sentence and target mention and tokenize the two segments using the LM’s tokenizer and vocabulary, e.g., WordPiece for BERT [386].Footnote 14 This step results in a text input sequence \(T=[\mathrm {CLS}, s_{0}, s_{1}, \ldots , s_{p}, \mathrm {SEP}, t_0, t_1, \ldots , t_q, \mathrm {SEP}] \in \mathbb {N}^{n}\) consisting of n word pieces, where n is the manually defined maximum sequence length.

Various forms of this representation have been proposed, e.g., opposite order sentence and target or instead of the plain target mention using a natural language question or pseudo sentence [151, 346]. We find that on average in the TSC domain, they yield lower performance than the plain variant that we employ.

The second input is a feature representation of the sentence, which we create using one or more EKS, such as dictionaries [151, 401]. Given an EKS with d dimensions, we construct an EKS representation \(E \in \mathbb {R}^{n\times d}\) of S, where each vector e i ∈{0,1,…,p} is a feature representation of the word piece i in the sentence. For example, when using a sentiment dictionary with two mutually non-exclusive polarities’ dimensions positive and negative [153], d = 2. Given a sentence “Good […],” we set e 1 = [1, 0]. To facilitate learning associations between the token-based EKS representation and the WordPiece-based sequence T, we create E so that it contains k repeated vectors for each token where k is the token’s number of word pieces. Thereby, we also consider special characters, such as CLS. If multiple EKS with a total number of dimensions \(\hat {d} = \sum d\) are used, their representations of the sentence are stacked resulting in \(E \in \mathbb {R}^{n\times \hat {d}}\).

The third input is a target mask \(M \in \mathbb {R}^{n}\), i.e., for each word piece i in the sentence that belongs to the target, m i = 1, else 0 [94].

Embedding Layer

We feed T into the LM to yield a contextualized word embedding of shape \(\mathbb {R}^{n\times h}\), where h is the number of hidden states in the language model, e.g., h = 768 for BERT [73]. We feed E into a randomly initialized matrix \(W_E \in \mathbb {R}^{\hat {d} \times h}\) to yield an EKS embedding. We repeat M to be of shape \(\mathbb {R}^{n \times h}\). By creating all embeddings in the same shape, we facilitate a balanced influence of each input to the model’s downstream components. We stack all embeddings to form a matrix \([TEM] \in \mathbb {R}^{n\times 3h}\).

Interaction Layer

We allow the three embeddings to interact using a single-layer BiGRU [151], which yields hidden states \(H \in \mathbb {R}^{n\times 6h} = \mathrm {BiGRU}([TEM])\). RNNs, such as LSTMs and GRUs, are commonly used to learn a higher-level representation of a word embedding, especially in state-of-the-art TSC prior to BERT-based models but also recently [151, 208, 213, 401]. We choose an BiGRU over an LSTM because of the smaller number of parameters in BiGRUs, which may in some cases result in better performance [54, 118, 151, 161].

Pooling and Decoding

We employ three common pooling techniques to turn the interacted, sequenced representation H into a single vector [151]. We calculate element-wise (1) mean and (2) maximum over all hidden states H and retrieve the (3) last hidden state h n−1. Then, we stack the three vectors to P, feed P into a fully connected layer FC so that z = FC(P), and calculate y = σ(z).

5.3.5 Evaluation

This section describes the experiments we conducted to evaluate our model for target-dependent sentiment classification.

Data and Metrics

In addition to NewsMTSC, we used the three established TSC sets: Twitter, Laptop, and Restaurant. We used metrics established in the TSC literature: macro F1 on all (F1m) and only the positive and negative classes (F1pn), accuracy (A), and average recall (R a). If not otherwise noted, performances are reported for our primary metric, F1m.

Baselines

We compared our model with TSC methods that yield state-of-the-art results on at least one of the established datasets: SPC-BERT [73]: input is identical to our text input. FC and softmax are calculated on CLS token. TD-BERT [94]: masks hidden states depending on whether they belong to the target mention. LCF-BERT [400]: similar to TD but additionally weights hidden states depending on their token-based distance to the target mention. We used the improved implementation [394] and enable the dual-LM option, which yields slightly better performance than using only one LM instance [400]. We also planned to test LCFS-BERT [286], but due to technical issues, we were not able to reproduce the authors’ results and thus exclude LCFS from our experiments.

Implementation Details

To find for each model the best parameter configuration, we performed an exhaustive grid search. Any number we report is the mean of five experiments that we run per configuration. We randomly split each test set into a dev-set (30%) and the actual test-set (70%). We tested the base version of three LMs: BERT, RoBERTa, and XLNET. For all methods, we tested parameters suggested by their respective authors.Footnote 15 We tested all 15 combinations of the following 4 EKS: (1) SENT [153], a sentiment dictionary (number of non-mutually exclusive dimensions, 2; domain, customer reviews); (2) LIWC [357], a psychometric dictionary (73, multiple); (3) MPQA [383], a subjectivity dictionary (3, multiple); and (4) NRC [247], dictionary of sentiment and emotion (10, multiple).

Overall Performance

Table 5.8 reports the performances of the models using different LMs and evaluated on both test sets. We found that the best performance was achieved by our model (F1m = 83.1 on test-rw compared to 81.8 by the prior state of the art). For all models, performances were improved when using RoBERTa, which is pre-trained on news texts, or XLNET, likely because of its large pre-training corpus. XLNET is not reported in Table 5.8 since its performances were generally similar to those of RoBERTa except for the TD model, where XLNET degrades performance by 5–9pp. Looking at BERT, we found no significant improvement of the proposed model over the prior state of the art. Even if we domain-adapted BERT [300] for 3 epochs on a random sample of 10M English sentences [96], BERT’s performance (F1m = 81.8) was lower than RoBERTa. We noticed a performance drop for all models when comparing test-rw and test-mt. It seems that RoBERTa is better able to resolve in-sentence relations between multiple targets (performance degeneration of only up to − 0.6pp.) than BERT (− 2.9pp.). We suggest to use RoBERTa for TSC on news, since fine-tuning it was faster than fine-tuning XLNET and RoBERTa achieved similar or better performance than other LMs.

Table 5.8 Overview of experimental results on NewsMTSC

While the proposed model yielded competitive results on previous TSC datasets (Table 5.9), LCF was the top performing model.Footnote 16 When comparing the performances across all four datasets, the importance of the consolidation became apparent, e.g., performance was lowest on the Twitter set, where a simplistic consolidation was employed during the dataset’s creation (Sect. 5.3.3.4). The performance differences of individual models when contrasting their use on prior datasets and NewsMTSC highlight the need LCF performed consistently best on prior datasets but worse than the proposed model on NewsMTSC. One reason might be that LCF’s weighting approach relies on a static distance parameter, which seems to degrade performance when used on longer texts as in NewsMTSC (Sect. 5.3.3.6). When increasing LCF’s window width SRD, we noticed a slight improvement of 1pp. (SRD =  5) but degradation for larger SRD.

Table 5.9 Classification performance on previous TSC datasets

Ablation Study

We performed an ablation study to test the impact of four key factors: target mask, EKS, coreferential mentions, and fine-tuning the LM’s parameters. We tested all LMs and if not noted otherwise report results for RoBERTa since it generally performed best (Sect. 5.3.5). We report results for test-mt (performance influence was similar on either test set, with performances generally being ≈3–5pp. higher on test-rw). Overall, we found that our changes to the initial design [151] contributed to an improvement of approximately 1.9pp. The most influential changes were the selected EKS and in part the use of coreferential mentions. Using the target mask input channel without coreferences and LM fine-tuning yielded insignificant improvements of up to 0.3pp. each. We did not test the VADER-based sentence classification proposed by Hosseinia, Dragut, and Mukherjee [151] since we expected no improvement by using it for various reasons. For example, VADER uses a dictionary created for a domain other than news and classifies the sentence’s overall sentiment and thus is target-independent.

Table 5.10 details the results of exemplary EKS, showing that the best combination (SENT, MPQA, and NRC) yielded an improvement of 2.6pp. compared to not using an EKS (zeros). The single best EKS (LIWC or SENT) each yielded an improvement of 2.4pp. The two EKS “no EKS” and “zeros” represented a model lacking the EKS input channel and an EKS that only yields 0’s, respectively.

Table 5.10 Classification influence of exemplary EKS combinations

The use of coreferences had a mixed influence on performance (Table 5.11). While using coreferences had no or even a negative effect in our model for large LMs (RoBERTa and XLNET), it can be beneficial for smaller LMs (BERT) or batch sizes (8). When using the modes “ignore,” “add coref. to mask,” and “add coref. as example,” we ignored coreferences, added them to the target mask, and created an additional example for each, respectively. Mode “none” represents a model that lacks the target mask input channel.

Table 5.11 Influence of target mask and coreferences

5.3.6 Error Analysis

To understand the limitations of the proposed model, we carried out a manual error analysis by investigating a random sample of 50 incorrectly predicted examples for each of the test sets. For test-rw, we found the following potential causes (not mutually exclusive): edge cases with very weak, indirect, or in part subjective sentiment (22%) or where both the predicted and true sentiment can actually be considered correct (10%) and sentiment of given target confused with different target (14%). The latter occurred especially often for long sentences consisting mostly of phrases that indicate the predicted sentiment but concerning a different target, e.g., “By the time he and Mr. Smith (predicted: negative, true: neutral) were trading texts, […], John was already fired by his boss.” Further, sentence’s sentiment was unclear due to missing context (10%) and the consolidated answer in NewsMTSC was wrong (10%). In 16%, we found no apparent reason. For test-mt, potential causes occurred approximately similarly often as in test-rw, except that targets are confused more often (20%).

5.3.7 Future Work

We identify three main areas for future work. The first area is related to the dataset. Instead of consolidating multiple annotators’ answers during the dataset creation, we propose to test to integrate the label selection into the model [295]. Integrating the label selection into the machine learning part could improve the classification performance. It could also allow us to include more sentences in the dataset, especially the edge cases that our restrictive consolidation currently discards.

To improve the model design, we propose to design the model specifically for sentences with multiple targets, for example, by classifying multiple targets in a sentence simultaneously. While we early tested various such designs, we did not report them due to their comparably poor performances. Further work in this direction should perhaps also focus on devising specialized loss functions that set multiple targets and their polarity into relation. Lastly, one can improve various technical details of the proposed model, e.g., by testing other interaction layers, such as LSTMs, or using layer-specific learning rates in the overall model, which can increase performance [347].

5.3.8 Conclusion

In this section, we presented NewsMTSC, a dataset for target-dependent sentiment classification (TSC) on news articles consisting of 11.3k manually annotated examples. Compared to prior TSC datasets, the dataset is different in key factors, such as that its texts are on average 50% longer, sentiment is expressed explicitly only rarely, and there is a separate test set for sentences containing multiple targets. In part as a consequence of these differences, state-of-the-art TSC models yielded non-optimal performances in our evaluation.

We proposed a model that uses a bidirectional GRU on top of a language model (LM) and other embeddings, instead of masking or weighting mechanisms as employed by the prior state of the art. We found that the proposed model achieved superior performances on NewsMTSC and was competitive on prior TSC datasets. RoBERTa yielded better results compared to using BERT, because RoBERTa is pre-trained on news and we found it can better resolve in-sentence relations of targets, i.e., RoBERTa can better distinguish the individual sentiments if multiple targets are present in a sentence.

In the context of the PFA approach, TSC represents a method which we propose to use in the target concept analysis component. Conceptually, the TSC method is simpler compared to the fine-grained framing effect classification proposed in Sect. 5.2. However, at the same time, the TSC method represents a pragmatic alternative to imitating part of the manual frame analysis as conducted in social science research on media bias. Due to its simplicity, the TSC method achieves strongly higher classification performance than the approach for frame property identification.

We provide the dataset and code to reproduce our experiments at

https://github.com/fhamborg/NewsMTSC.

5.4 Summary of the Chapter

This chapter presented frame analysis as the second and last analysis component of person-oriented framing analysis (PFA). The component aims to identify how persons are portrayed in the given news articles, both at the article and sentence levels. This task is difficult for various reasons, such as news articles rather implicitly or indirectly frame persons, for example, by describing actions performed by a person. In sum, reliably inferring how news articles and sentences portray persons is much more complex compared to prior work in related fields. For example, target-dependent sentiment classification (TSC) is concerned with inferring a sentence’s sentiment toward a target concept. TSC methods achieve high classification performance, but only on domains where authors explicitly state their attitude toward the targets, such as product reviews or Twitter posts. Because of this difficulty and the other issues highlighted in the chapter, prior approaches concerning frame analysis yield inconclusive or superficial results or require exacting manual effort. Thus, automatically and reliably identifying how persons are portrayed in news articles is essential for the success of PFA. We explored two approaches to enable target concept analysis: event extraction and coreference resolution.

Our first, exploratory approach identifies how a person is portrayed using so-called frame properties (Sect. 5.2). Frame properties are categories that represent predefined, topic-independent effects of political framing. As such, the approach aims to resemble how media bias is analyzed in social science research while avoiding the topic-specific and analysis question-specific frames used there. Early during our research on this approach, we conducted a short qualitative evaluation and found inconclusive results. The inconclusive results and the at that point already very high annotation cost are difficulties common among prior automated approaches that aim to resemble frame analyses (Sect. 5.2.5).

We took these issues as a motivation to explore a more pragmatic route to our frame analysis component. Specifically, we devised a dataset and deep learning model for target-dependent sentiment classification (TSC) in news articles (Sect. 5.3). Similar to the frame properties approach outlined previously, the TSC method aims to identify how a person is portrayed using categories representing predefined, topic-independent effects of political framing. In contrast to frame properties, TSC uses only a single dimension as a fundamental effect of framing: polarity, i.e., whether a person is portrayed positively, negatively, or neither. This way, we avoid the infeasibly high annotation cost and ambiguity of analyzing frames or frame derivatives while still capturing an essential framing effect. In contrast to any prior work, our method is the first to reliably classify sentiment in news articles (F1m = 83.1) despite the high level of interpretation required.

In the evaluation described in Chap. 6, we will investigate whether analyzing only a single framing effect dimension, i.e., sentiment polarity, instead of fine-grained framing effects, such as the frame properties, suffices to identify meaningful person-oriented frames.