6.1 Introduction

As stated in Chap. 1, empowering newsreaders in recognizing biases in political coverage is crucial since slanted news coverage can decisively impact public opinion and societal decisions, such as in elections. Fitting means for more balanced interaction with news media include practicing media literacy. However, while such non-technical means can be highly effective, they require high effort, such as for researching an event’s articles and contrasting their coverage. The high effort may represent an insurmountable barrier, preventing critical assessment in daily news consumption. Automated approaches to effortlessly identify and expose potential biases can complement manual media literacy techniques or even enable them in the first place during daily news consumption (Chap. 3).

This chapter introduces and evaluates Newsalyze, our prototype system to reveal biases in news articles by employing person-oriented framing analysis (PFA). While the previous chapters devised methods for PFA and then evaluated their technical performance, this chapter employs a large-scale user study to evaluate the practical effectiveness of the PFA approach in revealing biases. Our goal is to encourage non-expert news consumers to contrast how news articles report on individual events and investigate if our prototype supports its users in doing so.

The remainder of the chapter is structured as follows. Section 6.2 summarizes the most related findings of the literature review described in Chap. 2. Section 6.3 introduces our prototype system Newsalyze, which implements PFA by integrating target concept analysis and frame analysis. Section 6.4 introduces layouts and components to build modular visualizations to reveal biases. Section 6.5 presents the study design to evaluate our prototype in a setting that resembles real-world news consumption. In Sect. 6.6, we use two pre-studies to confirm and refine the design and visualizations. Section 6.7 presents the results of the study, and Sect. 6.8 discusses the limitations of both the prototype and the study to derive future research ideas. Lastly, Sect. 6.9 summarizes the main findings of our approach, and Sect. 6.10 concludes this chapter by discussing these findings in the context of this doctoral thesis.

We publish the survey materials, including questionnaires, news articles, visualizations, and anonymized respondents’ data at

https://doi.org/10.5281/zenodo.4704891.

The source code of the Newsalyze prototype is available at

https://github.com/fhamborg/newsalyze-backend/.

6.2 Background

This section briefly defines terms that are relevant for our study (see Sect. 6.2.1) and summarizes prior work relevant for the tasks of bias identification and bias communication (see Sect. 6.2.2). More in-depth information concerning the reviewed approaches can be found in Chap. 2.

6.2.1 Definitions

We use our definition of media bias as introduced in Sect. 3.2. Specifically, we define bias as the effect of framing, i.e., the promotion of “a particular problem definition, causal interpretation, moral evaluation, and/or treatment recommendation” [79] that arises from one or more of the bias forms defined by the news production process.

We define bias-awareness generally as an effect of bias communication. In practical terms, we define bias-awareness in this chapter as an individual’s motivation and ability to relate and contrast perspectives present in news coverage, both to another [276] and also to the individual’s views [104].

6.2.2 Approaches

Following the thesis’s research objective (Sect. 3.3.3), we briefly summarize approaches for the analysis and communication of media bias (also called bias diagnosis, measurement, and mitigation [276]). In this chapter, we exclude other means to tackle media bias, such as bias prevention during the production of news because they cannot practically tackle media bias (see our discussion of the solution space in Sect. 3.3).

Our literature review on prior work concerned with the analysis or communication of media bias reveals that bias-sensitive visualizations can effectively increase news consumers’ bias-awareness. Thus, such approaches may in principle support news consumers in making more informed choices [22]. However, we also find that the reviewed approaches suffer from one or more of the following shortcomings.Footnote 1

High Cost and Lack of Recency

Content analyses and frame analyses are among the most effective bias analysis tools. Decade-long research in the social sciences has proven them effective and reliable, e.g., to capture also subtle yet powerful biases (cf. [79]). However, because researchers need to conduct these analyses mostly manually, the analyses do not scale with the vast amount of news (Sect. 2.2.4). In turn, such studies are always conducted for few topics in the past and do not deliver insights for the current day [228, 267]; this would, however, be an effective means to support readers in critically assessing the news during daily news consumption (see Sect. 3.3.3).

Superficial Results

Many automated approaches for bias identification suffer from superficial results, especially when compared to the results of analyses as conducted in the social sciences (Sect. 2.5). Reasons include that the approaches treat media bias as a rather vaguely or broadly defined concept, e.g., “differences of [news] coverage” [278], “diverse opinions” [251], or “topic diversity” [252], and neglect social science bias models (see Chap. 2). Further, especially early approaches [252, 276] suffer from poor performance since word-, dictionary-, or rule-based methods as commonly employed in traditional machine learning fail to capture the “meaning between the lines” [126]. To improve performance, some approaches employ crowdsourcing [8, 277, 332], e.g., to gain bias ratings. Crowdsourcing can be an effective means to gather labeled data. However, such data is problematic if not carefully reviewed for biases [154], e.g., if users are not a representative sample or already biased through earlier exposure to systematically biased news coverage. Other approaches approximate biases by grouping news articles according to their news outlets’ respective political orientation [8]. Recent methods that employ deep learning or word embeddings can yield more substantial results, e.g., they identify framing categories that reflect meaningful patterns in the analyzed news texts. However, the creation of large-scale datasets required for their training is very costly (Sect. 5.1), and semi-automated approaches require careful manual revision of the automatically identified bias categories [193].

Inconsistency

The design of some approaches only facilitates the visibility of biases that might be in the data rather than identifying meaningful biases present in the data. Reasons for this inconsistency partially overlap with the previously mentioned reasons for automated approaches’ superficial results. Additional reasons include that approaches do not analyze the articles’ content to determine their biases but approximate potential biases using metadata, such as the political orientation of the articles’ outlets [8]. Others analyze the content but use only shallow or non-representative features, e.g., they analyze only headlines but not the remainder of articles [187].

Besides, many of the previously mentioned approaches are expert systems and thus not suitable for daily news consumption. While there are some easy-to-use systems and visualizations, especially outside the academic context, they suffer either from the previously mentioned shortcoming concerning bias identification [8] or are entirely bias-agnostic. For example, Fig. 6.1 depicts the bias-agnostic news overview provided by Google News. The main part in the center shows a list of current news events, where for each event, multiple articles reporting on it are shown. No information is available concerning how these articles are selected. However, we find that they are selected to represent the event and fit the users’ preferences, e.g., are from their favorite news outlets or are geographically close.

Fig. 6.1
figure 1

Screenshot of the bias-agnostic news overview provided by Google News

Besides its superficial bias analysis approach, AllSides entails an easy-to-use visualization for bias communication [8], which is intended to reveal biases present in current event coverage quickly. Figure 6.2 depicts the news overview provided by AllSides. Like Google News overview, a list shows current events and, for each event, multiple articles reporting on the event. In contrast to Google News and other popular news aggregators, AllSides aims to inform users about the different perspectives present in the event coverage. Therefore, AllSides shows at the top one event-representative article and below three articles, each representing a left-wing, center, and right-wing news outlet.

Fig. 6.2
figure 2

Screenshot of the bias-sensitive news overview provided by AllSides

In sum, most prior studies confirm the effectiveness and benefits of communicating biases to news consumers. However, prior work suffers from various shortcomings, such as requiring manual analyses, yielding superficial results, only facilitating the visibility of media bias that might be in the data, or requiring training before their use.

6.3 System Description

This section introduces our prototype Newsalyze. The prototype integrates the methods devised previously in this thesis to implement the person-oriented framing analysis (PFA) approach.

Given a set of news articles reporting on the same political event, our system seeks to find and visualize groups of articles that similarly frame the persons involved in the event using three phases: article gathering, bias analysis, and bias communication. This section summarizes our previous research concerning article gathering (Sect. 3.5) and bias analysis using PFA (Sect. 3.4). Section 6.4 then introduces our novel visualizations for bias communication.

For article gathering, we integrate our crawler and extractor specifically tailored for news articles (Sect. 3.5). Users provide the system with a set of URLs linking to news articles reporting on the same event to be analyzed by Newsalyze. The news crawler then extracts the required information from the articles’ web pages, i.e., title, lead paragraph, and main text. Alternatively, users can directly provide news articles to the system, e.g., by providing JSON files containing the previously mentioned information.

The bias identification using PFA consists of the three tasks depicted in Fig. 3.1 in Sect. 3.4. First, we perform NLP preprocessing as described in Sect. 4.3.3.1. We use our split preprocessing (Sect. 4.3.3.1) since it yields better coreference resolution performance than the standard preprocessing (Sect. 4.3.4.3).

In the following, we describe the subsequent tasks of PFA, i.e., target concept analysis and frame analysis.

Target concept analysis finds and resolves persons mentioned across the topic’s articles, including highly event-specific coreferences as they frequently occur in person-targeting bias forms. As highlighted in Chap. 1 and Sect. 2.3.4, especially in the presence of bias by word choice and labeling, persons’ mentions may be coreferential only in the coverage on a specific event, but otherwise they may be not be coreferential or even opposing in meaning, such as “regime” and “government.” To resolve such mentions, we use the method for context-driven cross-document coreference as described in Sect. 4.3. Specifically, we use the method using the first two sieves since they suffice to achieve the highest performance on individual persons (see concept type Actor in Sect. 4.3.4.3). The output of target concept analysis is the set of persons involved in the news coverage of the event and, for each person, all the person’s mentions across all news articles.

Frame analysis determines how the news articles portray the persons involved in the event and then finds groups of articles that similarly portray these persons. This task centers around our concept of person-targeting framing, which resemble (political) framing as defined by Entman [79], where a frame “promotes a particular problem definition, causal interpretation, moral evaluation, and/or treatment recommendation” [79]. Our person-oriented frames resemble these political frames, but are somewhat exploratory, e.g., implicitly defined and loosely structured.Footnote 2 As we discuss in Sect. 3.3.2, identifying frames would approximate content analyses, the standard tool used in the social sciences to analyze media bias (Sect. 2.2.4). However, doing so would require infeasible effort since researchers in the social sciences typically create frames for a specific research question [45, 46, 79]. PFA, however, is meant to analyze media bias on any coverage reporting on policy issues. Thus, we seek to determine a fundamental bias effect resulting from framing: polarity of individual persons, which we identify for each person mention (on the sentence level) and aggregate to article level. To achieve state-of-the-art performance in target-dependent sentiment classification (TSC) on news articles, we use our RoBERTa-based [214] neural model trained on our dataset (Sect. 5.3).

The last step of frame analysis is to determine groups of articles that frame the event similarly, i.e., the persons involved in the event. We call the resulting groups framing groups. By definition, all articles of one framing group share the same person-oriented frame, i.e., they represent one perspective present in the event coverage. We propose two methods for grouping. (1) Grouping-MFA, a simple, polarity-based method, first determines the single person that occurs most frequently across all articles, called most frequent actor (MFA). Then, the method assigns each article to one of three groups, depending on whether the article mentions the MFA mostly positively, ambivalently, or negatively. (2) Grouping-ALL considers the polarity of all persons instead of only the MFA. Specifically, the method uses k-means with k = 3 on a set of vectors where each vector a represents a single news article:

$$\displaystyle \begin{aligned} a = \begin{pmatrix} s_0 \\ \vdots \\ s_{i} \end{pmatrix}, \end{aligned} $$
(6.1)

where i ∈ (0, …, |P|− 1), P the set of all persons, a person’s sentiment polarity s i in a is

$$\displaystyle \begin{aligned} s_i = \sum_{m\in M}{\frac{w(m)s(m)}{m_{\mathrm{max},a}}}, \end{aligned} $$
(6.2)

where m is each mention of all the person’s mention in a, w(m) is a weight depending on the position of the mention (mentions in the beginning of an article are considered more important [53]), and s(m) yields the polarity score of m (1 for positive, -1 for negative, 0 else). To consider the individual persons’ frequency in an article for clustering, we normalize by m max,a, which is the number of mentions of the most frequent person in a.

In addition to grouping, we calculate each article’s relevance to the event and the article’s group using simple word embedding scoring.

6.4 Visualizations

Our visualizations aim to aid in typical online news consumption, i.e., an overview enables users to first get a synopsis of news events and articles (Sect. 6.4.1) and an article view shows an individual news article (Sect. 6.4.2). We seek to devise visualizations that (1) are easy to understand, i.e., usable by non-experts without prior training, and that (2) reveal biases (see our research objective described in Sect. 3.3.3). To measure the effectiveness not only of our visualizations but also their constituents, we design them so that individual visual features can be altered or exchanged. Later, in our conjoint-based evaluation [122], we can measure the effects of each constituent, i.e., the individual visual clues. In the following, a “conjoint profile” refers to a specific combination of all visual clues. For example, one specific conjoint profile of the news overview would show certain visual clues with specific settings while not showing other visual clues. The conjoint design will be described in detail in Sect. 6.5.2.

To more precisely measure the change in bias-awareness concerning only the textual content as required by our research objective (Sect. 3.3.3), we apply changes compared to typical news consumption. For example, the visualizations show the texts of articles (and information about biases in the texts) but no other content, e.g., no photos or outlet names. Further, in our study, the overview shows only a single event instead of multiple.

6.4.1 Overview

The overview aims to enable users to get a synopsis of a news event quickly. We devise a modular, bias-sensitive visualization layout, which we use to implement and test specific visualizations. The comparative layout aims to support users in quickly understanding the frames present in coverage on the event. The layout is vertically divided into three parts, denoted as parts a, b, and c in Fig. 6.3. The event’s main article (part a) shows the event’s most representative article. The comparative groups part (b) shows up to three perspectives present in event coverage by showcasing each perspective’s most representative article. It is designed to encourage users to contrast the articles and critically assess their content. When employing PFA, these perspectives are person-oriented frames. Since we also test baselines representing the state of the art, we generally refer to perspectives in this section. To determine the framing groups, the system uses one of the grouping methods described in Sect. 6.3, i.e., Grouping-MFA and Grouping-ALL. Finally, a list shows the headlines of further articles reporting on the event (part c).

Fig. 6.3
figure 3

Newsalyze’s overview consists of three parts: the main article (part (a)) represents a single; the comparative groups (part (b)) showcase up to three perspectives in the event coverage; and further articles (part (c)) are a list of further articles reporting on the event

All visualizations show brief explanations for all features that users may not be familiar with. For example, the overview contains a brief explanation informing users about what the comparative groups represent and how they were derived (see “1” in Fig. 6.4 for the specific variant of Grouping-MFA and “1” in Fig. 6.5 for the generic variant used by any grouping).

Fig. 6.4
figure 4

Excerpt of the news overview showing three perspectives of news coverage on a debt ceiling event

Fig. 6.5
figure 5

Shown is a news overview where the specific explanations, e.g., how the grouping was performed, and labels, e.g., headline tags, are replaced with generic variants. The added labels (“1,” “3,” and “4”) refer to the same as depicted in Fig. 6.4

In each overview, two types of visual clues conveying bias information can be enabled and altered depending on the conjoint profile (see Sect. 6.5.2). First, zero or more headline tags are shown next to each article’s headline. They indicate the political orientation of the article’s outlet (PolSides tags; see “2” in Fig. 6.4), the article’s overall polarity regarding the MFA due to Grouping-MFA (MFAP tags; see “3”), and the article’s group according to its polarity regarding all persons due to Grouping-ALL (ALLP tags), respectively.

Second, labels and explanations in the visualization are either generic or specific. The specific variants explain how the grouping was specifically performed (see “1” in Fig. 6.4) and provide specific group labels (see “4” in Fig. 6.4). In contrast, all visualizations employing the generic variant use the same universal explanation, e.g., only mentioning that our system automatically determined the three perspectives (see “1” in Fig. 6.5), and use the same generic coloring and labels, e.g., “Perspective 1” as shown close to “3” and “4” in Fig. 6.5.

6.4.2 Article View

The article view visualizes a single news article. It thus represents the second step in typical news consumption, i.e., after getting an overview of current events, users subsequently may want to read individual articles of interest. The layout of the view is vertically divided into three parts, denoted as parts a, b, and c in Fig. 6.6. The bias information part (a) at the top contains various visual elements that aim to inform newsreaders about the bias and positioning of the current article. The main part (b) shows the given article’s headline, lead paragraph, and main text. Lastly, a list shows the headlines of further articles reporting on the event (part c). Within these three parts, various visual clues to communicate bias information are enabled, disabled, or altered depending on the conjoint profile. We describe them in the following.

Fig. 6.6
figure 6

Newsalyze’s article view consists of three parts: the bias information part (a) shows bias-related information concerning the given article; the main part (b) contains the given article; and further articles (part (c)) are a list of further articles reporting on the event

The bias information part (“a” in Fig. 6.6) contains up to three visual clues to inform about potential slants of the current article. Specifically, the polarity context bar aims to enable users to quickly understand the overall slant concerning the event’s MFA of the current and other articles. The 1D scatter plot depicted in Fig. 6.7 represents each article as a circle. The polarity context bar places each circle depending on its article’s overall polarity regarding the MFA. To quickly assess how the current article’s polarity compares to the other articles’ slants, the current article is highlighted using a bold circle (see “1” in Fig. 6.7). Users can interactively, i.e., by hovering their cursor over the circles, view individual articles’ headlines (see “2”).

Fig. 6.7
figure 7

Polarity context bar showing the current and other articles’ polarity regarding the MFA and a tooltip of the headline of a hovered article

Also within the bias information part, bias indicators show the article’s framing group, analogously to the headline tags, i.e., the outlet’s political orientation (PolSides) and how the article reports on the MFA, called MFAP (as identified by Grouping-MFA; see Sect. 6.3), or all persons, called ALLP (as identified by Grouping-ALL). In contrast to the headline tags, which are shown besides all headlines, each indicator is a component that prominently shows the framing group of only the current article. Depending on the conjoint profile (identical to headline tags), individual indicators are shown or disabled. Figure 6.8 depicts the PolSides bias indicator; Fig. 6.9 depicts the MFAP bias indicator.

Fig. 6.8
figure 8

PolSides bias indicator showing the current article’s political orientation as identified by its outlet

Fig. 6.9
figure 9

MFAP bias indicator showing the current article’s overall polarity concerning the MFA (here, the MFA “Prime Minister Scott Morrison” is shown ambivalently)

Within the main part of the article view (“b” in Fig. 6.6), in-text polarity highlights aim to enable users to quickly comprehend how the individual sentences of the current news article portray the mentioned persons. To achieve this, we visually mark mentions of individual persons within the news article’s text. We test the effectiveness of the following modes: single-color (visually marking a person mention using a neutral color, i.e., gray, if the respective sentence mentions the person positively or negatively), two-color (using green and red colors for positive and negative mentions, respectively), three-color (same as two-color and additionally showing neutral polarity as gray), and disabled (no highlights are shown). For example, in the sentence “The Mueller report was tough on Trump,” the person mention “Trump” has negative polarity and would be highlighted red in the two- and three-color modes.

Within the further articles list (“C” in Fig. 6.6), headline tags are shown and have identical purpose and function as when shown in the overview (see Sect. 6.4.1).

6.5 Study Design

In this section, we present our study design to measure the effectiveness of our system. In contrast to prior bias studies, our design allows us to pinpoint the effectiveness to individual components. Section 6.7 presents our results, and Sect. 6.8 discusses the study’s limitations to derive future work ideas. All survey data, including questionnaires and anonymized respondents’ information, is available freely (see Sect. 6.1).

6.5.1 Objectives and Questions

We base our study design on our definition of bias-awareness (Sect. 6.2) as the primary metric to investigate the effectiveness of an analyzed means to “reveal biases” as requested by our research question (Sect. 1.3). In particular, the definition of bias-awareness highlights the need to contrast perspectives in the news as an effective means to become aware of biases, which in turn are defined as just these perspectives.

We focus in our study on the overview visualizations, with their comparative groups being the primary means to enable contrasting perspectives and thus reveal biases.

  1. Q1:

    How does a bias-sensitive, easy-to-understand news overview improve bias-awareness in non-expert news consumers?

Secondarily, we seek explore how bias-awareness can be affected by revealing biases in individual articles and by the respondents themselves.

  1. Q2:

    How does a bias-sensitive, easy-to-understand article view improve bias-awareness in non-expert news consumers?

  2. Q3:

    How do demographic factors of news consumers affect their bias-awareness?

6.5.2 Methodology

We propose to use a conjoint design [218], which is especially suitable for estimating the effect of individual components. Traditional survey experiments are limited to only identifying the “catch-all effect” [122] due to confounding of the treatment components. In contrast, conjoint experiments identify “component-specific causal effects by randomly manipulating multiple attributes of alternatives simultaneously” [122]. In a conjoint design, respondents are asked to rate so-called profiles, which consist of multiple attributes. In our study, such attributes are, for example, the overview, which topic it shows (or which article the article view shows), and if or which tags or in-text highlights are shown.

Conjoint experiments rest on three core assumptions: (1) stability and no carry-over effects, (2) no profile-order effects, and (3) randomization of the profiles [122]. In our evaluation, (2) holds by design for all tasks except for the forced-choice question (see workflow step 6 in Sect. 6.5.6). We briefly describe our means to ensure (1) and (3) in the following.Footnote 3

To ensure (1), i.e., the absence of carry-over effects from one task set to another, we applied during the study the diagnostics proposed by Hainmueller, Hopkins, and Yamamoto [122]. We refer to a task set as all tasks shown to a respondent for a single topic, e.g., in our main study, we show respondents for each topic one overview and subsequently two article views. We then calculated if there are meaningful differences across the task sets by building a sub-group for each task set. We found weak carry-over effects when comparing the individual attributes’ effects (using our main overview question across the task sets) and when testing the effect of the task set’s order (for all overview questions combined (Est. = 3.28%, p = 0.018), i.e., respondents were on average more bias-aware in the second task set). Further, in our main study, when sub-grouping for the task set, the other attributes’ effects differed. However, this is not necessarily problematic. A learning effect is expected and desirable in bias communication. Since we randomized the attributes within each task set, we can include the task sets in the analysis and thus measure the effects, regardless of the task set.

We ensure (3) by randomly choosing the attributes independently of another and for each respondent. To confirm the randomization was successful in our experiments, we employed a Shapiro-Wilk test during the study [296].

Since in each of our experiments the three assumptions hold, our design allows for an estimation of the relative influence of each component on the bias-awareness, which is called average marginal component effects (AMCEs) [122]. An AMCE represents the effect of an attribute level, e.g., the two-color mode (attribute level) of our in-text highlights (attribute), compared to a pre-selected baseline of that attribute, e.g., not showing any in-text highlights. In simplified terms, the concept of AMCEs is to create two subsets, one for the current attribute level and one for the attribute baseline. Then, the respondents’ answers, e.g., to our questions in the post-article questionnaire, are averaged in each subset. Lastly, by comparing the averaged answers of both sets, the AMCE represents the increase or decrease of an attribute’s specific level compared to the attribute’s baseline.

In our questionnaires, we employ discrete choice (DC) as well as rating questions (see Sect. 6.5.6) to measure bias-awareness on a behavioral as well as attitudinal level [303]. DC questions are widely used within the conjoint design and found to have high external validity in mimicking real-world behavior [121]. Additionally, DC questions elicit behavior, i.e., which news article respondents prefer to read or rely on for decision-making [287]. In contrast, rating questions capture attitudes and personal viewpoints better [380].

6.5.3 Data

We select four news topics with varying degrees of expected biases among the news articles reporting on the topic. To approximate the degree of bias, we use the topics’ expected polarization. Specifically, we select three topics expected to be highly polarizing for US news readers: gun control (Orlando shooting in 2016), debt ceiling (discussions in July 2019), and abortion rights (Tennessee abortion ban in June 2020). To better approximate regular news consumption, where consumers typically are exposed to news coverage on single events, we select a single event for each of these topics (shown in parentheses). We add a fourth event, which we expect to be only mildly polarizing: Australian bush fires, i.e., a foreign event without direct US involvement. As described later, during our pre-studies, we added the abortion topic due to a negative influence of the debt ceiling topic on bias-awareness. In the pre-studies, we could trace this back qualitatively to respondents’ critique of the topics being too “boring” and “complicated,” which also manifested in lower reading times on average.

For each event, we select ten articles from left-wing, center, and right-wing, online US outlets (political orientation as self-identified by the outlets or from [8]). To ensure high quality, we manually retrieve the articles’ content. Before the second pre-study, we shortened all articles so that they were of similar length (300–400 words) to address the high reading times and noise in the responses, a key finding of the first pre-study (see Sect. 6.6). We consistently apply the same shortening procedure to preserve the perspectives of the original articles, for example, by maintaining the relative frequency of person mentions and by discarding only redundant sentences that do not contribute to the overall tone. In all experiments, we remove any non-textual content, such as images, to isolate the effects in the change of bias-awareness due to the text content, our text-centric bias analysis, and visualization.

6.5.4 Setup and Quality

We conduct our experiments as a series of online studies on Amazon Mechanical Turk (MTurk). To participate in any of our studies, crowdworkers have to be located in the USA. To ensure high quality, we further require that participants possess MTurk’s “Masters” qualification, i.e., have a history of successfully completed, high-quality work. While we compensate respondents always, our study design includes discarding data of any respondent who fails to meet all quality criteria, including a minimum study duration, and correctly answering questions checking attention and seriousness [12]. Depending on the study’s duration, participants receive an assignment compensation that approximates an hourly wage of $10.

6.5.5 Overview Baselines

To answer our primary study question Q1, our study design compares our system and the overview visualization variants with baselines that resemble news aggregators popular among news consumers and an established bias-sensitive news aggregator (cf. [276]).

Plain is an overview variant that resembles popular news aggregators (Fig. 6.10). Using a bias-agnostic design similar to Google News, this baseline shows article headlines and excerpts in a list sorted by the articles’ relevance to the event. Section 6.3 describes the relevance calculation. Compared to our bias-sensitive overview design described in Sect. 6.4.1, Plain does not contain the comparative groups (part c in Fig. 6.3) but only the main article and list further articles (parts a and b). Besides each headline, headline tags as described in Sect. 6.4.1 can be shown, depending on the conjoint profile.

Fig. 6.10
figure 10

The Plain news overview uses a list to show articles reporting on a topic

PolSides is an overview variant that aims to resemble the bias-sensitive news aggregator AllSides [8]. PolSides yields framing groups by grouping articles depending on their outlets’ political orientation (left, center, and right, as self-identified by them or taken from [8]). The visualization uses the same bias-sensitive layout consisting of three vertical parts as we described in Sect. 6.4.1 and show in Fig. 6.3. Figure 6.11 shows an excerpt of the main article and the comparative groups (parts a and b). Conceptually, PolSides employs the left-right dichotomy, a simple yet often effective means to partition the media into distinctive slants, which is also one of the most commonly studied dimensions of bias. Being a well-known and easy-to-interpret concept, we do also expect that users will initially understand PolSides’s approach to determine the framing groups. However, this dichotomy is determined only on the outlet level. It thus may incorrectly classify the biases indeed present in a specific event (see Sect. 6.2.2), e.g., articles shown to be of different slants having indeed similar perspectives (and vice versa). We investigate this issue in our study (see Sect. 6.7).

Fig. 6.11
figure 11

The PolSides news overview aims to resembles the bias-sensitive news aggregators AllSides, which groups articles depending on their outlets’ political orientation

To our knowledge, the baselines exhaustively cover the relevant prior work, particularly concerning the communication of biases in news articles to non-expert news consumers (see Sect. 6.2). While we deem NewsCube another conceptually very relevant approach because of its similar research objective, the visualizations proposed by Park et al. [276] are designed to show only a single article rather than providing a news overview and thus do not allow comparison.

As stated in Sect. 6.4, we are interested in the effects on the bias-awareness due to textual means, i.e., bias forms at the text level. Thus, all visualizations, including the baselines, show the texts of articles and information about biases due to the textual bias forms defined in Sect. 3.3.3 but no other content, e.g., no photos or outlet names.

To understand how visualizations, including their layout and explanations, affect bias-awareness compared to the visualized content, e.g., the framing groups resulting from our analysis, we introduce two additional baseline concepts. First, we include for most overviews, including the baselines previously mentioned, generic variants (see Sect. 6.4.1). This single-blind setting helps to assess how respondents are affected by knowing (such as the popular left-right dichotomy employed by PolSides) or not knowing (such as our novel PFA approach) the employed grouping mechanisms. Second, we test an overview with generic explanations that randomly assign individual news articles to either of the three framing groups.

Concerning our secondary study question Q2, we test two headline tags (PolSides and MFAP) jointly with their respective indicators showing the article’s bias classification (PolSides and MFAP), each as described in Sect. 6.4.2. For example, if PolSides headline tags (showing the political orientation of each article in the list of further articles) are enabled, likewise is the bias indicator enabled (prominently showing the political orientation of the current article).

6.5.6 Workflow and Questions

Our study consists of seven steps.Footnote 4 We refer to a task set as a sequence of steps associated with one topic, i.e., task set 1 refers to the first topic shown to a respondent, including the overview, the two article views, and respective questionnaires (steps 2–6). The (1) pre-study questionnaire asks for demographic and background data [332], such as age, political orientation, education, news consumption, and attitudes toward the topics we used [93]. Generally, laws restricting abortion are [ wrongright]. Generally, laws restricting the use of guns are [wrongright]. Generally, laws restricting environmental pollution are [wrongright]. For these questions and other questions concerning bi-polar adjective pairs, we use 10-point Likert scales. We also ask respondents whether the mentioned topics are relevant or irrelevant to them personally. Lastly, we ask whether they perceive the media to be biased against their views, in general, to better distinguish the treatment effects from prior skepticism (also called hostile media effect [282]).

Afterward, we show an (2) overview as described in Sects. 6.4.1 and 6.5.5 including instructions shown prior to the overview. The (3) post-overview questionnaire then operationalizes the bias-awareness in respondents (see Sect. 6.2.1) by asking about their perception of the diversity and disagreement in viewpoints, if the visualization encouraged them to contrast individual headlines, and how many perspectives of the public discourse were shown, e.g., Do you think the coverage shown in the previous visualization represents all main viewpoints in the public discourse (independent of whether you agree with them or not) [not at allvery much]? Overall, how did you perceive the articles shown in the previous visualization [very differentvery similar; very opposingvery agreeing]?. To match our definition of bias-awareness, we use as our main question (cf. [277]): When viewing the topic visualization, did you have the desire to compare and contrast articles [not at allvery much]?

Afterward, we show an (4) article view as described in Sect. 6.4.2. A (5) post-article questionnaire operationalizes bias-awareness in respondents on an article level [332], i.e., How did you perceive the presented news article? [very unfairvery fair; very partialvery impartial; very unacceptablevery acceptable; very untrustworthyvery trustworthy; very unpersuasivevery persuasive; very biasedvery unbiased]. We also ask whether the article contains political bias and biases against persons mentioned in the article. We repeat steps 2–5 two times since we show two task sets. After each overview, we showed two articles, i.e., we repeated steps 4 and 5 two times. To measure the effect of seeing an overview before an article, we also introduce a variant where we skip the overview. In such cases, the overview steps (2, 3) are skipped entirely. Afterward, a (6) discrete choice question asks respondents to choose between two articles, i.e., which one they consider to be more biased. In a (7) post-study questionnaire, respondents give feedback on the study, i.e., what they liked and disliked.

In the two pre-studies, where we tested the study design and usability of the visualizations (see Sect. 6.6), we repeated the same procedure with only one article after each overview and excluding step 5. In the first pre-study, we also excluded steps 2–6, since we only tested bias-sensitive overviews.

6.6 Pre-studies

Before our main study, we conducted two pre-studies (E1 and E2).Footnote 5 E1 consisted of 260 respondents recruited on MTurk (we discarded 3% from 268 respondents due to the quality criteria described in Sect. 6.5.4). E2 consisted of 98 respondents (we discarded 11% from 110).

The pre-studies aimed at testing the study design described in Sect. 6.5 and the usability of the visualizations described in Sect. 6.4. Further, we used the first pre-study to find a set of well-performing overviews, including representative baselines. This selection was necessary to satisfy the conjoint assumption “randomization of profiles.” This assumption also requires that all profiles have the same set of attributes (Sect. 6.5.2). However, the number of attributes differs across our overviews (Sect. 6.5.5). For example, Plain has only two attributes (one for each headline tag), our bias-sensitive overview layout (Sect. 6.4.1) has an additional grouping attribute, and “no overview” naturally has no attributes. Thus, by determining which variants of the bias-sensitive layouts performed best in the pre-studies, we could fixate these overviews’ attributes and compare them in the second pre-study and our main study.

Our first pre-study, E1, aimed to confirm the overall study design and collect effect data to make an informed selection of overview variants, both for the PFA approach and the baselines. For the latter, we tested only variants using our bias-sensitive overview layout, where we randomly varied all attributes, i.e., grouping and the two headline tags. We identified (primarily insignificant) trends that indicated well-performing variants. In E2, we then tested the same design as planned for the main study (see Sect. 6.5.6), including article view and the other baselines (see below).

We also used the pre-studies to improve our design and visualizations. Reasons for partially mixed results in both pre-studies were various usability issues interfering with the effectiveness. For example, in E1, respondents reported they wanted to know how the grouping was performed and by whom. Before conducting E2, we addressed these shortcomings, e.g., by adding explanations (specific and generic) about how our system derives the classifications. After addressing these issues, we found positive, significant effects of our bias-sensitive overviews in the second pre-study, confirming our research design concerning the overview.

E2 revealed that showing both headline tags was most effective in improving bias-awareness in the Plain baseline. In contrast, for the bias-sensitive overviews, the bias-awareness remained unchanged or decreased if one or both tags were shown. We suspected that users might feel overwhelmed if many visual clues are present due to a higher cognitive load and potentially visual clutter. Further, the effect of the bias-sensitive layout itself was stronger than those of the headline tags if employed in a bias-sensitive layout. In sum, headline tags seemed to be most effective if employed in an otherwise bias-agnostic visualization, such as Plain.

Using the pre-study findings, we defined the following overview variants for the main study: (1) No overview; (2) Plain as described in Sect. 6.5.5; (3) PolSides as described in Sect. 6.5.5 with PolSides headline tags enabled to closely resemble the bias-sensitive news aggregator AllSides.com [8]; (4) MFA using the bias-sensitive layout (Sect. 6.4.1), Grouping-MFA (Sect. 6.3), and polarity headline tags enabled, which was the best-performing variant of MFA in our pre-studies; (5) PolSides-generic being identical to (3) but using generic explanations; (6) MFA-generic being identical to (4) but using generic explanations; (7) Random-generic using the bias-sensitive layout and random grouping; and (8) ALL-generic using the bias-sensitive layout, ALL-generic (Sect. 6.3), and cluster headline tags enabled. Note that we did not test a variant of Grouping-ALL with specific explanations.

In sum, we already found that the bias-sensitive overviews (PolSides and MFAP) yielded significant, positive effects on bias-awareness, confirming the overall study design. We also identified weaknesses, e.g., respondents criticized the lack of transparency regarding how the framing groups were determined and by whom. Before our main study, we addressed the identified shortcomings, e.g., by adding explanations about how our system derives the classifications. The study design and the visualization described in Sects. 6.4 and 6.5 are the results of our refinements and improvements using the pre-studies’ findings. We also used the pre-studies to select a set of well-performing overview variants, including baselines, to be compared in the main study.

6.7 Evaluation

For our evaluation, we used the study design described in Sect. 6.5. In our main study, we recruited 174 respondents on MTurk, from which we discarded 8% using our quality measures. In sum, the n = 160 respondents (age: [23, 77], m = 45.5, gender (f/m/d): 72/88/0, all native speakers, political orientation (liberal (1)–conservative (10)): m = 4.83, sd = 2.98; see Appendix A.3) provided answers to 283 post-overview questionnaires (excluding “no overview”), 320 discrete choices on article views, and 640 post-article view questionnaires. The average study duration was 15 min (sd = 6.22). In the following, we present the results and discuss our findings for our primary study question regarding the overview and the secondary study questions concerning the article view and respondent factors (Sect. 6.5.1). If not noted otherwise, the reported effects were operationalized using the main question of the post-overview questionnaire and the additive score of all post-article questions (Sect. 6.5.6).

6.7.1 Overview

In our user study, the bias-sensitive overviews increased respondents’ bias-awareness compared to the Plain baseline. PolSides achieved the highest effect when shown with specific explanations (Est. = 21.34). In the single-blind setting, i.e., if shown with generic explanations, the PolSides baseline had no significant effect (Est. = 8.46, p = 0.17).

In contrast, the PFA approach strongly and significantly increased bias-awareness in both settings: when specific explanations were used, our grouping method achieved a strong effect (MFA: Est. = 17.80). In the single-blind setting, only the PFA approach consistently, significantly, and most strongly increased bias-awareness (MFA, 13.35; ALL, 17.54).

Discussion of the Approaches and Their Results

But why is there a loss of effectiveness of PolSides in the single-blind setting, i.e., if generic explanations and labels are shown? Fully elucidating this question would require a larger sample size concerning respondents and topics. However, we qualitatively and quantitatively identified three potential, partially related causes, which we outline in the following.

(1) Popularity and Intuition

The left-right dichotomy employed by PolSides is a well-known concept and easily understood by news consumers. None reported they did not understand the concept, and 20% of respondents exposed to PolSides praised that the bias concept, i.e., the grouping mechanism, was easy to understand, e.g., “I liked that it was laid out with left, center, right. It was intuitive.” In contrast, PFA and its grouping techniques MFA and ALL are novel and somewhat technical, as are their (specific) explanations. For example, 40% of respondents exposed to MFA found its specific explanations (slightly) confusing and too “technical.” In contrast, only 10% of respondents exposed to any of the generic variants, including MFA and ALL, reported comprehension issues. This improvement might lie in the generic explanations being less technical but more conceptual compared to the specific explanations of MFA.

These findings potentially indicate that a proportion of the bias-awareness effect in any visualization is due to encouraging users to look for frames and biases. Albeit not significant, the mild effects of the Random-generic overview (Est. = 6.73 as shown in Table 6.1) might serve as a rough approximation for the “base” effectiveness of bias-sensitive visualizations. This base effectiveness appears to be partially independent of the visualized content, such as the framing groups, and its meaningfulness. In practical terms, solely encouraging users to expect biases, e.g., in our study due to bias-sensitive layouts and explanations, can increase bias-awareness. Referring to intuitive or well-known bias concepts can improve this base effectiveness further, as indicated by the previously outlined effectiveness of PolSides that is only present with specific explanations. This effect of using well-known bias concepts is partially in line with our second finding (see afterward), i.e., the learning effect noticed for our novel bias and grouping concept.

Table 6.1 Shown are the effects on respondents’ bias-awareness after overview exposure. Column “Est.” shows the percentage increase in bias-awareness for the attributes CDCR, Overview, and Topic compared to their respective baselines, i.e., CoreNLP, Plain, and bushfire. Columns “SE,” “z,” and “p” refer to the standard error, z-score, and p-value, respectively. In column “p,” asterisks represent the significance level where weakly significant (“*”), significant (“**”), and strongly significant (“***”) refer to p < 0.05, p < 0.01, and p < 0.001, respectively

(2) Learning Effect

Our study indicates that the novel PFA approach might have benefitted in the course of the study from respondents’ increasing understanding concerning how the approach works. Tables 6.2 and 6.3 show the bias-awareness effects after overview exposure in the study’s first and second task sets. While we notice an effect increase for all overview variants in the second task set compared to the first, there are key differences. First, while in task set 1, the PFA approach did not significantly increase bias-awareness, in the next task set 2, the PFA variants yielded the strongest, most significant effects. Specifically, PFA’s and its MFA grouping achieved the strongest effect among all overviews (MFA, Est. = 28.12; PolSides, 23.54). Second, while in task set 1, only PolSides significantly increased bias-awareness, it could not benefit as much as the PFA approach from respondents’ learning effects (effect increase from task set 1 to 2: MFA, =18.91pp.; PolSides, 1.99pp.).

Table 6.2 Effects on bias-awareness after overview exposure in the first task set
Table 6.3 Effects on bias-awareness after overview exposure in the second task set

These differences of the effects across the two task sets are also in line with the previously mentioned popularity and intuition (cause 1). Specifically, since the PFA approach is novel and its explanations are somewhat “technical,” as respondents reported, we can expect both PFA’s lack of effects in task set 1 and the learning effect throughout the study. Simultaneously, for the well-known and easy-to-understand left-right dichotomy employed by PolSides, we can expect significant effects from the beginning and only a slight increase of PolSides’s effects throughout the study. These findings are also in line with the framing groups’ substantiality discussed afterward (cause 3).

(3) Substantiality of Framing Groups

We qualitatively analyzed the groups yielded by individual grouping methods. We found that all methods, including PolSides, determined meaningful groups for most topics. For example, in the gun control topic, the groups yielded by any grouping method resemble the frames “gun control” and “gun rights” with subtle differences between the groups. Table 6.4 gives an overview of the frame we inductively identified when reading headlines and articles of each group. Note that while often frames were already apparent in the headlines shown in the table, in some cases, the groups’ underlying frames more clearly emerged from reading the lead paragraph.Footnote 6 Grouping-MFA yielded two “gun control” frames (one was argumentative; the other used factual language) and one “gun rights” frame (focusing on cruelty and the shooter). Also, PolSides and Grouping-ALL yielded two “gun control” and one “gun rights” frames. Here, the “gun rights” frame determined by Grouping-ALL focused not on the shooter but on the victims and their right to defend themselves. In sum, all methods yielded framing groups of articles representing meaningful frames present in the coverage.

Table 6.4 Frames in the gun control event that we inductively identified when qualitatively analyzing the articles of individual framing groups determined by the grouping methods

However, for some topics, the framing groups determined by MFA and ALL seemed to be more substantial compared to groups of PolSides. This finding is intuitive since PolSides determines the groups through the political orientation of the articles’ outlets and thus is content-agnostic. In contrast, PFA analyzes an article’s polarity toward individual persons, i.e., it uses in-text features for bias identification. Our respondent sample is too small for showing significant effects when sub-grouping for topics. However, the debt ceiling topic employed in our pre-studies highlights this methodological difference. As shown in Table 6.5, MFA yielded coherent groups that framed the deal positively by focusing on positive effects for the economy (frame 1) and countries’ safety through the military (frame 2) or negatively, e.g., as political hypocrisy (frame 3). In contrast, the PolSides’s groups—despite the topic has assumed left-right polarization—represent rather superficial frames. Specifically, all of PolSides’s groups frame the deal positively, i.e., two groups frame the issue highly similarly, the other focuses on the overall implications of the deal. The issue of non-substantial frames is also present when analyzing coverage on events where the political left and right do not have distinct positions, as the “bushfire” event discussed below shows.

Table 6.5 Frames in the debt ceiling event that we inductively identified when qualitatively analyzing the articles of individual framing groups determined by the grouping methods

Of course, the reliability of our inductive frame analysis is lower than that of a deductive frame analysis performed by multiple annotators with high inter-annotator reliability. However, our qualitative findings concerning the substantiality are in line with the quantitative effects measured in our study. The effects in the single-blind setting show that PFA increased bias-awareness stronger than PolSides, both when looking at the overall study and the second task set. Since all approaches “look” identical in the single-blind setting, effect differences can only be explained by the framing groups the approaches determined. PolSides achieving the lowest increase in bias-awareness is thus intuitive. Moreover, Grouping-ALL achieved a larger increase than Grouping-MFA. This finding is also intuitive since we expect Grouping-ALL to be more reliable than Grouping-MFA, since the former uses more features (all persons) for determining framing groups than Grouping-MFA, which only uses the MFA’s polarity.

None of the tested approaches yielded consistently meaningful framing groups and frames in coverage on the “bushfire” event. This finding is expected for the PFA approach, which identifies frames due to how persons are portrayed. However, in the “bushfire” event, much coverage focused on the fire’s consequences for the economy or the environment. The finding is also intuitive for the PolSides approach since the political left and right do not have distinct positions in the “bushfire” event.

Criticism and Comments by Respondents

Lastly, we explored respondents’ comments they provided in the post-study questionnaire to find issues and other trends in their comments. Table 6.6 shows a summary of the identified trends. Note that these qualitative findings are not representativeFootnote 7 but can serve to get a preliminary understanding of advantages and issues respondents noticed.

Table 6.6 Non-representative, qualitative trends concerning advantages and issues of the overviews reported by respondents. The columns PolSides and MFA (Grouping-MFA) include specific and generic visualization variants. The column ALL refers the generic visualization variant employing the Grouping-ALL mechanism. All figures are in percent

Overall, 90% of the respondents who were exposed at least once to any bias-sensitive overview, including PolSides, explicitly mentioned finding the overview helpful for critically reviewing news coverage (“It also allows me to decide where I stand and learn new beliefs as I think that is important to see all sides of an article and to be able to understand it better” and “I like a trimmed down view that allows readers to efficiently compare articles.”), compared to only 30% exposed to the Plain overview. For the MFA overview, 20% reported that they did not agree with the classification. However, 60% of respondents exposed to MFA reported that they liked quickly knowing the stance of an article, e.g., “I like that it gave you an idea on what stance the article had, whether it was pro, contra, or ambivalent.” and “I liked that I could see different perspectives without having to go to different sources. I haven’t seen anything much like that in any other app or online news sites. I think that this helps encourage critical thinking.” The qualitative findings are further in line with the previously identified three causes for effect differences across the approaches. For example, the better substantiality of PFA’s frames is reflected in the trend “Lack of substantiality or balance,” which was 20% for PolSides and only 5% for Grouping-MFA and Grouping-ALL. Further, the technicality of PFA’s explanations and the simplicity of left-right dichotomy are reflected in “Confusion about grouping mechanism,” which was 0%, 40%, and 10% for PolSides, Grouping-MFA, and Grouping-ALL, respectively. Note that respondents reported less confusion for Grouping-ALL, which we tested only in the generic variant with less technical explanations, compared to the Grouping-MFA, which we tested also with specific explanations.

Summary

Overall, our study shows that bias-sensitive news overviews significantly and strongly improved bias-awareness in news consumers compared to popular, bias-agnostic news aggregators. Both the PFA approach and the PolSides baseline achieved positive effects. The PolSides approach even achieved the strongest effect when considering both task sets (Est. = 21.34), demonstrating the practical strength of this prior approach. However, PolSides lost its effectiveness entirely when employed in a single-blind setting. In contrast, the PFA approach achieved significant, consistent, and strong effectiveness, both when employed in the single-blind setting (17.54) and else (17.80). Moreover, in the second half of the study, the PFA approach was by far the most effective means to increase bias-awareness in respondents (best PFA, Est. = 28.12; PolSides, 23.54; and in the single-blind setting: best PFA, 26.46; PolSides (insignificant), 12.41).

The respondents’ comments and our effect comparison from the first to the second task set suggested that the bias-sensitive PolSides baseline initially benefited from its well-known and easy-to-understand bias identification method. While all approaches benefited from a learning effect during the study, the PolSides baselines—perhaps because already well-known—benefited only mildly. In contrast, the PFA approach was by far the most effective means to increase bias-awareness in the second half of the study.

Lastly, the results of our inductive frame analysis of the approaches’ framing groups strengthened the previous findings. Albeit not representative, our qualitative analysis suggested that the groups determined by PFA more consistently represented meaningful and substantial frames than those yielded by PolSides. In the conclusion of this thesis, Sect. 7.1 practically demonstrates the findings of our evaluation on the example of news coverage that we introduced in Sect. 2.6 to practically demonstrate the research gap.

6.7.2 Article View

In the article view, only the in-text highlights significantly increased bias-awareness if between 5 and 9 of them were shown. Otherwise, the article view did not significantly increase respondents’ bias-awareness. This section discusses potential causes for this lack of significant effects.

Table 6.7 shows the overall lack of bias-awareness effects measured using the rating questions of the post-article questionnaire. There was a weak but significant increase in bias-awareness of the highly polarizing “abortion law” event compared to the baseline event “bushfire” (Est. = 3.92).

Table 6.7 Shown are the effects on respondents’ bias-awareness after article view exposure operationalized using the additive score of all rating questions in the post-article questionnaire. Column “Est.” shows the percentage increase in bias-awareness for the attributes CDCR, in-text highlights, polarity context bar, MFAP headline tags (jointly with the MFAP article indicator as described in Sect. 6.5.5), PolSides headline tags (jointly with the PolSides article indicator), and the article’s topic compared to their respective baselines, i.e., CoreNLP, disabled (four times), and bushfire. Columns “SE,” “z,” and “p” refer to the standard error, z-score, and p-value, respectively. In column “p,” asterisks represent the significance level where weakly significant (“*”), significant (“**”), and strongly significant (“***”) refer to p < 0.05, p < 0.01, and p < 0.001, respectively

Table 6.8 shows the article view’s effects when operationalizing bias-awareness using the forced-choice question. Here, in-text highlights achieved stronger but still insignificant effects, e.g., showing in-text highlights using two-color mode achieved (Est. = 5.08). Table 6.9 shows the same data but modeled with the count of in-text highlights instead of their color mode.Footnote 8 Likewise, this analysis yielded insignificant effects with the exception that if the article view showed between 5 and 9 in-text highlights, respondents’ bias-awareness was significantly increased (Est. = 13.43).

Table 6.8 Effects on respondents’ bias-awareness after article view exposure operationalized using the forced-choice question and distinguishing the color modes for in-text highlights
Table 6.9 Effects on respondents’ bias-awareness after article view exposure operationalized using the forced-choice question and distinguishing the count of in-text highlights

We were also interested in whether showing an overview before a single article affected bias-awareness in the article view compared to showing none. However, likely due to the issues discussed in the following, including a too-small respondent sample size, we cannot elucidate this question. Specifically, showing an overview before the individual news articles had inconclusive effects. We found a significant, mild effect caused by only the MFA overview (Est. = 4.64, p = 0.003). Other overviews had no significant effects compared to not showing an overview.

Discussion of the Article View and Its Results

We discuss four factors, which are partially related to another, that may explain the overall lack of significant effects of the article view.

(1) Lack of multiple Perspectives in the Article View

Bias is context-dependent and thus depends at least to some degree on relating and contrasting perspectives (Sect. 6.2.1). The underlying relativity of bias could be one reason why the overview (showing multiple articles and perspectives) achieved strong effects. In contrast, the article view (showing primarily a single article) achieved no significant effects overall, despite both employing the same techniques for bias identification. Most components of the article view communicate information about the given article. Only the headline tags and the polarity context bar allow to contrasting articles to some degree.

(2) User Experience (UX) Issues

Besides the previously mentioned potential conceptual issue of article view, we identified various infrequent UX issues in respondents’ feedback that we had not noticed in previous tests, including the pre-studies. Table 6.10 shows a summary of the identified, non-representative trends.Footnote 9 For example, respondents reported that there were too many in-text highlights (13% of respondents, e.g., they felt “overwhelmed”) or that, in their opinion, relevant mentions were missing (17%). This is in line with the lack of significant effects of these attributes (see, e.g., Table 6.9).

Table 6.10 Non-representative, qualitative trends concerning article view issues as reported by respondents. All figures in percent

Overall, respondents did not criticize the polarity context bar, but we noticed two UX issues of the bar. In some cases, the bar placed the circles of multiple articles at the same position. In these cases, respondents could not see and compare the overlapping articles, rendering the functionality of the bar invalid. The issue mainly occurred when Grouping-MFA was used because MFA uses a single feature, i.e., the article’s polarity toward the single MFA. Despite the MFA being the most frequently mentioned person among all articles, a few articles did not mention the MFA often or at all. Such articles had an increased likelihood of being placed at the same position in the polarity context bar. Further, some respondents might not have been aware of the hover functionality to view the articles’ headlines. Those respondents only saw each article’s polarity toward the event’s MFA. As a consequence, they could not contrast the articles’ headlines, which typically allow for getting a first understanding of articles’ main slant and content.

(3) Inaccurate In-Text Highlights

While also a UX issue, we discuss inaccurate in-text highlights separately since their root cause is PFA. The methods employed by PFA yield incorrect results in some cases despite their technically high classification performances (coreference resolution employed in target concept analysis, F1m = 88.7; target-dependent sentiment classification employed in frame analysis, F1m = 83.1). The consistent effects achieved by the overview visualizations demonstrated that the methods’ classification performances are sufficiently high to classify biases at the article level reliably. However, at the same time, users were likely much more sensitive to misclassified in-text highlights when viewing them individually in the article view (20% of respondents reported that the highlights were inaccurate; see Table 6.10). Figure 6.12 shows an example of in-text highlights with misclassified sentiment polarity.

Fig. 6.12
figure 12

Screenshot of misclassified sentiment shown by in-text highlights (second sentence: “Morrison […]”)

(4) Study Design and Sample Size

In typical news consumption, newsreaders actively choose which articles they want to read.Footnote 10 However, to adhere to the conjoint requirements, we had to present randomly chosen articles to them. This difference is not harmful per se, since the conjoint design also assumes that the effects of respondents’ mixed interest in articles would cancel each other out given a large enough sample. The lack of significant effects shown in Table 6.7, however, might indicate that conjoint design is not optimal for evaluating the article view’s effectiveness or that the sample size was too small given the diversity of articles and how respondents interacted with them.

Summary

In contrast to the news overviews, the article view only increased respondents’ bias-awareness if between 5 and 9 in-text highlights were shown (Est. = 13.43). Figure 6.13 depicts an example of such in-text highlights using the two-color mode. Respondents’ comments also indicated the principle usefulness of the visualization, e.g., “it [the article view and its in-text highlights] got me to think about the content of the article and the parts of the story it chose to focus on.” Lastly, the results suggested that the forced-choice question can better operationalize bias-awareness for the article view showing only a single perspective than the rating questions.

Fig. 6.13
figure 13

Screenshot of automatically classified sentiment in the article view’s in-text highlights

Besides the in-text highlights, the other elements of the article view did not increase bias-awareness. In our view, the most likely reasons for the overall lack of effects include the small sample of respondents and events, minor UX issues, and not showcasing multiple perspectives in the article view. In Sect. 6.8, we propose ideas to address the previously discussed issues.

6.7.3 Other Findings

Our study showed no significant effects of respondents’ demographic and background factors on bias-awareness measured after overview or article view exposure. This finding contrasts prior studies, which indicate the influence of people’s political orientation [20, 148], education, age, and other factors [183]. Like in the article view, one likely reason for the lack of significant effects in our study is the small respondent sample. Further, the distributions of most demographic and background attributes were imbalanced in our respondent sample (Sect. 6.7). While on the one hand, the distributions in our sample very roughly approximated those of the US population (cf. [183]), on the other hand, some attribute levels, such as in education, occurred too infrequently in our sample, even after dividing them into bins.

If, however, the resulting bins were sufficient in size, we found significant effects for some sub-groups, bins, and questions. For example, Table 6.11 shows the relative effects of a selection of respondents’ demographic and other background attributes measured on the fourth post-overview question (“Overall, how do you think the coverage in the overview’s articles compares to each other [very opposing ↔ very agreeing]”; see Sect. 6.5.6). Respondents having the lowest education level were significantly less aware of differences in the articles and perspectives compared to the baseline that had a bachelor’s degree (Est. = −15.49). However, Table 6.11 shows no effects of other education levels and likewise were there only mixed effects in other post-overview and post-article view questions. Thus, due to the lack of effects similarly expected for other attributes and levels, the study’s findings concerning respondents’ demographic and background factors are only of exploratory nature.

Table 6.11 Effects of respondents’ demographic and background attributes on post-overview question Q4

Another finding was related to the hostile media effect (Sect. 2.2.3). The sub-group of respondents who generally perceived news as rather biased against their viewsFootnote 11 particularly benefited from seeing a bias-sensitive news overview with Grouping-MFA (Est. = 2.86, p = 0.002).

Our conjoint evaluation showed only a mildly positive, insignificant effect of using context-driven cross-document resolution compared to the CoreNLP-based baseline in the overview (Table 6.1) and mixed, insignificant effects in the article view (Tables 6.7 and 6.8). Potential reasons explaining the lack of an effect are as follows in our view. First, the overview relies on aggregated polarity information rather than individual mentions. Table 6.1 suggests that both the CoreNLP-based baseline and our method suffice to find and resolve mentions to substantially represent the articles’ overall slant toward the most frequently mentioned persons. Second, while highlighting the polarity of individual mentions in the article view directly relies on high-quality individual mentions, the article view suffers from UX issues discussed in Sect. 6.7.2. Once these UX issues are addressed, we expect a larger sample size will yield a statistically significant effect of in-text highlights when using our coreference resolution due to its higher performance (F1m = 88.7 compared to 81.9; see Sect. 4.3.6).

Summary

Our study showed no conclusive and significant effects of respondents’ attributes, such as their political orientation and education level. This finding is in contrast with prior studies. We think that the main reason for this lack of expected tendency and influence is the respondent sample size. While sufficient to show consistent, strong, and significant effects across the news overviews, the sample size is too small for the relatively fine-grained attributes we queried in the background questionnaire. Moreover, the distributions of some of the respondents’ attributes are imbalanced since they roughly approximate the distributions of the US population. In conjunction with the small sample size, some attribute levels occur too infrequently to yield statistical soundness.

6.8 Future Work

We present ideas for future work to address limitations and issues concerning our approach and its evaluation. Specifically, we first discuss the limitations of our study design and the generalizability of the results. Afterward, we discuss the technical issues identified in our study.

Generalizability and Study Design

In our view, the main limitations of our experiments and results concern their representativeness and generalizability, mainly due to three partially related factors.

(1) Study Design

For example, respondents had to view given events and articles rather than deciding what they read. An interactive design and a long-term observation study might more closely resemble real-world news consumption and address further issues that we not explicitly noticed but may have faced in our experiments, such as study fatigue. While our study’s duration is well below where one would expect study fatigue [313], users on MTurk often work long and on many online tasks. Moreover, rather than querying respondents for the subjective concept of bias-awareness, a long-term observation study could directly measure the approaches’ effects on news consumption, for example, whether respondents read more articles portraying events from different perspectives when using bias-sensitive visualizations [276].

(2) Respondent Sample

While our sample roughly approximates the US distribution concerning dimensions important in this study, such as political orientation [282, 285], the sample contains selection biases, e.g., since we recruited respondents only on one platform and from only the USA. Thus, we cannot generalize our findings to news consumers from countries with systematically different political landscapes or media landscapes. For example, while in the USA, the two-party system has been shown to lead to more polarizing news coverage, countries with multi-party systems typically have more diversified media landscapes [395].

Further, we propose to increase the respondent sample to address the inconclusive or insignificant effects of article view, respondents’ attributes, and when sub-grouping. Our sample size is larger than suggested by Cochran’s formula [58, 385].Footnote 12 However, at the same time, the number of respondents was often too small in sub-groups or when analyzing respondents’ demographic or other imbalanced attributes. Increasing the respondent sample would help yield overall better statistical soundness and allow for sub-grouping while retaining statistical significance.

Our (3) event and article sample yielded similar limitations as described previously for the respondent sample due to the event sample’s small size and systematic creation. We thus propose increasing the number of events and articles per event. If both the event sample and respondent sample are increased, we could also use a random sample of events. A random event sample would reduce selection biases compared to a systematically selected sample. Lastly, our study did not relate bias-awareness to the articles’ content, but only to our approach, respondents’ attributes, and an event’s expected degree of polarization as an approximation for the articles’ amount of biases. To more precisely measure how articles’ content and inherent biases influence bias-awareness, we propose to conduct a manual frame analysis and relate the identified frames to changes in bias-awareness.

The results of a manual frame analysis would also enable another way to evaluate bias identification approaches. Specifically, comparing the frames identified manually to those predicted automatically would allow for assessing the overall accuracy of an automated approach (Sect. 2.4). This research direction could even go as far as compiling benchmark collections of datasets from content analyses and frame analysis concerning media bias. Similar to the GLUE collection [373] and other benchmarks, such collections could be used to evaluate and directly compare the framing detection performance of individual approaches. Moreover, the benchmarks would also sharpen the approaches’ representativeness and generalizability.

Technical Future Work Ideas

An idea from which effectiveness and UX of both the overview and the article view could benefit is to extract and show a distinct summary for each framing group. As a substitute for such a summary, the prototype currently shows the headline of a group’s most representative article, allowing users to get an overall impression of that article’s content and framing. However, a headline does not necessarily provide a concise summary of an article and even less so of the framing group that the article represents (Sect. 2.6). For example, one respondent reported that “[the] Overviews weren’t long enough to get full scope of agreed-facts.”

Besides the previously mentioned small respondent sample, other potential reasons for the lack of conclusive effects of the article view (except the in-text highlights) are rather technical. Addressing the view’s UX issues is relatively straightforward. However, we think another reason for the lack of effects is the relativity of bias. On the one hand, our and other researchers’ definitions of bias imply that bias requires contrasting information. On the other hand, the article view does not allow such comparison since it primarily shows a single article (with visual clues adding bias information). To address this conceptual issue, another line of research would need to be investigated. How can users be enabled to efficiently contrast individual “facts” presented in an article with matched facts from other articles (cf. [276])? We had excluded this idea when devising article visualizations for the study for various reasons. Most importantly, we think a visualization allowing for contrasting facts contradicts our ease-of-use objective (Sect. 6.4). For example, showing individual facts and alternative presentations of the same facts taken from other articles would increase the complexity of the article view. Further, this line of research requires the development of methods for news-specific semantic text similarity (Sect. 2.3.2). We have already taken the first steps toward this idea, and we proposed an exploratory system and visualization to explore how news articles use and reuse information from other sources [77, 139]. However, the preliminary evaluations of our exploratory approaches indicated the expected difficulty of this task.

To identify meaningful framing groups in non-person-centric coverage, such as on the “bushfire” event, we propose extending our analysis to further semantic concept types, such as groups of persons, countries, and objects. Our method for target concept analysis is already capable of resolving these types, and we would need only to extend the target-dependent sentiment classification method. Instead of analyzing framing effects, an automated approach could also investigate topic-independent frames or their derivatives (Sect. 5.1). Another idea to improve the distinctiveness of the resulting framing groups is to use a dynamic clustering technique in the frame analysis component, such as affinity propagation. We fixated this number to three using k-means to allow for direct comparison with the three groups by PolSides. However, by allowing for a variable number of groups, the groups determined by PFA could more closely match the characteristics of the data.

Outlook

The effectiveness of revealing biases in news articles raises crucial questions beyond the traditional scope of computer science. For example, one respondent of our study asked “How do you trust the ratings/categorization?” Of course, communicating how the individual “rating,” i.e., perspective, was achieved is crucial, and we aimed to achieve this using the explanations added to all visualizations. However, the underlying issue is much more complex and vital. Who is to decide which sources and articles should be contained in the input set analyzed by automated approaches? Should extreme or alternative outlets be included in the analysis and visualization to increase the number of distinct perspectives? We think that answering these questions with further interdisciplinary research is a crucial prerequisite before using automated bias identification methods at scale in real-world news consumption.

6.9 Key Findings

This chapter presented the first system to identify person-oriented frames and reveal corresponding framing groups of articles in political event coverage. Earlier, such frames could reliably be identified only using high-effort methods, such as frame analysis as conducted in social science research or media literacy practices. We demonstrated the effectiveness of our person-oriented framing analysis (PFA) approach in a large-scale user study.

In the study’s single-blind setting, we found that our approach and overview most strongly, significantly, and consistently increased respondents’ bias-awareness. In particular, the PFA approach found biases that were indeed present in news articles reporting on person-centric events. In contrast, the tested prior work rather facilitated the visibility of potential biases, e.g., by distinguishing between left- and right-wing outlets. Other prior approaches suffer from analyzing single or shallow features (Sect. 6.2). Using such simple techniques can result in superficial or unmeaningful framing groups, as noted by multiple respondents, e.g., “I am not sure I agree (others, too) with many of the ‘left/center/right’ designations of the sources, as many were quite ambiguous and/or the topics at hand are not always agreed or disagreed upon concretely by members of the same political party.” In sum, by using our methods for context-driven cross-document coreference resolution, target-dependent sentiment classification, and frame clustering on all persons mentioned in person-centric coverage, the PFA approach reliably identified frames and in turn achieved the strongest effectiveness.

Further, we discussed the limitations of our study, for example, regarding the findings’ generalizability. Reasons include selection bias, e.g., the respondent sample consisted only of people located in the USA, and the event sample only of 30 articles in 3 news events. We propose to address these limitations by increasing and diversifying both samples. Another promising idea for future work is to adapt the study design to more closely resemble daily news consumption. Here, we propose to conduct a long-term study to observe how revealing biases affects news consumption directly.

The most substantial technical improvement idea would enable the PFA approach to identify meaningful framing groups also in non-person-centric coverage. To achieve this, we propose to extend PFA to analyze other concept types, such as groups of persons and countries. Since the other methods in PFA are already capable of analyzing these types, only the target-dependent sentiment classification method would need to be extended. Further, to address the overall lack of conclusive and significant effects of the article view, we propose investigating how the relative concept of media bias can more effectively be communicated by the article view while maintaining ease of use.

6.10 Summary of the Chapter

The work described in this chapter allows us to answer questions raised in earlier chapters. Further, we are now able to investigate the effectiveness of the individual components concerning the overall research question.

How does our method for coreference resolution employed in the target concept analysis component contribute to the overall success of our system?

Our conjoint evaluation showed only a mildly positive, insignificant effect of using context-driven cross-document resolution compared to the CoreNLP-based baseline (Tables 6.1 and 6.8). However, we expect to see increased effectiveness after addressing the identified UX issues in visualization components that directly rely on high-quality individual mentions, such as in-text highlights. Further, in the future, when extending PFA to analyze also concepts types other than individual persons, our coreference resolution can directly be used since it is already capable of resolving these types, partially with much higher performance than prior methods (Sect. 4.3.4.3).

How does our target-dependent sentiment method employed in the frame analysis component contribute to the overall success of our system? More specifically (as hypothesized in Sect. 3.3.2), does the focus of our research objective to consider only person-targeting forms of bias suffice to address the overall research question, which seeks to reveal substantial biases in daily news consumption effectively? Likewise, does analyzing the framing effects on the one-dimensional polarity scale instead of, for example, nuanced political frames suffice to tackle our overall objective (as questioned in Sects. 2.3.4 and 5.2.1)?

We found that focusing the analysis on persons and specifically on person-oriented polarity generally suffices to identify substantial perspectives present in coverage on policy issues effectively. In the study, our approach achieved overall high effectiveness. Once respondents got used to the concept of bias or our approach, i.e., in the second task set, the PFA approach led to the strongest increase in bias-awareness. Additionally, the qualitative investigation of all approach’s resulting framing groups suggested that our approach found meaningful person-oriented frames in the analyzed articles. In contrast, prior approaches only facilitate the visibility of potential perspectives, as we also demonstrated for one approach in our study and in the practical demonstration of the research gap in Sect. 2.6.

However, PFA’s effectiveness is limited by characteristics of the analyzed news coverage, especially whether news articles report on persons. Whereas the PFA approach increased bias-awareness if news coverage focused on the persons involved in an event, we found that it yielded groups of articles representing indistinct frames when applied to the “bushfire” topic. In this topic, news articles reported less on individual persons. Instead, the news articles primarily reported on the consequences for society, economy, and nature. The low performance for such non-person-oriented topics is expected due to the design of the approach, which relies fundamentally on mentions of persons (Sect. 3.3). Fully elucidating this question requires a larger and more diverse event sample.

One idea to detect meaningful frames and respective framing groups in non-person-oriented topics is to extend our target-dependent sentiment classification to classify the sentiment of additional target types, such as groups of persons, countries, and objects. Our method for cross-document coreference resolution is already capable of resolving such types (see Sect. 4.3.4). Another line of research, which resembles frame analyses used in the social sciences to systematically analyze media bias, is to classify frames or their more practice-driven derivatives, such as micro frames [193], our frame properties (see Sect. 5.2), or frame types [45, 46]. However, each of them has its own limitations and challenges, typically requiring high annotation effort as we discussed in Sect. 3.3.2.