3.1 Introduction

The previous chapters showed that media bias and its extreme form, fake news, are pressing issues that can influence how—and even if at all—societies can make decisions. Only rarely are news consumers aware of biases in the news. Revealing biases as such, for example by showing the frames present in coverage on the same topic, can help news consumers become aware of bias and make more informed decisions. This idea is at the heart of the research presented in this thesis.

Automatically Identifying Framing in News Articles to Reveal Bias

As the literature review in Chap. 2 showed, media bias is highly complex and—albeit in computer science often being analyzed as a single, rather broadly defined concept—consists of a broad spectrum of forms. Many of these forms are rather subtle and difficult to identify. During daily news consumption, sophisticated critical assessment of news coverage is nearly impossible if one is not trained to recognize such bias form and can afford to invest significant time, e.g., for researching facts and contrasting coverage. In sum, identifying framing or media bias using, for example, media literacy practices and social science frame analyses requires in-depth expertise and time-consuming work. Automating these effortful but effective means to enable bias-sensitive news consumption is the objective of this thesis.

The current chapter proposes person-oriented framing analysis (PFA) to address our research question. In the following, we first provide a definition of media bias (Sect. 3.2). Then, we discuss the solution space to tackle media bias (Sect. 3.3). Using the findings of these discussion, we propose the PFA approach to reveal biases in news articles (Sect. 3.4). Lastly, we introduce a side contribution of this thesis, a system for news crawling and extraction (Sect. 3.5). The system collects news articles from online news outlets. The news extractor can be used before PFA to gather articles for analysis and has also demonstrated its usefulness in other use cases throughout the research described in this thesis.

After this chapter, Chaps. 4 and 5 introduce the individual methods part of PFA, and Chap. 6 then introduces our prototype and evaluation to demonstrate the effectiveness of the PFA approach.

3.2 Definition of Media Bias

As our literature shows, many definitions of media bias exist (Sect. 2.2.1). In sum, researchers in the social sciences have proposed various task-specific and in part overlapping or disagreeing definitions of bias. Compared to them, automated approaches tackle bias instead as a single holistic or superficial concept. In the remainder of this thesis, we use a definition of bias reflecting the shared, conceptual understanding established by our literature review.

Definition of bias

We define bias as the effect of framing, i.e., the promotion of “a particular problem definition, causal interpretation, moral evaluation, and/or treatment recommendation” [79], that arises from one or more of the bias forms defined by the news production process.

Bias can exist on various levels due to various means. For example, individual sentences and news articles can be slanted toward a specific attitude due to word choice and labeling, source selection, and other means of bias defined by the news production process (Sect. 2.2.3).

When comparing our definition to the various bias definitions devised in the social sciences, we identify the following commonalities and differences. First, our definition entails both intentional and unintentional bias. Some social science studies distinguish whether bias is intentionally implemented or unintentionally “exists” (cf. [327, 382]). Second, to allow for timely identification of bias, our definition allows bias to emerge from single incidents. In contrast, researchers in the social sciences analyze bias as a systematic tendency, i.e., an effect of multiple observations on extended time frames, since they are typically interested in the effects or implications of (biased) coverage, e.g., on society or policy decisions (cf. [382]).

Third, our definition is task-specific. Identical to social science research, which specific forms of bias are analyzed and how depends on the task at hand and research question. For example, in the social sciences, researchers devise features, such as frames, which they then quantify in the data, e.g., using content analysis. In our definition, the analyzed features are frames due to specific bias forms as described later in this chapter (Sect. 3.3.2). Fourth, identical to social science research, our definition is fundamentally based on the relativity of bias. Bias can only emerge from comparing multiple pieces of information (cf. [78, 288], Sect. 2.3). Newsreaders may have the sensation of bias (whether factually founded or not) when comparing news items with another or with their own attitude. Social science researchers compare news articles, e.g., with another, over a time frame, or with other information sources, such as police reports.

3.3 Discussion of the Solution Space

As our literature review shows, the spectrum of means to tackle media bias is as diverse as the complex issue of media bias itself. This section discusses questions that guide us toward a specific solution to address our research question. We summarize the findings of our literature review and discuss them in the context of our research question. Section 3.3.1 discusses when media bias can be tackled and the broad spectrum of means to tackle bias. Section 3.3.2 discusses approaches to address our research question specifically. Both discussions also strengthen the brief reasoning of our research question from Chap. 1.

3.3.1 Tackling Media Bias

Before discussing how media bias can be tackled, we need to discuss when it can be tackled.

Tackling Media Bias During News Production or After

Park et al. [276] distinguish two cases to tackle bias: during the production of news and afterward. The various means during the news production aim to prevent media bias in the first place or exposing it explicitly. Such means range from setting the goal to write and publish “objective” news coverage [381] to news formats that contrast media perspectives (so-called press reviews) or that are intended to explicitly convey the journalist’s opinion, such as columns, commentaries, or reviews.

According to Park et al. [276], all such means are impractical or inefficient. For example, defining “objective” news coverage is difficult or even impossible in a meaningful way. This can, for example, trivially be seen when looking at bias by event election. Given the myriads of events happening every day, journalists have to select a tiny subset to report on. Albeit necessary, how could this event selection be objective? Even when allowing “some degree of tolerable bias” [382], the fundamental issue remains: any approach aiming at objective—or tolerably biased—coverage will fail as long as there is no objectively measurable definition of such coverage.

There are various other reasons why news is rather biased than not. Perhaps most importantly, the news is meant to put events into context and assess the events’ meaning for individuals and society (see Chap. 2). Press standards, for example, often do not set objectivity as a higher-level goal but instead fairness or human dignity [360]. News publishers have also a financial incentive to at least slightly slanting their coverage toward the ideology of their target audience (see Sect. 2.2.3).

Lastly, bias-sensitive news formats, such as press reviews or commentaries, are an interesting means to tackle bias but suffer from the following issues. For example, press review can naturally only be created after other event coverage was already published. Similarly, a commentary can only serve as a source for an expert’s opinions and thoughts regarding an issue. While valuable as such, commentaries are by definition far from factual reporting. Thus, neither of such news formats can be a primary or universal form of news coverage. Instead, these forms can only complement up-to-date news formats, such as event coverage.

We conclude that avoiding slanted coverage or generally tackling media bias during news production seems infeasible or at least impractical. Because of the term’s vagueness, defining the term “objective” is problematic in the first place, as is adhering to it if set as a goal for news production. Press standards and research from the social sciences suggest that biases are structurally inherent to news coverage. In principle, a diversity of opinions and slants in the news can even considered to be desirable. Consequently, investigating post-production means to tackle media bias seems more suitable concerning our research question and also in general.

Post-production Means to Tackle Biases After the Production of News

We identify three conceptual categories of means to address media bias: bias analysis, bias correction, and bias communication.Footnote 1 The first category, bias analysis, includes both manual and automated approaches to identify and analyze biases. As we discuss in Sect. 2.3, to date, the most effective methods rely on costly manual analysis, e.g., systematic reading and annotation as part of content analysis. Scalable, automated approaches exist but yield less substantial or inconclusive results, especially when comparing their results to manual approaches from the social sciences. For example, they focus on quantitative properties, thereby missing the “meaning between the lines.” Or they analyze only vaguely defined instances of media bias, such as “topic diversity” [252] or “differences of [news] coverage” [278] (Sect. 2.2.4). In practical terms, automated approaches find technically significant biases, which, however, are often not meaningful or do not represent all frames of an event’s coverage.

Second, approaches for bias correction aim to identify biases and then “correct” [276] them, e.g., by removing biases or replacing biased statements with (more) neutral statements conveying the same information. The category represents a relatively young line of computer science research. Bias correction lacks an equivalent in social science research on media bias, possibly because of the previously mentioned characteristics (biases are inherent to the news and may—in principle—even be desirable to facilitate a rich diversity in opinions). The few automated approaches are mostly exploratory and yield mixed results [276]. For example, one recent approach aims to identify and then flip the slant of news headlines, e.g., from having a left stance to become right-slanted [49]. The poor results indicate the complexity of this task. 63% of the generated headlines with flipped slant were not even understandable. In only 42%, the bias could be flipped while still reporting on the initial headline’s event. Other approaches rely heavily on user-provided feedback (cf. [276]). Here, the lack of a ground truth comparison can be very problematic since users bring their own biases. The two fundamental issues of current approaches for bias correction are as follows. First, defining unbiased news is practically impossible as stated previously [276]. Second, current natural language generation methods do not suffice to reliably produce “corrected” texts from biased news texts, e.g., due to the lack of training datasets (cf. [35, 214]).

Lastly, approaches for bias communication aim to inform news consumers about biases, e.g., by showing different slants present in news coverage on a given event. Previous studies find that bias-sensitive visualizations can effectively communicate biases and help news consumers to become aware of these biases. For example, users of NewsCube’s bias-sensitive visualizations read more articles than users of a bias-agnostic baseline. Thereby, the users actively exposed themselves to more diverse perspectives because many articles conveyed perspectives not aligning with the individual users’ ideology. In sum, the users of bias-sensitive visualizations developed “more balanced views” [276] on the news events. Similarly, the evaluation of our matrix-based news aggregation finds that users exposed to bias-sensitive visualizations more effectively and more efficiently became aware of the various perspectives present in the news coverage [128]. Besides such academic efforts, other approaches exist to communicate biases during everyday news consumption. For example, AllSides is a bias-aware news aggregator that shows for each topic one article from a left-wing, center, and right-wing news outlet, respectively [8]. In contrast to popular news aggregators, this approach facilitates showing diverse perspectives.

In addition to the confirmed effectiveness, studies concerned with bias communication found positive effects on individuals and society. For example, bias communication supports news consumers in making more informed choices, e.g., in elections [22]. However, despite their effectiveness, effectively communicating biases suffers from the effort of manual techniques for bias identification or the superficial results yielded by automated approaches.

In sum, we conclude that devising a post-production approach for bias identification (category “bias analysis”) and subsequent communication (“bias communication”) is the most promising research direction to address the issues caused by media bias. The previous discussion thus also strengthens the brief reasoning for our research question described in Chap. 1.

3.3.2 Addressing Our Research Question

One key finding of our literature review is that interdisciplinary research can improve the effectiveness and efficiency of prior work conducted separately in each discipline (see Sect. 2.5). The largely manual methods from the social sciences are effective, e.g., they yield substantial results. However, they are also not as efficient compared to automated approaches. Simultaneously, while automated approaches are highly efficient, they are often not as effective as methods employed in social science research. One reason for the often only superficial or inconclusive results is the discrepancy in how bias is defined and analyzed in automated approaches compared to the practice-proven models established in the social sciences.

We aim to combine the relevant methodologies of both disciplines in this thesis. In particular, we propose an automated approach that roughly resembles the manual process of frame analysis established in bias research in the social sciences. By following social science methodology, we can address the previously mentioned discrepancy that is a fundamental cause for the comparably low performance of automated approaches. The automated approaches prevalently analyze media bias as a single holistic or vaguely defined concept (Sect. 1.2). In contrast, the news production process (Sect. 2.2.3) defines nine strongly different forms of media bias, each due to strongly different causes and each causing effects on different objects. Not distinguishing the individual forms and analyzing just one holistic “bias” must lead to superficial, unmeaningful, or inconclusive results.

So, instead of analyzing vaguely defined biases, such as “subtle differences” [210], we seek to identify meaningful frames in order to reveal biases. Following our definition of media bias (Sect. 3.2), we seek to identify substantial frames by analyzing specific forms defined by the news production process described in Sect. 2.2.3.

When selecting which forms to identify, we need to balance two goals: representativeness, i.e., covering a broad range of bias forms, and low cost, i.e., covering only a few forms. Our literature review shows that in-depth analysis of individual forms causes exacting effort to achieve substantial and reliable results. Of course, an automated approach would spare much or all of the repetitive effort caused by manual analyses. However, devising a reliable approach for analyzing a particular form still causes high cost initially. For example, because of the forms’ differing characteristics, individual methods would need to be devised for each form, each requiring also the creation of a sufficiently large, high-quality dataset for training or at least testing. Thus, on the one hand, focusing on a subset of bias forms seems more feasible than devising analysis methods for all forms individually. On the other hand, focusing on too few forms may cause the approach to miss relevant means of bias in a given news article or coverage. Thus, we aim to cover a sufficiently large set of impactful bias forms while maintaining high specificity and effectiveness through focusing on a set as small as possible. We expect that a well-balanced trade-off between both goals will allow us to identify substantial and meaningful frames.

We propose to identify a fundamental effect resulting from multiple bias forms emerging at the text level rather than analyzing them individually: effects of person-targeting framing, i.e., how individual persons are portrayed in the news. Person-targeting framing yields person-oriented frames, which roughly resemble the political frames proposed by [79] and used in our definition of bias (Sect. 3.2). However, our person-oriented frames are somewhat exploratory, e.g., implicitly defined and loosely structured. Person-oriented frames emerge, in particular, from the following bias form.Footnote 2

  • Word choice and labeling: how the word choice affects the perception of individual persons, e.g., due to how a text describes a person, actions performed by the person, or causes of these actions.

More indirectly, person-oriented frames also emerge from the following two forms of media bias.

  • Source selection: which sources are used when writing a news article and how their content and language affect the perception of individual persons (see Sect. 2.3.2).

  • Commission and omission of information: which information, such as actions and causes thereof, is included in the article (or left out) from these sources and how this affects the portrayal of individual persons (see Sect. 2.3.3).

3.3.3 Research Objective

As a conclusion of the discussion in Sect. 3.3, we define the following research objective, which we seek to address in this thesis:

Devise an approach to reveal substantial biases in English news articles reporting on a given political event by automatically identifying text-based, person-oriented frames and then communicating them to non-expert news consumers. Implement and evaluate the approach and its methods.

Focusing our research objective on person-targeting framing logically misses biases not related to individual persons. However, persons are especially important in news articles reporting on policy topics, e.g., because decisions are made by politicians and affect individuals in society. Further, according to the news production process (Sect. 2.2.3), the three bias forms jointly represent all means on the text level to directly affect the perception of persons. Thus, we hypothesize that focusing our research on the identification of person-targeting framing has high potential to effectively identify and communicate a significant share of the biases real-world news coverage consists of. We investigate this hypothesis in our prototype evaluation (Sect. 6.7).

3.4 Overview of the Approach

We propose person-oriented framing analysis (PFA), an approach to reveal biases by identifying and communicating how persons are portrayed in individual news articles. The PFA approach identifies person-targeting forms of bias, most importantly word choice and labeling, source selection, and commission and omission of information.

This section gives a brief conceptual overview of the analysis workflow and the individual methods. Chapters 4 and 5 then detail the respective methods and evaluate them individually. Chapter 6 introduces our prototype system that integrates the individual methods and subsequently reveals media bias to news consumers. Chapter 6 also presents our large-scale user study findings, demonstrating the effectiveness of the PFA approach in increasing bias-awareness in non-expert news consumers.

Our analysis seeks to find articles that similarly frame the persons involved in given political event coverage. As shown in Fig. 3.1, the analysis consists of three components: preprocessing, target concept analysis, and frame analysis. The analysis takes as an input a set of articles reporting on the same political event and first performs natural language preprocessing. Secondly, target concept analysis aims to find which persons occur in the event and identify each person’s mentions across all news articles. Thirdly, automated frame analysis aims to identify how each article portrays the individual persons, both at the article and sentence levels. Afterward, frame analysis clusters those articles that similarly portray the persons involved in the event. The output of our analysis is thus the set of news articles enriched with:

  • the set of persons that occur in the news coverage on the event,

    Fig. 3.1
    figure 1

    Shown is the three-component analysis workflow as it preprocesses news articles, extracts and resolves phrases referring to the same persons, and groups articles reporting similarly on these persons. Afterward, users can view the analysis results using our visualizations. Adapted from [123]

  • for each such person, all of its mentions resolved across the set of articles,

  • for each such mention, weighted framing categories representing how the local context of the mention portrays the person,

  • further information derived from the former types of information, including groups of articles with similar perspectives.

In the following, we briefly present each component involved in our analysis workflow. The subsequent chapters of this thesis then provide methodological details.

The input to PFA is a set of news articles written in English reporting on a single event related to politics. We define an event as something that happens at a specific and, more importantly, single point in time, typically at a single (geographic) location (cf. [201]). In contrast, we refer to a topic (also called issue) broadly as the “subject of a discourse” [233]. In the context of news coverage, a topic may consist of multiple news events. For example, a news topic might be the 2020 US presidential election. An individual event related to this topic (and of course related to also other topics) is the storming of the US Capitol on January 6, 2021.

The first component of PFA is preprocessing. Downstream analysis components, i.e., methods in target concept analysis and frame analysis, use the information extracted during preprocessing. Our preprocessing includes part-of-speech (POS) tagging, dependency parsing, full parsing, named entity recognition (NER), and coreference resolution [56, 57]. We use Stanford CoreNLP with neural models where available, otherwise using the defaults for the English language [224]. Section 4.3.3.1 details our preprocessing.

The second component of PFA is target concept analysis. Its objective is to identify which persons are mentioned in the news articles passed to the analysis. More specifically, the output of this component are the persons mentioned in the news coverage on the event and, for each person, the set of its mentions in all news articles. Therefore, the component performs two tasks (see Fig. 3.1). First, candidate extraction to extract any phrase that might be referring to a person. Second, candidate merging to resolve these individual mentions, i.e., find mentions that refer to the same person. Chapter 4 describes our research and methods for target concept analysis.

The third component of our analysis is frame analysis. This component aims to find groups of articles that similarly frame the event. The component performs two tasks to identify the framing. First, frame analysis determines how each news article portrays the individual persons identified earlier. More specifically, frame analysis determines for each mention how the mention’s local context, e.g., the surrounding sentence, portrays the person referred by the mention. Chapter 5 details our research and methods for this part of the frame analysis component. Second, frame analysis uses clustering techniques so that articles similarly portraying the individual persons are part of the same framing group (Sect. 6.3).

Lastly, our prototype system for bias identification and communication reveals the identified framing groups of articles. We devise visualizations intended to aid in the typical news consumption workflow, i.e., first to get an overview of current events and second to get more details on an event, e.g., by reading one or more individual news articles reporting on it. Chapter 6 details our prototype, visualizations, and large-scale user study to demonstrate the effectiveness of the PFA approach.

In addition to these core components of our analysis, we provide an optional component for data gathering, which can be used before the analysis to collect relevant news articles conveniently. Specifically, we present a web crawler and extractor for news articles. The system named news-please takes, for example, a set of URLs pointing to article web pages and extracts structured information, such as title, lead paragraph, and main text. Subsequently, this information can be passed to the system. The following section describes the crawler and extractor in more detail.

3.5 Before the Approach: Gathering News Articles

This section details our method and system for crawling and extracting news articles from online news outlets. Besides the need for such a system in this thesis, e.g., to conveniently acquire news coverage on a specific event, there is also a general need in the research community that motivated devising the system. For example, while news datasets such as RCV1 [205] are freely available, researchers often need to compile their own dataset, e.g., to include news published by specific outlets or in a certain time frame. Due to the lack of a publicly available, integrated crawler and extractor for news, researchers often implement such tools redundantly. The process of gathering news data typically consists of two phases: (1) crawling news websites and (2) extracting information from news articles.

Crawling news websites can be achieved using many web crawling frameworks, such as scrapy for Python [188]. Such frameworks traverse the links of websites, hence need to be tailored to the specific use case.

Extracting information from news articles is required to convert the raw data that the crawler retrieves into a format that is suitable for further analysis tasks, such as natural language processing. Information to be extracted typically includes the headline, authors, and main text. Website-specific extractors, such as used in [234, 270], must be tailored to the individual websites of interest. These systems typically achieve high precision and recall for their extraction task, but require significant initial setup effort in order to customize the extractors to a set of specific news websites. Such website-specific extractors are most suitable when high data quality is essential, but the number of different websites to process is low.

Generic extractors are intended to obtain information from different websites without the need for adaption. They use heuristics, such as link density and word count, to identify the information to be extracted. Our literature review and experiments show that Newspaper [396] is currently one of the most sophisticated and best-performing news extractors. It features robust extraction of all major news article elements and supports more than ten languages. Newspaper includes basic crawling, but lacks full website extraction, auto-extraction of new articles, and news content verification, i.e., determining whether a page contains a news article. The extraction performance of other frameworks, such as boilerpipe [185], Goose [185], and readability [114], is lower than that of the Newspaper tool. Furthermore, these latter tools do not offer support for crawling websites.

To our knowledge, no available tool fully covers both the crawling and extraction phase for news data. Web crawler frameworks require use case-specific adaptions. News extractors lack comprehensive crawling functionality. Existing systems lack several key features, particularly the capability (1) to extract information from all articles published by a news outlet (full website extraction) and (2) to auto-extract newly published articles. With news-please, we provide a system that addresses these two weaknesses using a generic crawling and extraction approach. The following section details the processing pipeline of news-please.

3.5.1 Method

news-please is a news crawler and extractor developed to meet five requirements: (1) broad coverage—extract news from any outlet’s website; (2) full website extraction; (3) high quality of extracted information; (4) ease of use, simple initial configuration; and (5) maintainability. Where possible, news-please combines prior tools and methods, which we extended with functionality to meet the outlined requirements. This section describes the processing pipeline of news-please as shown in Fig. 3.2.

Fig. 3.2
figure 2

Pipeline for news crawling and extraction. Source [136]

Root URLs

Users provide URLs that point to the root of news outlets’ websites, e.g., https://nytimes.com/. For each root URL, the following tasks are performed.

Web Crawling

news-please performs two sub-tasks in this phase. (1) The crawler downloads articles’ HTML, using the scrapy framework. (2) To find all articles published by the news outlet, the system supports four techniques: RSS (analyzing RSS feeds for recent articles), recursive (following internal links in crawled pages), sitemap (analyzing sitemaps for links to all articles), and automatic (tries sitemaps and falls back to recursive in the case of an error). The approaches can also be combined, e.g., by starting two news-please instances in parallel, one in automatic mode to get all articles published so far and another instance in RSS mode to retrieve recent articles.

Extraction

We use multiple news extractors to obtain the desired information, i.e., title, lead paragraph, main content, author, date, main image, and language. In preliminary tests (see Sect. 3.5.2), we evaluated the performance of four extractors (boilerpipe, Goose, Newspaper, and readability). Newspaper yielded the highest extraction accuracy for all news elements combined followed by readability. Thus, we integrated both extractors into news-please. Because both Newspaper and readability performed poorly for extracting publication dates, we employ a regex-based date extractor [101]. Because none of the extractors is able to determine the language an article is written in, we employ a library for language detection [66]. Our component-based design allows easily adding or removing extractors in the future. Currently, news-please combines the results of the extractors using rule-based heuristics. We discard pages that are likely not articles using a set of heuristics, such as link-to-headline ratio, and metadata filters.

Data Storage

news-please currently supports writing the extracted data to JSON files and to an Elasticsearch interface.

Besides the crawling and extraction workflow outlined previously, news-please supports two further use cases. First, users can directly use the extraction functionality (without the crawling part) for single URLs directly pointing to individual online news articles to retrieve the articles’ content as structured information, e.g., consisting of title and the other categories outlined previously. Second, news-please allows users to conveniently access the Common Crawl News Archive [256], which consists as of writing of over 400M potential articles gathered from more 50k potential news sources.Footnote 3 Especially the extraction functionality for Common Crawl has been used frequently during the individual research parts summarized in this thesis (see Sect. 3.5.3).

3.5.2 Evaluation

We conducted a preliminary, quantitative evaluation where we asked four assessors (students in computer science, aged between 19 and 25, three male, one female) to rate the quality of the information extracted by news-please and the four approaches described in Sect. 3.5, i.e., Newspaper, readability, Goose, and boilerpipe. We selected 20 articles from 20 news websites (the top 15 news outlets by global circulation and 5 major outlets in Germany) and manually assessed the quality of extracted information using a 4-point Likert scale. Our multi-graded relevance assessment includes the four categories: (A) perfect; (B) good, the beginning of an element is extracted correctly, later information is missing, or information from other elements is wrongfully added; (C) poor, in addition to (B), the beginning of an element is not extracted entirely correctly; and (D) unusable, much information is missing or from other elements. After two to three training iterations with the individual assessors, each including a discussion of their previous annotations, the inter-rater reliability measured with mean pairwise agreement was sufficiently high IRR = 0.78 (measured using average pairwise percentage agreement).

Table 3.1 shows the mean average generalized precision (MAgP), a score suitable for multi-graded relevance assessments [168]. The MAgP of news-please was 70.6 over all dimensions, assessors, and articles. Moreover, news-please yielded best extraction performances both overall and for each individual extraction dimension, except for main image and author. It performed particularly well for titles (MAgP =  82.0), description (70.0), date (70.0), and main image (76.0). For main content (63.6) and author (30.3), performance was worse but still better than or similar to the other approaches. Overall, news-please performed better than the included extractors individually.

Table 3.1 MAgP-performance of news-please and other news extractors. “Desc.” refers to the description, i.e., the lead paragraph, “Img.” to the article’s main image, and “Lang.” to the language the article is written in

3.5.3 Conclusion

We presented news-please, the first integrated crawler and information extractor designed explicitly for news articles. The system is designed to be able to crawl all articles of a news outlet, including articles published during the crawling process. The system combines the results of three state-of-the-art extractors. For high maintainability and extendibility, news-please allows for inclusion of additional extractors and adaption to use case-specific requirements, e.g., by adding an SQL result writer.

Our quantitative evaluation with four assessors found that news-please overall achieves a higher extraction quality than the individual extractors. By integrating both the crawling and extraction task, researchers can gather news faster and with less initial and long-term effort.

Within the context of this thesis, the system provides a convenient way to collect news articles that can then be analyzed for bias and subsequently visualized. We find that the system effectively helps to reduce the amount of manual work required throughout many use cases of this thesis, i.e., during the creation of our datasets for event detection (Sect. 4.2), coreference resolution and frame properties (Sect. 4.3), target-dependent sentiment classification (Chap. 5), and, finally, the user study (Chap. 6). In all cases, we use news-please to gather news articles and extract structured information from them, which we then manually revise for extraction errors. Due to the on average high extraction performance, we find that the amount of manual work required for revising the data is much lower than for manually extracting the news articles’ data from their respective web pages. Other researchers have used the output of news-please without manual verification. For example, Liu et al. [214] used our system to create part of their large-scale dataset to pre-train the widespread deep language model RoBERTa.

The system and code are available at

https://github.com/fhamborg/news-please.

3.6 Summary of the Chapter

This chapter proposed person-oriented framing analysis (PFA), our approach to reveal biases in news articles. By discussing the findings of our literature review in the context of our research question, we narrowed down our intentionally broadly defined and open research question to a specific research objective. Specifically, we concluded that of the three conceptual means to address media bias, the following two are the most promising to address our research question effectively: bias analysis, which aims to identify biases present in news coverage, and bias communication, which aims to inform news consumers about such biases.

Methodologically, we narrowed down our research question to identify person-oriented frames as the effects of person-targeting bias forms, especially bias by source selection, commission and omission of information, and word choice and labeling. According to the news production process, the three bias forms jointly represent all means on the text level to affect a person’s portrayal directly. Focusing on persons seems promising since news coverage on policy issues is fundamentally about persons, such as individuals in society affected by political decisions or politicians making such decisions. Thus, we hypothesize that our design can cover a wide range of substantial biases while avoiding the issues if we were to analyze all bias forms, e.g., infeasibly high annotation cost or unreliable methods. In Chap. 6, we will investigate the strengths and limitations due to these design decisions.

In sum, the PFA approach takes a set of news articles written in English reporting on the same policy event. The analysis employed by PFA consists of three components. First, we employ preprocessing. Second, we perform target concept analysis to identify and resolve persons mentioned in the news articles. Third, we perform automated frame analysis to identify how each news article portrays the individual persons, also on the sentence level.

A side contribution of this chapter is a system for news crawling and extraction, designed to conveniently gather news articles, such as to be analyzed subsequently using the PFA approach. We will also use this news extractor in all parts of this thesis to create our training and test datasets (see Sect. 3.5.3).

In the following chapters, we will introduce the individual analysis components of PFA. For each analysis component, we will explore different methods to tackle the respective component’s goals. Afterward, we will demonstrate the effectiveness of the PFA approach in a large-scale user study.