Keywords

1 Introduction

The impact of news on our daily affairs is greater than it has ever been. Fabrication and dissemination of falsehood have become politically lucrative endeavors, thereby harming public discourse and worsening political polarization [1]. These motivations have led to a complex and continuously evolving phenomenon mainly characterized by dis- and misinformation, commonly collectively referred to as fake news [2]. This denotes various kinds of false or unverified information, which may vary based on their authenticity, intention, and format [3]. Shu et al. [4] define it as “a news article that is intentionally and verifiably false.”

The dissemination of fake news is increasing, and because it appears in various forms and self-reinforces [1, 3], it is difficult to erode. Therefore, there is an urgent need for increased research in understanding and curbing it. This paper considers fake news that appears in the form of long online articles and explores the extent of internal consistency within fake news vis-à-vis legitimate news. In particular, we run experiments to determine whether thematic deviations—i.e., a measure of how dissimilar topics discussed in different parts of an article are—between the opening and remainder sections of texts can be used to distinguish between fake and real news across different news domains.

1.1 Motivation

A recent study suggests that some readers may skim through an article instead of reading the whole content because they overestimate their political knowledge, while others may hastily share news without reading it fully, for emotional affirmation [5]. This presents bad actors with the opportunity of deftly interspersing news content with falsity. Moreover, the production of fake news typically involves the collation of disjoint content and lacks a thorough editorial process [6].

Topics discussed in news pieces can be studied to ascertain whether the article thematically deviates between its opening and the rest of the story, or if it remains coherent throughout. Thematic analysis is useful here for two reasons. First, previous studies show that the coherence between units of discourse (such as sentences) in a document is useful for determining its veracity [6, 7]. Second, analysis of thematic deviation can identify general characteristics of fake news that persist across multiple news domains.

Although topics have been employed as features [8,9,10], they have not been applied to study the unique characteristics of fake news. Research efforts in detecting fake news through thematic deviation have thus far focused on spotting incongruences between pairs of headlines and body texts [11,12,13,14]. Yet, thematic deviation can also exist within the body text of a news item. Our focus is to examine these deviations to distinguish fake from real news.

To the best of the authors’ knowledge, this is the first work that explores thematic deviations in the body text of news articles to distinguish between fake and legitimate news.

2 Related Work

The coherence of a story may be indicative of its veracity. For example, [7] demonstrated this by applying Rhetorical Structure Theory [15] to study the discourse of deceptive stories posted online. They found that a major distinguishing characteristic of deceptive stories is that they are disjunctive. Also, while truthful stories provide evidence and restate information, deceptive ones do not. This suggests that false stories may tend to thematically deviate more due to disjunction, while truthful stories are likely to be more coherent due to restatement. Similarly, [6] investigated the coherence of fake and real news by learning hierarchical structures based on sentence-level dependency parsing. Their findings also suggest that fake news documents are less coherent.

Topic models are unsupervised algorithms that aid the identification of themes discussed in large corpora. One example is Latent Dirichlet Allocation (LDA), which is a generative probabilistic model that aids the discovery of latent themes or topics in a corpus [16]. Vosoughi et al. [17] used LDA to show that false rumor tweets tend to be more novel than true ones. Novelty was evaluated using three measures: Information Uniqueness, Bhattacharyya Distance, and Kullback-Leibler Divergence. Likewise, [18] used LDA to assess the credibility of Twitter users by analyzing the topical divergence of their tweets to those of other users. They also assessed the veracity of users’ tweets by comparing the topic distributions of new tweets against historically discussed topics. Divergence was computed using the Jensen-Shannon Divergence, Root Mean Squared Error, and Squared Error. Our work primarily differs from these two in that we analyze full-length articles instead of tweets.

3 Research Goal and Contributions

This research aims to assess the importance of internal consistency within articles as a high-level feature to distinguish between fake and real news stories across different domains. We set out to explore whether the opening segments of fake news thematically deviates from the rest of it, significantly more than in authentic news. We experiment with seven datasets which collectively cover a wide variety of news domains, from business to celebrity, to warfare. Deviations are evaluated by calculating the distance between the topic distribution of the opening part of an article, to that of its remainder. We take the first five sentences of an article as its opening segment.

Our contributions are summarized as follows:

  1. 1.

    We present new insights towards understanding the underlying characteristics of fake news, based on thematic deviations between the opening and remainder parts of news body text.

  2. 2.

    We carry out experiments on five cross-domain datasets. The results demonstrate the effectiveness of thematic deviation for distinguishing fake from real news.

4 Experiments

We hypothesize the following: the opening sentences of a false news article will tend to thematically deviate more from the rest of it, as compared to an authentic article. To test this hypothesis, we carried out experiments in the manner shown in Algorithm 1. We use Python and open-source packages for all computations.

Procedure.

All articles (\( S_{bg} \)) are split into two parts: its first \( x \)Footnote 1 sentences, and the remaining \( y \). Next, \( N \) topics are obtained from \( x \) and \( y \) using an LDA model trained using GensimFootnote 2 on the entire dataset. For \( i = \left( {1, . . .,m} \right) \) topics, let \( p_{x} = \left( {p_{x1} , . . . , p_{xm} } \right) \) and \( p_{y} = \left( {p_{y1} , . . . , p_{ym} } \right) \) be two vectors of topic distributions, which denote the prevalence of a topic \( i \) in the opening text \( x \) and remainder \( y \) of an article, respectively. The following are metrics used to measure the topical divergence between parts \( x \) and \( y \) of an article:

  • Chebyshev (\( D_{Ch} \)):

    $$ {\text{D}}_{{\rm Ch}} \left( {{\text{p}}_{{\rm xi}} , {\text{p}}_{{\rm yi}} } \right) \,{ = }\, \mathop { \max }\limits_{{\rm i}} \; \left| {{\text{p}}_{{\rm xi}} - {\text{p}}_{{\rm yi}} } \right| $$
    (1)
  • Euclidean (\( D_{E} \)):

    $$ \text{D}_{{\rm E}} \left( {\text{p}_{{{\rm xi}}} , \text{p}_{{{\rm yi}}} } \right) = \left\| {\text{p}_{{{\rm xi}}} - \text{p}_{{{\rm yi}}} } \right\| \, = \,\sqrt {\sum\nolimits_{i = 1}^{m} {\left( {\text{p}_{{{\rm xi}}} - \text{p}_{{{\rm yi}}} } \right)^{2} } } $$
    (2)
  • Squared Euclidean (\( D_{SE} \)):

    $$ \text{D}_{{{\rm SE}}} \left( {\text{p}_{{{\rm xi}}} , \text{p}_{{{\rm yi}}} } \right) = \sum\nolimits_{i = 1}^{m} {\left( {\text{p}_{{{\rm xi}}} - \text{p}_{{{\rm yi}}} } \right)^{2} } $$
    (3)

Intuitively, Chebyshev distance is the greatest difference found between any two topics in \( x \) and \( y \). The Euclidean distance measures how “far” the two topic distributions are from one another, while the Squared Euclidean distance is simply the square of that “farness”.

Finally, the average and median values of each distance are calculated across all fake (\( S_{f} \)) and real (\( S_{r} \)) articles. We repeated these steps with varying values of \( N \) (from 10 to 200 topics) and \( x \) (from 1 to 5 sentences).

figure a

Pre-processing.

Articles are split into sentences using the NTLKFootnote 3 package. Each sentence is tokenized and lowercased to form a list of words, from which stop words are removed. Bigrams are then formed and added to the vocabulary. Next, each document is lemmatized using spaCyFootnote 4, and only noun, adjective, verb, and adverb lemmas are retained. A dictionary is formed by applying these steps to \( S_{bg} \). Each document is converted into a bag-of-words (BoW) format, which is used to create an LDA model (\( {\mathcal{M}}_{bg} \)). Fake and real articles are subsequently pre-processed likewise (i.e., from raw text data to BoW format) before topics are extracted from them.

We consider the opening sentences of articles to be sufficient for capturing the lead or “opening theme” of the story, which will likely fall within the first paragraph. The first paragraph may in some cases be either too short or long for this, especially for fake articles that often lack a proper structure. Overly short or lengthy texts will influence the extraction of topics more adversely than if a set number of sentences are used.

Datasets.

Table 1 summarizes the datasets (after pre-processing) used in this study and lists the domains (as stated by the dataset provider) covered by each. An article’s length (Avg. length) is measured by the number of words that remain after pre-processing. The article maximum lengths (Max. length) is measured in terms of the number of sentences. We use the following datasets:

  • BuzzFeed-Webis Fake News Corpus 2016Footnote 5 (BuzzFeed-Web) [19]

  • BuzzFeed Political News DataFootnote 6 (BuzzFeed-Political) [20]

  • FakeNewsAMT + Celebrity (AMT + C) [21]

  • Falsified and Legitimate Political News DatabaseFootnote 7 (POLIT)

  • George McIntire’s fake news dataset (GMI)Footnote 8

  • University of Victoria’s Information Security and Object Technology (ISOT)Footnote 9 Research Lab [22]

  • Syrian Violations Documentation Centre (SVDC)Footnote 10 [23]

Table 1. Summary of datasets after pre-processing (f – fake, r – real).

Evaluation.

We evaluate differences in coherence of fake and real articles using the T-test at 5% significance level. The null hypothesis is that the mean coherence of fake and real news is equal. The alternative hypothesis is that the mean coherence of real news is greater than that of fake news. We expect that there will be a greater topic deviation in fake news and thus, its coherence will be lesser than that of real news.

5 Results and Discussion

Results of the experimental evaluation using the different divergence measures are shown in Table 2. We observe that fake news is generally likely to show greater thematic deviation (lesser coherence) than real news in all datasets. Table 3 shows the mean \( {\text{D}}_{{\rm Ch}} \) deviations of fake and real articles across \( {\text{N = }}\left\{ { 1 0 , 2 0 , 3 0 , 4 0 , 5 0 , 1 0 0 , 1 5 0 , 2 0 0} \right\} \) topics. Although results for AMT+C and BuzzFeed-Web are not statistically significant according to the T-test and therefore, do not meet our expectations, results for all other datasets are. Nonetheless, the mean and median values for fake news are lower than those of real news for these datasets. Table 3, which shows mean and median \( D_{Ch} \) deviations of fake and real articles across all values of \( N \). Figure 1 shows mean and median results for comparing topics in the first five and the remaining sentences. Results for values of \( N \) not shown are similar (with \( D_{Ch} \) gradually decreasing as \( N \) increases).

Table 2. Results of T-test evaluation based on different measures of deviation used.
Table 3. Mean and median \( D_{Ch} \) deviations of \( {\text{N = }}\left\{ { 1 0 , 2 0 , 3 0 , 4 0 , 5 0 , 1 0 0 , 1 5 0 , 2 0 0} \right\} \) topics combined for fake and real news (f – fake, r – real).
Fig. 1.
figure 1

Average and median Chebyshev distances in fake and real news, when comparing topics in the first five sentences to the rest of each article. Error bars show 95% confidence interval.

We found that comparing the first five sentences to the rest of the article yielded the best results (i.e., greatest disparity between fake and real deviations) for most datasets and measures. This is likely due to the first five sentences containing more information. For example, five successive sentences are likely to entail one another and contribute more towards a topic than a single sentence.

It is worth highlighting the diversity of datasets used here, in terms of domain, size, and the nature of articles. For example, the fake and real news in the SVDC dataset have a very similar structure. Both types of news were mostly written with the motivation to inform the reader on conflict-related events that took across Syria. However, fake articles are labeled as such primarily because the reportage (e.g., on locations and number of casualties) in them is insufficiently accurate.

To gain insight into possible causes of greater deviation in fake news, we qualitatively inspected the five most and least diverging fake and real articles (according to \( D_{Ch} \)). We also compared a small set of low and high number of topics (\( N \le 30 \) and \( N \ge 100 \)). We observed that fake openings tend to be shorter, vaguer, and less congruent with the rest of the text. By contrast, real news openings generally give a better narrative background to the rest of the story.

Although the writing style in fake news is sometimes unprofessional, this is an unlikely reason for the higher deviations in fake news. Both fake and real news try to expand on the opening lead of the story, with more context and explanation. Indeed, we observed that real news tends to have longer sentences, which give more detailed information about a story, and are more narrative. It can be argued that the reason behind this is that fake articles are designed to get readers’ attention, whereas legitimate ones are written to inform the reader. For instance, social media posts which include a link to an article are sometimes displayed with a short snippet of the article’s opening text or its summary. This section can be designed to capture readers’ attention.

It also conceivable that a bigger team of people working to produce a fake piece may contribute to its vagueness. They may input different perspectives that diversify the story and makes it less coherent. This may be compared to real news, whereby there typically are one or two professional writers and therefore, better coherence.

6 Conclusion

Fake news and deceptive stories tend to open with sentences which may be incoherent with the rest of the text. It is worth exploring if the consistency of fake and real news can distinguish between the two. Accordingly, we investigated the thematic deviations of seven cross-domain fake and real news using topic modeling. Our findings suggest that the opening sentences of fake articles topically deviate more from the rest of the article, as compared to real news. The next step is to find possible reasons behind these deviations through in-depth analyses of topics. In conclusion, this paper presents valuable insights into thematic differences between fake and authentic news, which may be exploited for fake news detection.