1 Introduction

The rise of online social media, such as Twitter, has offered increasingly huge opportunities for fake news production, dissemination and consumption [1, 2]. Fake news refer to those stories that are false: the story itself is fabricated with no verifiable facts, sources or quotes [3]. It often has the aim of misleading public opinion, disturbing social order, and damaging the credibility of social media [4]. With the pandemic of COVID-19, related fake news are spreading all over the sky throughout the social media [5]. Fake news not only threaten the physical health but also affect the mental health of the public with restless anxiety or fear [6]. Therefore, it is urgent to develop effective methods to prevent fake news from causing serious negative effects.

Current researches in the fake news detection aim to determine the truthfulness of a given claim [7]. Among the existing methods of fake news detection, content-based methods verify the validity of news by headlines and body contents [8]. Besides, content-based approaches learn latent textual or visual feature representation of fake news by context information [9]. However, there are two major challenges for learning the representation of fake news by context information:

  • Challenge 1: Feature sparsity of text. News item, such as tweets, headlines and online news, has become an important form to share information [10]. News text is usually composed of one or few short sentences, which lack the context of the syntactic structure [11, 12]. Recently, Graph Neural Networks (GNNs) have been widely used to learn text representations to solve the sparsity problem of text features in text classification [13]. Heterogeneous Graph ATtention networks (HGAT) embeds the Heterogeneous Information Network (HIN) for short text classification based on node-level and type-level attentions [14]. Short Text Graph Convolutional Network (STGCN) considers the word and sentence representations generated by the word co-occurrence, document word relations and text topic information as the classification feature [15]. These GNNs can capture semantic and sequential features in local consecutive word sequences well, but ignore global information between words of the news, and also fail to capture the relations of the features in news [16].

  • Challenge 2: Ineffectiveness to capture the non-consecutive and long-range word dependency. Most of current deep learning methods only capture local features (messages or word-level syntactic) in a small window, which leads to ineffectively capture the non-consecutive and long-range semantic relations among words [13]. These methods ignore global word co-occurrence in a corpus which carries non-consecutive and long-distance semantics [17]. Recently, GNNs have attracted wide attention, they do not consider the text as a sequence but as a set of co-occurrence words [18]. Text Graph Convolutional Network (TextGCN) builds a heterogeneous word document graph for the whole corpus and turns document classification into a node classification [17]. Tensor Graph Convolutional Networks (TensorGCN) proposes text graph tensor constructed by contextual information and two kinds of propagation learning [19]. However, worse performance for GCNs when integrating high-order information via stacking convolutional layers [7].

In order to address the aforementioned challenges, this paper has proposed Intra-graph and Inter-graph Joint Information Propagation Network (abbreviated as IIJIPN) with Third-order Text Graph Tensor for fake news detection, which utilizes variety relations between words and allows information to be disseminated within and among graphs successively. Particularly, data augmentation is utilized to solve the problem of data imbalance and the shortage of labeled data in corpus. We present Third-order Text Graph Tensor (abbreviated as TTGT) with sequential, syntactic, and semantic features to describe contextual information. Then, we propose Intra-graph and Inter-graph Joint Information Propagation (abbreviated as IIJIP) to propagate homogeneous and heterogeneous information respectively. Finally, news representation is generated by attention mechanism which consists of graph-level attention and node-level attention mechanism, and then news representations are fed into a fake news classifier. The main contributions of our work can be summarized as follows:

  • Third-order Text Graph Tensor (abbreviated as TTGT). Third-order Text Graph Tensor with three features is proposed to capture contextual information at different language properties. Sequential-based, syntactic-based, and semantic-based text graphs are constructed to form a text graph tensor. Text sequential features are extracted with point-wise mutual information to depict the property of local word co-occurrence language; text syntactic features are described with word dependencies to depict the language property in the rules of a formal grammar; text semantic features are represented with topics and topic-related key words to depict the language property of the meaning of word. Compared with other similar works that only incorporate independent sequential or semantic information for word embedding learning, our method has proposed Third-order Text Graph Tensor with sequential, semantic and syntactic information jointly learning of heterogeneous graphs. Besides, our work also proposes novel constructions of the text graph weight for sequential, syntactic and semantic features, and a novel method of semantic information extraction with topics and topic-related key words.

  • Intra-graph and Inter-graph Joint Information Propagation (abbreviated as IIJIP). To encode the heterogeneous information from multi-graphs, two levels of information propagation are proposed in IIJIP: intra-graph information propagation is firstly performed in each graph to realize homogeneous information interaction by building graph for each text property, and high-order homogeneous information interaction in each graph can be achieved by stacking propagation layer; inter-graph information propagation is then performed among text graphs to realize heterogeneous information interaction, and by connecting the nodes across the graphs, the heterogeneous information in one graph can be gradually fused into other graphs. To the best of our knowledge, it is the first time to implement high-order homogeneous information interaction in each graph firstly, and heterogeneous information interaction among graphs successively.

The organization of this paper is as follows. Section 2 provides a brief description of the existing work in the area of fake news detection. Section 3 introduces our method. Firstly, a brief description of data normalization and data augmentation are provided. Secondly, the overall architecture of our method is introduced in detail. Section 4 shows the experimental results on four datasets with detailed analysis. Section 5 summarizes our work and briefly describes future work.

2 Related work

Fake news detection can be considered as a binary text classification that judges a piece of news is true or not [20]. In this section, we first review related work on text classification and then detail the methods for fake news detection.

2.1 Text classification

2.1.1 Machine learning methods

Text classification refers to the process of assigning predefined labels to documents [21]. Numerous works have been proposed for text classification. Traditional classification methods are based on statistical learning models [21]. A Bag-Of-Word(BOW) model is proposed by supplementing rare information with related terms [22]. A Vector Space Model (VSM) improves the term weighting schemes for short text classification [23]. However, these traditional methods disregard the contextual information in textual data and ignore the semantic information of the words.

2.1.2 Deep learning methods

Compared with traditional classification methods, deep learning methods automatically provide semantic representations by text mining. TextCNN utilizes the convolution layers with filters to extract local features from text for text classification [24]. However, it focuses only on the locality of words and thus lacks of the long-distance and non-consecutive word interactions. Bidirectional Gate Recurrent Unit can make up for the deficiency of Convolutional Neural Network (CNN) in extracting semantic information from long text. Self-Attention-Based BiLSTM utilizes Bi-directional Long Short-Term Memory (Bi-LSTM) to obtain context-related semantic features for classifying short texts [25]. Attention mechanism can strengthen the learning of long-distance dependency by assigning different weights to different words. Attention-based bidirectional long short-term memory with convolution layer (AC-BiLSTM) is proposed to address the issues of the high dimensionality and sparsity of text data. It captures both the local feature of phrases and global sentence semantics [26]. Heterogeneous Graph Attention networks (HGAT) is proposed to embed the HIN for short text classification with both node-level and type-level attention mechanism to address the issue of the semantic sparsity [14]. The shortcoming of these deep learning methods is that they cannot distinguish the contribution of each word in an entire sentence.

2.1.3 Graph neural networks

Graph Neural Networks have obtained better performance on text classification by encoding syntactic structure of sentences. TextGCN constructs a graph for the whole corpus with global relations between documents and words [17]. However, the contextual-aware word relations within each document are neglected. As an improvement, TextING [18] builds the individual graph for each document with sequential information and performs text-level word interactions. Apart from TextING, a principled model called hypergraph attention networks (HyperGAT) is proposed for modeling each text document as document-level hypergraphs with node-level attention and edge-level attention to capture the high-order word interactions [27]. Inductive text representation learning in HyperGAT is implemented to eliminate the mandatory access of test documents during training to reduce computational consumption. SHINE is another GNN for short text classification. It models the text dataset as a hierarchical heterogeneous graph, and learns the document graph performing the label propagation among texts dynamically [16]. However, these methods only consider sequential information rather than semantic information.

2.2 Fake news detection

2.2.1 Deep learning methods

Deep learning methods such as Recurrent Neural Networks (RNN), CNN and LSTM have been used for fake news detection. A Deep CNN extracts a number of discriminatory features at each layer for fake news detection [28]. Semantic information in the text are extracted by an improved RNN for fake news detection [29]. Polysemy proliferates in natural language, however, deep learning methods cannot exactly distinguish the concrete meaning in the context.

2.2.2 Pre-trained language models

Pre-trained language models, such as Bidirectional Encoder Representations from Transformers (BERT) [30], can learn the meaning of the word depending on the context accurately. Better performance can be achieved by fine-tuning the pre-trained models on fake news classification. FakeBERT, a BERT-based deep learning approach, combines different parallel blocks of the single-layer deep CNN, which can address the issue of the ambiguity [31]. BERT-BiLSTM-Capsule utilizes BERT to obtain word encoding to solve the problem of sentence-level propaganda classification [32]. Albert, increasing the training speed of BERT, incorporates factorized embedding parameterization and cross-layer parameter sharing to reduce parameters [33]. An ensemble method of three transformer models (BERT, ALBERT, and XLNET) is proposed to analyze and detect COVID-19 fake news [34]. However, pre-trained methods perform poorly for the specific domains because professional phrases are rare in universal corpora.

2.2.3 Graph neural networks

Compared with pre-trained methods, GNNs have successfully described complex relationships and inter-dependencies with graphs [35]. Propagation patterns and social context are introduced for identifying the fake news [2, 8, 13, 36]. A propagation-based method is proposed to integrate the GNNs to distinguish the propagation patterns between the fake news and real news [8]. Explainable FakE News Detection (dEFEND) on social media exploits both news contents and user comments jointly to capture explainable user comments with a sentence-comment co-attention sub-network [36]. Knowledge-driven Multimodal Graph Convolutional Network (KMGCN) exploits multi-modal content information and additional background knowledge for fake news detection [13]. Temporally Evolving Graph Neural Network for Fake news detection (TGNF) captures dynamic evolution patterns of news propagation from the perspective of continuous time [2]. However, these methods mainly focus on multi-modal posts, and they are not appropriate for plain text news in social media.

3 Method

Giving a tweet about a news, the main goal of our approach is to infer whether it is a fake news or not based only on the text. In this section, we propose Intra-graph and Inter-graph Joint Information Propagation Network (abbreviated as IIJIPN) with Third-order Text Graph Tensor for fake news detection. IIJIPN jointly explores text feature extraction, information propagation and attention mechanism. The overall architecture of IIJIPN is shown in Fig. 1. Architecture of IIJIPN includes four parts:

  1. 1.

    Third-order Text Graph Tensor (abbreviated as TTGT). Sequential, syntactic, and semantic features are utilized to describe contextual information.

  2. 2.

    Intra-graph and Inter-graph Joint Information Propagation (abbreviated as IIJIP). It is used for realizing homogeneous and heterogeneous information interaction respectively.

  3. 3.

    Graph-level and Node-level attention mechanism. It captures both the importance of text graphs and nodes respectively.

  4. 4.

    Classifier. It is used to generate the tag of news.

Fig. 1
figure 1

Architecture of Intra-graph and Inter-graph Joint Information Propagation Network (abbreviated as IIJIPN) with Third-order Text Graph Tensor for fake news detection

Giving a news x = (e1,e2,⋯ ,en), it contains n words. Firstly, text sequential feature, syntactic feature, and semantic feature are extracted in x. Then, sequential graph Gseq, syntactic graph Gsyn, and semantic graph Gsem are built. Secondly, the information propagation within the graph and between the graphs are performed, the context interaction of homogeneous and heterogeneous information are realized, and the graph representation is updated to get \(G_{seq}^{\prime \prime }\), \(G_{syn}^{\prime \prime }\), and \(G_{sem}^{\prime \prime }\). Then, the graph attention mechanism calculates the final graph representation Gf, and sends Gf to the node attention mechanism obtaining the node representation wf. Finally, wf is sent to the classifier to predict the classification label \(\hat {y}\) of x.

3.1 Text normalization

Since there is noisy content including punctuation and irrelevant characters in the dataset, we firstly normalize text as follows:

  1. 1.

    Text cleaning: Removing HTML, punctuation, special symbols and any other characters.

  2. 2.

    Tokenization: Turning a meaningful piece of the whole text into a number of words.

  3. 3.

    Remove stop words: Removing the prepositions, pronouns, conjunctions, and auxiliary words that are used frequently but have nothing to do for the text.

  4. 4.

    Stemming and Lemmatization: Identifying the common root form of a word by removing or replacing word suffixes (e.g. “played” is stemmed as “play”, and “apples” is stemmed as “apple”).

  5. 5.

    To lowercase: Converting all characters to lowercase.

3.2 Data augmentation

In this work, experiments and analysis are based on four datasets including Liar, Constraint, Twitter15, and Twitter16. During the process of training model, one problem in Liar is data imbalance where at least one of the target variable values has a smaller number of instances when compared with the other values [37], and there is a ratio of 2.6:1 between true news and fake news in Liar. Another problem in Twitter15 and Twitter16 is smaller corpus where there is the shortage of training data. Twitter15 has 742 news, and Twitter16 only has 412 news. The performance of model is affected by the size and quality of the data [38]. We concentrate on adjusting the number of news to address above two issues: Increasing the number of fake news to make the number of fake news equal to the true news can solve the problem of data imbalance in Liar; Increasing the number of news can solve the problem of small corpus.

A method for data augmentation is proposed. It consists of two text editing operations: synonym replacement (SR, replacing word with one of its synonyms chosen at random), random insertion (RI, find a random synonym of a random word and then insert that synonym into a random position in the sentence). Giving a sentence, each operation can generate a variant sentence. Examples of augmented sentences are shown in Table 1. In the process of model training, SR and RI are used to generate two variant sentences for each sentence, aiming to by adjusting the number of training sets to solve the problem of data imbalance in Liar and small corpus in Twitter15 and Twitter16.

Table 1 The original sentence and generated sentences in data augmentation

3.3 Third-order text graph tensor

Tensor refers to a multi-dimensional array of numbers, called an nway or nmode array [39]. Vectors and matrices are first-order and second-order tensors respectively. \(A \in R^{n_{1} \times n_{2} \times n_{3}}\) is a third-order tensor where order is the number of ways or modes of the tensor.

A series of graphs can be utilized to depict text from different properties. We focus on three features within the text: sequential feature, syntactic feature, and semantic feature. Sequential-based, syntactic-based, and semantic-based text graphs are constructed to create a Third-order Text Graph Tensor T, denoted as T = (G1,G2,G3). Gi(i = 1,2,3) is the i th text graph in the Third-order Text Graph Tensor T. Given Gi = (Vi,Ei,Ai,Xi), where \(V_{i}(\lvert V_{i} \rvert =n)\) is the set of the i th graph nodes, Ei is the set of the i th graph edges, Ai is the i th graph adjacency matrix, and Xi is the i th graph feature matrix. All graph adjacency matrices are packed into the graph adjacency tensor Λ = (A1,A2,A3) ∈ Rn×n. All graph feature matrices are packed into a graph feature tensor Υ = (X1,X2,X3) ∈ Rn×m, where m is the dimension of the feature vector of each node.

All graph feature matrices Xi are uniformly initialized with Glove. Following subsections are focus on how to build Ei and Ai for each Gi.

3.3.1 Sequential-based graph

Sequential information depicts the local co-occurrence language property among words, which has been widely used for text representation learning. In this study, Point-wise Mutual Information (PMI) [19] is utilized to describe the sequential context using sliding window strategy. In Sequential-based Graph Gseq, the edge weight of each pair of words is estimated by

$$ d_{sequential} (e_{i},e_{j} )=log \frac{p(e_{i},e_{j} )}{p(e_{i})p(e_{j})} $$
(1)

where p(ei,ej) shows the probability of occurrence of the word pair (ei,ej) in a news record x, which is calculated by \( \frac {N_{co-occurence} (e_{i},e_{j} )}{N_{windows}}\), where Nwindows is the total number of sliding windows in x and Ncooccurence(ei,ej) is the number of sliding windows that the word pair (ei,ej) co-occurs in x. \( p(e_{i})=\frac {N_{occurrence}(e_{i})}{N_{windows}}\) is the probability that the word ei occurs in x, and Noccurrence(ei) is the total number of sliding windows in the text that contains ei.

3.3.2 Syntactic-based graph

Syntactic analysis, defined as syntax analysis or parsing, is the process of analyzing natural language with the rules of a formal grammar. We utilize word dependency produced by Stanford CoreNLP [40] to describe the syntactic feature. The word dependency provides a representation of grammatical relations between words in a sentence. Figure 2 shows an example of word dependency in one sentence. Different word dependency can be extracted in one sentence, each word dependency is a directed triplets consisting of the name of the relation, governor and dependent. In our model, we mainly focus on the word pair (governor,dependent). In Syntactic-based Graph Gsyn, we define the weight of each pair of words as: E(governor,dependent) = 1.

Fig. 2
figure 2

An example of word dependency in one sentence. Each word dependency is a directed triplets consisting of the name of the relation, governor and dependent. In our work, we mainly focus on the word pair (governor, dependent). In Syntactic-based Graph, we define the weight of each pair of words as: E(governor,dependent) = 1. For example, E(tested,COVID-19)= 1, E(stay,away) = 1 and so on

3.3.3 Semantic-based graph

Furthermore, in order to enrich the context for each word, we extract semantic feature by capturing topic-related correlations between words. The implementation steps are as follows:

  1. 1.

    We mine the latent topics τ = (t1,t2,⋯ ,tk) (k is the total number of topics) from the whole corpus using Latent Dirichlet Allocation (LDA) [41].

  2. 2.

    For each topic ti, it can be represented with the probability distribution over the words \(t_{i} = ({e^{i}_{1}},{e^{i}_{2}},{e^{i}_{3}}, \cdots )\).

  3. 3.

    To further represent the topic ti, we consider relating top M key words with the largest probabilities in the corpus \(t_{i} = ({e^{i}_{1}},{e^{i}_{2}}, \cdots , {e^{i}_{M}})\).

We apply extracting the topic and related key words into each text to numerically represent the semantic relations. Giving a news x = (e1,e2,⋯ ,eh,⋯ ,en), corresponding to topic \(t_{j} = ({e^{j}_{1}},{e^{j}_{2}}, \cdots , {e^{j}_{M}})\). When constructing the Semantic-based Graph Gsem, if word \(e_{h} \in ({e^{j}_{1}},{e^{j}_{2}}, \cdots , {e^{j}_{M}})\), the weight of word pair (ei,ej) is defined as:

$$ E_{i j}= \left\{\begin{array}{ll} 1 & \mathrm{i}=\mathrm{h}, \mathrm{j} \in(1,2, \cdots, n) \text{and} \mathrm{j} \neq \mathrm{i} \\ 0 & \text{else} \end{array}\right. $$
(2)

Figure 3 shows an example about how to build semantic feature graph for a news.

Fig. 3
figure 3

The keywords distribution of topics in Constraint dataset. We select a word (e.g. “death”) belonging to the topics in Fig. 3(a) from the sentence “Suffering COVID-19 may leads to death”. Then, we construct the relationship between “death” and other words in this sentence. When building semantic graph for this news, the weight of edges in the semantic feature graph includes E(death,Suffering) = 1, E(death,COV ID− 19) = 1, E(death,may) = 1, E(death,leads) = 1, E(death,to) = 1

3.4 Intra-graph and inter-graph joint information propagation

Third-order Text Graph Tensor T is utilized to perform Intra-graph and Inter-graph Joint Information Propagation. Formula (3) indicates the learning process:

$$ \begin{array}{@{}rcl@{}} T \xrightarrow{f_{intra}} T_{intra} \xrightarrow{f_{iner}} T_{inter} \end{array} $$
(3)

where T = (Gseq,Gsyn,Gsem), fintra and finter denote the intra-graph propagation and inter-graph propagation, respectively. The learning details of the two kinds of propagation are described next.

3.4.1 Intra-graph information propagation

Intra-graph information propagation means that nodes interact message with other nodes and update their representation within graph (see Fig. 4). Gated Graph Neural Networks [42] is firstly used to learn the word feature vector. Nodes can receive the information from its neighbours and merge it into own vector to update itself. As the propagation layer only works on the first-order neighbours, such layer can be stacked t times to achieve high-order feature interaction, where a node can receive another nodes information which far from t hops away. The formulas of the interaction are:

$$ \begin{array}{@{}rcl@{}} a^{t} &=& {\Lambda} w^{t-1}W_{a} \end{array} $$
(4)
$$ \begin{array}{@{}rcl@{}} z^{t} &=& \sigma(W_{z} a^{t}+U_{z} w^{t-1}+b_{z}) \end{array} $$
(5)
$$ \begin{array}{@{}rcl@{}} r^{t} &=& \sigma(W_{r} a^{t}+U_{r} w^{t-1}+b_{r}) \end{array} $$
(6)
$$ \begin{array}{@{}rcl@{}} l^{t} &=& tanh(W_{h} a^{t}+U_{h} (r^{t} \odot w^{t-1})+b_{h}) \end{array} $$
(7)
$$ \begin{array}{@{}rcl@{}} w^{t} &=& l^{t}\odot z^{t}+w^{t-1}\odot (1-z^{t}) \end{array} $$
(8)

where w(0) is Υ, σ is the sigmoid function, and all W, U and b are trainable weights and biases. z and r function as the update gate and reset gate respectively to determine to what degree the neighbour information contributes to the current node embedding.

Fig. 4
figure 4

Intra-graph information propagation is firstly conducted in each graph to interact homogeneous information, every word nodes updates itself from its neighbours and then merge with its own representation to update itself, and high-order neighbours information interaction in each text graph can be achieved by stacking propagation layer

After the Intra-graph information propagation, the T = (Gseq,Gsyn,Gsem) is transformed into \(T_{intra} = ({G_{seq}^{\prime },G_{syn}^{\prime },G_{sem}^{\prime }})\).

3.4.2 Inter-graph information propagation

Different graphs maintain different properties depicting the given data. To comprehensively describe the data, it is necessary to perform the learning among different graphs. The inter-graph information propagation is to aggregate information between different graphs. The input of Inter-graph information propagation is the Tintra. The information from one graph can be gradually fused into other graphs (see Fig. 5). Specifically, after we perform intra-graph information propagation, we choose any two updated graphs to achieve inter-graph word interaction. The formulas of the interaction are as follows:

$$ \begin{array}{@{}rcl@{}} G_{seq}^{\prime\prime} &=& G_{seq}^{\prime} + G_{syn}^{\prime} \end{array} $$
(9)
$$ \begin{array}{@{}rcl@{}} G_{syn}^{\prime\prime} &=& G_{syn}^{\prime} + G_{sem}^{\prime} \end{array} $$
(10)
$$ \begin{array}{@{}rcl@{}} G_{sem}^{\prime\prime} &=& G_{sem}^{\prime} + G_{seq}^{\prime} \end{array} $$
(11)
Fig. 5
figure 5

Inter-graph information propagation harmonize heterogeneous information using the output of intra-graph propagation. By connecting the nodes across the graphs, the heterogeneous information in one graph can be gradually fused into other graphs

After the Inter-graph information propagation, the \(T_{intra} = ({G_{seq}^{\prime },G_{syn}^{\prime },G_{sem}^{\prime }})\) is transformed into \(T_{inter} = ({G_{seq}^{\prime \prime },G_{syn}^{\prime \prime },G_{sem}^{\prime \prime }})\).

3.5 Attention mechanism

We also propose the graph attention mechanism and node attention mechanism to measure the importance of text graphs and nodes. The details of the two kinds of attention mechanisms are described next.

3.5.1 Graph attention mechanism

Graph attention mechanism is introduced to learn the weights to measure the importance of each graph \(G_{t} \in T_{inter}=({G_{seq}^{\prime \prime },G_{syn}^{\prime \prime },G_{sem}^{\prime \prime }})\), and the final graph Gf can be computed as follows:

$$ G_{f} = \sum\limits_{t=1}^{3} \beta_{t} G_{t} $$
(12)

Where βt measures the importance of t-th graph for the final graph Gf, and βt is calculated as follows:

$$ \begin{array}{@{}rcl@{}} u_{t} &=& tanh(W_{t} G_{t} + b_{t} ) \end{array} $$
(13)
$$ \begin{array}{@{}rcl@{}} \beta_{t} &=& \frac{exp(u_{t})}{{\sum}_{k=1}^{3} exp(u_{k})} \end{array} $$
(14)

Where ut is the hidden representation of Gt obtained by feeding the Gt to a fully embedding layer, and Wt, bt are trainable weights and biases.

3.5.2 Node attention mechanism

When we get the final text graph Gf, the word nodes are sufficient updated, which can be used to produce the final prediction. Because not all nodes play an equal important role in classification, we hope the key words should contribute to more explicitly. We define the node attention mechanism as:

$$ \begin{array}{@{}rcl@{}} e_{i} &=& \sigma (f_{1} (e_{i})) \odot tanh(f_{2} (e_{i} )) \end{array} $$
(15)
$$ \begin{array}{@{}rcl@{}} w_{f} &=& \frac{1}{|n|} {\Sigma}_{i\in n}^{\lvert V \rvert} e_{i} + Maxpooling(e_{1},e_{2},\cdots, e_{n}) \end{array} $$
(16)

Where f1 and f2 are two multilayer perceptions. The former is used as a soft attention weight and the latter as a non-linear feature transformation. In addition to average the weighted word features, we also apply a Max-pooling function for the word representation wf.

3.6 Final prediction

Finally, the label of text is predicted by feeding the word representation wf into a softmax layer. We minimize the loss through the cross-entropy function:

$$ \begin{array}{@{}rcl@{}} \hat{y} &=& softmax(Ww_{f} + b) \end{array} $$
(17)
$$ \begin{array}{@{}rcl@{}} \mathcal{L} &=& -{\Sigma}_{i} y_{i} log(\hat{y_{i}}) \end{array} $$
(18)

Where W and b are weights and bias, and yi is the i-th the element of the one-hot label.

4 Experiments

In this section, we aim at evaluating the performance of our model.

4.1 Datasets

We use four datasets including Liar, Constraint, Twitter15, and Twitter16 (see Table 2) to perform experiments and analysis.

Table 2 Summary statics of datasets

Liar [43]: It is a public dataset which includes over 12000 labeled short news. The news in the Liar dataset is classified into six categories for the truthfulness ratings: pants-fire, false, barely-true, half-true, mostly-true, and true. In our work, we mainly focus on classifying news as real and fake. For the binary classification, we divide these types into two types. Pants-fire and false are categorized as fake, barely-true, half-true, mostly-true, and true are as true. Although the dataset also provides some additional meta-data like the subject, speaker, job, et al., only textual content is used in our work.

Constraint [5]: The dataset for Constraint@AAAI2021 COVID-19 Fake New Detection in English provided by the organizers on the competition websites. The Constraint dataset contains fake and real news about COVID-19 in the form of social media posts such as Twitter, Facebook, Instagram, etc. As for the 10,700 social media posts in dataset, 8,560 posts are preserved for training, 1,070 posts are used in the validation phase, and the remain 1,070 posts are kept for testing.

Twitter15 [44]: The news in dataset is crawled from the two rumor tracking websites, namely, snopes.com and emergent.info. The dataset consists of total 942 events, but in our work, we only use the verified 370 fake news and 372 true news.

Twitter16 [45]: The news is divided into rumors and non-rumors from www.snopes.com, an online rumor debunking service. They obtain 778 reported events during March-December 2015, of which 64% are rumors. At the same as Twitter15, we also use verified 205 fake news and 207 true news.

4.2 Baselines

To evaluate the performance of proposed methods, we compare some state-of-art methods with our model:

1) Machine learning methods:

TF-IDF+SVM: Term frequency–inverse document frequency (TF-IDF) is used to feature extraction, and Support Vector Machine (SVM) as the classifier.

2) Deep learning Methods:

TextCNN [24]: Applying Convolution Neural Networks into Natural Language Processing, and processing text classification.

Bi-LSTM: Bi-directional Long Short-Term Memory, is one of the variants of Recurrent Neural Networks, which does well in processing Long-Term Dependencies.

3) Pre-trained Methods:

Albert [33]: Albert, one of the variations of BERT, incorporates factorized embedding parameterization and cross-layer parameter sharing to reduce parameters.

Roberta [46]: A Robustly Optimized BERT Pretraining Approach (Roberta), one of the variations of BERT, proposes a ‘pre-training’+‘post-training’+‘fine-tuning’ three-stage paradigm.

4) Topic Methods:

TMN [47]: Topic Model Networks for text classification, which can extract the topic and find the topic words with the support of Neural Topic Model and Topic Memory Mechanism.

5) Graph-based Methods:

HGAT [14]: Based on a dual-level attention level, Heterogeneous Graph Attention network(HGAT) can embed the heterogeneous information network for text classification.

TextGCN [17]: A Graph Convolution Networks for Text Classification, which models the whole corpus as a heterogeneous graph and learns the word and document embedding.

HyperGAT [27]: Hypergraph Attention Networks for Inductive Text Classification, sequential hyperedges and semantic hyperedges are used to describe Hypergraphs.

TextING [18]:Graph Neural Networks are used to inductive text classification, where building graph for each document and text word interactions can be performed in it.

4.3 Experiments sets

Due to different structure of above methods, we tune the hyperparameter by the performance on the validation set. Table 3 shows the hyperparameters of IIJIPN. We also show the differences of the configuration of IIJIPN and baseline models in Table 4.

Table 3 The hyperparameters of IIJIPN
Table 4 Differences of the configuration of IIJIPN and baseline models

4.4 Evaluation indicators

In this work, Accuracy(Acc), Precision(P), Recall(R) and F1-score(F1) are used to evaluate the performance of the model. Liar is an imbalanced dataset, we also use BalancedAccuracy(BAcc) to measure the performance of the models on the Liar datasets.

4.5 Results

We conduct comprehensive experiments to evaluate the performance on fake news classification. Overall, our model achieve state-of-the-art results, which effectively demonstrate the superior capability of detecting fake news.

Experimental results of liar dataset

As shown in Table 5, IIJIPN achieves better performance than baseline models. In addition, it can be seen from the Table 5 that IIJIPN and other Graph-based models, such as TextING, have achieved better results than non-graph models, such as TextCNN. For Graph-based models, a single graph with nodes and edges is firstly built for each news. Each node represents different word which carries various information. Each node can capture information from whole graph instead only capturing the information from its neighbours. Information propagation is conducted in each graph, realizing long-distance and discontinuous word interaction, obtaining the associated information between words, and transferring the text semantics to each node to achieve the continuity of context information. However, graph-based methods such as TextING and TextGCN only consider the propagation of information within the graph. In addition to the propagation of information within the graph, we also consider the propagation of information between the graphs. Information propagation between graphs realizes the interaction of heterogeneous information. The experimental results show that when both intra-graph and inter-graph information propagation are simultaneously considered, the obtained experimental results are better than the model that only considers intra-graph information propagation.

Table 5 The performance on liar dataset

Experimental results of constraint dataset

To avoid that the proposed model showing good performance only for Liar, we also conduct the second experiment using Constraint dataset. Table 6 shows the performance on Constraint. Although IIJIPN has not achieved the best performance, it still has the better performance than most models. Albert and Roberta get better performance. The pre-trained knowledge has significance on fake news detection, especially for Constraint (composed of COVID-19 events) and Liar dataset (composed of political news). Pre-trained language models contain large-scale annotated data and sufficient semantic understanding of domain-specific knowledge. In our model, both pre-trained word embedding Glove and random initialization are used to represent the word, which is not only fully expressing the familiar vocabularies but also describing new words without pre-trained tokens.

Table 6 The performance on Constraint Dataset

Experimental results of Twitter15 dataset

To verify the performance of our model on none particular event datasets, we further conduct experiment using Twitter15 and Twitter16 Datasets. The experimental results of Twitter15 Dataset are as shown in Table 7, IIJIPN has the highest values of Acc. TMN has the better Acc, which exceeding some Graph-based models, such as TextING and HGAT. Both TMN and our model consider the extraction of semantic information when obtaining text features. And our model also considers the extraction of additional information such as sequential information and syntactic information. From experimental results, our model achieves better results than TMN. It indicates that the better performance cannot be achieved if only considering the semantic information. Different methods describe text documents at different language properties. TextGCN and TextING describe sequence context information, and HyperGAT describes both sequence and semantic context information. But our model describes all the sequence, semantic, and syntactic context information. Short news is composed of few words. For the fake new classification, which is insufficient if we only consider a single kind of information.

Table 7 The performance on Twitter15 dataset

Experimental results of Twitter16 dataset

Table 8 shows the performance on Twitter16 dataset, it can be seen that IIJIPN achieves the highest value in all evaluation indicators. Deep learning methods TextCNN has a good effect on fake news detection, and its Precision value is 0.9375. TextCNN has not achieved better results. The potential reason is TextCNN fails to solve the problem of global feature loss caused by pooling layer. At the same time, long-term dependence cannot be learned under the limitation of convolution kernel size. Stacking the convolution layer multiple times can increase the receptive field to increase the feature learning ability. However, excessive convolution layers will lead to a sharp increase in computation. To solve the problem of global feature loss, two measures are proposed in this work. Firstly, for each graph with single information, nodes can receive the information from its neighbours and merge it into own representation to update itself by propagation layer. As the propagation layer only works on the first-order neighbours, such layer has been stacked many times to achieve high-order feature interaction. Secondly, to comprehensively describe the data, we perform the learning among different graphs to aggregate information between different graphs.

Table 8 The performance on Twitter16 dataset

There is the shortage of news in Twitter15 and Twitter16. Twitter15 has 742 news, and Twitter16 has only 412 news. These news are too small to effectively train the deep learning models. Better performance cannot be achieved when these deep learning models without sufficient training. Different from deep learning models, SVM is a powerful supervised algorithm that works well on smaller datasets. We introduce a simple text classification pipeline that contains stop words removal, features extraction with TF-IDF, SVM as classifier. In Tables 7 and 8, SVM achieves the excellent performance respectively. Particular words in datasets can be used in SVM to identify fake news.

4.6 Ablation analysis

Ablation analysis is conducted to investigate the contribution of components in our model.

4.6.1 Data augmentation analysis

We firstly examine and perform an analysis of the effectiveness of data augmentation. The results of without data augmentation and with data augmentation are presented in Table 9. It is clearly that data augmentation has helped tremendously for the classification on Liar, Twitter15, and Twitter16, especially for Liar.

Table 9 The results in data augmentation analysis IIJIPN(-Aug) indicates that IIJIPN without considering data augmentation

Data augmentation for liar

The Liar dataset is imbalanced. BAcc is an indicator to measure the performance of the model on the imbalanced dataset. For baseline models, the obtained value of BAcc is within the range of [0.50,0.57]. The obtained value for IIJIPN is 0.8587, which is larger than the baseline models. And the value of BAcc is close to Acc in IIJIPN.

Because the true news is more than the fake news in Liar dataset, it is easy for IIJIPN to learn the features of the true news sufficiently, but difficult to learn the features of the fake news. It is seen from Fig. 6 that numerous news are predicted to be true news. Data augmentation can solve the data imbalance in Liar. After applying data augmentation in Liar, the number of fake news is roughly as the same as true news. IIJIPN can sufficiently learn the characteristics of true news and fake news. In Fig. 7, the number of news which are predicted to be the correct label has been significantly increased, especially for fake news.

Fig. 6
figure 6

The confusion matrix describes the results of Liar, which is generated by IIJIPN(-Aug). In the figure, 0 represents Fake label and 1 represents True label in the horizontal and vertical coordinates. Darker color represents higher value

Fig. 7
figure 7

The confusion matrix describes the results of Liar, which is generated by IIJIPN

Data augmentation for twitter15 and twitter16

As for Twitter15 and Twitter16, the shortage of training news as the main bottleneck which leads to the poor results for IIJIPN(-Aug). After performing the data augmentation in Twitter15 and Twitter16, the number of total news has increased. Compared with the small corpus, IIJIPN can learn sufficient text features from large corpus which generated by data augmentation.

Data augmentation for constraint

Data augmentation is also applied in the Constraint Dataset. It is shown from Table 9 that the difference value on Acc between the IIJIPN (-Aug) and IIJIPN is only 0.0008. The improvement of data augmentation is not obvious because the data with 10,700 posts in the Constraint dataset is balanced. According to the results of Constraint dataset, it can be concluded that data augmentation can effectively solve the problems of data imbalance or the shortage of training data in the datasets.

4.6.2 Analysis of third-order text graph tensor

We also examine and perform an analysis of the effectiveness of Third-order Text Graph Tensor with sequential information, syntactic information and semantic information. We generate three variant models, each of which does not consider one text feature.

Sequential information

From Fig. 8, when the sequential information is not considered in IIJIPN, the value of Acc will decrease at different degrees. Among all datasets, the impact on Twitter15 and Twitter16 are greater than Liar and Constraint. Twitter15 and Twitter16 are small corpus. The frequency of word co-occurrence in these two datasets are small. The word co-occurrence means that words have associated semantics, which can be used to capture the contexts information. Therefore, when the sequential information is removed, there is no context information can be used to learn the word representation. Liar and Constraint are large corpus, even if the sequential information is not considered, the word representation can be learned by obtaining the syntactic information and semantic information.

Fig. 8
figure 8

Analysis of the effectiveness of sequential, syntactic, and semantic features, and three variants are produced. For example, IIJIPN(-Seq) means IIJIPN only considering feature syntactic and semantic feature rathan than sequential feature

Syntactic information

From Fig. 8, When the syntactic information is not considered, the Acc on Liar, Twitter15 and Twitter16 is greatly decreased, but Constraint is not. For Liar and Twitter16, the variant model without syntactic information achieves the worst result among three variant models. It can be seen that syntactic information has a great impact on IIJIPN. Syntactic information is obtained by extracting the grammar dependency between words. Both sequential and syntactic information are used to look for the relationship between words. The difference is that in the extraction of sequential information, we are extracting the sequence relationship between words and their neighbors in a limited range(a sliding window); In the extraction of syntactic information, we can find the grammar dependency between two words that are far away in the text.

Semantic information

From Fig. 8, When the semantic information is not considered, the Acc of Twitter15 is greatly affected, other three datasets are not. When acquiring semantic information, the topic and topic-related words are extracted in the datasets. It means that each word has different relevance to each topic. When distinguishing the true news from the false news, a certain kind of news usually related to specific topics and words. Therefore, the semantic information of a sentence can be obtained by integrating the words with topic distribution.

As shown in Fig. 8, IIJIPN is always performing better than the other three models. Only considering two kinds of text features, the performance on the fake news detection is weakened. It illustrates that all the sequential, syntactic and semantic features are complementary elements for fake news detection. Besides, different feature of text information will affect the performance for datasets at different degree. For example, the Acc on Liar and Twitter16 datasets are worse without considering syntactic information. It means that the syntactic dependency plays an important role in these two datasets. Similarly, not considering the sequential information will dramatically weaken the classification performance of the Constraint dataset. The Acc of Twitter15 dataset is worse without considering semantic information. It can be concluded that each text information will influence the classification performance respectively, and it is best when considering all features.

4.6.3 Analysis of Intra-graph and Inter-graph information propagation

Finally, we examine the effectiveness of Intra-graph and Inter-graph information propagation respectively. We develop two variants for IIJIPN, IIJIPN (-intra) and IIJIPN (-inter), which is IIJIPN but without using intra-graph information propagation and inter-graph information propagation respectively.

Intra-graph information propagation

The comparison results presented in Fig. 9 illustrate that IIJIPN has the best the performance, and IIJIPN (-inter) always show better performance than IIJIPN (-intra). The worse results in IIJIPN (-intra) can be attributed to which the word can not exchange information from neighbours within a graph. The lacking of information interaction leads to difficulty transfer the text information of each word to others, resulting in discontinuous information of the entire text.

Fig. 9
figure 9

Analysis of the effectiveness of information propagation methods in IIJIPN. IIJIPN(-Intra) means IIJIPN with only inter-graph propagation rather than intra-graph propagation

Inter-graph information propagation

From Fig. 9, it can be seen that IIJIPN (-inter) achieves better performance than IIJIPN (-intra), but poorly than IIJIPN. As for the inter-graph information propagation, IIJIPN cannot interact heterogeneous information without it, which has a bad influence on updating the word representation. Both Intra-graph and Inter-graph information propagation are considered simultaneously, and the homogeneous and heterogeneous information can be sufficiently learned.

5 Conclusion

The spread of fake news on social media has become a serious social problem. In this work, we propose Intra-graph and Inter-graph Joint Information Propagation Network (abbreviated as IIJIPN) with Third-order Text Graph Tensor for fake news detection. Particularly, we firstly propose data augmentation for corpus solving data imbalance and the shortage of label data. To comprehensively describe contextual information, Third-order Text Graph Tensor with sequential, syntactic, and semantic features is proposed. After constructing graphs for each news, intra-graph and inter-graph information propagation are used for realizing homogeneous and heterogeneous information interaction respectively. The experimental results demonstrate that the method we proposed is a powerful tool to effectively detect fake news.

During the experiments, huge cost of memory and time are existed in our model, so we will pay attention to how to deal with this problem in future. Moreover, additional information about news such as user profile, comments, or propagation paths will be merged in our model.