1 Introduction

Mobile social networks have brought us great facilities for acquiring information. Inevitably, a vast amount of useless misleading information, such as spam emails, clickbait links, and false health information, is created. This information will deceive us to do things with ill consequences. Table 1 gives two examples of how the meanings of content mislead people and impact categories in the Webis-Clickbait-17 dataset. In general, misleading information is deceptive, which makes it hard to distinguish the difference between two kinds of posts (positive and negative). Thus, how to detect misleading information effectively is challenging. Also, developing an efficient approach with high performance for misleading information detection is particularly essential.

Table 1 Content from different websites may carry normal or misleading information

Existing work on misleading information detection could be categorized into two types: machine learning-based approaches and deep learning-based approaches. Approaches based on machine learning often build document representations depending on different feature engineering techniques [10, 26, 35]. Various algorithms such as Labeled-LDA [35] and GBDT [2] also help enhance detection accuracy. Unfortunately, these approaches heavily rely on people to design sophisticated features and will cause lousy performance in a complex context. Deep learning-based approaches extract semantic features from content by multiple non-linear units to solve the above problems. Convolutional neural networks [1, 17], recurrent neural networks [23], and a combination of the two [22] are commonly used frameworks. Still, these approaches are limited to local semantic information and severely lack interpretability due to the complex structures.

To address the above limitations, we propose a novel model called Topic-Aware BiLSTM (TA-BiLSTM) to add corpus-level topic relatedness and enhance interpretability. Specifically, the TA-BiLSTM is decomposed into two parts: a neural topic model module and a text classification module. Assuming that a multi-layer neural network can approximate the document’s topic distribution, we model the topic by Wasserstein autoencoder (WAE) [37]. Neural topic model module constructs the topic distribution on latent space and reconstructs the document representation. The topic distribution could be transformed into the topic embedding provided for the attention mechanism concurrently. Unlike variational autoencoder-based approaches previously [29, 36], our model minimizes the Maximum Mean Discrepancy regularizer [15] based on Optimal Transport theory [39] to reduce Wasserstein distance between the topic distribution and Dirichlet prior.

Furthermore, the text classification module utilizes a two-layer bidirectional LSTM based on the Topic-Aware attention mechanism to extract semantic features. This attention mechanism incorporates topic relatedness information while calculating the representation. Finally, we input representations to the classifier for misleading information detection. To balance the two task learning, we leverage a dynamic strategy to control the importance of these objectives. We concentrate on the neural topic model preferentially, then simultaneously train the classification objective and topic modeling objective.

The main contributions of our work are as follows:

  • We propose a novel end-to-end framework Topic-Aware BiLSTM for misleading information detection.

  • We introduce a new Topic-Aware attention mechanism to encode the document’s local semantic and global topical representation.

  • Experiments are conducted on three public datasets to verify the effectiveness of our Topic-Aware BiLSTM model in terms of topic coherence measures and classification metrics.

  • We select representative cases from different datasets for visualization, demonstrating that the Topic-Aware BiLSTM enhances interpretability than other traditional approaches.

The remainder of the paper is organized as follows: Section 2 reviews the relevant work, and Section 3 introduces preliminary techniques. Section 4 introduces the methodology of Topic-Aware BiLSTM model. Experiments and result analysis are given in Section 5. Lastly, in Section 6, we conclude the paper.

2 Related Work

Our work is related to three lines of research which are misleading information detection, topic modeling and attention mechanism.

2.1 Misleading Information Detection

Misleading information detection models could be categorized as two streams based on implementation techniques: machine learning-based approaches and deep learning-based approaches.

Generally, machine learning-based approaches need to design the specific representation of texts. For example, Liu et al. [26] employs both the local and the global features via Latent Dirichlet Allocation and utilizes Adaboost to detect spammer. Likewise, Chakraborty et al. [7] uses multinomial Naive Bayes classifiers for pruned features of Clickbait data. Different models of this branch could also result in different detection performance. Song et al. [35] proposes the labeled latent Dirichlet allocation to mine the latent topics from user-generated comments and filter social spam. Biyani et al. [2] uses Gradient Boosted Decision Trees [11] to detect clickbait in news streams. Similarly, Elhadad et al. [10] detects misleading information about COVID-19 through constructing a voting mechanism. However, approaches of this branch often require sophisticated feature engineering and could not capture deep semantic patterns.

Thanks to the rapid development of deep representation learning, approaches such as convolutional neural networks, recurrent neural networks have been applied to extract semantic representation from text directly. Agrawal [1] and Hai-Tao et al. [17] utilize a convolutional neural network to detect misleading information from clickbait. Kumar et al. [23] adopts a bidirectional LSTM with an attention mechanism to learn a word contributing to the clickbait score in a different manner. Jain et al. [22] constructs a deep learning architecture based on convolutional layers and long short-term memory layers. Nevertheless, deep learning-based approaches often have complex structures and severely lack interpretability. Thus, we integrate the neural topic model to provide corpus-level semantic information and enhance interpretability.

2.2 Topic Modeling

Given a collection of documents, each document will discuss different topics. Topic modeling is an efficient technique which could mine latent semantic patterns from corpus.

Latent Dirichlet Allocation (LDA) [3] is the most publicly used traditional probabilistic generative model that can perform topic mining. Unlike traditional graphical topic models, Miao et al. [29] proposes a neural topic model NVDM based on variational autoencoders (VAE). Variational autoencoders use KL divergence to measure the distance between the topic distribution and Gaussian prior. ProdLDA [36] utilizes the approximated Dirichlet prior through Laplace approximation and improves the topic quality. On the other hand, Wang et al. proposes ATM [43], BAT, and Gaussian-BAT [44] in an adversarial manner. Wang et al. [42] also extends the ATM model for open event extraction. Inspired by ATM model, Hu et al. [20] attempts to improve topic modeling with cycle-consistent adversarial training and names this approach ToMCAT. Zhou et al. [49] extends this line of work by taking into account documents and words as nodes in the graph. Further, autoencoders could be trained stably and reduce the document’s representation dimensionally [25] to extract the most effective information [48]. So Nan et al. [31] incorporates adversarial training into Wasserstein autoencoder framework and proposes W-LDA model for unsupervised topic extraction.

2.3 Attention Mechanism

The attention mechanism is a brain processing mechanism unique to human vision originally. When we see a picture in life, our brain will prioritize the main content in the image, ignoring the background and other irrelevant information.

Inspired by this mechanism of the human brain, various attention mechanisms have achieved success in natural language processing tasks, such as sentiment analysis [45] and machine translation [27]. The typical attention mechanism only pays attention to word-level dependencies and assigns weights so that the model could highlight key elements of sentences [18]. Further, the hierarchical attention mechanism [47] uses two-layer attention, which is successively applied at the word level and sentence level to generate the document representation with rich semantics. Besides, Vaswani et al. [38] proposes a self-attention mechanism to deal with the increasing length of text. Self-attention calculates associations between words in a sentence directly. Previous work [16, 41] has shown that topic information could improve the semantic representation of text with the help of attention mechanisms. Nevertheless, to our best knowledge, no relevant work has been conducted on misleading information detection, so we explore and study in this work.

3 Preliminaries

3.1 Latent Dirichlet Allocation

Latent Dirichlet Allocation (LDA) is the most commonly used generative model for topic extraction. Assuming that a document can be represented by the probability distribution over topics, and each topic can be represented by the probability distribution over words. To learn the topic better, LDA utilizes the Dirichlet distribution as prior over latent space.

LDA uses 𝜃d to denote the topic distribution of a document d and zn to represent a topic allocation of the word wn. Thus, the generative process of documents is shown in Algorithm 1.

figure a

Here, \(Dir(\boldsymbol {\alpha }^{\boldsymbol {\prime }})\) is the Dirichlet prior distribution, \(\boldsymbol {\alpha }^{\boldsymbol {\prime }}\) signifies the hyper-parameter of Dirichlet prior, and 𝜃d is the topic distribution of document d sampled from Dirichlet prior. zn denotes the topic allocation of each position n in the document, and wn is a word randomly generate from multinomial distribution. φi is a topic-word distribution of the i-th topic, and \(\varphi _{z_{n}}\) is one column in the matrix. LDA infers these parameters in an unsupervised manner. After model training, we can obtain representative words with high probabilities in each topic, and these words represent the semantic meaning of each topic.

3.2 Long Short-Term Memory

As text is sequential data, and small changes of word order will affect the meaning of the entire sentence. However, traditional feedforward neural networks cannot directly extract the word dependency of context. Thus, researchers develop sequential models such as Recurrent Neural Networks (RNN) to extract sequential and contextual features from these data [21]. The RNN comprises an input layer, a hidden layer and an output layer. However, as the length of sentences increases, the training process will appear gradient disappearance and gradient explosion. The Long Short-Term Memory (LSTM) [19] adds a cell state to store long-term memory [13], which could deal with this problem.

Assuming that \(\textbf x_{j}\in \mathbb {R}^{D_{w}}\) represents a word embedding of the j-th word in the content and Dw is the dimension of word embeddings. LSTM feeds in word embeddings as a sequence and calculates the hidden state \(\textbf {h}_{j}\in \mathbb {R}^{D_{h}}\) for each word, where Dh is the dimension of hidden states. The calculation procedure follows below equations:

$$ \textbf f_{j}=\sigma\left( \textbf{W}_{f}\cdot [\textbf{h}_{j-1},\textbf x_{j}]+\textbf b_{f}\right) $$
(1)
$$ \textbf i_{j}=\sigma\left( \textbf{W}_{i}\cdot [\textbf{h}_{j-1},\textbf x_{j}]+\textbf b_{i}\right) $$
(2)
$$ \textbf{C}_{j}^{\prime}=\tanh\left( \textbf{W}_{C}\cdot [\textbf{h}_{j-1},\textbf x_{j}]+\textbf b_{C}\right) $$
(3)
$$ \textbf C_{j}=\textbf f_{j}\cdot \textbf C_{j-1}+\textbf i_{j}\cdot \textbf{C}_{j}^{\prime} $$
(4)
$$ \textbf o_{j}=\sigma\left( \textbf{W}_{o}\cdot [\textbf{h}_{j-1},\textbf x_{j}]+\textbf b_{o}\right) $$
(5)
$$ \textbf{h}_{j}=\textbf o_{j}\cdot \tanh(\textbf C_{j}) $$
(6)

where Wf, Wi, WC, Wo, bf, bi, bC and bo are learnable parameters, and σ(⋅) is sigmoid function. Forget gate fj determines the information that needs to be retained from the cell state Cj− 1. Input gate ij controls the proportion of new information stored in the new candidate Cj. Lastly, LSTM constrains the hidden state of the current node through output gate oj. The elaborated design of its structure enables LSTM could learn longer dependencies and better semantic representation.

4 Methodology

In this section, we first introduce the Topic-Aware BiLSTM (TA-BiLSTM) model. As depicted in Fig. 1, our proposed TA-BiLSTM could be divided into two parts: a neural topic model and a text classification model. The topic module employs a neural topic model to

Fig. 1
figure 1

The overall architecture of TA-BiLSTM: (a)Neural Topic Model on the left; (b)Text Classification Model on the right. MLP and fMLP are multilayer perceptron, vt denotes the topic embedding, and vd means the document’s representation, which is computed through attention weights a

discover latent topics from text corpus. The text classification module utilizes a two-layer BiLSTM network based on the Topic-Aware attention mechanism to detect misleading information from text.

4.1 Neural Topic Model

As shown in the left panel of Fig. 1, its structure is composed of an encoder and a decoder. (1) Encoder takes the V -dimensional xbow of the document as the input and transforms it into a topic distribution 𝜃 with K dimension through two fully connected layers. (2) Decoder takes the encoded topic distribution 𝜃 as the input, then reconstructs the document \(\hat {\textbf {x}}_{bow}\) with reconstruction distribution xre. After decoded by the first layer, the topic embedding vt is collected. Besides, to ensure the quality of extracted topics, we use the Wasserstein distance to conduct prior matching in latent topic space.

4.1.1 Encoder Network

For each document d = {w1,w2,...,wm} in the corpus Cd = {d1,d2,...,dn}, the encoder utilizes its bag-of-words representation xbow as input, where the weights are calculated by TF-IDF formulation:

$$ tf_{ij}=\frac{c_{ij}}{{\sum}_{k}c_{kj}},\qquad idf_{i}=\log\frac{|C_{d}|}{\left|\left\{ j:w_{i}\in d_{j}\right\}\right| +1} $$
(7)

where cij indicates the number of the word wi appearing in document dj, and \({\sum }_{k}c_{kj}\) is the total number of words in document dj. |Cd| indicates the total number of documents in the corpus, and \(\left |\left \{ j:w_{i}\in d_{j}\right \}\right |\) represents the number of documents containing word wi.

$$ x_{bow}^{(i)}=tf_{ij}\times idf_{i} $$
(8)

where \(x_{bow}^{(i)}\) refers to the semantic relevance of the i-th word in the vocabulary in document dj.

According to Eqs. 7 and 8, each document could be represented as \(\textbf x_{bow}\in \mathbb {R}^{V}\), where V indicates the vocabulary size.

The encoder firstly maps xbow into the Ds-dimensional semantic space through following transformation:

$$ \textbf{h}_{s}=\text{BN}(\textbf{W}_{s}\textbf x_{bow}+\textbf b_{s}) $$
(9)
$$ \textbf o_{s}=\max(\textbf{h}_{s}, leak * \textbf{h}_{s}) $$
(10)

where \(\textbf {W}_{s}\in \mathbb {R}^{D_{s}\times V}\) and \(\textbf b_{s}\in \mathbb {R}^{D_{s}}\) are the weight matrix and bias term of the fully connected layer, hs is the hidden state normalized by batch normalization BN(⋅), leak denotes the hyper-parameter of LeakyReLU activation, and os represents the output of the layer.

Subsequently, the encoder projects the output vector os into a K-dimensional document-topic distribution 𝜃e:

$$ \boldsymbol{\theta}_{e} = \text{softmax}\left( \text{BN}(\textbf{W}_{o}\textbf o_{s}+\textbf b_{o})\right) $$
(11)

where \(\textbf {W}_{o}\in \mathbb {R}^{K\times D_{s}}\) and \(\textbf b_{o}\in \mathbb {R}^{K}\) are the weight matrix and bias term of the fully connected layer, 𝜃e denotes the topic distribution corresponding to the input xbow and the k-th (k ∈{1,2,...,K}) dimension \(\theta _{e}^{(k)}\) means the proportion of k-th topic in the document.

We add noise to document-topic distribution to draw more consistent topics. We randomly sample a noise vector 𝜃n from the Dirichlet prior and merge it with 𝜃e. The calculation is defined as:

$$ \boldsymbol{\theta}=(1-\eta)\boldsymbol{\theta}_{e}+ \eta\boldsymbol{\theta}_{n} $$
(12)

where η ∈ [0,1] denotes the mixing proportion of noise.

The encoder transforms the bag-of-words representation into topic distribution which perceives the semantic information in latent space.

4.1.2 Decoder Network

The decoder takes the topic distribution 𝜃 as input. And then, two fully connected layers reconstruct the document’s word representation \(\hat {\textbf {x}}_{bow}\). After the transformation of first layer, vt serves as the topic embedding of the input document and is provided for the attention mechanism.

The decoder firstly transforms the topic distribution 𝜃 into the Dt-dimensional topic embedding space:

$$ \textbf{h}_{t} = \text{BN}(\textbf{W}_{t}\boldsymbol{\theta}+ \textbf b_{t}) $$
(13)
$$ \textbf v_{t} = \max(\textbf{h}_{t}, leak* \textbf{h}_{t}) $$
(14)

where \(\textbf {W}_{t}\in \mathbb {R}^{D_{t}\times K}\) and \(\textbf b_{t}\in \mathbb {R}^{D_{t}}\) are the weight matrix and bias of the fully connected layer, ht is the hidden vector normalized by batch normalization BN(⋅). The vt is activated by the LeakyReLU and then used in Topic-Aware attention mechanism.

Subsequently, the decoder transforms the hidden vector ht into V -dimensional reconstruction distribution:

$$ \textbf x_{re} = \text{softmax}\left( \text{BN}(\textbf{W}_{r}\textbf{h}_{t}+\textbf b_{r})\right) $$
(15)

where \(\textbf {W}_{r}\in \mathbb {R}^{V\times D_{t}}\) and \(\textbf b_{r}\in \mathbb {R}^{V}\) are the weight matrix and bias, and xre is the reconstruction distribution.

The decoder is an essential part of the neural topic model. After model training, it could generate the words corresponding to each topic. We input one-hot vectors into the decoder to obtain the word distribution of each topic. Here, we use 10 words with the highest probability of each topic to represent its semantic meaning. Based on the topic distribution and the semantics of topics, interpretable word-level information could be provided for classifying documents in the detection process.

4.1.3 Prior Distribution Matching

Since the Dirichlet distribution is commonly regarded as the prior of multinomial distribution, choosing this prior has substantial advantages [40]. To match the encoded topic distribution to Dirichlet prior, we add a regularizer in TA-BiLSTM. Thus, the training process minimizes the regularization term based on the Maximum Mean Discrepancy (MMD) [15] to reduce the Wasserstein distance, which measures the divergence between the topic distribution 𝜃 and randomly samples \(\boldsymbol {\theta }^{\boldsymbol {\prime }}\) from prior.

Regarding the kernel function is \(\boldsymbol {\mathrm {k}}:{\Theta }\times {\Theta }\rightarrow \mathfrak {R}\), the MMD based regularizer could be defined as:

$$ \begin{aligned} \mathcal{D}_{{\Theta}}&=\text{MMD}_{\boldsymbol{\mathrm{k}}}(Q_{{\Theta}},P_{{\Theta}})\\ &=\left\lVert {\int}_{{\Theta}}\boldsymbol{\mathrm{k}}(\boldsymbol{\theta},\cdot)dP_{{\Theta}}(\boldsymbol{\theta})- {\int}_{{\Theta}}\boldsymbol{\mathrm{k}}(\boldsymbol{\theta},\cdot)dQ_{{\Theta}}(\boldsymbol{\theta}) \right\rVert_{\mathcal{H}_{\boldsymbol{\mathrm{k}}}} \end{aligned} $$
(16)

where \({\mathscr{H}}\) is the Reproducing Kernel Hilbert Space (RKHS) of real-valued functions mapping Θ to \(\mathfrak {R}\). k(⋅,⋅) implies the kernel function of this space, and k(𝜃,⋅) maps 𝜃 to the features on the high-dimensional space.

As distributions in the latent space are matched with the Dirichlet prior on the simplex, we choose the information diffusion kernel [24] as the kernel function. This function is susceptible to points near the simplex boundary and has better effects on sparse data. The detailed calculation equation is:

$$ \boldsymbol{\mathrm{k}}(\boldsymbol{\theta},\boldsymbol{\theta}^{\boldsymbol{\prime}})=\exp\left( -\arccos^{2}\left( {\sum}_{i=1}^{K}\sqrt{\theta^{(i)}{\theta}^{\prime(i)}} \right)\right) $$
(17)

When performing distribution matching, we employ the Dirichlet distribution, \(\boldsymbol {\alpha }^{\boldsymbol {\prime }}\) means hyper-parameter, then \(\boldsymbol {\theta }^{\boldsymbol {\prime }}\) can be sampled by the following equations:

$$ p(\boldsymbol{\theta}^{\boldsymbol{\prime}} \mid \boldsymbol{\alpha}^{\boldsymbol{\prime}}) =Dir(\boldsymbol{\theta}^{\boldsymbol{\prime}}\mid \boldsymbol{\alpha}^{\boldsymbol{\prime}}) \triangleq \frac{1}{\text{B}(\boldsymbol{\alpha}^{\boldsymbol{\prime}})}{\prod}_{i=1}^{K} \left( {\theta}^{\prime(i)}\right)^{{\alpha^{\prime}}^{(i)}-1} $$
(18)

where \({\theta }^{\prime (i)}\) denotes the value of the i-th dimension of \(\boldsymbol {\theta }^{\boldsymbol {\prime }}\), \({\alpha ^{\prime }}^{(i)}\) means the hyper-parameter of the i-th dimension of the Dirichlet distribution, \(\boldsymbol {\theta }^{\boldsymbol {\prime }}\) represents a sample sampled from the Dirichlet prior, and \(\text {B}(\boldsymbol {\alpha }^{\boldsymbol {\prime }})=\frac {{\prod }_{i=1}^{K}{\Gamma }({\alpha ^{\prime }}^{(i)})} {\Gamma \left ({\sum }_{i=1}^{K}{\alpha ^{\prime }}^{(i)}\right )}\).

Given M encoded samples and M samples sampled from Dirichlet prior, MMD could be calculated by the following unbiased estimation:

$$ \begin{aligned} \widehat{\text{MMD}}({\kern-.5pt}Q_{{\Theta}}{\kern-.5pt},\!P_{{\Theta}}{\kern-.5pt})& = \frac {1} {M(M - 1)} \!\sum\limits_{i \neq j}\! \boldsymbol{\mathrm{k}} (\boldsymbol{\theta}_{i},\!\boldsymbol{\theta}_{j})\\ & + \frac {1} {M({\kern-.5pt}M - 1{\kern-.5pt})} \!\sum\limits_{i {\kern-.5pt}\neq{\kern-.5pt} j}\! \boldsymbol{\mathrm{k}}(\boldsymbol{\theta}_{\boldsymbol{i}}^{\boldsymbol{\prime}},\!\boldsymbol{\theta}_{{\kern-.5pt}\boldsymbol{j}}^{\boldsymbol{\prime}}) - \frac{2} {M^{2}} \!\sum\limits_{i,j}\! \boldsymbol{\mathrm{k}} ({\kern-.5pt}\boldsymbol{\theta}_{i}{\kern-.5pt},\!\boldsymbol{\theta}_{{\kern-.5pt}\boldsymbol{j}}^{\boldsymbol{\prime}}{\kern-.5pt}) \end{aligned} $$
(19)

where \(\{ \boldsymbol {\theta }_{1},\boldsymbol {\theta }_{2},...,\boldsymbol {\theta }_{M}\}\sim Q_{{\Theta }}\) are the samples collected from the encoder, and QΘ is the encoded distribution of samples. \(\{ \boldsymbol {\theta }_{1}^{\boldsymbol {\prime }},\boldsymbol {\theta }_{2}^{\boldsymbol {\prime }},...,\boldsymbol {\theta }_{M}^{\boldsymbol {\prime }}\}\sim P_{{\Theta }}\) are sampled from the prior distribution PΘ.

4.2 Text Classification Model

In this subsection, we will introduce the text classification module. As depicted in the right panel of Fig. 1, we utilize a two-layer BiLSTM based on the Topic-Aware attention mechanism. Because of the complex context of misleading information, we incorporate corpus-level topic features by this mechanism to obtain richer semantic representation. Then, we use a classifier with two fully connected layers to detect misleading information.

4.2.1 BiLSTM

Bag-of-words representation is sparse, and the typical solution approach to the sparsity problem is computational intelligence [46] like word embedding technology. Word2vec [30] and GloVe [32] utilize words as the smallest unit for training, while the fastText [4] splits words into n-gram subwords to construct vectors.

Considering that there are many out-of-vocabulary words in misleading information, we use the embedding layer initialized by the pre-trained fastText. Suppose the word sequence of a document d = {w1,w2,...,wm}, wi represents the i-th word in the content. After transforming each word to a one-hot vector, the embedding layer could map words to their corresponding vectors \(\textbf x_{embed} \in \mathbb {R}^{D_{w}}\), where Dw is the dimension of embedding space.

Then, we utilize a two-layer BiLSTM to extract semantic features, and each layer contains bidirectional LSTM units. This bidirectional structure implements the semantic contextual representation of misleading information. The network takes xembed in the order of the content as input and gets each word’s hidden state. If the definition of LSTM unit is simplified as LSTM(⋅), the hidden state \(\textbf {h}^{\prime }\) of each word could be calculated by:

$$ \textbf{h}_{f1} = \overrightarrow{\text{LSTM}}(\textbf x_{embed}),\quad \textbf{h}_{b1} = \overleftarrow{\text{LSTM}}(\textbf x_{embed}) $$
(20)
$$ \textbf{h}_{fb}=[\textbf{h}_{f1},\textbf{h}_{b1}] $$
(21)
$$ \textbf{h}_{f2} = \overrightarrow{\text{LSTM}}(\textbf{h}_{fb}),\quad \textbf{h}_{b2} = \overleftarrow{\text{LSTM}}(\textbf{h}_{fb}) $$
(22)
$$ \textbf{h}^{\boldsymbol{\prime}} = \text{BN}\left( [\textbf{h}_{f2},\textbf{h}_{b2},\textbf x_{embed}]\right) $$
(23)

where \(\textbf {h}_{f1},\textbf {h}_{f2} \in \mathbb {R}^{D_{h}}\) are vectors calculated by the forward LSTM, and \(\textbf {h}_{b1},\textbf {h}_{b2} \in \mathbb {R}^{D_{h}}\) are vectors calculated by the backward LSTM. \(\textbf {h}^{\boldsymbol {\prime }} \in \mathbb {R}^{2\times D_{h}+D_{w}}\) is the hidden state that combines the word embedding and the bidirectional LSTM.

4.2.2 Topic-Aware Attention Mechanism

Generally, the attention mechanism is similar to human behavior when reading a sentence, evaluating how important each word is by giving a weight to each part [50]; the higher value is, the more important the word will be. In the typical attention-based model, the alignment score of each word is calculated as:

$$ f(\textbf{h}^{\boldsymbol{\prime}})=\textbf{q}^{T}\tanh(\textbf{W}_{q}\textbf{h}^{\boldsymbol{\prime}}+\textbf b_{q}) $$
(24)

where \(\textbf {q}\in \mathbb {R}^{D_{h}}\) are learnable parameters.

However, typical attention mechanisms could not utilize external information, so we design the Topic-Aware attention mechanism to incorporate topic features while calculating the misleading information representation. In this way, we integrate the neural topic module and the text classification module to train the entire model end-to-end.

The attention weights a for each word are calculated based on the similarity between the topic embedding vt and hidden states \(H = \{\textbf {h}_{1}^{\boldsymbol {\prime }}, \textbf {h}_{2}^{\boldsymbol {\prime }}, ..., \textbf {h}_{L}^{\boldsymbol {\prime }}\}\) in the last layer of BiLSTM, where L represents the max sentence length in batch.

Specifically, TA-BiLSTM counts the attention weight ai based on the alignment score between the hidden state \(\textbf {h}_{i}^{\boldsymbol {\prime }}\) and the topic embedding vt, where i = {1,2,...,L}. We set Dt = Dh and use the following equation to calculate the alignment score:

$$ f(\textbf{h}_{i}^{\boldsymbol{\prime}},\textbf v_{t})=\textbf{v}_{t}^{\textbf{T}}\tanh(\textbf{W}_{a}\textbf{h}_{i}^{\boldsymbol{\prime}}+\textbf b_{a}) $$
(25)

where \(\textbf {W}_{a}\in \mathbb {R}^{D_{h}\times D_{h}}\) and \(\textbf b_{a}\in \mathbb {R}^{D_{h}}\) are learnable parameters. The larger the value of \(f(\textbf {h}^{\boldsymbol {\prime }},\textbf v_{t})\), the greater the probability of misleading information implied by the corresponding word. Then, the document representation could be summarized based on the alignment scores above:

$$ a^{(i)}=\frac{\exp\left( f(\textbf{h}_{i}^{\boldsymbol{\prime}},\textbf v_{t})\right)}{{\sum}_{j=1}^{L}\exp\left( f(\textbf{h}_{j}^{\boldsymbol{\prime}},\textbf v_{t})\right)} $$
(26)
$$ \textbf v_{d}=\sum\limits^{L}_{i=1}a^{(i)}\textbf{h}_{i}^{\boldsymbol{\prime}} $$
(27)

where a(i) is the weight of the hidden state \(\textbf {h}_{i}^{\boldsymbol {\prime }}\) of the i-th word, and \(\textbf v_{d}\in \mathbb {R}^{D_{h}}\) contains both semantics of hidden states and topic information embedded by the neural topic model.

4.2.3 Classifier

In this paper, the text which contains misleading information is taken as a positive example. We apply two fully connected layers and a sigmoid activation function to convert the document representation vd into the probability for classification. Therefore, the higher value of the output, the more possible this document containing misleading information. The prediction process could be defined as:

$$ \textbf{h}_{d}=\text{BN}(\textbf{W}_{d}\textbf v_{d}+\textbf b_{d}) $$
(28)
$$ \textbf o_{d}=\max(\textbf{h}_{d}, leak* \textbf{h}_{d}) $$
(29)
$$ \hat{y} = \sigma(\textbf{W}_{c}\textbf o_{d}+b_{c}) $$
(30)

where \(\textbf {W}_{d}\in \mathbb {R}^{D_{m}\times D_{h}}\), \(\textbf b_{d}\in \mathbb {R}^{D_{m}}\), \(\textbf {W}_{c}\in \mathbb {R}^{D_{m}}\) and \(b_{c}\in \mathbb {R}\) are learnable parameters, and \(\hat {y}\) is the predicted probability.

4.3 Training Objective

In multi-task learning framework, models are optimized for multiple objectives jointly. Our proposed framework mainly has two training objectives: neural topic modeling objective and misleading information detection objective.

For the neural topic modeling, its objective includes the reconstruction term and the MMD based regularization term. It is defined as follows:

$$ \begin{aligned} \mathcal{L}_{t}&=\mu \cdot\mathbb{E}_{P_{\textbf x_{bow}}}\mathbb{E}_{Q(\theta\mid \textbf x_{bow})}\text{c}(\textbf x_{bow}, \textbf x_{re})+\mathcal{D}_{{\Theta}} \\ &=-\mu \cdot \sum\limits_{i=1}^{V}x_{bow}^{(i)}\log x_{re}^{(i)}+\widehat{\text{MMD}}(Q_{{\Theta}},P_{{\Theta}}) \end{aligned} $$
(31)

where c(xbow,xre) is the reconstruction loss, \(x_{bow}^{(i)}\) denotes the weight of the i-th word in the vocabulary, and \(x_{re}^{(i)}\) denotes the probability of the i-th word in reconstruction distribution. In our implementation, we follow W-LDA and multiply a scaling factor \(\mu =1/(l\log V)\) to balance the two terms, where l indicates the average sentence length in each batch and V indicates the vocabulary size.

For classification objective, we measure the binary cross-entropy between the target label and the predicted output:

$$ \mathcal{L}_{c}=-\frac{1}{N}\sum\limits_{i=1}^{N}\left( y_{i}\log\hat{y_{i}}+(1-y_{i})\log(1-\hat{y_{i}})\right) $$
(32)

where yi is the ground truth, and \(\hat {y_{i}}\) represents the predicted probability of the i-th document. N means the total number of document in the corpus.

To balance the two task specific objectives, we adopt a dynamic strategy to control the weights of objectives above. The neural topic model is mainly concerned in the early stage, and then we pay more attention to train the classification objective. Thus, the total training objective is formed as:

$$ \mathcal{L}_{total}=\lambda\cdot\mathcal{L}_{c}+\mathcal{L}_{t} $$
(33)

where λ is a hyper-parameter that dynamically balances the two objectives.

We set λ to a slight value in the early stage, allowing the framework to train neural topic model preferentially. Later, we change λ to 1, shifting the focus to multi-task learning, and train the classifier and the neural topic model jointly.

5 Experiments and Results Analysis

5.1 Experimental Setup

5.1.1 Datasets

We conduct experiments on three public datasets about misleading information to evaluate the effectiveness of the proposed TA-BiLSTM model.

Enron Spam

[28] is an English public spam dataset compiled in 2006. Ham emails are collected from the mailboxes of six employees in Enron Corporation. Spam messages are obtained from four sources: SpamAssassin corpus, Honeypot project, spam collection of Bruce Guenter, and spam collected by third parties. These emails were sent and received between 2001 and 2005. The dataset consists of six sub-datasets, which are combined into a whole dataset for experiments.

2007 TREC Public Spam

[9]. The Text Retrieval Conference (TREC) is a series of seminars, which mainly focuses on the problems and challenges in information retrieval research. The 2007 TREC conference held a spam filtering competition and published this dataset. The dataset includes complete mail information such as sending and receiving addresses, time, HTML code. In the experiments, we retain content in the main body and ignore other information.

Webis-Clickbait-17

[33] contains a total of 19,538 Twitter posts with links from 27 major news publishers in the United States. These posts were published between November 2016 and June 2017. Five annotators from Amazon Mechanical Turk marked whether articles in these links were misleading information. We use the content of articles linked in the post for detection.

Due to noisy data such as blanks and duplicate documents in three datasets, the statistics of preprocessed datasets are listed in Table 2. We arrange 2/3 of the data as the training set and 1/3 of the data as the test set.

Table 2 Statistics of three preprocessed datasets

5.1.2 Model Configuration

In the experiments, all datasets use package enchant to check the spelling of words. Each word is reverted to base form with no inflectional suffixes by the en_core_web_lg model of package spacy. We utilize package gensim to obtain the word embedding matrix and initialize the embedding layer.

For the neural topic model, we set the number of topics K to 50 and the dimension Ds of the fully connected layer in the encoder to 256. The dimension Dt of the topic embedding is equal to the dimension Dh of the hidden state \(\textbf {h}^{\boldsymbol {\prime }}\). We make Dirichlet prior as sparse as possible and set the Dirichlet hyper-parameter \(\boldsymbol {\alpha }^{\boldsymbol {\prime }}\) to 0.001. The proportion of noise η that adds to topic distribution is defined as 0.1.

For text classification model, we apply 300-dimensional pre-training fastText word embeddings [14], that is, Dw is set to 300. The dropout of the BiLSTM layer is 0.3, and the dimension Dm in the classifier is 64. The weight matrixes in BiLSTM are initialized by orthogonal initialization, and the parameters in the Topic-Aware attention mechanism are initialized by uniform initialization.

During model training, the hyper-parameter λ is set to 1e-8 initially, and when the training reaches the last 20 Epochs, λ is set to 1. Adam optimizer with a learning rate of 1e-4 to train the parameters of the neural topic model and with a learning rate of 5e-5 to train other parameters. The batch size is 16. The computer CPU is Intel Xeon (Skylake) Platinum 8163, and the operating system is Ubuntu 20.04 64-bit. All models are implemented with PyTorch and run on an NVIDIA V100 32G graphic card.

5.1.3 Baselines

We choose Naive Bayes, Support Vector Machine, Decision Tree, Random Forest four machine learning models for comparison.

Naive Bayes

[28] is a probabilistic model. By learning the joint probability distribution of the input and output of the training data, the model computes the label with the largest posterior probability of the predicted data.

SVM

[8] is a linear binary classification model defined in the feature space. It uses a kernel function to find a hyperplane to separate the two categories, and maximizes the interval between the data and the plane.

Decision Tree

[6] adopts a tree structure and uses layered inferences on the data to achieve the final classification, so it has good interpretability.

Random Forest

[5] is an ensemble learning method containing multiple decision trees. The model trains each decision tree independently, and the result is determined by the category with the most output of decision trees.

Besides, we also compare our model with following deep learning-based baselines.

BiLSTM

uses a BiLSTM network without attention mechanism. The hidden state of words in the document is averaged as the classifier’s input.

Attention-BiLSTM

uses a BiLSTM network based on a traditional attention mechanism and inputs the classifier after the weighted summation of each word’s hidden state.

In the aspect of topic modeling, we compare our model with the following neural topic models.

LDA

Footnote 1 [3] extracts topics based on the co-occurrence information of words in the document. We use package gensim to implement this model.

NVDM

Footnote 2 [29] comprises an encoder network and a decoder network, inspired by the variational autoencoder based on Gaussian prior distribution.

W-LDA

Footnote 3 [31] is the prototype of our model, which uses Wasserstein autoencoder and Dirichlet prior distribution to mine topic information.

BAT

[44] applies bidirectional adversarial training with Dirichlet prior for neural topic modeling.

The last three neural topic models mentioned above adopt a neural network structure similar to our model.

5.1.4 Evaluation Metrics

In the experiments, we mainly evaluate the classification performance of the text classification model and the topic quality of the neural topic model.

For classification, we compare three widely used performance metrics: accuracy, precision, and F1-score. Accuracy refers to the proportion of correctly classified samples to the total number. The calculation is:

$$ Accuracy=\frac{1}{N}\sum\limits_{i=1}^{N}\mathbb{I}[(\hat y_{i}=y_{i})] $$
(34)

where N is the total number of samples, and \(\mathbb {I}(\cdot )\) depicts the indicator function. When ⋅ is true, the function equals 1; otherwise, it is equal to 0. In binary classification, we generally divide the combination of predicted labels and ground truths into four types, namely True Positive (TP), False Positive (FP), True Negative (TN), and False Negative (FN). True or False means whether the prediction is correct, Positive or Negative means whether the forecast result is a positive or negative sample. These four categories respectively correspond to the number of samples that meet the condition, so the sum of four values equals N. Based on the above, the definition of precision is:

$$ Precision = \frac{TP}{TP+FP} $$
(35)
$$ Recall=\frac{TP}{TP+FN} $$
(36)

Precision is the number of correct labels divided by the number of all predicted positive results, and recall is the fraction of true positive samples predicted to be positive. So the precision and recall are a set of contradictory measures. To comprehensively consider the precision and recall metrics, we also evaluate the effectiveness with the F1-score. The definition is below:

$$ F1=\frac{2\times Precision \times Recall}{Precision + Recall} = \frac{2 \times TP}{m+TP-TN} $$
(37)

Under the same experimental conditions, the higher above metrics, the better classification performance.

For topic quality, we utilize two standard metrics CV and CA of topic coherence Footnote 4[34]. Here we choose 10 representative words for each topic as word sets and respectively compute CV to measure semantical support for one word in each set. Variously, CA compares pairs of single words in each topic’s set to evaluate the coherence between words. To this end, we apply the two metrics to quantify the quality of topic modeling comprehensively.

5.2 Results and Analysis

In this section, we present the experimental results and corresponding analysis of proposed TA-BiLSTM model in terms of classification performance and topic quality.

5.2.1 Classification Performance

Table 3 lists the results of classification performance on three used public datasets compared with different baselines. We could observe that the TA-BiLSTM model could obtain better results in accuracy, precision and F1-score.

Table 3 Misleading information detection performance on the three datasets

Specifically, the bag-of-words representation limits the traditional machine learning approaches. The precision of Random Forest on the Clickbait-17 dataset is higher because the model only selects confirmed positive samples to minimize the number of FP. Therefore, the accuracy of Random Forest is not high, and the F1-score is lower than other approaches.

Moreover, we conduct ablation study by comparing BiLSTM and Attention-BiLSTM to verify the outperforming of the Topic-Aware attention mechanism. We could observe that the results are better than those of machine learning-based approaches, indicating that richer semantic feature representation, especially context information, could improve classification performance. Compared with the BiLSTM, the results of Attention-BiLSTM show slight improvements, indicating that the attention mechanism assigns more weights to specific words to provide a more suitable document representation.

Furthermore, in the comparison of Attention-BiLSTM and TA-BiLSTM, we observe that accuracy increases 0.64%, 1.12%, 3.11% and F1-score increases 0.63%, 0.99%, 4.95% for the latter on the three datasets, respectively. The significant improvements show that Topic-Aware attention mechanism could incorporate topic information into classification module. Moreover, the topic information could indeed help TA-BiLSTM to provide more suitable representations for misleading information detection.

5.2.2 Topic Quality Comparison

The calculation of attention mechanism often incorporates supervision signal from a document, which will be helpful for mining latent semantic patterns in topic modeling procedure. Thus, we also evaluate the quality of topics in this subsection. Table 4 presents the results of different topic coherence metrics CA and CV comparing with other topic modeling baselines on three datasets.

Table 4 Topic coherence scores of various topic models on the three datasets, a higher value means more coherent topics

Compared with the topics extracted by W-LDA on Enron Spam dataset, the CA of TA-BiLSTM has increased by 5.81%, and the CV metric has risen by 11.53%. On the 2007 TREC dataset, CA is almost the same as the W-LDA, but the CV has increased by 13%. We also present the comparison with BAT. It obtains slightly higher than W-LDA and LDA on Clickbait-17, but our model improves CA and CV by 2.31% and 3.06%.

Ignoring NVDM with poor performance, Table 5 lists the top-10 representative words with the highest probability for each topic on three datasets. Thus, we could compare the quality of performance intuitively. Generally, compared with other models, we could realize that the topics generated by TA-BiLSTM have fewer irrelevant words and higher semantic coherence.

Table 5 Topic models top-10 words of five same topics on the three datasets, where italics indicate irrelevant words to the topic

The topic words of NVDM are not very consistent because it employs Gaussian prior to mimic Dirichlet in topic distribution space. As the proposed TA-BiLSTM utilizes Dirichlet as prior distribution in topic space, it could obtain coherent topics than NVDM. Meanwhile, the supervision signal also helps the TA-BiLSTM to surpass LDA, W-LDA and BAT in topic modeling evaluation.

5.2.3 Hyper-Parameter Analysis

To further validate the robustness of TA-BiLSTM, we conduct hyper-parameter analysis in this subsection. Concretely, parameter analysis on three parameters (the number of topics K, the dimension of hidden states \(\textbf {h}^{\boldsymbol {\prime }}\) and the proportion of noise η) has been carried out.

Firstly, the number of topics K is set to 30, 50, 80 and 100, respectively. The quantitative results on three datasets are reported in Table 6 and visualized in Fig. 2.

Table 6 Parameter analysis of the number of topics K on three datasets
Fig. 2
figure 2

Illustration of different numbers of topics K on three datasets. Each of these subfigures is constituted by four components. The first one depicts how TA-BiLSTM performance varies with different numbers of topics and others depict the comparison with baselines on three classification metrics Accuracy, Precision and F1-score

For Enron Spam and 2007 TREC datasets, we could observe that TA-BiLSTM performs fairly stable on three metrics. For Clickbait-17 dataset, the classification performance is more sensitive to changes of K, which may be caused by the complicity of the dataset. It is worth mentioning that optimal numbers of topics over datasets are different (50 on Enron Spam, 80 on 2007 TREC and 50 on Clickbait-17). If this number is too large, the model is not interpretable, and if the number is too small, the model training will be negatively affected [12]. Thus, we set the number of topics K to 50 in our experiments.

Similarly, we conduct parameter analysis on the dimension of hidden states \(\textbf {h}^{\boldsymbol {\prime }}\). It has been set to 25, 50, 75, 100 and 150 respectively. And the corresponding statistics are listed in Table 7. By comparing the results, we could observe that simple models perform better on Enron Spam and 2007 TREC datasets. While dealing with Clickbait-17, classification performance improves with the increasing of model complexity. This may be also caused by the complexity of Clickbait-17 dataset which needs a more complicated model to fit the data.

Table 7 Parameter analysis of the dimension of hidden states \(\textbf {h}^{\boldsymbol {\prime }}\) on three datasets

We further investigate the impact of different proportions of noise η on the performance. In detail, we compute the metrics of classification and topic modeling separately with five proportion settings [0,0.1,0.2,0.3,0.4]. The detailed comparison is shown in Table 8. It can be concluded that adding a proper proportion of noise to the topic distribution upgrades the quality of topic modeling on all datasets. However, not the optimal parameter for the topic mining has the same consequence on classification performance. Topic coherence is better when the proportion is set to 0.1 or 0.2, while less noise is helpful for the Topic-Aware attention mechanism to preserve topic features and prediction. Hence we set the proportion of noise to 0.1 for better comprehensive results in the experiments.

Table 8 Parameter analysis of the proportion of noise η on three datasets

5.2.4 Case Study and Visualization

To validate that proposed TA-BiLSTM could indeed improve the model interpretability, we conduct case study and visualization in this subsection.

Figure 3a shows an advertising email for an online pharmacy in the Enron Spam dataset. As Topic 8 represents drugs, we could infer that this email may discuss related topics. Also, we could find various drug names appeared in its text content. Likewise, Fig. 3b depicts a web page content from Clickbait-17 which entices people to buy cosmetics. We can also find relevant words from Topic 15 and Topic 45, such as ‘carpet’, ‘fashion’, ‘beauty’, ‘makeup’.

Fig. 3
figure 3

Case study of two misleading examples from the test sets of Enron Spam (subfigure (a)) and Clickbait-17 (subfigure (b)). Color shade indicates the proportion of topic distribution. A higher proportion in topic distribution will result in a darker color in the figure. Representative top-10 words for crucial topics are listed below the bar

Thus, the above two examples show that corpus-level topic relatedness could really improve model interpretability.

6 Conclusion

In this paper, we proposed the Topic-Aware BiLSTM (TA-BiLSTM) model, an end-to-end framework. TA-BiLSTM contains a neural topic model and a text classification model, which explores corpus-level topic relatedness to enhance misleading information detection. Meanwhile, the supervision signal could be incorporated into topic modeling process to further improve the topic quality. Experiments on three English misleading information datasets demonstrate the superiority of TA-BiLSTM compared with baseline approaches. Additionally, we analyze multiple hyper-parameters in detail and select specific topic examples for visualization. More recently, classification and topic modeling on short texts are still challenging tasks. Our future study would pay more attention to detect misleading information from the short text on social media platforms.