Rumour detection using deep learning and filter-wrapper feature selection in benchmark twitter dataset

Microblogs have become a customary news media source in recent times. But as synthetic text or ‘readfakes’ scale up the online disinformation operation, unsubstantiated pieces of information on social media platforms can cause significant havoc by misleading people. It is essential to develop models that can detect rumours and curtail its cascading effect and virality. Undeniably, quick rumour detection during the initial propagation phase is desirable for subsequent veracity and stance assessment. Linguistic features are easily available and act as important attributes during the initial propagation phase. At the same time, the choice of features is crucial for both interpretability and performance of the classifier. Motivated by the need to build a model for automatic rumour detection, this research proffers a hybrid model for rumour classification using deep learning (Convolution neural network) and a filter-wrapper (Information gain—Ant colony) optimized Naive Bayes classifier, trained and tested on the PHEME rumour dataset. The textual features are learnt using the CNN which are combined with the optimized feature vector generated using the filter-wrapper technique, IG-ACO. The resultant optimized vector is then used to train the Naïve Bayes classifier for rumour classification at the output layer of CNN. The proposed classifier shows improved performance to the existing works.


Introduction
As social media is a fertile ground for origin and spread of rumours, it is imperative to detect and deter rumours. A rumour is any chunk of information circulated in the public domain without adequate awareness and confirmation to support its legitimacy [13]. It spreads like wildfire and is believed overtly especially during a crisis situation. Undeniably, the economics of social media favors rumours, hate-speech, pseudo-news, alternative facts or fake news [2,23,25]. The wave of misinformation and rumour pertaining to the COVID-19 on social media and other digital platforms is a testimony to this rising infodemic. Figure 1 depicts a sample rumour that emerged during the recent COVID-2019 impelled India Lockdown 1.0 regarding cut in pension disbursement of Indian citizens. The news was totally baseless and false and the Ministry of Finance, Government of India had to bust the fake news and give clarification.
To ensure information credibility, many social networking sites such as Facebook, WhatsApp, Twitter and Instagram employ strategies and tools dedicated to identifying rumours and improving online accountability. These follow obligatory regulations or standard guidelines and rely on a combination of artificial intelligence, user reporting, and content moderators to implement rubrics for reliable and apposite content filtering. But the strategies and code of practices are opaque to the users whereas the moderators are overwhelmed by the sheer volume of content and the ordeal that comes from sifting through vexing posts. Furthermore, the aggressive virality of rumours is an added nuisance [26]. Often, despite debunking a rumour, a re-posting of the same claim emerges. Thus, automated debunking of rumours and combating their viral spread is the need of the hour.
Typically, debunking and resolving a rumour entails four sub-tasks: detecting the rumour, tracking the rumour, classifying the stance and veracity [33]. An absolute rumour resolution framework involves integration and interaction of all these sub-tasks. Detecting rumours as a primary step may facilitate effective evaluation of the subsequent sub-tasks. Typically rumours in microblog posts are associated with a set of events and their lifecycle depends on their temporal characteristic [34]. The rumours can span over a longer period of time typifying an unrelenting and long-standing character or unfold in chorus to an occurrence of a breaking news event, which is with no history. Studies report a substantial interval between the first appearance of a rumour and its resolution [28]. The time period for determining the truth value of a breaking-news rumour may take three to eleven hours (Fig. 2).
Timely detection of the legitimacy of rumours is a strategic aspect in averting their viral spread [32]. An early stage debunking is necessary as the intensification is at its peak at the inception of an event. Computationally intelligent models with self-learning and generalization capabilities can facilitate automatic rumour detection. The use of different content, user and network based features have been reported in literature [13,33]. Linguistic semiotic features can remarkably assist in identifying rumours during the initial propagation phase. These include vocabulary, structure, and grammar of oral/written language are the fundamental extractable textual features language which can contribute considerably as breaking news rumours are mostly circulated as trending stories and hashtags. It is thus important to automatically learn new, hidden features in natural language and their correlations from the input text itself for a real-time unfolding news story or event. Correspondingly, automatic rumour depends on feature set and learning model.
Rumour detection is quintessentially a text classification task which intends to categorize the incoming social media post as rumourous or not [12]. Feature engineering is a crucial sub-task in text classification, which is essential for the conversion of data into a machine learning-ready format. The primary problem pertaining to text-based rumour detection is the lower detection rate, wherein the classifiers fail to classify the misinformation accurately, which has a further derogatory effect on the detection rate and accuracy of the system. Deep learning models have attained state-of-the-art results on many natural language processing tasks problems. Previous studies also confirm that feature selection allows faster training of machines and increases the accuracy of trained models, as few relevant features are better to train the model than huge amounts of irrelevant and redundant features [24]. It also benefits by avoiding the curse of dimensionality and overfitting, thus improving the accuracy and efficiency of the classifier. The feature selection techniques are broadly classified into filter methods, wrapper methods and embedded methods [3,8]. Filter methods define selection techniques that rely on characteristics in the data itself by considering each feature individually and assessing its importance in prediction. On the other hand, wrapper techniques use a 'greedy' approach to generate an optimal feature set by assessing all possible feature combinations. Embedded methods define selection techniques that occur together while model fitting. While filter methods in general are fast and model agnostic, they tend not to select the optimal feature subset. Wrapper methods, by examining model performance on all (or several) feature subsets tend to find the best subspace for a given algorithm. But, as they build loads of models, they tend to be very computationally expensive, and This research seeks to include the pros of both deep learning to maximize utilization of unstructured data and feature selection techniques as a solution for such computational problems with the increased number of features, the training time surges rapidly and the risk of overfitting is increased too. A hybrid model for the rumour detection is proposed, where deep neural learning (Convolution neural network) is used in convergence with a filter-wrapper (Information gain-Ant colony) optimized Naive Bayes classifier enhanced accuracy in prediction results. CNNs are best at feature extraction with improved capabilities of representation and learning [11]. At the same time, naïve Bayes is an efficient classifier which is very simple to build and robust to outlier and irrelevant features. The proposed deep neural architecture consists of a convolution neural net (CNN) with four archetypical layers: embedding, convolution, activation and down-sampling (pooling) layers with a Bayesian classifier at the output layer. That is, the rumour classification in the output layer is done using a Naïve Bayes classifier. This classifier is trained using a combination of two sets of features, that is, the features which are learnt using the CNN and an optimized feature vector generated using the filter-wrapper technique, information gain (IG) [21] and ant colony optimization (ACO) [7]. This combined feature vector is used to train the Naïve Bayes classifier to predict the rumour. Thus, the proposed deep neural model, CNN-IG-ACO NB has two primary design components: • Firstly, instead of softmax regression which is normally used in the output layer of CNN, we use a Naïve Bayes classifier. Naïve Bayes is a generative model as compared to the discriminative softmax regression (logistic regression) model which is routinely used in the output layer of the CNN. Logistic regression tends to overfit if the training data is less. Alternatively, a naïve bayes classifier performs well even with less training data and has faster training time. This abets quick learning which is favorable in the case of rumour detection where early prediction can save streaming and virality. • Secondly, as the Bayesian classifier can suffer from oversensitivity to redundant and/or irrelevant attributes and lead to a decline in the performance of the classifier. Feature selection may filter features leading to reduced dimensionality of the feature space. The selection of relevant feature subset is essential for both interpretability and predictive accuracy of the classifier. This research makes use of the hybrid feature selection technique, IG filter-swarm based ACO wrapper, reported by Bhatia and Sangwan [3] for detecting & predicting anomalies for IoT-based real-time abuse. This hybrid helps to select features with maximum relevance and minimal redundancy to train the prediction model.
Hence, the proposed CNN-IG-ACO NB model, utilizes CNN as an automatic feature learner and NB as a rumour classifier. The Naïve Bayes classifier is trained by combining CNN generated feature vector with the IG-ACO optimized feature vector. The performance of the CNN-IG-ACO NB model is validated on the PHEME rumour dataset [36]. A comparison is done against the state-of-the-art conditional random field (CRF) classifier [35] 2 Rumour: taxonomy and tasks The newfound social media landscape for communication, disseminating information and voicing opinions brings to us substantial risks of fabricated information. Much of the discourse on 'online information fabrication' conflates three notions: misinformation (honest mistakes), disinformation (rumours, fake news and manipulated content) and malinformation (information leaks, harassment and hate speech), with each playing its part in contributing to the pollution of our information streams. These vary in accordance to the truth value of the content and the intent of information being created, produced or distributed (Fig. 3). That is, dis-information encompasses absolute lies with no truth and is intentionally created to harm an individual, group, organization or country. Comparatively, misinformation is an erroneous mistake though the information is false, but it is not created with the intention of causing harm, rather it. Mal-information is grounded on reality but either taken entirely misrepresented, misquoted or manipulated, with malicious intent to inflict harm on a person, organization or country.
Undeniably, these 'information disorders' [4] that affect the social web have exposed us to the relentless virtual transgressions of lies, falsehoods and hate-crimes on the Web. The ease in online account creation, posting accessibility, broad latitude and virality makes social media an ideal and seamless choice for perpetrators as they tend to hide behind fake or hacked profiles to spread gossip or misleading stories. The economics of social media too favors rumours, hate-speech, pseudo-news, alternative facts or fake news [15,31,33].
Simply put, a rumour is an assertion whose truth value is unverified. The rumour resolution process consists of four phases or subtasks as shown in Fig. 4.  The detection of rumour origin is also a subtask that tracks the original or the first user who posted that content.

Literature review
Detecting rumours is essential, keeping in mind the volume and velocity of user-generated information on social media. Social media allows information propagation regardless of the source verification status and truth value. Forwarding and sharing content combined with the lack of validation fuels rumours as it permits exchange and broadcasting at an unmatched level. Nevertheless this can be harmful when users are exposed to damaging or undesirable content. Also, most social media platforms allow users to form groups based on their shared interests; however, such virtual alignments may lead to the creation of echo-chambers in which participants' own views are amplified and reinforced. Such echochambers also make unconfirmed posts appear more trustworthy. When a group member receives a certain piece of information, they might think that the information is truthful because it is from their "own" people.
Automatic rumour detection in social media data, especially on Twitter and Sina Weibo has been reported in various studies. In 2016, Zubiaga et al. [36] A comprehensive survey was given in 2018 by Zubiaga et al. [33]. The authors discussed the existing literature with respect to the various sub-tasks of rumour resolution. Various machine learning and deep learning models have been used to detect rumourous posts in microblogs. In 2018, Kumar and Sangwan [12] performed an analysis using a variety of ML for rumour detection. A range of features including text-based, user-based and network-based features have been used to train the learning models for detection and prediction of rumours [5,16,17,27,29,30]. Recently deep learning models have also been used for rumour detection in textual modality. RNN [18], attention-based RNN [6], hybrid of CNN with RNN [20] and LSTM with RNN [1] have reported superlative results. Multimodal rumour detection using LSTM and RNN with attention has also been proposed by Jin et al. [10]. The sequential classifier model, CRF, was given by Zubiaga et al. [35]. This research suggests building a

The proposed CNN-IG-ACO NB model
The proposed hybrid of deep neural model and filter-wrapper feature selection entails the following components (Fig. 5): • CNN for automatic feature learning • IG + ACO for optimized feature selection • NB for rumour classification The first component defines, initializes and trains the CNN which is seeded using the ELMo 5.5B embeddings [22]. ELMo generates the vectors for a word based on context. It is a character-based model using character convolutions and can handle out-of-vocabulary words. This fits the breaking news rumour type vector representation with words that are not seen in training. The textual features are converted into numerical data that can further be used for performing convolutions. The model uses three layer convolutional architecture with a total of 100 convolution filters each for window size (3,3). The dropout regularization is set to 0.5 to ensure that that model does not over fit. The ReLU activation function is used for introducing nonlinearities into the model which generates a rectified feature map. Max-pooling is used as a down-sampling strategy. The output layer in our model has a Naïve Bayes classifier. This classifier takes a concatenated feature set obtained by combining the learned vector representations from CNN (the output of top hidden layer) and a set of optimal features generated by applying IG + ACO on the training data set. Finally, the NB classifier categorizes the post as rumour or non-rumour.

PHEME dataset
The benchmark PHEME dataset used in this research for rumour detection [36] has tweets related to five breaking news events annotated for the 'rumour' and 'non-rumour' categories by expert journalists. These events are: • #charliehebdo-Around noon on 7th January 2015, two gunmen forced themselves into the offices of the French satirical weekly newspaper Charlie Hebdo in Paris and killed 12 people and wounded 11. The dataset contains 458 rumours and 1621 nonrumours. The label distribution for events within the dataset is shown in Fig. 6.

Preprocessing
Preprocessing is the task of preparing the data in a manner, which is easier for the machine learning model to comprehend. The raw data is transfigured into clean data which is then used as input to the model.

Automatic feature learning-convolution neural network (CNN)
CNNs are usually apposite in computer vision tasks, however more recently their application to various NLP tasks have shown encouraging results [9]. Identical to the representation of images as an array of pixel values, text can also be represented as an array of vector where 1-dimensional convolutions are performed to pick up patterns in sequential data. CNNs are much faster to train, because of the batching. Typically, CNN architectures for text classification can either be character-level CNN or wordlevel CNN. In character-level CNN, input text is represented as k*n matrix of one-hot encoding of the characters whereas in word-level CNN the input text is represented as n*k matrix using word embeddings. In this research, word-level CNN is used. Figure 7 depicts the architecture of a typical CNN Model. The posts from the dataset are extracted individually and are pre-processed to clean the posts, by removing the stop words and converting the words into their stems. These pre-processed posts are given as input to the embedding layer, where ELMo 5.5B model is used as the word vector learning technique to seed the classifier. Word vectors are used to represent the relationship across words, sentences, and documents. They are simply the vectors containing numbers that show and map word meanings for the model to understand. The vector representation of the text is provided to the filtration layer of the convolution having 128 filters of 8 size each, after applying a randomized function over the vectors. These filtered feature vectors are given as an input to a non-linear function that acts as a linear function and uses stochastic gradient descent to train the model and to avoid the saturation problem of other activation functions. The Rectified Linear Activation function is used in the third layer of CNN and so is named as the ReLU Activation layer. The ReLU function, f used in the model is ReLU layer provides a half rectified feature map which is passed onto the fourth layer to perform downsampling by applying max-pool operation on the output feature vector matrix of ReLU layer over a feature matrix of 2 × 2, c max = max(c) . The max operation identifies the maximum value out of the sample feature map of 2 × 2 and converts the input feature matrix of shape 8 × 8 × 1 to 3 × 3. The output of the downsampling layer is a max-pooled feature map representation of the input tweet, which will provide a similar result for the minorly modified tweet. In a typical CNN that uses the fully connected output layer with softmax activation, this representation is used as an input to finally classify the tweet as positive (+ 1) of rumour or negative (-1) of rumour. But in the proposed model, the softmax output layer of the CNN is replaced by an easy to interpret and scalable Naïve Bayes classifier. NB learns a concatenated feature set generated by combining the high-level features obtained by CNN and the optimized feature set obtained using hybrid IG-ACO techniques to achieve the task of rumour classification as already shown in Fig. 6.

Feature selection using filter-wrapper (IG-ACO)
In this work, to create the initial feature matrix, term frequency-inverse document frequency (TF-IDF) is used. TF-IDF is a weighting scheme for measuring the importance of a word with respect to a complete document [3]. It also checks the relevance of the keyword throughout the corpus. Feature selection techniques facilitate reduction in the number of input variables based on the usefulness to target prediction [3,11,13,24]. Common categories of feature selection techniques include: • Filter techniques attempt to assess the merits of attributes from the data, ignoring learning algorithm. • Wrapper techniques the attributes subset selection is done using the learning algorithm as a black box. • Embedded techniques performs automatic feature selection during training Figure 8 summarizes this hierarchy of feature selection techniques. In this work, feature selection is then done using the Information Gain (IG) filter and swarm-based (Ant-colony optimizer) ACO wrapper [3]. IG is calculated as: where, c i indicates the i th class; p(c i ) indicates the probability of the i th class; p(t) and p(t') are respectively the probabilities of presence and absence of the feature t; p(c i |t) and p(c i |t') are the conditional probabilities given the presence and absence of the feature t resectively. (1) In contrast to the filter methods, wrapper methods are based on the "usefulness" of features with respect to the classifier performance. Given by Dorigo [7] in 1992, ACO is inspired by the communication process used by ants. The algorithm for ACO is given as:

Algorithm 1: Ant Colony Optimization Begin
Initialize pheromone and other parameters Generate a population of n ants for(ant i) Calculate fitness value Determine best position Determine best global ant (solution) Update pheromone trail Check stopping criterion End The algorithm of the proposed CNN-IG+ACO NB model is given next.

Naïve bayes classifier
The Naive Bayes is a linear classifier using Bayes theorem and strong independence condition among features. Given a data set with n features represented by Naive Bayes states the probability of output: Y from features F_i is, This requires that the features F_i are conditionally independent. From Bayes theorem: Softmax regression (or multinomial logistic regression) is a generalization of logistic regression which can handle multiple classes as compared to binary classes in the latter case. The learning mechanism is a bit different between the Naive Bayes (generative model) and Logistic regression (discriminative model).
• Generative model: Naive Bayes models the joint distribution of the feature X and target Y, and then predicts the posterior probability given as P(y|x) • Discriminative model: Logistic regression directly models the posterior probability of P(y|x) by learning the input to output mapping by minimizing the error.
In 2001, Ng and Jordan [19] provided a mathematical proof of error properties of both logistic regression and naïve Bayes models. Their study concluded that when the training size reaches infinity the discriminative model: logistic regression performs better than the generative model Naive Bayes. However the generative model reaches its asymptotic faster (O (log n)) than the discriminative model (O (n)), i.e., the generative model (Naive Bayes) reaches the asymptotic solution for fewer training sets than the discriminative model (Logistic Regression). Naïve Bayes classifiers require a small amount of training data to estimate the necessary parameters and are extremely fast. As the size of the PHEME rumour dataset was small too, Naïve Bayes was a fitting choice.

Results and discussion
The performance of the proposed model was evaluated for individual events within the PHEME dataset. The confusion matrix using the following values was computed for each event: • True Positives (TP)-number of rumours correctly identified The confusion matrices for each individual event are shown in Fig. 9 with actual class on the horizontal axis and predicted class on the vertical axis. Fig. 10 AUC-ROC for individual events in PHEME Two events, Charlie Hebdo and Ferguson unrest suffer from the class imbalance problem. Data skewness or class imbalance proves to be a major limitation in a classification task. In the case of class imbalance, the accuracy score is not used as an evaluation metric as it often leads to incorrect interpretation of performance. Thus, we have used F1 Score, Precision, and Recall as evaluation metrics to correctly represent performance on the PHEME dataset. We evaluate our model for individual events over five iterations with a leave-one-event-out approach [14]. Other than precision, recall, and F1 score, we have also used AUC-ROC curves to judge the performance of our model. The AUC scores of the five events present in the PHEME dataset range from 0.75 to 0.90. Figure 10 shows the ROC curve for each individual event.
The experiments show that our proposed model, which is a hybrid network, is able to handle multiple input types using the distinct classifiers to maximize the potential of each feature type. It achieves significant improvements over traditional methods of binary text classification. Table 1 depicts the results for each individual event in PHEME evaluated for the proposed CNN + IG-ACO NB model and the state-of-the-art CRF Classifier [35]. The proposed model achieves higher precision compared to the state-of-the-art classifier for three events out of the five, the exceptions being Ottawa Shooting and Sydney Siege events. The recall scores achieved by the proposed model are also superior for all the five events. The model also manages to strike an equilibrium between recall and precision, a qualitative improvement over the state-of-the-art.
Also, quite clearly, the proposed model does not let the events with class imbalance impede the classifier performance which is another improvement over the CRF classifier. CRF has an F1 score of 0.636 and 0.465 for Charlie Hebdo and Ferguson Unrest events respectively, whereas the proposed model yields a score of 0.848 and 0.857 for the same events. Same trend was observed for the remaining events which were free from the class imbalance problem, i.e., the proposed model outperforms the CRF for the Germanwings crash, Sydney Siege, and Ottawa Shooting events in terms of F1 scores. It is worth mentioning that the CNN + IG-ACO NB model performs exceptionally well in terms of recall score and achieves an overall superiority over the CRF classifier.
The overall effectiveness of the CNN + IG-ACO NB classifier is shown in Table 2  The results of using the hybrid IG-ACO technique were also evaluated on three different classifiers, namely, random forest (RF) and Naïve Bayes (NB), decision tree (DT) for all the five events of the PHEME dataset. We also compared the IG-ACO with the IG-Cuckoo search algorithm and trained the dataset on three different classifiers (NB, RF and DT). The results are given in Table 3.
Though the best average results over the entire dataset were given by TF-IDF + IG + Cuckoo but IG + ACO performed better with the NB classifier which was decided to be used as the output classifier in the proposed model. A reduction of 77% in the feature set was observed using IG + ACO which was comparable to the 79% reduction in features observed using IG-Cuckoo as shown in Fig. 11. For both TF-IDF + IG + ACO NB and TF-IDF + IG + Cuckoo, NB took 0.30 s for model building whereas RF took maximum time (10 times more than NB).

Conclusion
Rumours proliferate in times of crisis. The uncertainty and significance of the situation, combined with the lack of information fuels rumours in the virtual social world. It is thus imperative to question the tangibility of information. As a solution to debunk online rumours, this study proffered a novel model for real-time rumour classification which learns combined features from the high level features from CNN and the optimized features obtained using hybrid information gain filter-meta-heuristic ant colony optimization wrapper feature selection technique. The classifier is an easy to interpret Naïve Bayes classifier that replaces the final logistic regression (softmax) layer in the CNN architecture to classify tweets into binary categories of rumour and non-rumour. The model is evaluated on the PHEME benchmark dataset and compared with the existing state-of-the-art. Results validate superior F1 of 0.732 using the proposed CNN + IG-ACO NB rumour classifier. Our approach uses only the text-based features whereas meta-features such as re-tweet count and user-based features can be learned separately to build a robust model to uncover rumours. Further, the use of country-specific content written in native language is also compounding the linguistic issues in detecting rumours. As a future direction, next stages of the rumour resolution pipeline can be explored using the hybrid model. Also, as this work presents text-based rumour detection, context modelling can be done to improve the detection and debunking of rumourous stories. Further it is also imperative to leverage information from different media platforms and different languages to verify rumor automatically and analyze multimedia rumor verification datasets, such as CCMR and MediaEval 2015's Verifying Multimedia Use (VMU 2015).