1 Introduction

Social networks had played a very significant role in the dissemination of knowledge in the last few years. More precisely, with the facility of short-length messages, microblogs became a more common platform for expressing views and sharing opinions. Twitter is one of the most popular microblog websites that allows users to share their views and opinions in the form of tweets with a length of a maximum of 140 characters [45]. Generally, the dissemination of reliable information is one of the objectives of the social network. Among all the positive aspects of the social network, there are a few downsides, fake news and rumours can easily be spread to millions of people in a short time [36]. Such false/fake information not only causes serious problems for social media websites but also creates disasters for governments and economies. For example, in 2013 fake news was spread about the burst of two bombs in the white house and the US president was wounded. This rumour initially created panic on a large scale and caused a dramatic crash in the stock market. Therefore, the propagation of fake/rumour information on the social network is highly undesirable [10].

Rumour is defined as unverified or unproved information [48]. There are several meanings of rumour, “unverified and instrumentally related facts claim in circulation” is commonly used. The main difference between fake news and rumour is that “fake news is the false information while the rumour is unverified, it is not necessarily false and may turn out to be false or true” [35]. Rumour and fake content identification are similar in technique because they have the majority of features in common. In COVID-19, new rumours according to the situation were circulated that threatened people from one side while people are still being fooled by the others. For instance, smoking alcohol stops COVID-19 or holding the breath for 10 s to check COVID-19 [42, 43]. The dissemination of unverified information causes mistrust on social networks e.g. Facebook was declared a “dust and cloud of nonsense” in 2016 when it was unable to track the spread of rumours about the US presidential election [11]. Unfortunately, such rumours can be found in any area of life. Therefore, it is necessary to identify rumours so that people will not be misled [46].

The early identification of rumours enables us to stop the spread of rumours. People can be prevented or avoided from any threat/panic caused by rumours when we identify it at an early stage. It is difficult to identify the rumours but it is in high demand. Identification of rumour is a challenging task due to three reasons: (1) Demand for real-time detection, (2) The Nature of rumours are confusing, and (3) A lot of work to process huge information. Many studies are conducted on rumour identification in literature. In the majority, text or image content-based methods are presented [26, 27, 40]. Few worked on exploiting the propagation features [23].

More specifically, several types of features have been proposed and various models are developed for rumour identification. For instance, influence potential [36], network characteristics [13, 18, 20, 36, 47], textual features [1, 2, 13], personal interest [36], temporal, semantic and structural features [3, 17, 19, 38] etc. It is necessary to utilize some new characteristics to efficiently identify rumours. Therefore, positive and negative discrete emotions, linguistic and metadata characteristics are utilized from content-based and word2vec [28], and BERT methods are selected from context-based features. According to our knowledge, these features were not used in literature for rumour detection in health and related domains. In addition, the utilization of a powerful machine learning model is always necessary to build an effective detection/identification system.

The study is organized in the following way: Next section describes research questions and contributions. Section 3 provides related work. Section 4 presents the methodology. Section 5 presents the experimental results and examined their outcomes. Discussion and implications are presented in Section 6 whereas concluding remarks are provided in Section 7.

2 Research questions and contributions

In this case study, three research questions are addressed:

RQ1

Which ML model is more robust in performance to design an effective model for rumour detection.

RQ2

Does any subset of features exist that are most influential and have a strong relationship to rumours?

RQ3

Which type of characteristic (Word2vec, BERT, discrete emotions, metadata, and linguistics) makes the maximum contribution to twitter rumour identification?

The objective of this case study is to examine the impact of proposed discrete emotions, linguistic, metadata, word2vec, and BERT methods for the identification of rumours at the tweet level. The process of feature selection is employed and the best subset is selected using the wrapper method. four well-known twitter datasets, four popular machine learning models, and five evaluation measures are utilized for the experimental setup. The proposed framework is evaluated using an individual type of features as a standalone model and using their hybrid combinations. The findings of this case study provide new insights for Zikavirus, breaking news of Ottawa Shooting and Germanwings Crash events domains in the area of rumour detection. To sum up, the main highlights of the paper are:

  1. 1.

    An effective rumour detection system for Zikavirus, Ottawa Shooting, and Germanwings Crash events is developed using novel features and the random forest model.

  2. 2.

    The BERT model and word2vec embeddings are used to examine the language context of a tweet for rumour identification.

  3. 3.

    A two-step framework for feature selection is employed which enables shortlisting of the best features.

  4. 4.

    The comparison of four ML models reveals that random forest model demonstrated the best performance.

  5. 5.

    The proposed framework operates at the single tweet level rather than at the topic level.

  6. 6.

    The experimental results demonstrate that our model outperforms the three state-of-the-art baselines on all evaluation metrics.

  7. 7.

    The proposed framework is validated on four real-life events twitter datasets and achieved maximum accuracy of approximately 97% on the zikavirus dataset.

  8. 8.

    The findings reveal that URLs, Trust emotion, Verbs, Adjectives, and Propositions are the top-5 textual features to detect rumour at the tweet level.

3 Related work

In recent years, rumour detection has become a hot issue and several approaches have been presented in the literature. Some concentrated on proposing new characteristics while others attempted to apply robust machine learning models. Major approaches are supervised and the most common are content-based.

In 2011, Castillo, et al. presented a method to evaluate information credibility for news articles on the Twitter network [7]. They used the length of distinct words, total words, and sentiments as features. Their system achieved precision and recall in the range of 70–80%. Later in 2016, Ma, et al. developed a method for learning continuous representation of twitter events to deal with rumour detection [22]. In 2017, Kwon, et al. developed a rumour detection model using temporal, structural, and linguistic features [19]. Then a robust model is presented to identify rumours between 3 and 56 days at varying time slots [20]. Four types of features are investigated and varied predictive performance is observed on various time windows. Their results showed that user and linguistic indicators are significant for the short-term whereas temporal and structural features have good performance for the long-term period.

Next in 2018, Sicilia, et al. proposed a rumour detection model at tweet level aiming by exploiting influence potential measures, personal interest, and network characteristics [36]. Their method achieved an accuracy of 89%. Then, a framework for identifying users spreading rumours is developed by Ruchansky, et al. [34]. The model used a recurrent neural network and it outperformed four standard baselines. Similarly, a system is developed by Vijeev, et al. [39] in the same year to identify rumours on Twitter microblog. Content-based and user-based features are used, and three machine learning models are tested. random forest classifier outperformed. s.

Recently in 2019, a detection model is proposed to detect rumours early in time [37]. The method reduces the time span of prediction by 85%, which is better than the state-of-the-art baseline. Next, Hamidian, et al. [13] derived a two-step model to address rumour detection problem and then classification aiming in exploiting the network-specific, n-grams and pragmatic features. Similarly, text-based fusion neural network model by Chen, et al. [8], graph convolutional networks based method by Huang, et al. [16], and rumour veracity detection by Kumar, et al. [18]. Then a novel method for detecting rumours in the Arabic language using “semi-supervised expectation maximization.“ is presented [1]. User and content level features are used. Wang, et al. presented a method for the detection of rumours [41], aiming to exploit the structures for dynamic propagation and content characteristics in combination. Their method is very effective for capturing the dynamic structure. The details of the literature are also presented in tabular form as shown in Table 1.

Table 1 Features and ML models used in the literature

More recently in 2020, a probabilistic model is developed [47] aiming to use not only retweeting behavior but also intent. The proposed system is effective in the detection of malicious users. Then Bai, et al. proposed a stochastic attention convolutional neural network-based system to detect rumour by using fine-grained and coarse-grained features [2]. Similarly, the identification of retweeting behavior for rumours is presented by Tian, et al. [38]. They used reaction time, retweeting frequency and TF-IDF features for model construction and their system achieved an accuracy of 88%. Bian et al. developed a propagation and dispersion-based bi-directional graph convolutional network method to detect rumour [3]. According to the authors, their method is more effective than the state-of-the-art baseline. Then Huang et, al. proposed a heterogeneous graph attention network framework to identify rumour [17]. They developed the tweet-word-user graph using semantic features on Twitter network. In 2021, the graph convolutional network-based rumour detection model is developed by lotfi, et al. [21]. Reply tree and user graph are extracted for each conversation and they claimed that their model outperformed the baseline but the time and space complexity of their model is very high. The spatiotemporal graph and attention-based neural networks are also used in citywide crowd flows prediction problems [5, 12, 15, 33]. Later in 2022, HE, et al. [14] proposed another model for propaganda detection using lifelong machine learning technique. They used sentiment, content relevance and user attention rate features but the time and space complexity of their model are very high.

Most of the aforementioned literature belongs to supervised learning. The review presented so far depicts that majority of approaches detect rumour at the topic/conversation level. Rumour identification at the post/tweet level needs more attention. Next, different topics do not have the same structure of sentences and semantics of words, methods based on such features as well as the use of characteristics at the topic level maybe not be directly relevant to detect rumour for a specific topic level. Therefore, available solutions cannot be directly applicable at the tweet level. In addition, prior contributions at tweet-level used influence potential [36], network characteristics [13, 18, 20, 36, 47], textual features [1, 2, 13], personal interest [36], temporal, semantic and structural features [3, 17, 19, 38] etc. for rumour detection. To the best of our knowledge, no one used discrete emotions, tweet-related metadata, word2vec, and BERT embedding techniques as characteristics for rumour identification at the tweet/post level. Inspired by these ideas, we propose a novel framework that exploits linguistic, metadata, discrete emotions, word2vec and BERT techniques to deal rumour detection at tweet-level. We hope to detect rumour more accurately.

4 Methodology

The components of our proposed framework are presented here. The pipeline of the rumour detection framework is shown in Fig. 1. First, four real-life publicly available twitter datasets are collected. In addition, to cope with datasets, the more required information is crawled from Twitter social microblog. The datasets are further considered for pre-processing (cleaning and removal of irrelevant information). Then it leads to the extraction of five types of features (discrete emotion, linguistic metadata, word2vec, and BERT). Feature normalization (min-max normalization) and feature selection are applied to representative features. Four popular machine learning (ML) models, five evaluation measures, and 20-fold cross-validation are used in experiments. As an outcome, the system classifies tweets into rumour or not-rumour class.

Fig. 1
figure 1

Flow of steps in research methodology

4.1 Problem formulation

Let \({{\upchi }}_{\text{m},\text{n}}\) be a feature matrix, having m rows and n columns, where m represents the number of tweets in Ʈ and n denotes the number of features. Ʈ is the collection of tweets \(\text{Ʈ}=\{{{\uptau }}_{1},{{\uptau }}_{2},{{\uptau }}_{3},\dots ,{{\uptau }}_{\text{m}}\}\) in the dataset and χi is the feature vector of the tweet \({{\uptau }}_{\text{i}}\) such that \({{\upchi }}_{\text{i}}\in {\text{R}}^{\text{n}}\). Every tweet \({{\uptau }}_{\text{i}}\) is an instance/sample, consisting of the following components {D, L, M, W, B, C}. Where D represents discrete emotions, L represents linguistic, M represents metadata features, W represents word2vec, and B represents BERT embeddings related to tweets whereas C is the target class label i.e. rumour or not-rumor.

Let Y be the vector of predicted class labels for all tweets and yi represents the predicted class label for \({{\uptau }}_{1}\) (i.e. rumour or non-rumour ). To classify whether a tweet is a rumour or not-rumour, we define the following predictive function.

$${y}_{i}\text{ = F }\left({{\uptau }}_{i}\text{ / }{{\upchi }}_{\text{i}}\right)$$
(1)

Where

$$ \mathrm{F}\left({\tau}_{\mathrm{i}}/{\chi}_{\mathrm{i}}\right)=\left[\begin{array}{c}\ge 0\kern0.75em \mathrm{if}\kern0.75em {y}_{\mathrm{i}}=+1,\kern2.25em \mathrm{rumour}\\ {}<0\kern0.75em \mathrm{if}\kern0.75em {y}_{\mathrm{i}}=-1,\kern0.5em \mathrm{not}\ \mathrm{rumour}\end{array}\right] $$
(2)

Our aim here is to develop a predictive model that will minimize the predictive error of yi given \({{\upchi }}_{\text{i}}\).

4.2 Datasets

In this case study, four real-life twitter data sets are used. The first dataset (DS1) is publicly available and is built by extracting tweets from 111 events on Twitter [20]. In this dataset, every tweet is annotated as either rumour or not-rumour. In the beginning, we have 111 events and there are several tweets in each event. We selected 12 health-related events of which 4 are non-rumour events and the remaining are rumour events. After preprocessing, we have 653 instances in total of which 359 instances are rumour (positive) whereas 294 are non-rumour. The second dataset (DS2) is designed by crawling tweets related to the health-domain. Zika virus is the only topic and related tweets are considered. In other words, using #Zikavirus and Zika microcephaly [36], the tweets are selected. After preprocessing, we have 693 instances as shown in Table 2, in which 58% belong to the rumour class and 42% are related to the non-rumour class. The third dataset (DS3) is also publicly available and contains tweets collected from breaking news of the Ottawa Shooting event. The total number of instances is 890 among which 470 are rumours (52.8%) and 420 are non-rumours (47.2%). The fourth dataset (DS4) consists of tweets related to breaking news of the Germanwings Crash event and is publicly available. It contains 469 instances among which 238 are rumours (50.7%) and 231 are non-rumours (49.3%) respectively.

Table 2 Description of datasets

4.3 Machine learning models and evaluation metrics

Four machine learning models are used in experiments to classify tweets into rumours or non-rumours. The ML models are: Gradient Boosting Classifier (GBC) [12], Multilayer Perceptron (MLP) [15], Support Vector Machine (SVM) [33], and Random Forest (RF) [5]. Furthermore, the 20-fold cross-validation method and five evaluation measures are used to evaluate the performance. The measures are precision, accuracy, recall, f1-score, and area under the curve (AUC). Python programming language is used to code the models [32].

4.4 Feature extraction

In this case study, five types of features are extracted. The objective is to find the set of influential features that can detect rumour or non-rumour accurately at the tweet level. The features are (1) Word2vec Embedding, (2) BERT model, (3) Discrete emotions, (4) Linguistic and (5) Metadata type. A detailed description of these features is provided next.

4.4.1 Word2Vec embedding model

To capture the semantic of a word, word-embedding is one of the most popular representations of text. Word2vec is one of the methods to generate word embeddings. It can be utilized to get insights for rumour detection from tweet data. Word2Vec is based on an unsupervised shallow two-layer neural network, that can be trained for generating high quality, distributed, and continuous dense vector representation of words [28]. It can capture contextual and semantic similarity and consists of two learning algorithms, i.e. continuous bag-of-words (CBOWs) and continuous skip-gram. The architectures of both algorithms are shown in Fig. 2.

Fig. 2
figure 2

The CBOW and Skip-gram architecture of word2vec [29]

In the continuous bag-of-words model, the target word is predicted given the context words, whereas the skip-gram model predicts the context words given the target word. We used the skip-gram model to generate context words up to 100 dimensions using DS1 and DS2 Twitter datasets. Each context word in a dimension contains information about one aspect of a particular work. The objective of using word2vec is to capture the context words to accurately identify rumours in the tweet text.

4.4.2  BERT model

BERT is a transformer-based ML approach, designed by Jacon Devin and his colleagues in 2018 [9]. It is developed for learning tasks in natural language processing. BERT model can be employed for various language tasks such as sentiment analysis, next sentence classification, question answering, named entity recognition, etc. Also, Google has been using BERT for understanding users’ searches since 2019 [31]. The BERT has two models (1) BERT-base and (2) BERT-large. Both models are pre-trained. The BERT-base uses 12-encoders with 12-bidirectional self-attention heads whereas BERT-large uses 24-encoders with 24-bidirectional heads. The architecture of BERT for natural language processing is presented in Fig. 3.

Fig. 3
figure 3

Architecture of the BERT model for natural language processing [9]

It utilizes an attention mechanism (transformer) that learns contextual relations among sub-words/words in a text. The transformer consists of two modules; the first is an encoder that takes text input and the second is a decoder that predicts the desired output. It is bidirectional or non-directional because the directional models read input sequentially whereas the encoder reads the entire input sequence at once. We are the first to use the BERT model for rumour detection at the tweet level. As our task resembles NLP, therefore utilization of BERT will be more beneficial.

4.4.3 Discrete emotions

Discrete emotions are the type of textual features. According to theory, discrete emotions are biologically determined emotional responses whose recognition are the same for all persons regardless of cultural differences [44]. Eight discrete emotions are classified as discrete positive and discrete negative. Anticipation, joy, surprise, and trust are discrete positive whereas anxiety, sadness, anger, and disgust are discrete negative emotions [25]. These emotions can be extracted using the NRC lexicon [30] provided by National Research Council Canada. The lexicon contains 8265 words. The mathematical formula to compute each discrete emotion is the same. E.g. for trust emotion:

$$\mathrm{Trust}‐\mathrm{emotion}=(\#\mathrm{trust}‐\mathrm{related}\;\mathrm{words}\;\ast\;\text100)/\mathrm{total}‐\mathrm{words}\;\mathrm{in}\;\mathrm a\;\mathrm{tweet}$$
(3)

The details of the NRC emotion lexicon, the list of emotion dimensions, and the number of words related to each emotion dimension are described in Table 3. The aim is to investigate the influence of discrete emotions embedded in tweet text on rumour identification. According to our knowledge, we were the first to use these emotions for rumour detection at the tweet level. The utilization of discrete emotions will uncover the significance of each positive and negative emotion.

Table 3 Details of emotion lexicon provided by NRC [30]

4.4.4 Linguistic features

Linguistic features of tweet text are the important predictors that can influence rumour identification [25]. Part-of-speech is a type of linguistic characteristic. They are the list of words that have similar grammatical properties and follow the linguistic rules [6]. Thirty-five part-of-speech tags are available by Natural Language Tool Kit (NLTK). These features can be easily extracted using NLTK part-of-speech tagger [4] and then their percentage can be computed from the tweet text. These tags-based characteristics may have a significant role in detecting rumours at the tweet level. The list of all extracted linguistic features is presented in Table 4.

4.4.5 Metadata features

Prior studies demonstrated that metadata characteristics play a significant role in natural language processing tasks such as helpfulness prediction and rumour detection [24, 36]. These features can cause an effect in an intangible or indirect way. They consist of all the properties related to the Twitter account of a user such as followings, number of followers, age of user’s account, presence of questions marks and URLs in user’s tweet, etc. The set of proposed metadata features was not used by prior studies for rumour detection. The utilization of these features will improve the classification performance of the rumour detection model. The list of extracted metadata features is presented in Table 4.

Table 4 List of extracted discrete emotions, linguistic and metadata features

4.4.6 Feature selection

The selection of the extracted features is an important task in the feature engineering process. The objective of this section is to examine which feature combination is most significant and to test whether the proposed ones are significant or not for the classification task. We designed a two-step strategy to select the most significant combination of proposed features so that an effective rumour detection model can be designed. Various candidate sets of features are compared using the random forest classifier and every feature is evaluated using the accuracy metric. In beginning, we have 44 features in total. In the first step of the double-round strategy, every feature performance is evaluated using the accuracy metric, and features are ranked in descending order. We selected the top-23 and their performances are shown in Fig. 4.

Fig. 4
figure 4

Top-23 features performance using accuracy metric (step 1)

In the second step, a customized elimination method is applied to all features selected in step 1. The impact of each feature is evaluated by eliminating it from the feature set and then measuring performance with the rest of the features. Random forest is used as the classifier. The steps of the elimination method are; At first, by combining 23 features, the accuracy measure is computed and denoted by the Accuracybase feature set. After that, every feature is removed one by one, and performance is computed using the rest of the features, denoted by Accuracydrop f from base set. Each feature’s impact is calculated by taking the difference between Accuracydrop f from base set and Accuracybase feature set as described by Eq. (4). If I(f) is zero or above, then it reveals that elimination of that particular feature is useful. Thus, we can eliminate that feature without any loss. If I(f) is negative, then accuracy will decrease by eliminating that feature. We dropped those features which have I(f) ≥ 0 and the remaining features are selected. Finally, we got the 15 best features and their I(f) values are presented in Fig. 5. The list of selected features is presented in Table 5.

$$\text{I }\left(\text{f}\right)\text{ = Ac}{\text{curacy}}_{\text{drop \;f\; from\; base\; set}}-\text{Ac}{\text{curacy}}_{\text{base \;feature\; set}}$$
(4)
Fig. 5
figure 5

Performance of top-15 features (round 2)

Table 5 List of selected features

4.5 Baselines

For comparisons, we selected three prior studies. The reason why we have chosen these is that these approaches also used the Twitter platform for the dataset construction.

  1. 1.

    Sicilia, et al. [36] used influence potential measures, personal interest, and network characteristics.

  2. 2.

    Kumar, et al. [18] used content-based, pragmatic, and network-specific features for rumour detection on Twitter.

  3. 3.

    Huang, et al. [17] used a heterogeneous graph attention network framework for rumour detection.

5 Results and analysis

In this section, three types of experiments are conducted to evaluate the effectiveness of the proposed framework for rumour detection in specific health events, Ottawa shooting, and Germanwings crash events. We used python language for feature extraction, training, and testing of ML models. In addition, the Weka tool is used for feature selection and feature normalization tasks.

5.1 Prediction performance

In this section, we look for the best-performing ML classifier using the proposed set of features for rumour detection on Twitter. For this purpose, a hybrid combination of discrete emotions, metadata, and linguistic features is used to compare the performances of four classifiers. Four popular machine learning models (Section 4.2) with 20-fold cross-validation are implemented in the python programming language [32]. As a result, four rumour detection models are built and evaluated using five evaluation measures. i.e. precision, accuracy, recall, f1-score, and area under the curve are measured. The aforementioned mechanism is employed for all datasets (DS1, DS2, DS3, and DS4), and the results are shown in Tables 6, 7, 8, and 9. The evaluation of classifiers is presented in ascending order in all tables. For all datasets, the random forest has outperformed three other classifiers against five evaluation metrics. This established the efficacy of the random forest classifier as compared to three other classifiers on Twitter datasets. In addition, random forest, as well as SVM, have been the most used models in the literature for rumour detection [7, 36].

Table 6 Comparison of classifiers using Dataset 1

On the other end, the significance of the hybrid combination of discrete emotion, linguistic, and metadata for rumour identification is also tested and we obtain the best values of all performance indexes (accuracy, precision, recall, f1-score, AUC) with random forest classifier as shown in Tables 6, 7, 8 and 9. In addition, as compared to DS1, DS2 and DS3, we obtain better performance indexes with DS2. The AUC measures (93.96% and 86.88%) are very effective on DS2 and DS3 respectively. We obtain 83% accuracy on DS2 which demonstrates the significance of the rumour detection model with random forest.

Table 7 Comparison of classifiers using Dataset 2

In addition, 82.32% precision, 88.60% recall, and 85.09% f1-score with DS2 are also effective indexes. On the other hand, on DS2, DS3, and DS4, the GBC classifier presented the second-best performance, whereas SVM and MLP are in the third and fourth rank. In contrast, the performances of classifiers on DS1 are ranked as RF, SVM, MLP, and GBC classifiers. Thus, we conclude that with random forest on DS1, we obtain at least 77% performance, on DS2, at least 82.32% performance, on DS3, at least 80.18% performance, and on DS4, at least 73.19% performance.

Table 8 Comparison of classifiers using Dataset 3
Table 9 Comparison of classifiers using Dataset 4

5.2 Feature-wise performance comparison

Exhaustive experiments are conducted to evaluate the significance of all proposed features as a stand-alone model, and comparisons with three state-of-the-art latest baselines for rumour detection on Twitter. The random forest classifier is selected because it outperformed others in prior experiments. For the experimental setup, five evaluation measures, 20-fold cross-validation, and four datasets are used. From Table 10, it is evident that the BERT model outperformed the other features as a standalone model on dataset 1. We obtain 96.7% accuracy and 99.5% AUC indexes that are very effective and validate the significance of bidirectional encode for rumour detection. In addition, we can note that all performance indexes are very promising with the BERT model. Likewise, the word2vec embedding model also demonstrated better performance as compared to linguistic, discrete emotion, and metadata features as a standalone model. Moreover, its performance is comparable with the BERT model. Thus both contextual models outperform the three textual models.

Among textual features, the linguistic model presented better performance as compared to the discrete emotion and metadata model. We can summarize that among textual models, linguistic features outperformed. The performance of three state-of-the-art latest baselines is also added as shown in Table 10. Huang, et al. method demonstrated better performance than two other baselines. It is also observed that BERT and Word2vec models presented much better performances as compared to three baselines as a standalone model. In hybrid combination, textual models (metadata + discrete + linguistic) also presented better performance as compared to the three baselines. In addition, a hybrid combination of Word2vec + BERT and all proposed features demonstrated much better performance indexes as compared to three standard baselines. This proves the significance of proposed BERT, Word2vec, and linguistic features as a stand-alone model and as a hybrid model using dataset 1. Hence, both textual and contextual features-based rumour detection models are robust.

Table 10 Feature-wise performance using dataset 1

Using dataset 2, once again BERT model outperformed all other features as a standalone model. But performance indexes on DS2 are less than performance indexes on DS1 when the BERT model is used. In addition, the Word2vec model presented the second-best performance as shown in Table 11. Hence, again contextual features outperformed the textual features. In textual features, linguistic features again outperformed the discrete emotions and metadata features. Therefore, we can summarize that the outstanding performance of linguistic features is consistent on datasets 1 and 2 in textual characteristics. Moreover, the prominent performance of the BERT model is also consistent upon both datasets in contextual features.

Table 11 Feature-wise performance using dataset 2

With dataset 2, It is observed that hybrid textual features also outperformed the three latest state-of-the-art baseline approaches as shown in Table 11. The performance of word2vec and BERT model in combination again demonstrated better than hybrid textual features. The best performance is achieved using hybrid combination of textual and contexual features. This proves the significance of proposed textual and contextual features for rumour detection on Twitter using five performance indexes on DS1 and DS2.

On dataset 3 and dataset 4, same outstanding performances are observed as on dataset 1 and dataset 2 (Tables 12 and 13). The BERT model presented best performance as a standalone model but we get comparatively low threshold on dataset 4 as compared to first three datasets. Similarly, word2vec presented second best performance as a standalone model and outperformed the three standard baselines. Hence contextual features are more significant in identification of rumours at the tweet level as compared to textual features (evident from results on four datasets). The hybrid combination of linguistic, discrete emotions and metadata presented better performance than three baseline. In addition, the best performance is observed by using hybrid combination of contextual and textual features. This proves the significance of contextual and textual features for identification of rumours.

Table 12 Feature-wise performance using dataset 3
Table 13 Feature-wise performance using dataset 4

5.3 Feature importance

The importance of individual textual features for rumour detection is evaluated in this section. For the experimental setup, the random forest classifier runs a 20-fold cross-validation with an accuracy performance measure, and four datasets (DS1, DS2, DS3, and DS4) are used. Fifteen textual features are evaluated individually and their performance is presented in Figs. 6, 7, 8, and 9  respectively.

For dataset 1, ‘URLs’ is observed to be the most effective feature for detecting rumour at the tweet level. ‘Trust emotion’ is the second best, and ‘Verbs’ is the third-best feature. The prominence of the ‘URLs’ feature reveals that rumour tweets contain more URLs than non-rumour tweets. In the same context, rumour tweets comparatively use more trust-related emotional words. Next ‘Adjectives’ and ‘Prepositions’ are the fourth and fifth-best features. It uncovers that tweets embedded with more ‘adjectives and prepositions’ have the maximum probability to be rumours. The ‘#Full stop and Sadness emotion’ are the next best characteristics for rumour identification using dataset 1.

Fig. 6
figure 6

Importance of fifteen content features using accuracy measure (dataset 1)

Using DS2, we again find ‘URLs’ to be the best feature for identifying a tweet as a rumour as shown in Fig. 7. Moreover, this feature also demonstrated the best performance with DS1. Thus the effectiveness of the ‘URLs’ feature is consistent in both datasets. The second best feature is ‘Trust’ whereas ‘Verbs and Adjectives’ are the third and fourth-best features. It is being observed that top-4 features have a consistent performance on dataset 1 and dataset 2 (Figs. 6 and 7). While differences are also being observed like ‘#Full Stop’ is at position 5 on DS1 but is shifted to 6th position on DS2, instead, ‘Sadness’ comes at position 5 on DS2. In addition, ‘#hashes and Anticipation’ features switch their positions on DS1 and DS2. The overall performances of the fifteen features are consistent on both datasets.

Fig. 7
figure 7

Importance of fifteen content features using accuracy measure (dataset 2)

Using DS3 and DS4, we can observe the similar performances of individual features as we observed on DS1 and DS2. The performance of the top-5 features is 100% consistent. However, ‘Joy and Disgust emotions’ features switch each other at 12th and 13th positions. In addition, the ‘#Full Stop’ feature is at position 8 on DS3 and DS4 whereas it is at position 7 on DS2 and at position 6 on DS1. The ‘anticipation and #hashes’ features switch their positions on DS3 and DS4 as compared to DS2. Thus, after experiments on four datasets, we can conclude that ‘URLs, Trust, Verbs, Adjectives, Prepositions, Sadness, Adverbs, and #Full Stop’ are the top-8 textual features to identify rumour at the tweet level and their performance is almost consistent on four datasets.

Fig. 8
figure 8

Importance of fifteen content features using accuracy measure (dataset 3)

Fig. 9
figure 9

Importance of fifteen content features using accuracy measure (dataset 4)

6 Discussions and implications

This research has improved the accuracy of the detection model for rumours in specific events of health-domain, Ottawa shooting and Germanwings crash events on social microblog platforms and presented a robust detection model with 96.7% accuracy. Mainly, five types of features: BERT model, word2vec embeddings, discrete emotions, linguistics, and metadata are investigated. The textual features are further considered for feature selection and the two-step method is adapted to identify the most important features. The rumour detection model based on the random forest classifier is finally designed, which outperformed the three latest standard baselines. Like few prior studies, our research uses a single observation window to generate the results. However, there are, some studies in literature, which are inspired by rumour identification, that changes over time. Therefore, using a single observation window is one of the limitations of our research, and findings cannot be generalized for all cases.

From a theoretical perspective, this research has reduced the training time and complexity of the rumour detection model as compared to prior models. In addition, our research used those features that enhance the accuracy of the model and are stable at the tweet level. All these objectives are achieved using an influential set of features and a robust ML model. If we consider network features for rumour identification, they are comparatively difficult to extract as well as dynamic in nature (changing over time and extracting a user network graph is more complex). In contrast, our methodology provides more optimal solution in less time.

If we look practically at this research, it is more applicable and relevant to social media platforms where everyone can share their opinions freely which could cause rumours to spread. These platforms should have a system for detecting fake/rumour at the post or tweet level. Also, news agencies often use popular social media platforms to gather information and need such a system to detect rumour/fake information at an early stage. Most of the time, this rumour/fake news not only targets news agencies but also makes losses at the national and worldwide levels. This research delivers a solid mechanism to detect such rumours efficiently at the tweet/post level.

7 Conclusions

This case study developed a framework for rumour detection that works at the tweet level in the specific health, Ottawa shooting and Germanwings crash events. The framework is different from other literature approaches in the sense that it did not incorporate the use of topic information as a feature and thus avoids any prior domain-related assumptions. Two types of contextual and three types of textual characteristics are proposed to investigate their impact as a standalone model and as a hybrid model on the detection of rumours. The performance of four classifiers is compared by running on 20-fold cross-validation and four real-life datasets. Our model presented 97% accuracy on dataset 1, 85% accuracy on dataset 2, 85% accuracy on dataset 3, and 80% accuracy on dataset 4, which is far better than the three latest state-of-the-art baselines. BERT model presented the best performance among applied contextual features and linguistic features presented the best performance among applied textual features. Moreover, the best textual features are selected using the two-step feature selection method. Generally, the BERT model presented the best performance as a standalone model. The findings indicate that ‘URLs, Trust emotion, Verbs, Adjectives, and Propositions’ are the five best textual features for rumour detection.

In the future, some extensions can be made. First, these experiments are restricted to four specific datasets. The framework can be utilized for other domain datasets. Second, the proposed system can be applied in other domains, such as fraud detection and security, etc. Third, new social and semantic characteristics could be incorporated to improve the detection model accuracy. Fourth, evolutionary algorithms or ensemble models can be applied to build a robust detection model.