1 Introduction

One of the most important aspects of a health emergency is disease prediction(Hirose and Wang 2012). It's difficult to distinguish between disease-related and non-disease-related symptoms in the COVID-19 pandemic from various sources. Respiratory sound, X-Ray image, and social media content analysis are the primary sources for the current COVID-19 outbreak. In this paper, social media content has been used for analyzing corona disease posts. People have been using Twitter to share their feelings via tweets related to coronavirus disease and related issues. Sample informative and uninformative Tweets about the COVID-19 pandemic are shown in Table 1. As previously noted, most past research has indicated tweets related to a low-grade sickness. The detection of tweets related to health situations is the subject of this article (COVID-19).

Table 1 Informative and uninformative Covid-19 tweets

All online social networking sites are trending with Corona-related topics during the current COVID-19 pandemic outbreak. Many people continue to post and communicate with COVID-19 messages on these websites. Twitter is a popular online media microblogging and social networking service in which users post and interact with messages known as "tweets." Tweets are in various categories of information in a massive amount of data at a significant speed, consisting of both disease-related and non-disease-related tweets (Goodwin et al. 2009). This information includes symptoms of illness such as a cold, fever, headache, runny nose, body pains, and so on (Quincey et al. 2016). Various social and government organisations' or departments' resources, such as social media, should be highlighted in order to raise awareness and deliver medical kits and treatments for various ailments. The classified tweet information contains one-of-a-kind resources to assist the suffering and suspected users in learning the status of the pandemic (Loey et al. 2021). In this case, we must categorise tweets related to various diseases in order to provide medical facilities for the authorities in order to keep them from becoming ill.

To diagnose these types of situations, we need a standard automated system to collect high-level tweeting as an individual reaches the final stage of breathing. Disease prediction is a critical component in health emergencies (Rudra et al. 2017). Automatic systems (Chew and Eysenbach 2010) are required in pandemic situations to detect informative tweets from Twitter posts. It is difficult to distinguish between INFORMATIVE and UNINFORMATIVE (Jagadeesh and Alphonse 2020) tweets from a source involved in the COVID-19 outbreak. According to previous research, the majority of previous studies focused on identifying disease-related tweets using low-grade methods. In this paper, we look at the issue of disease-related tweets during a health emergency. To that end, we propose an IPSH (Information of words, parts of speech, statistical, and high-frequency terms as features) method for solving the aforementioned problem. For the classification of tweets, the existing method uses two types of features: informative words and statistical features(Chen et al. 2016). In order to improve classification accuracy, we use quality and efficient features in our proposed method IPSH. For base-level and second-level classifiers, we used Random Forest and Adaboost classifiers, respectively. The proposed method is associated with two base methods, SVM and BOW (Han et al. 2013).

This paper's dye points are as follows:

  1. 1.

    Identify Insightful COVID-19 Tweets, such as symptoms, prevention, and precautions, and so on.

  2. 2.

    Using four features to identify INFORMATIVE tweets with greater accuracy.

  3. 3.

    Observing improved results with latest COVID-19 data set.

  4. 4.

    Using a combination of Adaboost, Random Forest, and other models to compare the results.

The following is how this paper map is organized:-The relevant work is explained in Sect. 2, the proposed approach is presented in Sect. 3, the experimental findings are presented in Sect. 4, and the conclusion is presented in Sect. 5.

2 Related work

During a health crisis, a massive amount of short critical messages were posted on Twitter(Agarwal et al. 2011). Victims are identified by detecting the target tweets on Twitter at the time of crisis. The automated detection of target tweets is a challenging task.

Examination of social media user messages can measure different population services, including public health interventions. The tweet content review reveals that users record typical, often extreme signs as well as medicinal uses. By applying two existing methods for extracting clinical words from natural language created for the analysis of clinical records (MetaMap,cTakes) to a real-world collection of medical blog posts(Denecke 2014). Such tools can also accurately extract medical concepts which are stated explicitly in texts from medical 21 Code check failed social media data, but the abstraction lacks essential details captured in paraphrase or formulated in popular language. An interactive framework to identify blogs using advanced features to describe their knowledge content is put in place to illustrate the medical and efficient content of blog posts. The findings show that the qualities of various health-related Web services are substantially different(Denecke and Nejdl 2009). Weblogs and reaction portals mainly deal with diseases and medicines. More information on anatomy and procedures is provided by the Wiki and encyclopedia. While patients and nurses discuss personal aspects of their lives, their blog posts tend to convey health-related information to doctors. The data and resources are made accessible to the research community to allow Twitter's richer text analysis and related social media data sets (Fox 2011). A modern way of answering TREC-CDS medical questions is by first revealing the reply and then choosing and evaluating research documents containing the reply(Gimpel et al. 2010). AIDR is used to classify messages posted during disasters in a set of user-defined categories of information (Imran et al. 2014). Post first word2with word embedding focused on 52million tweets linked to the crisis. We present human-annotated standardized lexical tools for different lexical variations to fix tweet language issues (Imran et al. 2016).

The identification and classification of tweets related to health (Madichetty and Sridevi 2019) and current disease pandemic COVID-19 have been discussed;(Malla and Alphonse 2021);(Babu and Eswari 2020). Text mining, Chest X-Ray image classification and respiratory techniques based on AI techniques are discussed (Aggarwal and Zhai 2012, Wiysobunri et al. 2020 and Kumar and Alphonse 2021). Text classification has been performed by Machine Learning models like Naive Bayes and k-NN classifier for with the help of the top-most frequency word features and low-level lexical features.

3 Disease tweets detection

The working methodology for train and test the model are shown in Fig. 1 and Fig. 2. Different phases in INFORMATIVE tweets detection are (Bethard et al. 2016) tweet collection, pre-processing, feature extraction, training, and testing. The following sections explain the flow of the IPSH method.

Fig. 1
figure 1

Training Phase of the Proposed IPSH Method

Fig. 2
figure 2

Testing Phase of the Proposed IPSH Method

3.1 Collection of health-related tweets

INFORMATIVE Tweets are collected from the COVID-19 data set, as shown in Table 2.

Table 2 Detecting Informative Features in the Disease-Related Tweets

3.1.1 Informative tweets

Tweets relevant to symptoms like, runny nose, cough, sore throat, possibly a headache, and maybe a fever attack last a couple of days. Messages having public death information also come under this category.

3.1.2 Uninformative tweets

Tweets that are not dealing with symptoms of the disease are irrelevant to UNINFORMATIVE tweets. This type of tweet may point to the tweets related to the non-symptomatic tweets related to diseases or different.

3.2 Tweet data pre-processing

Pre-processing of tweets is a very important phase for tweets analysis (Krouska et al. 2016). The following steps involved pre-processing of tweets.

  1. 1.

    Normalization: Processes of transfer of all upper case letters to lower case letters(Kouloumpis et al. 2011).

  2. 2.

    Tokenization: convert tweets into small parts(tokens).

  3. 3.

    Stop-Word Removal: Not useful words like an, are, a, of, from, etc.

  4. 4.

    @ (user-mentions), URLs, \# (hash-tags), three or less than characters, and numerical values are removed from tweet data.

3.3 Feature extraction from tweets

The model mainly depends on four features, namely,

  1. 1.

    Informative words

  2. 2.

    POS tagging

  3. 3.

    Statistical features

  4. 4.

    High-frequency words.

3.3.1 Informative words

TF_IDF technique (Sreenivasulu and Sridevi 2017) is used for informative word extraction for better results. This technique depends on Term Frequency(T_F) and Inverse Document Frequency(ID_F), as shown in Eq. 1.

3.3.2 POS tagging

Parse trees help create NERs((most named entities are Nouns) and obtaining relations among the words, but these parse trees are built by parts of speech(POS) Tag. POS Tags are used to reduce a word to its root form.POS tagging is a method of labeling a term in a corpus, based on its meaning and description, into a corresponding part of a speech tag. This function is not straightforward because a specific word can have another part of speech depending on the situation in which the word is worn. So POS Tagging is a significant step.

3.3.3 Statistical features

This Statistical featuresextraction process includes all steps of data pre-processing except the third step. Statistical features (as shown in Table 2) like frequency of numbers, hashtags (#), user mentions (@), URLs, wh-words are extracted in this stage. Algorithm 1 describes the procedure involved in the statistical features extraction. Some example tweets are tabulated in Tables 2 and 3, respectively.

Table 3 Tweets related to statistical features

High-frequency words.

High-frequency words (Bharti et al. 2015) are repeating words in the corpus. To analyze the frequency of words from the tweets, tweets.

  1. 1.

    URL free.

  2. 2.

    Words should be in unique case(generally lower case)

  3. 3.

    Contains valid words(remove not useful words)

4 Proposed IPSH method

We proposed the IPSH method for detecting tweets associated with the current pandemic outbreak (COVID-19). This method aims to improve the quality of training by two levels of classifiers (base-level and second-level). Firstly base-level classifier takes the vector as input from four different features such as informative words, POS tagging, statistical feature, and high-frequency words and produces output to the second-level classifier. The second-level classifier is used to correct the improper training of the base-level classifier. Performance mainly depends on the model's diversified value. This IPSH model gains a high diversifying value by using four features, which are used in base-level classifiers individually.

  1. 1.

    Informative word features (Wang et al. 2011) are converted to vector by using the "CountVectorizer" function with handmade features and gives its value to one of the ensemble classifiers in a Base-level classifier.

  2. 2.

    POS features (Xia et al. 2011) are converted to vector by using the "POS" function and gives their value to one of the ensemble classifiers in the base-level classifier.

  3. 3.

    Statistical features (Guo et al. 2010) are converted to vector by using the "Extract_features" function and provides its value to one of the ensemble classifiers in the Base-level classifier.

  4. 4.

    High-frequency words (Jin et al. 2009) are converted to vectors using the "CountVectorizer" function with max features set to 50 and gives its value to one of the ensemble classifiers in the base-level classifier.

In case there are no informative features in a tweet, they are not helpful for target tweet detection. So, the remaining features are heaped with the aid of the base-level classifier to alert the target tweet. Even in the absence of anyone among the four features, Tweet can be identified with the remaining ones. The choosing of an accurate classifier for the selected features is essential for the ensemble model. Therefore, different modern-day(state-of-art) models such as AdaBoost, Random Forest, Bagging, and SVM (Kim et al. 2003) are used in base-level classifiers and also in second-level classifiers in the IPSH model. The combinations of classifiers used for the experiment are shown in Table 9.

figure a

4.1 Dataset

COVID-19 data set is used to classify. Subsections 4.1.1 to 4.1.3 explain the details of the data set as shown in Table 4.

Table 4 COVID-19 data set details

4.2 COVID-19 labeled data set

The COVID-19 tweets are collected from Nguyen et al. (2020) informative COVID-19 English Tweets. This data set is labeled as INFORMATIVE and UNINFORMATIVE. The Data set is a collection of Training, validation, and test sub-data set in the CSV format.

4.3 Training phase

As shown in Fig. 2, the selected four features from the data set are trained individually using one of the ensemble classifiers at the base level, and their output is forwarded to a second-level classifier for a better score.

5 IPSH model analysis

This section deals with experimental results of proposed model with different combinations of the base-level classifiers and the Second-level classifiers and also compares IPSH model’s performance (F1-score, recall, precision, AUC_ROC score, and accuracy) with existing baseline models. The IPSH model is implemented using python language with the machine learning package “Scikit” (Pedregosa et al. 2011). The experiments are conducted in two different phases (in-domain and cross-domain) as shown in the following sections. The detailed results of IPSH model are analyzed and discussed in next section.

5.1 Informative words as features

As features, a collection of informative words from tweets is used. These characteristics are used as input parameters to train and test the various machine learning models, as shown in the Table 5 and Fig. 3., in order to select the best baseline model.

Table 5 Various classifiers performance (for informative words as features) on COVID-19 dataset
Fig. 3
figure 3

Machine Learning Classifiers Performance for Informative Words as Features

5.2 POS tags

A collection of parts of speech from tweets is used as a feature. These characteristics, as shown in the Table 6, are used as input parameters to train and test the various machine learning models in order to select the best baseline model as well as Fig. 4.

Table 6 Various classifiers performance for (POS tags as features) on COVID-19 dataset
Fig. 4
figure 4

Machine Learning Classifiers Performance for POS words as Features

5.3 Statistical features

A collection of statistical terms from tweets is employed as a feature. These features are utilised as input parameters to train and evaluate the various machine learning models in order to identify the optimal baseline model, as indicated in the Table 7 as well as Fig. 5.

Table 7 Various classifiers performance (for statistical words as features) on COVID-19 dataset
Fig. 5
figure 5

Machine Learning Classifiers Performance for Statistical Words as Features

5.4 High-frequency words as features

A collection of high-frequency words from tweets is used as a feature. In order to select the best baseline model, these characteristics are used as input parameters to train and test the various machine learning models, as shown in the Table 8 as well as Figs. 6, 7.

Table 8 Various classifiers performance (for high-frequency words as features) on COVID-19 dataset
Fig. 6
figure 6

Machine Learning Classifiers performance for High-Frequency words as Features

Fig. 7
figure 7

Results Comparision of IPSH Model with Different Machine Learning models

According to Tables 6,7,8, and 9, the proposed model is trained with these features (Informative Words, POS Tags, Statistical Features, and High-Frequency Words) as inputs using 10 classifiers, and it is demonstrated that among all of them, the AdaBoost classifier outperformed in predicting tweet content.

Table 9 Selection of base and second-level classifiers by their combined performance

5.5 Performance measures

The IPSH model used the following parameters to identify tweet data in both scenarios (in-domain and cross-domain).

Precision

Precision is describing in Eq. 1.

$$ Precision = \frac{TP}{{TP + FP}} $$
(1)

where TP (True Positive)-the number of correctly detected disease tweets.

FP (False Positive)—number of wrongly detected disease tweets.

Recall

The recall is calculated using the Eq. 2.

$$ Recall = \frac{TP}{{TP + FN}} $$
(2)

F1-Score

F1-Score can be computed using the Eq. 3

$$ F1 - Score = \frac{2*Precision*Recall}{{Precision + Recall}} $$
(3)

Accuracy

Accuracy is computed by using Eq. 4.

$$ Acuuracy = \frac{TP + TN}{{TP + TN + FP + FN}} $$
(4)

AUC_ROC score.

AUC_ROC score describes in Eq. 5.

$$ AUC\_ROC Score = \frac{{T_{0} - n_{0} *\left( {n_{0} + 1} \right)*0.5}}{{n_{0} *n_{1} }} $$
(5)

where T0—a sum of the ranks of the disease tweets,n0—number of disease tweets, and n1—number of non-disease tweets.

The tests are achieved with different base-level classifiers such as Bagging, SVM, AdaBoost, and Random Forest and their values can be found in Table 5 and 6. Random Forest dominates other variations such as AdaBoost, SVM, and Bagging in the base-Level classifier. The performance is mainly based on the base-level classifiers but remains unchanged when the Second-classifier changes. The Random Forest provides huge achievement when compared with Bagging, SVM, and AdaBoost. The model's efficiency is mainly based on the base-level classifiers and by their extreme properties. Random Forest is the Outperform Ensemble algorithm for selected features. However, the IPSH method surpasses both the SVM and BOW. Informational words relevant to disease-related shall be explored by using TF-IDF technology to classify disease tweets. Features of tweets are extracted based on POS tagging. Disease-related tweets contain features of statistical (count of affected or dead), web links (represents URLs), etc. A large amount of vocabulary causes a sparsity problem to base-line-1 (SVM with the BOW model) model. Baseline-2 (BOW with informative and statistical features) did not classify the targeted tweets. Besides, cross-domain tests for the data set are carried out to verify the reliability of the IPSH process.

6 Results

In this experiment, COVID-19 data set is used for training and testing and the proposed IPSH model is compared against state-of-art machine learning models like Bag of Words, AdaBoost, SVM, and Random Forest in terms of accuracy, F1-score, Precision and Recall as shown in Table 10. Figure 3 shows that the proposed ensemble machine learning model is exceptional when compared with the other machine learning models. The proposed IPSH model aims to achieve 83.06% accuracy, 83.73% F1-score, 83.84% precision and 83.67% recall compared with other profound learning techniques, as shown in Table 10.

Table 10 Results comparison of the proposed model with existing methods on COVID-19 dataset using performance metrics

Referring to Table 10, it is well understood that our proposed model has achieved an better accuracy and F1 score compared with existing models as mentioned. This gives a clear indication that the model has succeeded in differentiating the informative tweets related to COVID-19 disease outbreaks from combined tweets.

7 Conclusion

Users often discuss disease information via Twitter. A massive volume of data is shared on Twitter before and after a pandemic situation. The proposed IPSH method recognizes INFORMATIVE tweets effectively by using the AdaBoost in the base level and the Random Forest as a second-level classifier. The experimental results convey that the recommended IPSH method succeeds in the ensemble technique. This paper emphasizes the importance of combining informative words, POS features, statistical properties, and high-frequency words for detecting INFORMATIVE tweets during the COVID-19 pandemic. The proposed research is for identifying other disease outbreaks through microblog sites and for developing methods to identify different diseased symptoms. Further, we plan to use the deep learning models to predict Insightful tweets and then determine if the organizations or offenders will benefit from the discovered tweets or not.