In this section, we present the two-step sentiment classification and quantification method. As discussed earlier, the goal in sentiment classification is different from the one of classic sentiment classification of Tweets. Many News tweets are re-tweeted in Twitter. Classifying the tweets into Personal and News tweets in the first step can help consider only Personal tweets in a sentiment analysis in the next step (Negative vs. Non-Negative classification). Since we are interested in studying the correlation between the timeline trend of sentiments and of News, the detection of News tweets needs to be seamlessly integrated. Thus, our approach of classifying a tweet into one of the three classes—Personal Negative, Personal Non-Negative (including neutral and positive), and News—allows not only the classification but also correlation studies. An overview of our method is shown in Fig. 1.
Only English tweets, which were automatically detected during the data collection phase (see Table 5 for the data sets), are considered. As shown in Fig. 1, the sentiment classification problem is approached in two steps. First, for all English tweets, we separated Personal from News (Non-Personal) tweets. Second, after the Personal tweets were extracted by the most successful of the Personal/News Machine Learning classifier, these Personal tweets were used as input to another Machine Learning classifier, to identify Negative tweets. After News tweets, Personal Negative tweets, and Personal Non-Negative tweets were extracted, these tweets were used to compute the correlation between the sentiment trend and the News trend. The details of each “box” in Fig. 1 will be introduced in the rest of this section.
Pre-processing of features
In cases of disease surveillance on Twitter, the classical division of sentiments into positive and negative is inappropriate, because diseases are generally classified as negative. Positive emotions could arise as a result of relief about an epidemic subsiding, but we ignore this possibility. Thus, a two-point “Likert scale” with the points positive and negative would not cover this spectrum well. Rather, we started with an asymmetric four-point Likert scale of “strongly negative”, “negative”, “neutral”, and “positive”. We then combined “strongly negative” and “negative” into one category, and “neutral” and “positive” into another. We use “Negative” as the name of the first category and “Non-Negative” for the second one. Thus, the problem reduces to a two-class classification problem, and a Personal tweet can either be a Negative tweet or a Non-Negative tweet.
Some features need to be removed or replaced. We first deleted the tweets starting with “RT”, which indicates that they are re-tweets without comments to avoid duplications. For the remaining tweets, the special characters were removed. The URLs in Twitter were replaced by the string “url”. Twitter’s special character “@” was replaced by “tag”. For punctuations, “!” and “?” were substituted by “excl” and “ques”, respectively, and any of “.,:;−|+=/” were replaced by “symb”. Twitter messages were transformed into vectors of words, such that every word was used as one feature, and only unigrams were utilized for simplicity.
Tweet sentiment classification
In the following, we present Personal vs. News classifiers and Negative vs. Non-Negative classifier:
Clue-based tweet Labeling
The clue-based classifier parses each tweet into a set of tokens and matches them with a corpus of Personal clues. There is no available corpus of clues for Personal versus News classification, so we used a subjective corpus MPQA (Riloff and Wiebe 2003) instead, on the assumption that if the number of strongly subjective clues and weakly subjective clues in the tweet is beyond a certain threshold (e.g., two strongly subjective clues and one weakly subjective clue), it can be regarded as Personal tweet, otherwise it is a News tweet. The MPQA corpus contains a total of 8221 words, including 3250 adjectives, 329 adverbs, 1146 any-position words, 2167 nouns, and 1322 verbs. As for the sentiment polarity, among all 8221 words, 4912 are negatives, 570 are neutrals, 2718 are positives, and 21 can be both negative and positive. In terms of strength of subjectivity, among all words, 5569 are strongly subjective words, and the other 2652 are weakly subjective words.
Twitter users tend to express their personal opinions in a more casual way compared with other documents, such as News, online reviews, and article comments. It is expected that the existence of any profanity might lead to the conclusion that the tweet is a Personal tweet. We added a set of 247 selected profanity words (Ji 2014a) to the corpus described in the previous paragraph. USA law, enforced by the Federal Communication Commission, prohibits the use of a short list of profanity words in TV and radio broadcasts (FederalCommunicationsCommittee 2014). Thus, any word from this list in a tweet clearly indicates that the tweet is not a News item.
We counted the number of strongly subjective terms and the number of weakly subjective terms, checked for the presence of profanity words in each tweet and experimented with different thresholds. A tweet is labeled as Personal if its count of subjective words surpasses the chosen threshold; otherwise it is labeled as a News tweet.
In clue-based classification, if the threshold is set too low, the precision might not be good enough. On the other hand, if the threshold is set too high, the recall will be decreased. The advantage of a clue-based classifier is that it is able to automatically extract Personal tweets with more precision when the threshold is set to a higher value.
Because only the tweets fulfilling the threshold criteria are selected for training the “Personal vs. News” classifier, we would like to make sure that the selected tweets are indeed Personal with high precision. Thus, the threshold that leads to the highest precision in terms of selecting Personal tweets is the best threshold for this purpose.
The performance of the clue-based approach with different thresholds on human-annotated test datasets is shown in Table 1. More detailed information about the human-annotated dataset is shown in Sect. 4.3.2.2. Among all the thresholds, s3w3 (3 strong, 3 weak) achieves the highest precision on all three human annotated datasets. In other words, when the threshold is set so that the minimum number of strongly subjective terms is 3 and the minimum number of weakly subjective terms is 3, the clue-based classifier is able to classify Personal tweets with the highest precision of 100 % but with a low recall (15 % for epidemic, 7 % for mental health, 1 % for clinical science).
Table 1 Results of Personal tweets classification with different thresholds (Precision/Recall)
Machine learning classifiers for personal tweet classification
To overcome the drawback of low recall in the clue-based approach, we combined the high precision of clue-based classification with Machine Learning-based classification in the Personal vs. News classification, as shown in Fig. 2. Suppose that the collection of Raw Tweets of a unique type (e.g., tuberculosis) is T. After the pre-processing step, which filters out non-English tweets, re-tweets, and near-duplicate tweets, the resulting tweet dataset is T′ = {tw
1
, tw
2
, tw
3
,…, tw
n
}, which is a subset of T, and is used as the input for the clue-based method for automatically labeling datasets for training a Personal vs. News classifier as shown in Fig. 2.
In the clue-based step for labeling training datasets, each tw
i
of T′ is compared with the MPQA dictionary (Riloff and Wiebe 2003). If tw
i
contains at least three strongly subjective clues and at least three weakly subjective clues, tw
i
is labeled as a Personal tweet. Similarly, tw
i
is compared with a News stopword list (Ji 2014b) and a profanity list (Ji 2014a). The News stopword list contains 20+ names of highly influential public health News sources and the profanity list has 340 commonly used profanity words. If tw
i
contains at least one word from the News stopword list and does not contain any profanity word, tw
i
is labeled as a News tweet. For example, the tweet “Atlanta confronts tuberculosis outbreak in homeless shelters: By David Beasley ATLANTA (Reuters)—Th… http://yhoo.it/1r88Lnc #Atlanta” is labeled as a News tweet, because it contains at least one word from the News stopword list and does not contain any profanity word. We mark the set of labeled Personal tweets as T
p
′, and the set of labeled News tweets as T
n
′, note that (T
p
′ ∪ T
n
′) ⊆ T’.
The next step is the Machine Learning-based method. The two classes of data \(T_{p}^{\prime }\) and \(T_{n}^{\prime }\) from the clue-based labeling are used as training datasets to train the Machine Learning models. We used three popular models: Naïve Bayes, Multinomial Naïve Bayes, and polynomial-kernel Support Vector Machine. After the Personal vs. News classifier is trained, the classifier is used to make predictions on each tw
i
in T′, which is the preprocessed tweets dataset. The goal of Personal vs. News classification is to obtain the Label for each tw
i
in the tweet database T′, where the Label O(ts
i
) is either Personal or NT (News Tweet). Label was introduced in Definition 5, whereby Personal could be PN or PNN.
Negative sentiment classifier
As shown in Fig. 1, after a classifier for Personal tweets in step 1 is built, the second step in the sentiment classification is to classify the set of Personal tweets \(T\prime \prime = \left\{ {tw_{i} :O\left( {tw_{i} } \right) = {\text{Personal}}, tw_{i} \in T\prime } \right\}\) into Personal Negative (PN) or Personal Non-Negative (PNN) tweets. Figure 3 shows the process of classification in this second step. In the rest of this section, Negative is used to refer to the Personal Negative and Non-Negative is used to refer to the Personal Non-Negative.
In terms of training the classifier for Negative vs. Non-Negative classification, the ideal training dataset must be large and contain little noise. Manual annotation of a training dataset is possible, but this process usually requires different annotators to independently label each tweet and to calculate their degree of agreement. This limits the fast generation of large-sized training datasets. Pang and Lee (2008) listed a few annotated corpuses used in previous work in the field of sentiment analysis. These corpuses cover topics such as customer reviews of products and restaurants. However, to the best of our knowledge, there is no disease-related annotated corpus that can be used as a training dataset to distinguish Negative tweets from Non-Negative tweets.
In order to build the training datasets for Negative versus Non-Negative classification (TR-NN), we formed a whitelist and blacklist of stopwords using predefined emoticons. An emoticon is a combination of characters that form a pictorial expression of one’s emotions. Emoticons have been used as important indicators of sentiments in previous research. We combined the emoticon lists used by Go et al. (2009), Pak and Paroubek (2010), and Agarwal et al. (2011). A partial list of emoticons is in Table 2.
Table 2 Partial list of the emoticons used
The whitelist and blacklist of stopwords for building TR-NN are described in Table 3. The whitelist is used for extracting while the blacklist is used for eliminating information. A tweet is extracted as a Negative tweet if and only if this tweet contains at least one stopword (or emoticon) from the Negative whitelist, and does not contain any stopword (or emoticon) from the Negative blacklist. A tweet is extracted as Non-Negative using similar lists, a Non-Negative whitelist, and a corresponding blacklist. For example, the tweet “They are going to take fluid from around the spinal cord to see if she has meningitis… :(” is extracted as a Negative tweet, because it contains at least one stopword from the Negative whitelist and no words from the Negative blacklist.
Table 3 Whitelist and blacklist of stop words for building TR-NN
As shown in Fig. 3, the emoticons contained in the tweets are used to generate the training dataset TR-NN. Tweets were labeled as PN or PNN based on the emoticons they contained. More specifically, if a tweet contains at least one negative emoticon or at least one word from the profanity list that has 247 selected profanity words (Ji 2014a), it is labeled as PN. If a tweet contains at least one non-negative emoticon or at least one positive emoticon, it is labeled as a PNN. These two categories (PN and PNN) of labeled tweets were combined into the training dataset TR-NN for Negative vs. Non-Negative classification. Table 4 shows examples of tweets in TR-NN. The set of labeled PN tweets is marked as \(T_{ne}^{\prime \prime }\), and the set of labeled PNN tweets is marked as \(T_{nn}^{\prime \prime }\), and (\(T_{ne}^{\prime \prime }\) ∪ \(T_{nn}^{\prime \prime }\)) ⊆ T′. Similarly, \(T_{ne}^{\prime \prime }\) and \(T_{nn}^{\prime \prime }\) are used to train the Negative vs. Non-Negative classifier, and the classifier is used to make predictions on each tw
i
in T″, which is the set of Personal tweets. The goal of Negative vs. Non-Negative classification is to obtain the Label for each tw
i
in the tweet database T″, where the Label O(tw
i
) is either PN or PNN. (There are no News tweets at this stage).
Table 4 Examples of Personal Negative and Personal Non-Negative tweets in training dataset TR-NN
After step 1 (Personal tweets classification) and step 2 (sentiment classification), for a unique type of tweets (e.g., tuberculosis), the Raw Tweet dataset T is transformed into a series of Tweet Label datasets TS
i
. Recall from the definition section that TS
i
is the Tweet Label dataset for time i, and TS
i
= {ts
1
, ts
2
, ts
3
,…, ts
n
}, where O(ts
i
) is either PN, or PNN, or NT.
Experimental results of the classification approach
Data collection and description
We implemented a data collector using the Twitter API version 1.1 and Twitter4J library (Twitter4J 2014) to collect real-time tweets containing certain specified health-related keywords (e.g., listeria), along with associated user profile information for subsequent analysis. The overall data collection process can be described as “ETL” (Extract-Transform-Load) approach, as it is widely used in Data Warehousing. The data was collected in JSON format from the Twitter Streaming API. (This is the “Extract” step). Then the raw JSON data was parsed into relational data, such as tweets, tweet_mentions, tweet_place, tweet_tags, tweet_urls, and users (Transform step). Finally, the relational data were stored into our MySQL relational database (Load step).
The current prototype system has collected a total of 15+ million tweets in 12 datasets. These datasets include six infectious diseases: Listeria, influenza, swine flu, measles, meningitis, and tuberculosis; four mental health problems: Major depression, generalized anxiety disorder, obsessive–compulsive disorder, and bipolar disorder; one crisis: Air disaster; and one clinical science issue: Melanoma experimental drug. The core component uses the Twitter Streaming API for collecting epidemics-related real-time tweets. The tweets were collected from March 13 2014 to June 29 2014. The statistics of the collected datasets are shown in Table 5.
Table 5 The statistics of the collected dataset
For each tweet type, the tweets were collected according to the keywords of the dataset. These keywords are shown in the “Appendix” Section. The language of tweets is automatically identified by Twitter4J library during the data collection phase. For example, if the value of the tweet attribute “lang” is “en”, that means this tweet is an English tweet. If the value of tweet attribute is “fr”, it means that this tweet is a French tweet. Only English tweets are used in our experiments. As shown in Table 5, some datasets have a larger portion of non-English tweets, for example, influenza, swine flu, and tuberculosis compared with other datasets.
The pre-processing step filters out re-tweets and near-duplicate tweets. Two tweets are considered near-duplicates of each other, if they contain the same tokens (words) in the same order; however, they may contain different capitalization of words, different URLs and different special characters such as @, # etc. For example, the two tweets (1) “SEVEN TONS OF #HUMMUS RECALLED OVER LISTERIA FEARS… http://t.co/IUU5SiJgjG” and (2) “seven tons of hummus recalled over @listeria fears—http://t.co/dBgAk1heo4.” are near-duplicates, thus only one tweet (randomly chosen) is kept in the database.
Evaluation
To the best of our knowledge, there are no evaluation datasets for the performance of sentiment classification of health-related tweets. To compare the three previously discussed classifiers, Naïve Bayes, Two-Step Multinomial Naïve Bayes, and Two-Step Polynomial-Kernel Support Vector Machine, we created one group of test datasets using the clue-based method and a second group of test datasets using human annotation, in order to evaluate the usability of our approach. Weka’s implementations (Hall et al. 2009) of Naïve Bayes, Multinomial Naïve Bayes, and polynomial-kernel SVM with default parameter configurations were used for the experiments.
Clue-based annotation for test dataset
The clue-based annotation of the test dataset was done as follows. We first automatically extracted the Personal tweets and News tweets by the clue-based approach described in Sect. 4.2.1 and labeled them as Personal or News. Then we randomly divided the labeled dataset into three partitions and used two partitions for training the three different classifiers. Finally, we compared the different classifiers’ accuracies on the third partition of labeled data. For example, for Dataset 3 in Table 5, in the classification step, 2899 Personal tweets and 508 News tweets were automatically extracted using the MPQA corpus (Riloff and Wiebe 2003). We randomly divided these tweets into training and test datasets, resulting in 1933 Personal and 339 News tweets as training dataset, and the remaining 966 Personal tweets and 169 News tweets as test dataset. A similar emoticon-based approach was used to automatically generate a training dataset and a test dataset for Negative vs. Non-Negative classification.
Human annotation for test dataset
Because the clue-based annotation method is automatic, it is relatively easy to generate large samples. However, the drawback is that the training and testing datasets are extracted by the same clue-based annotation rule, thus the results might carry a certain bias. In order to more fairly evaluate the usability of our approach, we created a second test dataset by human annotation, which is described as follows.
We extracted three test data subsets by random sampling from all tweets from the three domains epidemic, clinical science, and mental health, collected in the year 2015. Each of these subsets contains 200 tweets. Note that the test tweets are independent from the training tweets that were collected in the year 2014. One professor and five graduate students annotated the tweets, with each tweet annotated by three people. The instructions for annotators are shown in the “Appendix”. Annotators were asked to assign a value of 1 if they considered a tweet to be Personal, and a value of 0 if they considered it to be News, according to the instructions they were given. If a tweet was labeled as a Personal tweet by an annotator, s/he was asked to further label it as Personal Negative or Personal Non-Negative tweet. We utilized Fleiss’ Kappa (Fleiss 1971) to measure the inter-rater agreement between the three annotators of each tweet. Table 6 presents the agreement between human annotators. For each tweet, if at least two out of three annotators agreed on a Label (Personal Negative, Personal Non-Negative, or News), we labeled the tweet with this sentiment. Table 7 shows the numbers of tweets with different labels. For example, the fraction 25/200 for Negative tweets in “epidemic” means that out of the 200 human-annotated epidemic tweets, 25 tweets were labeled as Personal Negative tweets. The total number of tweets in each dataset does not add up to 200, because in some cases each of the three annotators classified a tweet differently. Tweets for which no majority existed were omitted from the analysis.
Table 6 Agreement between human annotators
Table 7 Statistics regarding human annotated dataset
Classification results
The results of the two-step classification approach are shown in this section. The performance was tested separately with the clue-based annotated test dataset and the human annotated test dataset.
Results with clue-based annotated test dataset
We compared the previously discussed classifiers: Two-Step Naïve Bayes, Two-Step Multinomial Naïve Bayes, and Two-Step Polynomial-Kernel Support Vector Machine. As previously discussed, the labeled dataset was randomly divided into three partitions and we used two partitions for training the three different classifiers. The detailed training and test dataset sizes are shown in Table 8. Note that the test datasets for each classifier in step 2 can be different. The reason is that different classifiers extract different numbers of Personal tweets in the first step, thus the test data in the second step, which is extracted from the previously extracted Personal tweets, can also be different for the three classifiers. The two-step sentiment classification accuracy on individual datasets (1–12) is shown in Table 9 and confusion matrices of the best classifiers in terms of accuracy are shown in Table 10; similarly, the classification accuracy and confusion matrices of the best classifiers for the three domains (epidemic, mental health, clinical science) are shown in Tables 11 and 12, respectively.
Table 8 Size of experimental training and test datasets for two-step classification (PN is Personal Negative and PNN is Personal Non-Negative)
Table 9 Results of S1A/S2A (S1A = step one accuracy and S2A = step two accuracy) on individual dataset (rounded to 2 decimal places)
Table 10 Confusion matrices of the best classifier on each dataset (Step 1 positive class is Personal and Negative class is News, Step 2 positive class is Personal Negative and Negative class is Personal Non-Negative)
Table 11 Results of S1A/S2A (S1A step one accuracy and S2A step two accuracy) on individual domain
Table 12 Confusion matrices of the best classifier on individual domain (Step 1 positive class is Personal and Negative class is News, Step 2 positive class is Personal Negative and Negative class is Personal Non-Negative)
On individual datasets, all three two-step methods show good performance. SVM is slightly better than the other two classifiers for most of the datasets. For the domain datasets, which combine individual datasets according to their domains, all three two-step methods also exhibit good performance. SVM again slightly outperforms the other two classifiers in all three domains.
Results with human annotated test dataset
In order to evaluate the usability of two-step classification, Personal vs. News classification and Negative vs. Non-Negative classification were also evaluated with human annotated datasets.
-
Personal vs. News Classification We compared our Personal vs. News classification method with three baseline methods. (1) A naïve algorithm that randomly picks a class. (2) The clue-based classification method described in Sect. 4.2.1. Recall that in the clue-based method, if a tweet contains more than a certain number of strongly subjective terms and a certain number of weakly subjective terms, it is regarded as a Personal tweet, otherwise as a News tweet. (3) A URL-based method. In URL-based method, if a tweet contains an URL, it is classified as a News tweet; otherwise the tweet is classified as a Personal tweet. The classification accuracies of different methods and confusion matrices of the best classifiers are presented in Tables 13 and 14, respectively. The results show that 2S-MNB and 2S-NB outperform all three baselines in most of the cases. Surprisingly, 2S-SVM does not perform as well as on the clue-based annotated test dataset. It is possible that SVM overfitted to the clue-based annotated dataset, since SVM is a relatively complex model and it infers too much from the training datasets. Overall, all methods exhibit a better performance on the epidemic dataset than on the other two datasets. In addition, as we compare the ML-based approaches (2S-MNB, 2S-NB, 2S-SVM), the ML-based approaches outperform the clue-based approaches in most of the cases. This means that although the ML-based approaches utilize the simple clue-based rules to automatically label the training data, they also learn some emotional patterns that cannot be distinguished by MPQA corpus. Some unigrams are learned by the ML-based methods and are shown to be useful for the classification, which will be discussed later.
Table 13 Accuracy of Personal vs. News classification on human annotated datasets
Table 14 Confusion matrices of the best Personal vs. News classifier on human annotated datasets (positive class is Personal and Negative class is News)
-
Negative vs. Non-Negative Classification The second step in the two-step classification algorithm is to separate Negative tweets from Non-Negative tweets. As discussed in Sect. 4.2, the training datasets are automatically labeled with emoticons and words from a profanity list, and then the classifier is trained by one of the three models, Multinomial Naïve Bayes (MNB), Naïve Bayes (NB), and Support Vector Machine (SVM). The accuracies of Negative vs. Non-Negative classification and confusion matrices of the best classifiers for human annotated datasets are shown in Tables 15 and 16, respectively. 2S-MNB outperforms the other two algorithms on the epidemic dataset, and 2S-NB outperforms the other two algorithms on the mental health and clinical science datasets. All three classifiers perform better than the random-select baseline, which generates an average of 50 % accuracy. We can see that although the classifier is trained with tweets containing profanity and tweets containing emoticons, the classifier is still able to perform with an average accuracy of 70+% on human annotated test datasets. Overall, 2S-NB and 2S-MNB both achieved good Negative vs. Non-Negative classification accuracy in terms of accuracy and simplicity, followed by 2S-SVM.
Table 15 Negative vs. Non-Negative classification results on human annotated datasets
Table 16 Confusion matrices of the best Personal Negative vs. Personal Non-Negative classifier on human annotated datasets (Positive class is Personal Negative and Negative class is Personal Non-Negative)
Error analysis of sentiment classification output
We analyzed the output of sentiment classification. As discussed in Sect. 4.3.2, we manually annotated 600 tweets as Personal Negative, Personal Non-Negative, and News. We used 2S-MNB, which achieved the best accuracy in our experiments described in Sect. 4.3.3, to classify each of the 600 manually annotated tweets as Personal Negative, Personal Non-Negative, or News. Then we analyzed the tweets that were assigned different labels by 2S-MNB and by the human annotators.
For the Personal vs. News classification, we found two major types of errors.
-
1.
The tweet is in fact a Personal tweet, but is classified as a News tweet. By manually checking the content, we found that these tweets are often users’ comments on News items (Pointing by URL) or users are citing the News. There are 27 out of all 140 errors belonging to this type. One possible solution to reduce this type of error is that we can calculate what percentage of the tweet text appears in the web page pointed to by the URL. If this percentage is low, it is probably a Personal tweet since most of the tweet text is the user’s comment or discussion, etc. Otherwise, if the percentage is near 100 %, it is more likely a News tweet since the title of a news article is often pasted into the tweet text.
-
2.
The tweet is in fact a News item, but is classified as a Personal tweet. Those misclassified tweets are News items that have “personal” titles, and mostly have a question as title. There are 48 out of all 140 errors belonging to this type. One possible solution is to check the similarity between the tweet text and the title of the web page content pointed to by the URL. If both are highly similar to each other, the tweet is more likely a News item. Those two types of errors together cover 54 % (75/140) of the errors in Personal vs. News classification.
For Negative vs. Non-Negative classification, in 50 % (30/60) of all errors, the tweet is in fact Negative, but is classified as Non-Negative. One possible improvement is to incorporate “Negative phrase identification” to complement the current ML paradigm. The appearance of negative phrases such as “I feel bad”, “poor XX”, and “no more XX” are possible indicators of Negative tweets. Examples of misclassified tweets are as follows:
“This is the scariest chart I’ve made in awhile http://t.co/3MH5exZjSh
http://t.co/oc9lyEO0XY
” (Personal tweet classified as News tweet).
“My OCD has been solved! Get our newsletter here: http://t.co/fAxsHjaIn4
http://t.co/1Jhkbta2Px
” (Personal tweet classified as News tweet).
“What is Generalized Anxiety Disorder? (GAD #1) http://t.co/y32GmkYhkh #Celebrity #Charity http://t.co/EYDupOLxY8
” (News tweet classified as Personal tweet).
“Basal Cell Carcinoma is the most common form of skin cancer. Do you know what to look for? http://t.co/hmofWTApG9
” (News tweet classified as Personal tweet).
“@Jonathan_harrod I know there is some research going on, but… Measles kills and us easily spread. @mercola” (Negative tweet classified as Non-Negative tweet).
“Having a boyfriend with diagnosed OCD is not easy task, let me tell ya” (Negative tweet classified as Non-Negative tweet).
Contribution of unigrams
In order to illustrate which unigrams are most useful for the classifiers’ predictions, ablation experiments were performed on Personal vs. News classification and Negative vs. Non-Negative classification on the three human annotated test datasets. The classifier 2S-MNB was used since it took less time to train and has one of the best average accuracies on human-annotated test dataset. 2S-MNB was trained with the automatically generated data from the Epidemic, Mental Health, and Clinical Science domains collected in the year 2014. Then the trained classifiers were used to classify the sentiments of human annotated datasets collected in the year 2015, where unigrams were removed from the test dataset one at a time, in order to study each removed unigram’s effect on accuracy. The change of classification accuracy was recorded each time, and the unigram that leads to the largest decrease in accuracy (when removed) is the most useful one for predictions. Table 17 shows the ablation experiments for Personal vs. News classification. For example, the unigrams “i”, “plz”, “lol” are not in MPQA corpus but are learned by the ML classifier 2S-MNB as the most important unigrams contributing to classification. Some words that are closely related to sentiment polarity are also shown in the list. For example, “bitch”, “love”, and “risk” are strong indicators for Personal vs. News classification. We did not find any useful unigram in Negative vs. Non-Negative classification by this ablation experiment.
Table 17 Most important unigrams in Personal vs. News classification
Bias of Twitter data
Twitter may give a biased view, since people who are tweeting are not necessarily a very representative sample of the population. As pointed out by Bruns and Stieglitz (2014), there are two questions to be addressed in terms of generalizing collected Twitter data. (1) Does Twitter data represent Twitter? (2) Does Twitter represent society? To answer the first question, according to the documentation (Twitter 2014b), the Twitter Streaming API returns at most 1 % of all the tweets produced on Twitter at any given time. Once the number of tweets matching given parameters (keywords, geographical boundary, user ID) is beyond the 1 % of all the tweets, Twitter will begin to sample the data that it returns to the user. To mitigate this, we utilized highly specific keywords (e.g., h1n1, h5n1) for each tweet type (e.g., flu) to increase the coverage of collected data (Morstatter et al. 2013). These keywords are shown in “Appendix” Section. As for the second question, Mislove et al. (2011) has found that the Twitter users significantly over-represent the densely populated regions of the USA, are predominantly male, and represent a highly non-random sample of the race/ethnicity distribution. To reduce the bias of collected Twitter data, we defined the MOC in relative terms in Sect. 3. It depends on the fraction of all tweets obtained during the day that have been classified as “Personal Negative” tweets. The MOC analysis will be discussed in more detail in Sect. 5.