Assessing the reTweet proneness of tweets: predictive models for retweeting
Abstract
The problem of assessing the mechanisms underlying the phenomenon of virality of social network posts is of great value for many activities, such as advertising and viral marketing, influencing and promoting, early monitoring and emergency response. Among the several social networks, Twitter.com is one of the most effective in propagating information in real time, and the propagation effectiveness of a post (i.e., tweet) is related to the number of times the tweet has been retweeted. Different models have been proposed in the literature to understand the retweet proneness of a tweet (tendency or inclination of a tweet to be retweeted). In this paper, a further step is presented, thus several features extracted from Twitter data have been analyzed to create predictive models, with the aim of predicting the degree of retweeting of tweets (i.e., the number of retweets a given tweet may get). The main goal is to obtain indications about the probable number of retweets a tweet may obtain from the social network. In the paper, the usage of the classification trees with recursive partitioning procedure for prediction has been proposed and the obtained results have been compared, in terms of accuracy and processing time, with respect to other methods. The Twitter data employed for the proposed study have been collected by using the Twitter Vigilance study and research platform of DISIT Lab in the last 18 months. The work has been developed in the context of smart city projects of the European Commission RESOLUTE H2020, in which the capacity of communicating information is fundamental for advertising, promoting alerts of civil protection, etc.
Keywords
Social media Twitter monitoring Retweet proneness Virality Predictive models Principal component analysis Classification trees Machine learning1 Introduction
In recent years, social media have become an important communication tool and instrument for monitoring preferences of users, as well as making predictions in a number of contexts. Many social media platforms allow rapid multimedia information diffusion, and thus they may be used as a source of information for viral advertising and marketing, early warning, emergency response and, more generally, for promoting and/or informing many users. Among the various platforms, Twitter.com has a very large user base, consisting of 1.3 billion of accounts and hundreds of millions of users per month. Twitter users can produce a post (i.e., a “tweet”), about any topic within the 140characters limit and can follow other users, in order to receive their tweets/posts on their own twitter web page, as well as on the mobile App. Twitter plays an important role in spreading information, allowing people to communicate and share contents in a fast manner. The posts made by a user are displayed on his/her profile page, and they are also brought to the attention of all his/her followers. It is also possible to send some direct private messages to other users without provoking diffusion. Another solution to enhance the diffusion and the echo of tweets is to include in a tweet including a direct mention of a user; this can be done by using the “@” prefix such as “@usernickname”. In this case, the @usernickname user is stimulated by receiving a notification. Therefore, the information conveyed in a tweet is diffused among the social network users through retweets of the former tweet, thus echoing the original message to the followers, hence producing a chain of messages since the retweets are also echoed. A retweet represents the echo of an original tweet made by one user that has been automatically forwarded by Twitter.com to the followers of the retweeting users (a part for eventual promotions performed by Twitter.com for featuring the most important tweets when they are getting on the list of the most appreciated). In the world of Twitter, the effectiveness of a tweet is frequently measured in terms of retweet count, which is the number of times the tweet has been retweeted [46]. It gives a measure of the number of reached audience and/or appreciation.
There is a growing interest, both in research and commercial fields, for influential strategies and solutions for seeding and diffusing information. Twitter offers to business users the possibility to integrate its analytics with audience measurement tools and services, such as Nielsen Digital Ad Ratings (DAR) and ComScore validated Campaign Essentials (vCE). Overviews of predictive methods exploiting tweets have been proposed in the works of Sikdar et al. [52], Madlberger and Almansour [37], Zaman et al. [61]. In most cases, the predictive capabilities of Twitter data have been identified by using volume metrics on tweets (i.e., the total number of tweets and/or retweets associated with a Twitter user or presenting a certain hashtag). However, in specific cases, a deeper semantic understanding of tweets has been required to create useful predictive capabilities. Thus, algorithms for sentiment analysis computation have been proposed to consider the meaning of tweets by means of natural language processing algorithms. Moreover, the adoption of techniques for segmenting, filtering or clustering by context (e.g., using natural language processing for avoiding the misclassification of tweets talking about flu), or by users’ profiles (e.g., age, location, language, and genre) may help to obtain more precise results in terms of predictability. On the other hand, the aim of this paper is to study the retweet proneness of a tweet, which we define and refer in the following of the paper as the capability to be retweeted, including a quantitative measure of the number of retweets a given tweet may get (which can be considered as the potential degree of being retweeted).
This paper is focused on presenting a study on identifying and assessing the most representative metrics which can be used to predict the degree of retweeting of a tweet (i.e., the number of retweets a given tweet may get). According to the literature, the tweet features can be related to the tweet, to the author of the tweet and thus to the network of relationships of the tweets’ author. The study is grounded on the analysis of tweets datasets collected in different areas in the last 18 months, for a total amount of about 100 million posts. By analyzing the datasets with the aim of identifying the best predicting model allowed us to identify also the main characteristics of tweets to predict the degree of retweeting. Please note that, according to the state of the art reviewed and presented in the following section, the identification of models for estimating of the degree of retweeting of a tweet has been only partially addressed in the literature; a few efforts are mainly focused on identifying parameters to guess the probability of retweeting, and/or to study the cascading effected through the network.
To our knowledge, the main original contributions brought by the work proposed in the present paper are the following: our work aims not simply predicting the probability for a tweet to be retweeted, rather to go a step further, which is predicting and estimating the degree of retweeting. Moreover, the proposed analysis identified additional relevant metrics/features, with respect to those proposed in the reviewed literature, such as the publication time of tweets and the number of users who added a given tweet’s author to a list, as discussed later in more detail. The motivation for establishing the probability of prediction of a tweet is related with the value of the tweet itself and the value of the advertising service that may have produced it. The estimation of the probability to be retweeted is a measure of the effectiveness of a tweet and it is somehow a more precise measure of the concept of tweet virality, that tend to assess only tweets and their context to create huge volumes of retweets.
The paper is organized as follows. In Section 2, a review of the state of the art and related works found in recent literature is presented. In Section 3, the general architecture of the Twitter Vigilance solution, adopted for collecting Twitter data and making statistical analysis, is reported and discussed. Section 4 provides an overview of the methods and models adopted to explain the metrics that might affect the number of retweets of a tweet and the prediction of the degree of retweeting. In Section 5, preliminarily the different classification models are summarized; then the predictive method is presented together with an analysis of features that determine the retweet proneness of tweets. Section 6 provides a comparison among results that can be obtained by using different models. Conclusions are drawn in section 7.
2 Related works
In this section, the predictive capability of Twitter data has been reviewed with the aim of providing a better view of the context in which the research has been developed, and the impact of the obtained results. In the work of Sinha et al. [53], a solution for predicting results of football games has been proposed, taking into account the volume of tweets. Opinions pools and politic elections predictions have been proposed to be correlated with the volume of tweets by using Sentiment Analysis techniques in O’Connor et al. [43]. Different models based on volume of tweets and other means have been also used for predicting purposes: voting results in Bermingham and Smeaton [3] and in Tumasjan et al. [56], economics [4, 15], marketability of consumer goods [50], public health seasonal flu [1, 34, 51], boxoffice revenues for movies [2, 36, 38, 54], crimes [58], book sales [26], recommendations on places to be visited [14] and weather forecast information [24, 25]. Moreover, Twitterbased metrics have been used to predict and estimate the number of people in some location, such as airports, the socalled crowd size estimation by the work of Botta et al. [5], as well as to predict the audience of scheduled television programmes, where the audience is highly involved, such as it occurs with reality shows (i.e., X Factor and Pechino Express, in Italy) [17]. Other adoptions of Twitter have been used to perform risk analysis [29].
In general, a Twitter user could find a tweet worth sharing, and therefore he/she may retweet it to followers. There is no upper limit to the number of times a retweet (repost) operation can be performed. Hence, multiple levels of retweeting can be identified (considering the retweet of an original tweet as the firstlevel). A user could actually retweet a formerly retweeted post to his/her followers, and his/her followers can do the same again and again. In this way, retweets became a popular mean of propagating information through the Twitter community, as they may get viral propagation when volumes of retweets become high. Most studies about the assessment of the retweeting capability of tweets (proneness of a given tweet to be retweeted) try to analyze retweeting behaviors and, thus, to discover the features that may help Twitter users (i.e., the tweets’ authors) in creating tweets which are more effective in collecting retweets. In the literature, different models have been proposed to shed some light on what kind of factors are likely to influence information propagation in Twitter.
Various motivations for retweeting behaviors have been explored in the paper of Golder [22]. They found that the most influential users can retain significant influence over several different topics. In the works of Kwak et al. [33] and Cha et al. [13], the relationships between the number of followers of Twitter users and their influence and lists of the most influential Twitter users, compiled according to a variety of metrics (including retweet count), have been investigated. Kwak et al., have ranked users by the number of followers and by PageRank, and found the two rankings to be similar. They have analyzed the tweets of top trending topics and reported on the temporal behavior of trending topics and user participation. Cha et al. [13] have examined three types of influential users, performed in propagating popular news topics. Hansen et al. investigated the features of tweets that garner large numbers of retweets, analyzing a dataset of 210,000 tweets about the 2009 United Nations Climate Change Conference, as well as a random sample of about 350,000 tweets from 2010 [27]. Hong et al. [28], studied the dynamics of user influence across topics and time, as well as the problem of predicting the popularity of messages as measured by the number of future retweets. The study was conducted by classifying tweets in four categories according to the number of retweets they received (0, < 100, [100, 9999], ≥ 10,000), formulating the prediction task as a classification problem. Moreover, they used a multiclass classifier, training it on one week and testing it on the next week for creating a shortterm prediction. Naveed et al. [41] used a similar technique to predict the probability that a tweet receives any retweets. They proposed a predictive model to forecast the likelihood for a given tweet of being retweeted, based on its contexts; furthermore, they deduced what are the most influential features that contribute to the likelihood of a retweet on the basis of the parameters learned by the model. In the work of Suh et al. [55], a number of features that might affect the probability of tweets to be retweeted (“retweetability”, e.g., retweet proneness of a tweet) have been examined by using the principal component method and logistic regression models. The aim was the assessment of the probability of a tweet to be retweeted without assessing the degree of retweeting. Amongst the features that can be computed for each tweet, the presence of URLs and hashtags in the tweet body have been proved to present a strong relationship with retweetability. The experiment has been computed on a small dataset of 10 K observations, and the achieving prediction accuracy is not reported. Pezzoni et al. [46] have defined the “influence” as the ability of a user to spread information in a network, assuming that the retweet count may measure the popularity of a message on Twitter. The influence of a user could be also estimated by the average number of retweets collected by all tweets of the user. In that paper, the authors demonstrated by simulation that the probability to be retweeted is modeled by a power law function and the capacity of the most influential authors depends on their number of followers. Peng et al. [45] have proposed a model called retweet patterns (i.e., the retweet propagation trend). In that case conditional random fields have been used, taking into account three types of features: tweets features, users features and relationship features (which incorporates the perspectives whether the tweet may be simultaneously retweetable for two users). They have constructed the network relations for retweet prediction, and have demonstrated that conditional random fields can improve prediction effectiveness by incorporating social relationships, compared to those baselines that do not take into account such feature. Morchid et al. [39] have computed both Naïve Bayes and Support Vector Machine models considering two classes: tweets retweeted less than 30 times and tweets retweeted more than 100 times (massively retweeted tweets). The aim of their study was to detect those tweets that are massively retweeted in a short time, however without addressing the problem of predicting the potential number of retweets. They also used the principal component analysis to evaluate relevant features that could have an impact in detecting some retweeting proneness, without proposing a model for assessing the degree of retweeting, thus presenting only an exploratory descriptive approach. Zaman et al. [60] have measured the popularity of a tweet through the timeseries path of its retweets, by using a Bayesian probabilistic model. They have used the user ID of the original tweet and retweet authors, the number of followers and the word contained in tweets to predict the future retweets. Uysal and Croft [57] proposed a predictive model for estimating the likelihood of retweeting for a given user and tweet by using a logistic regression model.
Yang and Counts [59], used a factor graph model to investigate the retweeting behavior focusing on those features related to the user profile and to the content of a tweet. Can et al. [10] focused their research on predicting the expected retweet count of a tweet by studying three types of features: content based features (presence or absence of hashtags), structure based features (as followers count, friends count, statuses count), as well as multimedia and image based features (the distribution of color intensities, perceptual dimensions, responses of individual object detectors). They have used the logarithm of retweet count for a given tweet as the response variable, and three different types of regression: linear, SVM with a Gaussian kernel, and Random Forest. The experiments produced better results with Random Forest, providing a RMSE score of 1.297 in log scale, very close similar performances have been obtained with SVM. They identified the Followers counts to be the most correlated feature. A common drawback found in contentbased predicting tools reviewed in the literature, is represented by the 140character constraint imposed by Twitter, which makes it difficult to identify and extract contentbased predictive features [10].
Pálovics et al. [44] have treated the retweet prediction as a binary classification problem. They have used a multiclass classification for ranges of cascade sizes, in order to directly predict the logarithm of the retweets volume. For each day in the testing period, they have trained a Random Forest classifier to predict the future volume of retweets for tweets appearing on the day. The experiments have been compared by using the AUC (area under the precisionrecall curve) demonstrating the dependency of the model with respect to the user feature (e.g., followers counts), hashtag used popularity, user network features. Bunyamin and Tunys [9], have provided a comparison of the performance for different learning methods and features, in terms of retweet prediction accuracy and feature importance, to understand what kind of tweets would be retweeted, by using as response variable a dummy variable representing the two states of being retweeted or not retweeted. They have found that Random Forests method archives the best performance. Moreover, they have found and included among the best features the following ones: number of times the user is listed by other users, number of followers, and the average number of tweets posted per day. On the same line, Jiang et al. [30] and Zhang et al. [62] have treated the retweeting behavior prediction as a binary classification problem, achieving an accuracy of 0.85 and 0.789 respectively. Liu et al. [35] have proposed a twophase model to predict how many times a tweet can be retweeted in Sina Weibo microblog. In the first step, they have built a multiclassification model, while in the second step a regression model on each class has been constructed. They have achieved a high Mean Absolute Error of 58.22%, using the combination of Random Forest model and Least Median Squared Linear Regression model. However, discussion about the importance of each considered features is not reported. Firdaus et al. [19] have tried to consider user’s different behaviors in different roles for the purpose of retweet prediction. They argue that the retweet prediction model might give better prediction accuracy results when the difference between the behavior of the author and retweetters is considered, determining the topic of interest of a user based on his past tweet and retweet.
3 Twitter vigilance architecture
All these kinds of analysis are performed at both Twitter Vigilance Channel level and at single search level. In the specific, the following information and metrics can be retrieved: number of tweets and retweets; user citations (to detect potential influencers, pushers, emerging citations, etc.); hashtags (to understand which are the most used, emerging, evolving, etc.); keywords tagged with their partofspeech (that is, their grammatical function), in terms of nouns, verbs, and adjectives; sentiment analysis; relationships among users; etc.
The derived metrics and information can be useful to understand which are the most widely used or emerging hashtags, as well to detect which are the most influential in determining the positive/negative signature and polarity detection in the sentiment analysis, and thus for better tuning the tweet collected and for precomputing basic metrics that can be useful for the researcher to make further analysis in different domains and generically for communication and media, predictive models [24, 25]. It can be a useful tool for identifying reasons for positive/negative tweets, as well as the reaction of the community.
4 Assessment framework for retweet modeling by using Twitter vigilance outcomes
 I.
Collection of the data from Twitter.com by crawling them by using Twitter Vigilance platform and tools on the basis searches and channels. The platform allows computing simple metrics for counting tweets/retweets for search and channel, extracting relationships among users, etc.
 II.
Selection of predictors/features from collected data and metrics.
 III.
Computation of potential predictors: a statistical criterion is applied to identify the statistically significant features. The use of an exploratory method is a crucial issue not only for ranking the variables before the construction of a prediction model, but also to give the phenomenon’s first interpretation and to understand the underlying data structure.
 IV.
Computation of a predictive model for the assessment of the binary probability to be retweeted or not.
 V.Computation of a model to predict the degree of retweeting. The results have been obtained by comparing several different computational alternatives and approaches and selecting the better ranked and the most relevant metrics as described in the following.
According to the previous statements, we have adopted Classification And Regression Tree (CART) models to understand the relevance of variables and to construct a model for predicting the probability to be retweeted and the degree of retweeting.
4.1 Collection of the datasets
4.2 Identification of potential features/metrics
Considered features/metrics from the tweet information
Tweet metrics  Description 
URLs count  # of URLs in the tweet 
Mentions count  # of mentions/citation of Twitter users in the tweet 
Hashtags count  # of hashtags included in the tweet 
Favorites count  # of favorite obtained by the tweet 
Publication time  Local hour H24 in which the tweet has been published in the day according to the author’ local time. 
Author of tweet metrics  Description 
Days count  # of days since the tweet’s author created its Twitter account 
Statuses count  # of tweets made by the tweet’s author since the creation of its own account 
Author network metrics  Description 
Followers count  # of followers the author of the tweet 
Followees count  # of friends the tweet’s author is following 
Listed count  # of people added the tweet’s author to a list 
In the proposed analysis, we have specifically addressed metrics such as: Publication Time and Listed Count. The Publication Time metric should consider the classical claim stating that a higher probability of retweeting could be achieved if the tweet is published when the audience is online. The Listed Count metric should consider the reputation of the author, which is an additional level with respect to be just followed by another user. In addition to the metrics reported in Table 1, we also collected the Retweet Count (i.e., # of retweets obtained by the tweet), which can be considered, in our case, the target of our prediction models and not a real metric.
4.3 Computation and understanding of potential predictors

Retain just enough components to explain some specified large percentage of the total variation of the original variables. Values between 70% and 90% are usually suggested, although smaller values might be appropriate as q or n (the sample size) increases [18].

The Kaiser criterion [32] recommends retaining only factors with eigenvalues greater than one.

The screen test of Cattell [11], recommends plotting the eigenvalues and finding a place where the smooth decrease of eigenvalues appears to level off to the right of the plot. The number of components selected is the value corresponding to an “elbow” in the curve, i.e., a change of slope.
Importance of principal components
Factors  Eigenvalue  % variance  % Cumulative variance 

1  1.9545  17.7681  17.7681 
2  1.3748  12.4979  30.2659 
3  1.0777  9.7976  40.0636 
4  1.0335  9.3959  49.4594 
5  1.0248  9.3164  58.7758 
6  0.9623  8.7485  67.5243 
7  0.9523  8.6576  76.1819 
8  0.9339  8.4899  84.6717 
9  0.7679  6.9808  91.6526 
10  0.5976  5.4325  97.0851 
11  0.3206  2.9149  100 
Principal component loadings
Metrics  PC1  PC2  PC3  PC4  PC5 

Retweet count  −0.1623  0.4346  0.1635  −0.0026  −0.1009 
Favorites count  −0.6294  0.3908  0.1922  −0.1128  −0.1880 
Followers count  −0.7599  0.2736  0.0522  −0.0983  −0.0857 
Followees count  −0.1336  −0.0907  −0.4627  −0.2494  0.1182 
Listed count  −0.8431  −0.1549  −0.0498  0.1500  0.1871 
Statuses count  −0.4256  −0.5016  −0.3781  0.2795  0.2410 
Hashtags count  −0.1585  −0.5661  0.4377  −0.0517  0.0309 
Mentions count  0.0394  0.2194  0.0786  −0.1607  0.7697 
URLs count  −0.1288  −0.5483  0.2539  −0.3388  −0.3248 
Publication time  0.0076  −0.0728  0.3639  −0.5186  0.3707 
Days count  −0.0370  0.0070  −0.5072  −0.6604  −0.1691 
Factor 1 carries more than 17% of the total variability of the dataset (Table 2), and this variability is mainly explained by the covariates Favorite Count, Followers Count and Listed Count. This first factor is strongly different with respect to the one identified by the Kaiser criterion [32], since the Listed Count metric (which is dominant) was taken into account in that article. The variability of Factor 2 (12.5%) is carried by the negative correlation of Hashtags Count (−0.5661) and URLs Count (−0.5483), while Factor 3 explains about 9.7% of the total variability, and it is represented by Followees Count feature. Component 4 explains almost 9.3% of the total variability, and it is negatively correlated with the Publication Time of a tweet and the age of the author account (Days Count). Please note that also the Publication Time was not considered in [32]. The Mentions feature (0.7696) is mainly carried by Factor 5, and it explains the same proportion of variability of Component 4. PCA allowed to sort the features according to the impact on total variability, as well as to understand the correlation among the metrics and the number of retweets.
According to the analysis results, the most relevant metrics are: Mentions Count (76.9% of Factor 5 total variability); Listed Count (explains the main variability of Factor 3 sharing it with Followers and Favorite); Hashtags (that explains the main variability of Factor 2, sharing it with URLs Count, Statuses Count and Retweets Count); Days Count (that explains the main variability of Factor 4, sharing it with Publication Time).
5 Predicting the probability to be retweeted and the degree of retweeting of a tweet
In this section, before to present the analyses performed, a presentation of the considered classifications methods is provided. Then, the different analyses are reported. As a first phase, as reported in Section 5.2, a binary classification has been performed to create a model to identify tweets that have a probability to be retweeted, and thus the most relevant features that may determine the model. As a second phase, Section 5.3 presents the model for predicting the degree of retweeting of tweet. Also in this case, the most relevant features for the prediction have been identified.
5.1 Analysis of the considered classification methods
Classification Trees are machinelearning methods for constructing prediction models from data, and they have been widely used for the data exploration, description and prediction purposes. Trees have many properties, including their ability to handle various types of response such as numeric, categorical, censored, multivariate, and dissimilarity matrices; trees are invariant to monotonic transformations of the predictors; complex interactions are modeled in a simple way; besides, missing values in the predictors are managed with minimal loss of information. Thanks to these properties, the use of classification and regression trees (i.e., a recursive partitioning method that is free from distributional assumptions), has potential advantages to construct predictive models.
In this section, a short recall of the methods considered and compared for creating a suitable predicting model to estimate the degree of retweeting for single and/or groups of tweets is reported.
Recursive partitioning procedure models (RPART) are defined by recursively partitioning the data space, and defining a simple local prediction model for each resulting partition. This can be represented graphically as a decision tree, with one leaf per partition [6]. The model can be written in the following form (1):
In the classification setting, a multinoulli model has to be fitted to the data in the leaf satisfying the test X_{j} < t by estimating the classconditional probabilities \( {\widehat{\pi}}_c=\frac{1}{\left\mathcal{D}\right}{\sum}_{i\in \mathcal{D}}\mathbb{I}\left({y}_i=c\right) \), where \( \mathcal{D} \) is the data in the leaf. Given the classconditional probabilities, we have used the Gini index [23] to evaluate the partition: \( {\sum}_{c=1}^C{\widehat{\pi}}_c\left(1{\widehat{\pi}}_c\right)={\sum}_C{\widehat{\pi}}_c{\sum}_c{{\widehat{\pi}}_c}^2=1{\sum}_C{{\widehat{\pi}}_c}^2 \).
This index is the expected error rate \( {\widehat{\pi}}_c \)is the probability that a random entry in the leaf belongs to class c, and \( \left(1{\widehat{\pi}}_c\right) \) is the probability that it would be misclassified. To prevent overfitting, we have stopped the growth of the tree performing a pruning. This is performed by using a scheme that prunes the branches giving the least increase in the error [6]. A problem introduced by using recursive partitioning procedure is the fact that trees are unstable. One way to reduce the variance of an estimate is to average together many estimates using the bagging (bootstrap aggregating) technique.
In the Random Forests approach [8] each tree is constructed using a different bootstrap sample from the original data. For each tree of the collection, a random subset of predictors is chosen to determine each split. In this way, the correlations between predictions of the individual trees are reduced. In other words, Random Forests try to decorrelate (each tree has the same expectation) the base learners by learning trees based on a randomly chosen subset of input variables, as well as a randomly chosen subset of data cases. In general, Random Forests procedure is better than bagging.
Stochastic Gradient Boosting [21] is another way to reduce the variance. The algorithm for Boosting Trees evolved from the application of boosting methods. Boosting method (Freund and Schapire [20]) fits many large or small trees to reweighted versions of the training data, and performs classifications by weighted majority vote. In Stochastic Gradient Boosting, many small classification (or regression) trees are built sequentially from “pseudo”residuals (the gradient of the loss function of the previous tree). At each iteration, a tree is built from a random subsample of the dataset (selected without replacement) producing an incremental improvement in the model. An advantage of Stochastic Gradient Boosting is that it is not necessary to preselect or transform predictor variables. It is also resistant to outliers. In general, boosting procedure outperform the Random Forests.
In the multinomial approach, trees are formulated as statistical models, alike generalized linear and additive models [16]. In this approach, splits are based on an explicit statistical model, the deviance of which defines the dissimilarity measure. For classification trees the use of a multinomial model is equivalent to the information index, with the deviance defined by the multinomial loglikelihood.
5.2 The probability to be retweeted
By following the line of Suh et al. [55] and Naveed et al. [41], we have transformed the variable Retweet Count into a binary variable (0: no retweets, 1: one or more retweets). Suh et al., fitted a Generalized Linear Model (GLM) to 10 K dataset, and used the results in a logistic equation to predict the probability of a retweet. Naveed et al., trained a prediction model to forecast the likelihood, for a given tweet, of being retweeted based on its contents. From the parameters learned by the model, they deduced which are the influential content features that contribute to the likelihood of a tweet to be retweeted. Our aim is to evaluate the relevant metrics associated to the action of retweeting in a predictive perspective: we used a learning approach to predict the probability for a tweet to be retweeted. The binary classification model provides us a general picture of the most important features (Table 1) related to retweeting. Given the finding that some features have strong relationship associated with the degree of retweeting, we have fitted the predictive models, presented in Section 5, on 500 K dataset.
Retweet binary classification models comparison on 500 K data
Classification methods  Accuracy  Precision  Recall  F_{1}score 

Recursive partitioning  0.9071  0.9926  0.8157  0.8955 
Random forests  0.9150  0.9826  0.8407  0.9061 
Gradient boosting  0.9061  0.9936  0.8127  0.8941 
Multinomial/Logistic model  0.9021  0.8115  0.9853  0.8899 
5.3 Predicting the degree of retweeting of a tweet
For the analysis of collected tweets, we conducted a 10fold crossvalidation evaluation on the complete 100 Million dataset and the features reported in Table 1. After the assessment of the abovementioned approaches (as shown in the following), we have considered a CART model with Recursive Partitioning procedure (RPART model) as the best learning algorithm. In the next section, a comparison of the abovementioned methods is provided. In the considered predictive models the response variable Retweet Count has been transformed in a categorical variable, namely Retweet Class, having classes: “0”, “1–100”, “101–1000”, “1001–10,000”, and “Over 10,000”, with the evident meaning of classifying the degree of retweeting, in 0 retweets, from 1 to 100 retweets, etc. Please note that the chosen classes are different from those of Fig. 5. Actually, classes “1–10” and “11–100”, as depicted in Fig. 5, have been merged into a single size class “1–100”. In addition, we have created two new classes “1001–10,000” and “Over 10,000”, with the aim of understanding the degree of retweeting especially when the retweet count is high. As it will be described in the following, compacting classes “1–10” and “11–100” allowed us to obtain a higher accuracy (a better prediction model).
Note that, the training set has been extracted as the 80% of 100 million data and the validation of the predictive capability has been performed on a test set of 20% of the total observations.
Predicting class of degree of retweeting of the RPART procedure
Assessment drivers  Degree of retweeting classes  

0  1–100  101–1000  1001–10,000  Over 10,000  
Sensitivity  0.7737  0.8105  0.3142  0.0208  0.0136 
Specificity  0.9132  0.6694  0.9199  0.9996  1.0000 
Positive predictive value  0.8564  0.6256  0.3752  0.7345  0.8488 
Negative predictive value  0.8579  0.8382  0.8975  0.9485  0.9915 
Prevalence  0.4007  0.4053  0.1328  0.0526  0.0086 
Detection rate  0.3100  0.3285  0.0417  0.0011  0.0001 
Detection prevalence  0.3620  0.5251  0.1112  0.0015  0.0001 
Balanced accuracy  0.8435  0.7399  0.6170  0.5102  0.5068 
Overall statistics in predicting class of degree of retweeting
Assessment parameters  Values 

Accuracy  0.6815 
Accuracy 95% confidential interval (min, max)  (0.6813, 0.6817) 
Recall  0.7737 
Precision  0.8564 
Kappa  0.4922 
Confusion matrix of the RPART procedure
Degree of retweeting classes  Reference degree of retweeting classes  

0  1–100  101–1000  1001–10,000  Over10000  
0  31.0009  4.7219  0.3055  0.1487  0.0232 
1–100  7.3885  32.8530  8.7785  2.9702  0.5240 
101–1000  1.6765  2.9545  4.1732  2.0247  0.2941 
1001–10,000  0.0005  0.0055  0.0258  0.1092  0.0077 
Over10000  0.0000  0.0000  0.0000  0.0021  0.0117 
6 Comparison among different approaches
The choice of the RPART model has been justified by the fact that the accuracy obtained was higher than other ensemble learning techniques as Random Forests, Stochastic Gradient Boosting and Penalized Multinomial Regression. The comparisons have been performed by using the datasets of 100 K and 500 K tweets, due to the computational costs of some of the compared algorithms. Moreover, the recursive partitioning procedure is also the result of a compromise between goodness in terms of accuracy, simplicity in terms of interpretation (each tree derives from a series of logical rules [47]) and the ability to take into account of millions of data within a reasonable timeframe.
Furthermore, RPART models can easily handle mixed discrete and continuous inputs, they are insensitive to monotone transformations of the inputs (because the split points are based on ranking the data points), they perform automatic variable selection, and they are relatively robust to outliers [40]. However, RPART model trees can produce models with high variance in the estimators. Two ways to reduce the variance of predictions could be adopted, for instance by using a bagging approach [7] or a boosting technique [48]: models like Random Forests often provide very good predictive accuracy. Actually, such an approach [8] aims at decorrelating the base learners by learning trees on the basis of a randomly chosen subset of input variables. Typically, the running time of classical Random Forests technique is not viable for millions of observations. On the other hand, applying it on a 100 K tweet dataset does not provided relevant improvements in term of accuracy with respect to the recursive partitioning procedure.
Models comparison on 100 K observations. The recursive partitioning resulted as the better ranked in terms of accuracy
Classification methods  Accuracy  Precision  Recall  F_{1}score 

Recursive partitioning  0.6827  0.8436  0.7806  0.8108 
Random forests  0.6812  0.8509  0.7761  0.8117 
Gradient boosting  0.6764  0.8547  0.7715  0.8110 
Multinomial model  0.6480  0.8423  0.7275  0.7807 
Retweet models comparison on 500 K data in terms of computation time in model estimation
Classification methods  Accuracy  Precision  Recall  F_{1}score  Processing time in sec. 

Recursive partitioning  0.6807  0.8512  0.7767  0.8122  180 
Random forests  0.6884  0.8601  0.7866  0.8217  198,968 
Gradient boosting  0.6796  0.8534  0.7731  0.8113  64,448 
Multinomial model  0.6411  0.8367  0.7245  0.7765  31,576 
7 Conclusions and future perspectives
The work presented in this paper started with the aim of better understanding the correlation of features associated to tweets with respect to the action of retweeting. Most of the proposed papers in the literature proposed analysis without deriving models for predicting the degree of retweeting, in others they limited to identify the probability to be rewetted or not. The proposed analysis identified additional relevant metrics with respect to those proposed in the literature, namely, Publication Time and Listed Count. This approach resulted in obtaining a more effective principal component analysis and coverage of the phenomena. Therefore, on the basis of such an analysis, in this paper we proposed a method to predict the degree of retweeting through a classification trees model with recursive partitioning procedure applied on a dataset of 100 Million of tweets. We have shown that the choice of the RPART model is justified by the fact that the accuracy is better with respect to Random Forests, Stochastic Gradient Boosting and Penalized Multinomial techniques, compared on a viable sample of 100 K observations. The Recursive Partitioning procedure is the result of a compromise between goodness in terms of accuracy, simplicity in terms of interpretation and the ability to take into account millions of observations within a reasonable timeframe. By analyzing the results obtained with the Recursive Partitioning procedure, Mentions Count is the most correlated metric with the degree of retweeting, and the accuracy of the predictive model is about 68%.
The model produced can be used for assessing the degree of retweeting of each single tweet produced by some author or those prepared for advertising and/or for information campaign. Potential applications fields are many, including marketing and advertising, early monitoring, emergency response and, more generally, promoting and diffusing information; and the related raking and pricing of the actions performed in advertising. The work has been developed in the context of smart city projects in which the capacity of communicating information is fundamental for diffusing information about changes in the city, and/or directives for alerts of civil protection, as weather forecast, and in general for early warning, and thus for communicating. In fact, when a tweet is structurally more likely to be retweeted is more effective in propagating information.
As a perspective for future research, the analysis for predicting the degree of retweeting could be focused at a deeper and more specific level, for instance considering narrower domains (e.g., selecting tweets on the basis of their topics or subjects in terms of hashtags, as well as considering specific Twitter Vigilance channels) such as politics, healthcare, weather, healthcare, city services, emergency, etc. This could be made in order to understand if it is possible to identify more specific metrics and models, with respect to the ones analyzed in the present work, which could lead to higher values of prediction accuracy.
Notes
Acknowledgements
This work has been supported by the RESOLUTE project (www.RESOLUTEeu.org) and has been funded within the European Commission H2020 Programme under contract number 653460. This paper expresses the opinions of the authors and not necessarily those of the European Commission. The European Commission is not liable for any use that may be made of the information contained in this paper.
References
 1.Achrekar H, Gandhe A, Lazarus R, Yu S, Liu B (2012) Twitter improves seasonal influenza prediction. Healthinf 61–70Google Scholar
 2.Asur S, Huberman BA (2010) Predicting the future with social media. CoRR abs/1003.5699Google Scholar
 3.Bermingham A, Smeaton A (2011) On using twitter to monitor political sentiment and predict election results. Proceedings of the Workshop on Sentiment Analysis where AI meets Psychology (SAAIP 2011), Chiang Mai, Thailand, p 2–10Google Scholar
 4.Bollen J, Mao H, Zeng XJ (2011) Twitter mood predicts the stock market. J Comput Sci 2(1)CrossRefGoogle Scholar
 5.Botta F, Moat HS, Preis T (2015) Quantifying crowd size with mobile phone and Twitter data. Roy Soc Open Sci 2:150–162MathSciNetGoogle Scholar
 6.Breiman L, Friedman J, Stone CJ, Olshen RA (1984) Classification and regression trees. CRC pressGoogle Scholar
 7.Breiman L (1996) Bagging predictors. Mach Learn 24(2):123–140zbMATHGoogle Scholar
 8.Breiman L (2001) Statistical modeling: the two cultures (with comments and a rejoinder by the author). Stat Sci 16(3):199–231MathSciNetCrossRefGoogle Scholar
 9.Bunyamin H, Tunys T (2016) A comparison of retweet prediction approaches: the superiority of Random Forest learning method. Telkonika (Telecommun Comput Electron Control) 14(3):1052–1058CrossRefGoogle Scholar
 10.Can EF, Oktay H, Manmatha R (2013) Predicting retweet count using visual cues. In: Proceedings of the 22nd ACM international conference on Conference on information & knowledge management, San Francisco, California (USA), p 1481–1484Google Scholar
 11.Cattell RB (1966) The screen test for the number of factors. Multivar Behav Res 1(2):245–276CrossRefGoogle Scholar
 12.Cenni D, Nesi P, Pantaleo G, Zaza I (2017) Twitter Vigilance: a multiuser platform for crossdomain Twitter data analytics, NLP and sentiment analysis. IEEE international Conference on Smart City and Innovation, San Francisco, California (USA)Google Scholar
 13.Cha M, Haddadi H, Benevenuto F, Gummadi KP (2010) Measuring user influence in Twitter: the million follower fallacy. Proceedings of the International Conference on Weblogs and Social Media (ICWSM 10), Washington DC (USA), p 10–17Google Scholar
 14.Chauhan A, Kummamuru K, Toshniwal D (2017) Prediction of places of visit using tweets. Knowl Inf Syst 50(1):145–166CrossRefGoogle Scholar
 15.Choi H, Varian H (2009) Predicting the present with Google Trends. Official Google Research Blog. Available at: http://bit.ly/h9RRdW
 16.Clark LA, Pregibon D (1992) Treebased models. In: Chambers JM, Hastie TJ (eds) Statistical models in S, Chapman & Hall/CRC, p 377–420Google Scholar
 17.Crisci A, Grasso V, Nesi P, Pantaleo G, Paoli I, Zaza I (2017) Predicting TV programme audience by using Twitter based metrics. Multimed Tools Appl 1–30Google Scholar
 18.Everitt B, Hothorn T (2011) An introduction to applied multivariate analysis with R. Springer Science & Business MediaGoogle Scholar
 19.Firdaus SN, Ding C, Sadeghian A (2016) Retweet prediction considering user's difference as an author and retweeter. Proceedings of the IEEE/ACM International Conference Advances in Social Networks Analysis and Mining (ASONAM), p 852–859Google Scholar
 20.Freund Y, Schapire RE (1996) Experiments with a new boosting algorithm. Proceedings of the 13th International Conference on Machine Learning (ICML’96), Bari (Italy), p 148–156Google Scholar
 21.Friedman JH (2002) Stochastic gradient boosting. Comput Stat Data Anal 38(4):367–378MathSciNetCrossRefGoogle Scholar
 22.Golder S (2010) Tweet, tweet, retweet: conversational aspects of retweeting on Twitter. In: Proceedings of the 43rd International Conference on System Sciences Sciences (HICSS ’10), Hawaii (USA), p 1–10Google Scholar
 23.Gini C (1921) Measurement of inequality of incomes. Econ J 31(121):124–126CrossRefGoogle Scholar
 24.Grasso V, Zaza I, Zabini F, Pantaleo G, Nesi P, Crisci A (2016b) Weather events identification in social media streams: tools to detect their evidence in Twitter. PeerJ Preprints 4:e2241v1Google Scholar
 25.Grasso V, Crisci A, Nesi P, Pantaleo G, Zaza I, Gozzini B (2016a) Public crowdsensing of heatwaves by social media data. In: Proceedings of the 16th EMS Annual Meeting & 11th European Conference on Applied Climatology (ECAC), Trieste, ItalyGoogle Scholar
 26.Gruhl D, Guha R, Kumar R, Novak J, Tomkins A (2005) The predictive power of online chatter. In: Proceedings of the 11th ACM International Conference on Knowledge discovery in data mining (SIGKDD), Chicago, Illinois (USA), p 78–87Google Scholar
 27.Hansen LK, Arvidsson A, Nielsen FA, Colleoni E, Etter M (2011) Good friends, bad news  affect and virality in Twitter. CoRR, abs/1101.0510Google Scholar
 28.Hong L, Dan O, Davison BD (2011) Predicting popular messages in Twitter. In: Proceedings of the 20th International Conference companion on World wide web (WWW), Hyderabad (India), p 57–58Google Scholar
 29.Jansen B, Zhang M, Sobel K, Chowdury A (2009) Twitter power: tweets as electronic word of mouth. J Am Soc Inf Sci Technol 60(1532):2169–2188CrossRefGoogle Scholar
 30.Jiang B, Liang J, Sha Y, Li R, Liu W, Ma H, Wang L (2016) Retweeting behavior prediction based on oneclass collaborative filtering in social networks. In: Proceedings of the 39th ACM International Conference on Research and Development in Information Retrieval, Pisa (Italy), p 977–980Google Scholar
 31.Jolliffe I (2002) Principal component analysis. John Wiley & Sons, LtdGoogle Scholar
 32.Kaiser HF (1960) The application of electronic computers to factor analysis. Educ Psychol Meas 20(1):141–151CrossRefGoogle Scholar
 33.Kwak H, Lee C, Park H, Moon S (2010) What is Twitter, a social network or a news media? In: Proceedings of the 19th International Conference on World Wide Web, New York, NY (USA), p 591–600Google Scholar
 34.Lampos V, Bie TD, Cristianini N (2010) Flu detector  tracking epidemics on Twitter. Mach Learn Knowl 6323:599–602Google Scholar
 35.Liu G, Shi C, Chen Q, Wu B, Qi J (2014) A twophase model for retweet number prediction. In: Proceedings of the International Conference on WebAge Information Management. Springer, Cham, p 781–792Google Scholar
 36.Lu Y, Kruger R, Thom D, Wang F, Koch S, Ertl T, Maciejewski R (2014) Integrating predictive analytics and social media. In: Proceedings IEEE Conference on Visual Analytics Science and Technology (VAST), Paris (France), p 193–202Google Scholar
 37.Madlberger L, Almansour A (2014) Predictions based on Twitter  a critical view on the research process. In: Processing of the International Conference on Data and Software Engineering (ICODSE), p 1–6Google Scholar
 38.Mishne G, Glance N (2006) Predicting movie sales from blogger sentiment. In: Proceedings of the AAAI Spring Symposium on Computational Approaches to Analysing Weblogs (AAAI CAAW), p 155–158Google Scholar
 39.Morchid M, Dufour R, Bousquet PM, Linarès G, TorresMoreno JM (2014) Feature selection using principal component analysis for massive retweet detection. Pattern Recogn Lett 49:33–39CrossRefGoogle Scholar
 40.Murphy KP (2012) Machine learning: a probabilistic perspective. MIT pressGoogle Scholar
 41.Naveed N, Gottron T, Kunegis J, Alhadi AC (2011) Bad news travel fast: a contentbased analysis of interestingness on Twitter. In: Proceedings of the 3rd ACM International Conference on Web Science Conference (WebSci), Koblenz (Germany)Google Scholar
 42.Nesi P, Pantaleo G, Sanesi GM (2015) A Hadoop based platform for natural language processing of web pages and documents. J Vis Lang Comput 31:130–138CrossRefGoogle Scholar
 43.O’Connor B, Balasubramanyan R, Routledge BR, Smith NA (2010) From tweets to polls: linking text sentiment to public opinion time series. In: Proceedings of the 4th International Conference on Weblogs and Social Media (ICWSM), Washington, DC (USA), p 122–129Google Scholar
 44.Pálovics R, Daróczy B, Benczúr AA (2013) Temporal prediction of retweet count. In: Proceedings of the IEEE 4th International Conference on Cognitive Infocommunications (CogInfoCom), Budapest (Hungary), p 267–270Google Scholar
 45.Peng HK, Zhu J, Piao D, Yan R, Zhang Y (2011) Retweet modeling using conditional random fields. In: Proceedings of the IEEE International Conference on Data Mining Workshops (ICDMW), Vancouver, BC (Canada), p 336–343Google Scholar
 46.Pezzoni F, An J, Passarella A, Crowcroft J, Conti M (2013) Why do I retweet it? An information propagation model for microblogs. In: Proceedings of the 5th International Conference on Social Informatics, Kyoto (Japan), 8238, p 360–369Google Scholar
 47.Quinlan JR (1990) Learning logical definitions from relations. Mach Learn 5(3):239–266Google Scholar
 48.Schapire RE, Yoav F (2012) Boosting: foundations and algorithms. MIT pressGoogle Scholar
 49.Shih YS (1999) Families of splitting criteria for classification trees. Stat Comput 9(4):309–315CrossRefGoogle Scholar
 50.Shimshoni Y, Efron N, Matias Y (2009) On the predictability of search trends. Available at: http://doiop.com/googletrends
 51.Signorini A, Segre AM, Polgreen PM (2011) The use of Twitter to track levels of disease activity and public concern in the U.S. during the influenza A H1N1 pandemic. PLoS One 6(5):1–10CrossRefGoogle Scholar
 52.Sikdar S, Adali S, Amin M, Abdelzaher T, Chan KL, Cho JH, Kang B, O'Donovan J (2014) Finding true and credible information on Twitter. In: Proceedings of the 17th IEEE International Conference on Information Fusion (FUSION), Salamanca (Spain), p 1–8Google Scholar
 53.Sinha S, Dyer C, Gimpel K, Smith NA (2013) Predicting the NFL Using Twitter, arXiv:1310.6998v1Google Scholar
 54.Sitaram A, Huberman BA (2010) Predicting the future with social media. Social Computing Lab, HP Labs, Palo AltoGoogle Scholar
 55.Suh B, Hong L, Pirolli P, Chi EH (2010) Want to be retweeted? Large scale analytics on factors impacting retweet in Twitter network. In: Proceedings of the 2nd IEEE International Conference on Social computing (SOCIALCOM), Washington, DC (USA), p 177–184Google Scholar
 56.Tumasjan A, Sprenger TO, Sandner PG, Welpe IM (2010) Predicting elections with Twitter: what 140 characters reveal about political sentiment. In: Proceedings of the International Conference on Weblogs and Social Media, (ICWSM 10), Washington DC (USA), p 178–185Google Scholar
 57.Uysal I, Croft WB (2011) User oriented tweet ranking: a filtering approach to microblogs. In: Proceedings of the 20th ACM International Conference on Information and knowledge management (CIKM), Glasgow, Scotland (UK), p 2261–2264Google Scholar
 58.Wang X, Gerber MS, Brown DE (2012) Automatic crime prediction using events extracted from Twitter posts. Social computing, behaviouralcultural modeling and prediction, p 231–238CrossRefGoogle Scholar
 59.Yang J, Counts S (2010) Predicting the speed, scale, and range of information diffusion in Twitter. In: Proceedings of the International Conference on Weblogs and Social Media, (ICWSM 10), Washington DC (USA), p 355–358Google Scholar
 60.Zaman TR, Herbrich R, Van Gael J, Stern D (2010) Predicting information spreading in Twitter. In: Proceedings of the Workshop on Computational Social Science and the Wisdom of Crowds, NIPSGoogle Scholar
 61.Zaman T, Fox EB, Bradlow ET (2014) A Bayesian approach for predicting the popularity of tweets. Ann Appl Stat 8(3):1583–1611MathSciNetCrossRefGoogle Scholar
 62.Zhang Q, Gong Y, Wu J, Huang H, Huang X (2016) Retweet prediction with attentionbased deep neural network. In: Proceedings of the 25th ACM International Conference on Information and Knowledge Management (CIKM), Indianapolis, Indiana (USA), p 75–84Google Scholar
Copyright information
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.