Predicting TV programme audience by using twitter based metrics
 1.7k Downloads
 2 Citations
Abstract
The predictive capabilities of metrics based on Twitter data have been stressed in different fields: business, health, market, politics, etc. In specific cases, a deeper analysis is required to create useful metrics and models with predicting capabilities. In this paper, a set of metrics based on Twitter data have been identified and presented in order to predict the audience of scheduled television programmes, where the audience is highly involved such as it occurs with reality shows (i.e., X Factor and Pechino Express, in Italy). Identified suitable metrics are based on the volume of tweets, the distribution of linguistic elements, the volume of distinct users involved in tweeting, and the sentiment analysis of tweets. On this ground a number of predictive models have been identified and compared. The resulting method has been selected in the context of a validation and assessment by using real data, with the aim of building a flexible framework able to exploit the predicting capabilities of social media data. Further details are reported about the method adopted to build models which focus on the identification of predictors by their statistical significance. Experiments have been based on the collected Twitter data by using Twitter Vigilance platform, which is presented in this paper, as well.
Keywords
Twitter monitoring Social media monitoring Predicting audience Twitter data analysis1 Introduction
Social media analysis is becoming a very important instrument to monitor communities, users’ preferences, and to make predictions. Among the social media solutions, Twitter is one of the most widespread microblogs allowing users to have a personal news feed and followers attached to it. Followers receive some notification connected to the actions performed by the users they follow. Typical actions of users can be: posting a message (tweet), commenting, expressing like/favourite, retweeting (the echo of some tweet messages by some other users to the followers of the retweeting user). Therefore, tweets and retweets are shown (exposed) to other Twitter users, thus making more likely the chance of provoking their interests and reactions: retweets, comments, likes, etc. Some of these mechanisms can provoke viral processes that may lead to massive propagation of tweets in the user community. Twitter users are formally identified by “@” preceding their nickname as “@paolonesi”, one of the paper authors. Any user may appeal to the attention of other users by including the @Twitterusername in the tweet. For example, “Nice post @paolonesi! give me your opinion on XXX”, is a citation of “@paolonesi”. In the tweet text, every user can stress the attention to specific keywords called hashtags that are marked with “#” as first character. For example, hashtag: “#houseofcards” can be used to remark that the tweet is about the TV serial House of Cards (hashtags can be suggested to the audience by the TV producers, or spontaneously created by some users as well). Citations and hashtags are well indexed in Twitter.com and can be searched as main vehicles of involvement and remark, and thus are used by Twitter.com to propagate information to cited users and communities interested on following users or the hashtags, respectively.
Thanks to the above described social engagement mechanisms, a lot of users join and use Twitter every day; not only single users, but also news agencies, public institutions, producers, VIPs, teams, schools, municipalities, governments, etc., with the aim of sharing, promoting and communicating. On such grounds, Twitter is used as a source of information to deliver news, events, and innovations, and thus, it can be exploited as a tool for the prediction of different kinds of events and occurrences.
As described in the following, the research reported in this paper is about the usage of Twitter data to predict the attendance to TV shows by (i) computing metrics based on twitter data (volume of messages/posts including keywords (citation, hashtags) and/or mentions, volume of messages containing specific elements extracted from natural language processing (verbs, adjectives, words), and sentiment analysis by weighting each single text element on the basis of positive and/or negative moods), (ii) setting up and making in place predictive models also addressing feature selection. Thus, before passing to describe the solution proposed, the following subsection presents the related work.
1.1 Related work
As previously stated, Twitter data have been used for setting up several kinds of predictive models in different domains according to the differences in the events and phenomena.
In [48], a solution to predict football game results has been proposed by considering the volume of tweets. In more details the approach adopted defined a function for putting in correlation the delta changes in the volume of tweets with respect to a fixed number of categories, thus the obtained prediction rate was in the range of 68%. Opinion polls and predictions of political elections have been interrelated to the volume of tweets by using Sentiment Analysis techniques in [37]. In this case, the sentiment analysis has been performed by counting words and assigning to them negative or positive weights according to Opinion Finder lexicon based on only 2800 words, obtaining a highest correlation value of about 80% with respect to measures of public opinion derived from polls in the case of Obama elections. Voting results have been correlated with tweets in the 2009 German elections [53], addressing the counting of the tweets citing the different parties without providing a predictive model, another example can be found in [4]. In [15], sentiment analysis and volume approaches has been used for electoral prediction in the Senate competition which is 1:1, still obtaining correlations in the range of 40–60%.
Different models, based on both the volume of tweets and other means, have been also used for other predicting purposes: spread of contagious diseases [39] observing the inception over time of the adoption of terms which can be related to problems and symptoms that can be connected to specific illnesses. Other cases in the health domain have been studied for detecting the inception of public health seasonal flu [1, 27, 46], [7].
In economics, sentiment analysis has been adopted by employing SelfOrganizing Fuzzy Neural Network, since long time series are present, predicting the direction of the stock market with a highest accuracy of over 86% [5]. Other cases in the market and business domains are described in [9, 43], for marketability of consumer goods in [45], and for book sales in [19].
With the aim of predicting boxoffice for movies, in [3] a model has been proposed adopting the average tweet rate, the presence of URLs in tweets and the volume of retweets as features. Also in this case, the time series are long (several days), and the model obtained an adjusted R squared of 0.94 via a linear model addressing sentiment analysis. Other cases in the same domain are: [28, 30, 32, 49], in which the combination of volume and sentiment analysis for long terms series has been proposed in a tool without proposing specific models. For example, in [24] the sentiment analysis is introduced by using the ratio from positive and negative score estimation of the tweets, obtaining accuracy of 64%. In [13], Twitter data have been used for predicting the performance of movies at the box office. To this end, a fuzzy inference system has been set up exploiting metrics such as the counting of tweets, followers, sentiment analysis metrics, and also additional information about the actors’ rating according to the model proposed in [42]. The results presented on specific cases provide large mean square errors from 6% up to 27%.
Other applications highlighting Twitter data capabilities can be on: detecting crimes with the capability of identifying the inception of certain critical cases (such as micro discussions on crashes, fire, etc.) [56], places to be visited observing the most frequently attended places in a given location [8]. In addition, Twitter data has been used for assessing weather forecast information in [17], and in [18].
Twitterbased metrics have been used to estimate the number of people in specific locations like airports (the so called crowd size estimation) [6]. In this case, a simple linear model on the basis of volume metrics (i.e., number of tweets) has been proposed. In [16], the averaged value of past audience and Twitter data (contributions per minute) have been used for predicting audience (TV rating) on long series of political TV shows (from 14 to 280 shows), by using mainly volume metrics during broadcast time, and the rate of twitting people, obtaining an adjusted R squared of 0.95. On this regard, Nielsen Media Research discussed the capability of Twitter data to explain the variance of 2/3 of the difference in premiere audience sizes [36]. TV rating is usually estimated sampling the audience with specific meters such as those installed by Auditel or more precise measures as those of Sky via set top box/decoders. In [22], a neural network approach has been used for predicting audience on the basis of Facebook data, obtaining a prediction accuracy in terms of Mean Absolute Percentage Error (MAPE) from about 6% to 24% on different TV shows. In [33], a number of TV shows have been analysed, clustering them for similarity, with the aim of identifying a predictive model for each cluster taking into account the Twitter data of previous days. The proposed predictive model is based on a linear regression (using volume and sentiment analysis metrics) that produced an R squared in range of 0.73–0.94 depending on the cluster. Typically, clusters with smaller amount of tweets in total per series are better ranked. A cross validation was not proposed to verify the robustness of the model. In those cases, very stable data and long series have been addressed. These series have a very different behaviour with respect to “reality TV shows”, in which there is a strong involvement of the audience in many phase of the show, and thus the number of tweets is much higher in the days before and massive in the day of the show. In [50], the authors discovered relevant correlations between the number of tweets passed 30 min before and after the show and in successive episodes without proposing a predictive model. In [55], a functional comparison of classical solutions for estimating TV show rating with respect to the TV data usage is proposed, together with an early solution for the estimation of TV rating based on textual, spatial, and temporal relevance, without proposing a predictive model.
According to the state of the art analysis, the predictive capabilities of Twitter data have been explained by using volume metrics on tweets (i.e., the total number of tweets and/or retweets associated with a Twitter user or having a given hashtag). However, in some cases a deeper semantic understanding of tweets has been required to create useful predictive capabilities. For these reasons, algorithms for sentiment analysis computation have been proposed to take into account the meaning of tweets via natural language processing algorithms (e.g., [37]). The adoption of techniques for segmenting, filtering or clustering by context (e.g., using natural language processing so as to avoid the misclassification of tweets related to the flu), or by users’ profiles (e.g., age, location, language, and genre) may help in getting more precise results in terms of predictability. Overviews of predictive methods exploiting tweets have been proposed in [47], and in [31]. Moreover, [31] have criticised the predictive capabilities of some proposed models based on Twitter data. In fact, some approaches proposed general models adopting specific filtering and/or classifications based on human assessors, thus reducing the replicability of the solution. Twitter data also present some problems due to the way they are ingested and collected. In particular, the access to the twitter API has some limitations such as: the maximum number of request calls in a period, the huge amount of tweets that can be produced for certain cases, the complexity of social relationships among users, the limited size of tweets (140 characters), and the fact that historical Twitter data are not accessible via the Twitter API, etc. These facts force the developers to set up specific architectures for collecting tweets, while attempting to get them with a sufficient reliability [38].
In [26], the trend of the dissemination information via Twitter has been analysed, observing the issues regarding the retweets cascade effect and the show count. Please note that the number of shows of a tweet is not easily accessible from Twitter data, but it is a well know observable metric exposed by internal Twitter analytic. The paper has demonstrated that the counting of retweets and the number of shows do not have a strong correlation. With the aim of predicting the number of shows, a number of predictive metrics have been proposed, and in particular: number of followers, friends, favourites; number of times the user has been listed; number of posts; number of active days, etc.
1.2 Article overview
The paper is focused on presenting how Twitter data and derived metrics can be used for predicting the audience of reality TV shows. They are very attractive and addicting shows creating relevant effects of retweeting. In reality shows, the prediction of event attendance (TV rating) can be very useful for service tuning  e.g., catering, cleaning, security, and for selling advertising. The prediction of the audience to TV programmes is mainly relevant to adapt the value of advertising and to attract more advertisers. In such cases, the adoption of Twitter and the related collected metrics have been used to study and define a model able to predict the audience of TV shows. The proposed prediction model is based on the data collected during the days before the events. Such data demonstrated to have predictive capabilities thanks to the identification of a relevant number of feature/metrics, including: volume (counting of tweets, retweets, etc.), natural language processing (counting nouns, adjectives, etc.), network (e.g., number of unique users), and sentiment analysis (assessing positive and negative orientation of tweets). The identified metrics have been used to derive a model obtaining high significance and predicting capabilities on the basis of a comparison among four methods: multilinear regression, ridge, lasso and elastic net, as described in the paper. The approach and results have been validated, thus demonstrating that it is possible to obtain some specific excellent metrics with predictive skills from Twitter data also for those cases. The model validation has been performed by using data related to X Factor season 9 (XF9), X Factor season 10 (XF10), and Pechino Express 2015, which are reality shows broadcasted in Italy, in 2015 and 2016 summerfall period. They are reality shows where people are highly involved through a participative support of media actors by using Twitter posts. The above described predictive models have been applied by exploiting Twitter data and computed metrics collected by using Twitter Vigilance solution, which is a tool for multiuser collection of tweets for research and analysis (http://www.disit.org/tv). Twitter Vigilance has been partially developed in the context of RESOLUTE H2020 (http://www.resoluteeu.org) and used in REPLICATE H2020 projects, and in SiiMobility smart city national project (http://www.siimobility.org). Presently, Twitter Vigilance is adopted by a number of institutions to collect and exploit Twitter data for research and analysis purposes.
The paper is organized as follows. Section 2 describes the general architecture of the TwitterVigilance solution (http://www.disit.org/tv) adopted to collect Twitter data and perform the estimation of a number of metrics. Section 3 provides a description of the methods adopted to identify and validate the predictive models and framework. In the same section, the metrics adopted have been explained and formalized. They are related to: volume of tweets, retweets; natural language processing counting nouns, adjectives, and other elements; the assessment of network of unique users tweeting; and the sentiment analysis in terms of positive and negative orientation of tweets. Section 4 reports the usage of Twitter data for the analysis and prediction of audience in the context of a number of reality show TV programmes (in particular for XF9, XF10 and Pechino Express, 2015). The section reports not only the results but also a comparison with a number of methods to arrive at identifying the best resulting approach. Conclusions are drawn in Section 5.
2 Twitter Vigilance architecture
Twitter provides different kinds of modalities to allow accessing Twitter data: Search API and Streaming API calls. Since version 1.1 of Twitter API, it is necessary to log into Twitter by using OAuth protocol for all requests. Both Twitter APIs types return data in JSON format. Search API presents a limited number of requests every 15 min. The Streaming APIs give developers a low latency access to Twitter’s global stream, but limited access to the whole tweets. Twitter offers different streaming endpoints customized for use type: public, user and site. Both Search and Streaming APIs present some limitations in terms of maximum number of tweets per hour, and any of them do not guarantee that all tweets which are on Twitter.com could be obtained for the analysis.
Twitter Vigilance is multiuser, where each user may provide its own set of searches and aggregated views.
The Twitter Vigilance approach is based on the concept of “TwitterVigilanceChannel”, which consists in a set of simple and complex search queries performed on Twitter platform by the Crawler engine.
The configuration and statistical results about the Crawler are accessible from the frontend interface. The simplest TwitterVigilanceChannel to be monitored can refer to collect and analyse tweets referring to a single: Twitter user, user citation, hashtag, or keyword. Complex TwitterVigilanceChannels may consist in tens of queries/searches according to the search query syntax of Twitter APIs by combining keywords, users IDs, hashtags, citations, etc., with some operators (e.g., and, or, from). Twitter Vigilance is able to monitor, follow and analyse slow and fast events on Twitter. A fast event occurs with several hundred, thousands or millions of related tweets produced in short time. Slow events may have very few tweets per day or week or their absence for a long period of time. The Twitter Vigilance collects Twitter data and makes them accessible for the back office processes of statistical analysis, natural language processing (NLP) and sentiment analysis (SA), and for the general data indexing, based on NLP on Hadoop [35].

(Tweet score pos) = Sentiment Analysis score for positive mood of Tweets;

(Tweet score neg) = Sentiment Analysis score for negative mood of Tweets;

(reTweet score pos) = Sentiment Analysis score for positive mood of reTweets;

(reTweet score neg) = Sentiment Analysis score for negative mood of reTweets;

(Tweet Score) = (Tweet score pos) + (Tweet score neg);

(ReTweet Score) = (reTweet score pos) + (reTweet score neg);

(T + RT score pos) = (Tweet score pos) + (reTweet score pos);

(T + RT score neg) = (Tweet score neg) + (reTweet score neg);

(T + RT Score) = (T + RT score pos) + (T + RT score neg).
The computation of the above presented sentiment analysis metrics is useful to detect the inception and position in time of relevant events as pikes. Once detected, the user can download the data table to estimate more complex and high level metrics (grounded on the above mentioned ones) which are more suitable for predicting the TV rating, as described in Section 3a. In Fig. 3b, the trends of the above listed sentiment analysis metrics, computed on the basis of the adjectives extracted in the tweets, are depicted. This view may help the analysts to identify the most influencing tweets and corresponding adjectives which have provoked significant positive/negative tendency. To this end, the operator may click on the graph and get back the list of the adjectives with their score, and from them also some examples of tweets can be shown. Similar graphs can be accessed corresponding to nouns and to verbs.
Recently, TwitterVigilance has been also made accessible as real time computation of statistical and sentiment analysis for specific dedicated analysis. An example of the channels under observation in real time can be checked at http://www.disit.org/rttv.
3 Framework for quantitative prediction by using TwitterVigilance outcomes
As shown in the state of the art and related work sections presented in the introduction, Twitter data have a relevant and flexible predictive power, and generally, they lead to quantitative statistical predictive capabilities of several social targets of interest. Relations among social media data and predictive variables are a priori unknown. An analysis of Twitter data related with media show audience has been proposed in the literature. In [16], averaged value of audience in past events and Twitter data (contributions per minute) have been used for predicting audience on successive political TV shows having long series of events; thus demonstrating a correlation between the volume of tweets and the audience. In [22], a neural network approach has been used for predicting audience on the basis of Facebook data. In particular, the number of posts, the number of shares, number of comments, etc. without entering in the context of the posts; thus demonstrating the possibility of predicting the rating/share by using a neural network approach. In [29], a very high level analysis of the twitter data related to TV programme has been proposed, showing that the degree of interaction on Twitter was correlated with X Factor programme and its evolution. The approach of using Twitter for TV programme analysis is also used by Nielsen for analysing if Twitter is helping the audience or viceversa, deducing that the fact are related, “the volume of tweets caused significant changes in live TV ratings among 29 percent of the episodes”. [51].
This paper describes the results of a research work aimed at identifying suitable predictive models to predict media show audience (number of people following the programme) by exploiting social media info for reality shows. The research meant also to verify their validity in terms of prediction performance. The prediction of the number of attendees of the TV program is a more precise measure with respect to the estimation of the rating, as in [22]. The rating can be affected by the presence of other competing TV programmes in the same time slots. In addition, the prediction of audience in short term TV shows such as reality shows is very relevant for the present kind of television.
The framework proposed in this paper aims at defining a reliable statistical methodology to exploit Twitter data. Predictions with social data are generally based on conversational flow metrics concerning the volume of tweets, as well as tweet content/text in terms of keywords, hashtags, mentions; and/or users’ activity. Thus, our identified Twitter based metric predictors can be classified into a number of main classes and estimated for each single TwitterVigilanceChannel and/or for each single search per day or per hour, or in total per event, and in particular the: (1) volume/number of tweets (TW) and retweets (RTW) versus time; (2) volume/number of tweets or retweets containing a certain keyword, verb, adjective, hashtags, citation, etc., versus time; (3) total sentiment analysis scores, taking into account positive and/or negative scores for elements in the tweets and/or retweets, versus time; (4) linear compositions of previous point tweets volumes statistics versus time (e.g., the ratio between number of retweets divided by the number of the corresponding tweets); (5) calendar variables calculated since the time tweets and/or retweets have been released; (6) volume of unique users tweeting and/or retweeting versus time. Please note that the metrics based on retweets have to be counted considering only the number of retweets at that time and not those in the future (for example up to the day before with respect to the predictive day value). Moreover, it should be noted that the lifecycle of retweet is limited in time. In the sense that according to the literature, almost all retweets are manifested in few minutes and sometimes few hours after the tweet, thus the number of those arriving after days can be neglected [57].
The number of predictors that could be extracted depends on the Twitter data of the considered channel. Queries of popular keywords/searches on TwitterVigilanceChannel created for large events with many searches are very rich in information and complex to analyse. Many predictive models could be built; however, not all of them may have predictive capability, or the same effectiveness in predicting events, visitors and/or audience. The selection of predictors is crucial to build a reliable predictive model, on such grounds it is mandatory to identify predictors having a significant connection with the event which a prediction is needed for, with a reasonable temporal horizon.
In order to build a reliable predictive model, the temporal dynamics explaining the predictive capability have to be identified. Predictive models and metrics show different behaviours when periodic or continuous events are considered. For example, the number of visitors during an event could show relevant or null relationship with calendar variables (as month, week day, year, etc.); while these variables are very important when the same attendee prediction is performed over uninterrupted and timebound events of long term duration such as a long term event, a carnival, an exposition, etc.
3.1 Metrics definition and computation
Definitions of metrics for assessing the stream of tweets per search and channel
Metric name  Kind  Description  Metric definition 

TWWeek_z  volume  total number of tweets of the main hashtag collected over the 5 days preceding the event.  \( \mathrm{TWWeek}\_\mathrm{z}=\sum_{d= D5}^{D1}{TW}_z^d \) where \( {TW}_z^d \)is the number of tweets collected at day d, varying from D5 to D1, being D the day of the event. 
TWRTWWeek_z  volume  total number of tweets plus retweets of the main hashtag over the 5 days preceding the event.  \( \mathrm{TWRTWeek}\_\mathrm{z}=\sum_{\mathrm{d}=\mathrm{D}\hbox{} 5}^{\mathrm{D}\hbox{} 1}{\mathrm{TW}}_{\mathrm{z}}^{\mathrm{d}}{+\mathrm{RTW}}_{\mathrm{z}}^{\mathrm{d}} \) where \( {TW}_z^d \)is the number of tweets and \( {RTW}_z^d \) the number of retweets collected at day d, varying from D5 to D1, being D the day of the event. 
RTWWeekRatio_z  High level metric, volume  ratio from the number of retweets and tweets collected over the 5 days preceding the event, is a sort of measure of the reactivity of the audience of visitors with respect to the conversation based on single tweet inside a TwitterVigilanceChannel.  \( \mathrm{RTWWeekRatio}\_\mathrm{z}=\sum_{\mathrm{d}=\mathrm{D}\hbox{} 5}^{\mathrm{D}\hbox{} 1}\frac{{\mathrm{TW}}_{\mathrm{z}}^{\mathrm{d}}+{\mathrm{RTW}}_{\mathrm{z}}^{\mathrm{d}}}{{\mathrm{TW}}_{\mathrm{z}}^{\mathrm{d}}} \) where \( {RTW}_z^d \) is the number of retweets and \( {TW}_z^d \) the number of tweets collected at day d, varying from D5 to D1, being D the day of the event. 
UnqUserRTW_z  network  measures the number of unique users who retweeted in the 5 days preceding the event.  \( \mathrm{UnqUserRTW}\_\mathrm{z}=\sum_{\mathrm{d}=\mathrm{D}\hbox{} 5}^{\mathrm{D}\hbox{} 1}{\mathrm{Uu}}_{\mathrm{RTW}}^{\mathrm{d}} \) where \( {Uu}_{RTW}^d \) is the number of unique users involved in retweeting estimated at day d, varying from D5 to D1, being D the day of the event. 
UnqUserTW_z  network  measures the number of unique users who tweeted in the 5 days preceding the event.  \( \mathrm{UnqUserTW}\_\mathrm{z}=\sum_{\mathrm{d}=\mathrm{D}\hbox{} 5}^{\mathrm{D}\hbox{} 1}{\mathrm{Uu}}_{\mathrm{TW}}^{\mathrm{d}} \) where \( {Uu}_{RTW}^d \) is the number of unique users involved in tweeting estimated at day d, varying from D5 to D1, being D the day of the event. 
FUnqUsers_z  network  the whole set of unique users involved in tweeting and/or retweeting in the 5 days preceding the event.  \( \mathrm{FUnqUsers}\_\mathrm{z}=\sum_{\mathrm{d}=\mathrm{D}\hbox{} 5}^{\mathrm{D}\hbox{} 1}{\mathrm{Uu}}^{\mathrm{d}} \) where Uu ^{ d } is the number of unique users involved in tweeting and/or retweeting estimated at day d, varying from D5 to D1, being D the day of the event. 
NLPTWWeek_z  NLP volume  score taking into account tweets in the 5 days preceding the event, counting the occurrence of distinct nouns, adjectives and verbs.  \( \mathrm{NLPTWWeek}\_\mathrm{z}=\sum_{\mathrm{d}=\mathrm{D}\hbox{} 5}^{\mathrm{D}\hbox{} 1}\left({\sum}_{\mathrm{n}=1}^{\mathrm{N}\mathrm{nns}}\mathrm{TW}\_{\mathrm{n}\mathrm{ns}}_{\mathrm{z}}^{\mathrm{d},\mathrm{n}}+{\sum}_{\mathrm{a}=1}^{{\mathrm{N}}_{\mathrm{a}\mathrm{dj}}}\mathrm{TW}\_{\mathrm{a}\mathrm{dj}}_{\mathrm{z}}^{\mathrm{d},\mathrm{a}}+{\sum}_{\mathrm{v}=1}^{{\mathrm{N}}_{\mathrm{v}\mathrm{er}}}\mathrm{TW}\_{\mathrm{v}\mathrm{er}}_{\mathrm{z}}^{\mathrm{d},\mathrm{v}}\right) \) where \( {TW\_ nns}_z^{d, n} \), \( {TW\_ adj}_z^{d, a} \) and \( {TW\_ ver}_z^{d, v} \) are the total occurrence counts of, respectively, a generic noun n, a generic adjective a and a generic verb v extracted from collected tweets at day d, varying from D5 to D1, being D the day of the event. N _{ nns }, N _{ adj } and N _{ ver } are the total number of distinct nouns, adjectives and verbs, respectively, extracted in tweets collected in the same temporal window. 
NLPRTWWeek_z  NLP volume  score taking into account retweets in the 5 days preceding the event, counting the occurrence of distinct nouns, adjectives and verbs.  \( \mathrm{NLPRTWWeek}\_\mathrm{z}=\sum_{\mathrm{d}=\mathrm{D}\hbox{} 5}^{\mathrm{D}\hbox{} 1}\left({\sum}_{\mathrm{n}=1}^{{\mathrm{N}}_{\mathrm{n}\mathrm{ns}}}\mathrm{RTW}\_{\mathrm{n}\mathrm{ns}}_{\mathrm{z}}^{\mathrm{d},\mathrm{n}}+{\sum}_{\mathrm{a}=1}^{{\mathrm{N}}_{\mathrm{a}\mathrm{dj}}}\mathrm{RTW}\_{\mathrm{a}\mathrm{dj}}_{\mathrm{z}}^{\mathrm{d},\mathrm{a}}\mathrm{RTW}\_{\mathrm{ver}}_{\mathrm{z}}^{\mathrm{d},\mathrm{v}}\right) \) where \( {RTW\_ nns}_z^{d, n} \), \( {RTW\_ adj}_z^{d, a} \) and \( {RTW\_ ver}_z^{d, v} \) are the total occurrence counts of, respectively, a generic noun n, a generic adjective a and a generic verb v extracted from collected retweets at day d, varying from D5 to D1, being D the day of the event. N _{ nns }, N _{ adj } and N _{ ver } are the total number of distinct nouns, adjectives and verbs, respectively, extracted in retweets collected in the same temporal window. 
SATWPosWeek_z  Sentiment analysis  Sentiment score taking into account all tweets in the 5 days preceding the event, adding the nouns, adjectives and verbs, each one weighted by its corresponding positive SA score.  \( \mathrm{SATWPosWeek}\_\mathrm{z}=\sum_{\mathrm{d}=\mathrm{D}\hbox{} 5}^{\mathrm{D}\hbox{} 1}\left({\sum}_{\mathrm{n}=1}^{\mathrm{N}\mathrm{nns}}\mathrm{TW}\_{\mathrm{n}\mathrm{ns}}_{\mathrm{z}}^{\mathrm{d},\mathrm{n}}{\mathrm{gss}}_{\mathrm{pos}}^{\mathrm{n}}+{\sum}_{\mathrm{a}=1}^{{\mathrm{N}}_{\mathrm{a}\mathrm{dj}}}\mathrm{TW}\_{\mathrm{a}\mathrm{dj}}_{\mathrm{z}}^{\mathrm{d},\mathrm{a}}{\mathrm{gss}}_{\mathrm{pos}}^{\mathrm{a}}+{\sum}_{\mathrm{v}=1}^{{\mathrm{N}}_{\mathrm{v}\mathrm{er}}}\mathrm{TW}\_{\mathrm{v}\mathrm{er}}_{\mathrm{z}}^{\mathrm{d},\mathrm{v}}{\mathrm{gss}}_{\mathrm{pos}}^{\mathrm{v}}\right) \) where \( {TW\_ nns}_z^{d, n} \) is the occurrence of a generic noun n with positive sentiment score \( {ss}_{pos}^n \) at day d; \( {TW\_ adj}_z^{d, a} \) is the occurrence of a generic adjective a with positive sentiment score \( {ss}_{pos}^a \) at day d and \( {TW\_ ver}_z^{d, v} \) is the occurrence of a generic verb v with positive sentiment score \( {ss}_{pos}^v \) at day d; these three metrics are computed for all the tweets collected in the 5 days preceding the event; N _{ nns }, N _{ adj }and N _{ ver } are the total number of distinct nouns, adjectives and verbs, respectively, retrieved in tweets collected in the same temporal window. 
SATWNegWeek_z  Sentiment analysis  Sentiment score taking into account all tweets in the 5 days preceding the event, adding the nouns, adjectives and verbs, each one weighted by its corresponding negative SA score.  \( \mathrm{SATWNegWeek}\_\mathrm{z}=\sum_{\mathrm{d}=\mathrm{D}\hbox{} 5}^{\mathrm{D}\hbox{} 1}\left({\sum}_{\mathrm{n}\hbox{} 1}^{\mathrm{N}\mathrm{nns}}\mathrm{TW}\_{\mathrm{n}\mathrm{ns}}_{\mathrm{z}}^{\mathrm{d},\mathrm{n}}{\mathrm{gss}}_{\mathrm{pos}}^{\mathrm{n}}+{\sum}_{\mathrm{a}=1}^{{\mathrm{N}}_{\mathrm{a}\mathrm{dj}}}\mathrm{TW}\_{\mathrm{a}\mathrm{dj}}_{\mathrm{z}}^{\mathrm{d},\mathrm{a}}{\mathrm{gss}}_{\mathrm{pos}}^{\mathrm{a}}+{\sum}_{\mathrm{v}=1}^{{\mathrm{N}}_{\mathrm{v}\mathrm{er}}}\mathrm{TW}\_{\mathrm{v}\mathrm{er}}_{\mathrm{z}}^{\mathrm{d},\mathrm{v}}{\mathrm{gss}}_{\mathrm{pos}}^{\mathrm{v}}\right) \) where \( {TW\_ nns}_z^{d, n} \) is the occurrence of a generic noun n with negative sentiment score \( {ss}_{n eg}^n \) at day d; \( {TW\_ adj}_z^{d, a} \) is the occurrence of a generic adjective a with negative sentiment score \( {ss}_{neg}^a \) at day d and \( {TW\_ ver}_z^{d, v} \) is the occurrence of a generic verb v with negative sentiment score \( {ss}_{neg}^v \) at day d; these three metrics are computed for the tweets collected in the 5 days preceding the event; N _{ nns }, N _{ adj }and N _{ ver } are the total number of distinct nouns, adjectives and verbs, respectively, retrieved in tweets in the same temporal window. 
SARTWPosWeek_z  Sentiment analysis  Sentiment score taking into account all retweets in the 5 days preceding the event, adding the nouns, adjectives and verbs, each one weighted by its corresponding positive SA score.  \( \mathrm{SARTPosWeek}\_\mathrm{z}=\sum_{\mathrm{d}=\mathrm{D}\hbox{} 5}^{\mathrm{D}\hbox{} 1}\left({\sum}_{\mathrm{n}=1}^{{\mathrm{N}}_{\mathrm{n}\mathrm{ns}}}\mathrm{RTW}\_{\mathrm{n}\mathrm{ns}}_{\mathrm{z}}^{\mathrm{d},\mathrm{n}}{\mathrm{gss}}_{\mathrm{pos}}^{\mathrm{n}}+{\sum}_{\mathrm{a}=1}^{{\mathrm{N}}_{\mathrm{a}\mathrm{dj}}}\mathrm{RTW}\_{\mathrm{a}\mathrm{dj}}_{\mathrm{z}}^{\mathrm{d},\mathrm{a}}{\mathrm{gss}}_{\mathrm{pos}}^{\mathrm{a}}+{\sum}_{\mathrm{v}=1}^{{\mathrm{N}}_{\mathrm{v}\mathrm{er}}}\mathrm{RTW}\_{\mathrm{v}\mathrm{er}}_{\mathrm{z}}^{\mathrm{d},\mathrm{v}}{\mathrm{gss}}_{\mathrm{pos}}^{\mathrm{v}}\right) \) where \( {RTW\_ nns}_z^{d, n} \) is the occurrence of a generic noun n with positive sentiment score \( {ss}_{pos}^n \) at day d; \( R{TW\_ adj}_z^{d, a} \) is the occurrence of a generic adjective a with positive sentiment score \( {ss}_{pos}^a \) at day d and \( {RTW\_ ver}_z^{d, v} \) is the occurrence of a generic verb v with positive sentiment score \( {ss}_{pos}^v \) at day d; these three metrics are computed for the retweets collected in the 5 days preceding the event; N _{ nns }, N _{ adj }and N _{ ver } are the total number of distinct nouns, adjectives and verbs, respectively, retrieved retweets collected in the same temporal window. 
SARTWNegWeek_z  Sentiment analysis  Sentiment score taking into account all retweets in the 5 days preceding the event, adding the words, adjectives and verbs, each one weighted by its corresponding negative SA score.  \( \mathrm{SARTWNegWeek}\_\mathrm{z}=\sum_{\mathrm{d}=\mathrm{D}\hbox{} 5}^{\mathrm{D}\hbox{} 1}\left({\sum}_{\mathrm{n}=1}^{{\mathrm{N}}_{\mathrm{n}\mathrm{ns}}}\mathrm{RTW}\_{\mathrm{n}\mathrm{ns}}_{\mathrm{s}}^{\mathrm{d},\mathrm{n}}{\mathrm{gss}}_{\mathrm{n}\mathrm{eg}}^{\mathrm{n}}+{\sum}_{\mathrm{a}=1}^{{\mathrm{N}}_{\mathrm{a}\mathrm{dj}}}\mathrm{RTW}\_{\mathrm{a}\mathrm{dj}}_{\mathrm{z}}^{\mathrm{d},\mathrm{a}}{\mathrm{gss}}_{\mathrm{n}\mathrm{eg}}^{\mathrm{a}}+{\sum}_{\mathrm{v}=1}^{{\mathrm{N}}_{\mathrm{v}\mathrm{er}}}\mathrm{RTW}\_{\mathrm{v}\mathrm{er}}_{\mathrm{z}}^{\mathrm{d},\mathrm{v}}{\mathrm{gss}}_{\mathrm{n}\mathrm{eg}}^{\mathrm{v}}\right) \) where \( {RTW\_ nns}_z^{d, n} \) is the occurrence of a generic noun n with negative sentiment score \( {ss}_{n eg}^n \) at day d; \( R{TW\_ adj}_z^{d, a} \) is the occurrence of a generic adjective a with negative sentiment score \( {ss}_{neg}^a \) at day d and \( {RTW\_ ver}_z^{d, v} \) is the occurrence of a generic verb v with negative sentiment score \( {ss}_{neg}^v \) at day d; these three metrics are computed for the retweets collected in the 5 days preceding the event; N _{ nns }, N _{ adj }and N _{ ver } are the total number of distinct nouns, adjectives and verbs, respectively, retrieved in retweets collected in the same temporal window. 
To take into account of the ratio from RTW e TW does not mean that for very high numbers of tweets the amount of retweets actually diminish the crowd/audience size, since the number of tweets and retweets are typically numerically balanced in absence of large viral events and audience as you can see from Twitter Vigilance platform. Furthermore, in presence of audience, the identified ratio (RTW/TW) is a measure of the reactivity, while as to measuring the volume metrics based on the total volume are more relevant. The ration of RTW/TW may lead to have large values if the event under monitoring becomes strongly viral, for example millions of retweets with only few tweets. This was not the case in general in all the three data sets tested. On the other hand, this is not the case in our kind of events.
The computation of Sentiment Analysis metrics has been performed by exploiting SentiWordNet [10], a semantic knowledge base specifically designed for Sentiment Analysis. SentiWordnet assigns sentiment scores to each extracted keyword in order to estimate the general sentiment polarity of collected tweets. SentiWordNet is a sentimentenriched implementation of WordNet [12], a widely used lexical database of English nouns, verbs, adjectives and adverbs grouped into sets of cognitive synonyms (synsets). In SentiWordNet independent positive, negative, and neutral sentiment values (i.e., real numbers varying in the interval from −1 to 1) are associated with about 117 thousands of synsets. In order to carry out the analysis in both English and Italian languages, the SentiWordNet lexicon (which has been originally designed in English) has been automatically ported to an Italian version, on the basis of MultiWordNet [41], a resource which aligns WordNet English synsets to Italian ones, which can therefore be used to transfer sentiment polarity information associated to English words to Italian corresponding ones. For each single tweet/retweet, its overall polarity score is given by the sum of all the sentiment weighted keywords extracted in it.
Most of the above mentioned metrics can be estimated every 5 min, every hour or day, or on more days according to the objective of the assessment (see Fig. 3 for example). The Twitter Vigilance platform allows estimating a number of them daily and other hourly. In any case, the user may recompute them with different granularity from a specific interface requested an adhoc task. In the next section, an overview of the whole process is presented.
3.2 The overall process for model definition
 (I)
Set up a TwitterVigilanceChannel semantically linked to the event in order to perform Twitter data harvesting. The creation of the channel is grounded on the official hashtags and Twitter users IDs, and relevant keywords. Other searchers to collect tweets can be added on the basis of the early analysis of the Twitter data, thus enlarging the set of searched queries on Twitter. This step is strongly dependent on the cases under analysis and described in Section 4.
 (II)
identify a first large set of possible metrics from early collected data, by using a coherent temporal basis of aggregation with respect to the real data values to be predicted (for example, volume of single channel query over time, unique users over time, calendar variables, natural language processing features, sentiment analysis features). In any case, the searches of the TwitterVigilanceChannel which collect a large number of tweets and retweets are typically significant and thus good potential predictors. Then, the timeseries of metrics have to be merged to define a channel’s “guess metric matrix”;
 (III)
select metrics: when metrics extracted from channel are too many, a statistical criterion may be applied to select the statistical significant metrics. For example, by using principal component analysis, PCA, which may give indication of the variance coverage and of complexity of data in terms of number of PCA to be considered. In addition, some early experiments adopting a multilinear regressive schema may help with the support of the Akaike Information Criterion, AIC [2] in selecting/discharging the most/less significant metrics as predictors. The selection may be carried out by using stepwise process to build a sharper model both discharging not reliable variables (by minimizing the AIC) and retaining the ones with a stronger linkage with variable to be predicted [54]. The statistical reliable predictors are defined as the ones having a significant tstudent test outcome (pvalue < 0.05). In alternative, machine learning approaches can be adopted, in any way the predictive capability, the adjusted Rsquared and the AIC may help in deciding among the different methods. In most cases, the predictive model is produced by using the 70%–80% of data (e.g., estimating coefficient parameters, or learning parameters). Then the learned model is used to predict the remaining 30%–20% on which the MAPE (Mean Absolute Percentage Error), and or APE (Absolute Percentage Error) are estimated to perform the validation of the predictive mode against the actual values recorded by the auditing agencies.
3.3 Predictive models
TV programmes as reality show are in some sense short time events occurring with week periodicity and not for several weeks, thus concentrating the audience in few hours per week. Good examples of this kind of events are the socalled reality shows, such as: XF9, XF10 and Pechino Express, which are broadcasted live typically once per week (for a few hours), few weeks per year condensed in specific part of the year.
The aim is to invert the model (1) by estimating β _{1} , β _{2} , β _{3} , … , β _{ k } , n, which represents the coefficients and the intercept of the best fitting line, respectively, obtained by a least squares model. In this process, the estimated model can be more or less significant and statistical significance can be estimated for each coefficient and for the whole fitting. Weights are estimated by means of a learning period, thus allowing targeting the model construction. Basically, several different models have been tested by estimating weights, and assessing predicting capabilities. In order to set up a predictive model, the value of x _{ t } is estimated on the basis of explanatory variables/metrics (z _{1} , z _{2} , . . , z _{ k }) computed at t1 or before.
With many predictors and few observations in the dataset, fitting the full model without penalization could result in large prediction intervals, and sometimes the model can overfits the data: when there are issues with collinearity, the linear regression parameter estimates may become inflated. One consequence of large correlations between the predictor variances is that the variance can become very large. For this reason, a shrinkage/regularization model (i.e., ridge regression) has been tested [21], where it adds a penalty on the sum of the squared regression parameters. The effect of the penalty consists in the fact that the estimated parameters are allowed to become large only if there is a proportional reduction in sum of the squared errors (SSE). Thus, by adding the penalty, we are making a tradeoff between the model variance and bias by sacrificing some bias, we can often reduce the variance enough to make the overall MSE (Mean Square Error) lower than unbiased models. In the selection of the best predictive model also other techniques have been tested such as lasso [52] and Elastic Net [20].
The following section refers to the prediction of the audience on TV programmes: X Factor 9, X Factor 10 and Pechino Express. For such reasons a suitable prediction model has been obtained by exploiting data from previous days using multiregressive and ridge models. According to the above considerations, the reliable covariates we used have been individuated on the basis of their statistical relevance with respect to the variable to be predicted and by using a minimal AIC criterion [2]. The assessment quality of the models in terms of predictive capability has been performed against the validation period on the basis of the root mean square error (RMSE) and Mean Absolute Error (MAE) metrics that have been applied on the predicted values, as well as the correspondent ones that were observed during the validation/test period. The metric selection process has been carried out by approaching their incidence in exploiting the variable to be predicted in the multilinear regression model.
4 Predicting TV audience via twitter data

XF9 description and actual audience data are accessible on: https://it.wikipedia.org/wiki/X_Factor_%28nona_edizione%29, while TwitterVigilance data can be accessed from: http://www.disit.org/tv/index.php?p=chart_singlechannel&canale=Xfactor9

XF10 description and actual audience data are accessible on: https://it.wikipedia.org/wiki/X_Factor_(decima_edizione), while TwitterVigilance data from: http://www.disit.org/tv/index.php?p=chart_singlechannel&canale=xf10

Pechino Express description and actual audience data are accessible on: https://it.wikipedia.org/wiki/Pechino_Express_%28quarta_edizione%29, while TwitterVigilance data from: http://www.disit.org/tv/index.php?p=chart_singlechannel&canale=ads
In more details, X Factor is a television music competition format born in UK and then exported abroad, becoming the biggest television talent competition in Europe. In Italy the 9th season was televised (identified as XF9), from September to December 2015 and a season 10 in the 2016 with the first episodes devoted to auditions and singers’ selections. The initial transmissions were followed by six weeks of weekly live shows where less appreciated singers have been progressively eliminated, thus, the best four talents could reach the final event where the winner was voted by the public. XF9 and XF10 have been broadcasted by paytv channel Sky1, while first phases and the final ones have been also transmitted on free of charge channels, i.e., national public television. The show began at prime time and closed after mid night with a shorter transmission called “Xtra Factor” to talk about the main show while always attracting the same audience. The audience of XF9 is typically based on young people, who are also engaged in voting singers and groups, so as to eliminate or push them ahead in the competition. As it occurs for every talent competition, the participation of the public is critical for the success of the show; social media play a relevant role in promoting singers, stimulating discussions and comments, while pushing audience to follow the show, voting their favourite singers and so on.
Votes from the audience during the final broadcast of XF9 reached 7 million, and the official hashtag #xf9 was the most widely used of the day (on the 10th December, final show date) both in Italy and in the worldwide trending topic on Twitter. The competition has led to four finalists in December 2015: Giosada, Urban Strangers, Davide Sciortino, Enrica Tara, and the final selected Giosada has been the winner. A similarly analysis could be performed for XF10.
The knowledge about the audience volume, and thus its prediction, can be very important when it comes to ads sale, which is delivered in the context of television programmes. Today, the ads value is only guessed since the measure of audience is obtained the day after, by Smart Panel Sky and/or Auditel in some cases (Auditel is the national metering of TV audience, could not provide measures of XF9 over 15 days in the period and on such basis it was not used as reference value).
4.1 Descriptive statistics
Importance of components for XF9 data
Factors  Eigenvalue  % Variance  % Cumulative Variance 

1  2.63  53.26  53.26 
2  1.92  28.44  81.71 
3  1.15  10.15  91.85 
4  0.86  5.72  97.57 
5  0.46  1.61  99.18 
6  0.23  0.41  99.59 
7  0.18  0.25  99.83 
8  0.12  0.11  99.94 
9  0.07  0.04  99.98 
10  0.05  0.02  100.00 
11  0.01  0.00  100.00 
12  0.01  0.00  100.00 
Principal Component loadings for XF9 data with respect to identified metrics
metrics and data  PC1  PC2  PC3 

Sky Audience  −0.1913  −0.4001  0.7099 
TWRTWWeek_z  −0.8745  0.4396  −0.1848 
TWWeek_z  −0.8572  0.4846  −0.1485 
RTWWeekRatio_z  −0.3462  0.8058  0.3857 
UnqUserTW_z  −0.9241  0.2170  −0.1115 
UnqUserRTW_z  −0.7276  0.6518  0.0693 
FUnqUsers_z  −0.7607  0.6225  0.0707 
SATWPosWeek_z  −0.8562  −0.3978  0.1180 
SATWNegWeek_z  −0.8439  −0.4174  0.2269 
SARTWPosWeek_z  −0.6261  −0.6526  0.2860 
SARTWNegWeek_z  −0.5478  −0.5900  0.5665 
NLPTWWeek_z  −0.8680  −0.4149  0.1310 
NLPRTWWeek_z  −0.6449  −0.5671  0.3206 
Factor 1 carries more than 53% of the total variability of the dataset (see Table 2) and this variability is mainly explained by the majority of covariates. The variability of Factor 2 (28.4%) is carried by the positive correlation of RTWWeekRatio_z (0.8058) and the negative correlation of SARTWPosWeek_z (−0.6526), while Factor 3 explains about 10.15% of the total variability. PCA allowed to sort the features according to the impact on total variability and understand the correlations among the metrics and the XF9 Sky audience.
4.2 Validation models
Parameters of the validation models using only volume and network based metrics estimated for XF9 and XF10 with a multilinear regression approach
Metrics  XF9a Validation Model  XF10a Validation Model  

Coeff  Std Err  tval  pval  Coeff  Std Err  tval  pval  
TWRTWWeek_z  β _{1}  161.2  144.1  1.119  0.314  999.6  788.1  1.268  0.260 
TWWeek_z  β _{2}  −220.4  240.1  −0.918  0.401  −1489  1412  −1.054  0.340 
RTWWeekRatio_z  β _{3}  −2,190,936  1,308,957  −1.674  0.155  −11,342,148  4,477,279  −2.533  0.052 
UnqUserTW_z  β _{4}  −327.8  490.8  −0.668  0.534  −6414  2761  −2.323  0.068 
UnqUserRTW_z  β _{5}  −99.16  670.1  −0.148  0.888  −6655  2821  −2.359  0.065 
FUnqUsers_z  β _{6}  −5.461  617.1  −0.009  0.993  6208  2726  2.277  0.072 
Intercept  n  5,387,852  2,306,725  2.336  0.067  21,546,552  8,072,832  2.669  0.044 
R squared  0.867  0.781  
Adjuster R squared  0.707  0.517  
AIC  306  310  
RMSE  42,159  50,800  
MAE  34,244  42,288  
Weeks  12  12  
millions of tweets + retweets on Twitter Vigilance  1.625  1.383 
Parameters of the validation models according to the Eq. (1) using only volume and network based metrics for XF9 with a multilinear regression approach
Metrics and parameters  XF9b Validation Model  

Coeff  Std Err  tval  pval  
TWRTWWeek_z  β _{1}  15.19  5551  2.736  0.0256 
UnqUserTW_z  β _{2}  −346.2  81.7  −4.237  0.0028 
RTWWeekRatio_z  β _{3}  −1,505,184  382,610  −3.934  0.0043 
Intercept  η  4,092,413  612,821  6.678  0.00015 
R squared  0.832  
Adjuster R squared  0.768  
AIC  302  
RMSE  47,408  
MAE  40,745  
Weeks  12  
millions of tweets + retweets on Twitter Vigilance  1.625 
Parameters of the validation models according to the Eq. (1) using only volume and network based metrics for Pechino Express with a multilinear regression approach
Metrics and parameters  PEb Validation Model  

Coeff  Std Err  tval  pval  
TWWeek_z  β _{1}  −136.5  53,07  −2.573  0.062 
UnqUserRTW_z  β _{2}  3175  1491  2.130  0.100 
FUnqUsers_z  β _{3}  −1392  1082  −1.286  0.268 
Intercept  η  2,235,653  112,963  19.790  3.85E05 
R squared  0.877  
Adjuster R squared  0.785  
AIC  203  
RMSE  42,747  
MAE  36,453  
Weeks  8  
millions of tweets + retweets on Twitter Vigilance  0.455 
Parameters of the validation models using ridge approach with mixed metrics (volume, NLP and SA) estimated for XF9 and XF10
Metrics and parameters  XF9c mixed Validation Model  XF10c mixed Validation Model  

Coeff  Std Err  tval  pval  Coeff  Std Err  tval  pval  
RTWWeekRatio_z  β _{1}  −969,524  354,103  −2.738  0.041  −2,288,390  899,333  −2.545  0.051 
SATWNegWeek_z  β _{2}  253.4  327.8  0.773  0.474  2495  809.8  3.081  0.027 
SARTWPosWeek_z  β _{3}  7.541  2.563  2.943  0.032  −125.2  73.66  −1.699  0.150 
SARTWNegWeek_z  β _{4}  −4.489  7.064  −0.635  0.553  310.6  98.05  3.168  0.025 
NLPTWWeek_z  β _{5}  −13.73  10.62  −1.293  0.252  −73.77  19.37  −3.809  0.012 
NLPRTWWeek_z  β _{6}  0.03587  0.2756  0.130  0.901  3.97  2.378  1.669  0.156 
Intercept  η  3,193,367  647,930  4.929  0.004  5,377,506  1,646,706  3.266  0.022 
R squared  0.859  0.861  
Adjuster R squared  0.690  0.695  
AIC  306  305  
RMSE  43,370  40,358  
MAE  33,374  31,982  
Weeks  12  12  
millions of tweets on Twitter Vigilance  1.625  1.383 
The model has been produced after testing several combinations of the metrics according to systematic approaches which allowed us to derive the best model in terms of AIC produced exploiting volume, NLP and sentiment analysis metrics (using both multilinear and ridge). Also in this case, according to the pvalue, we could identify some less satisfactory metrics for XF9 data that may be good for XF10. Thus a compromise model fitting satisfactory for both cases has been reported. The final model has been obtained with ridge approach, and the obtained adjusted R squared is of 0.69, and an R squared of about 0.86, having a suitable AIC of about 305 in both cases. Please note that, comparing Tables 4, 5 and 7, both multilinear and ridge approaches produced similar results. In some cases, the model based on volume metrics may be better ranked with respect to the mixed models in terms of adjusted R squared, and worst in terms of RMSE.
In the next section, a wider comparison with other approaches is reported in the context of predictive models. For Pechino Express, the identical mixed model is not viable since the number of metrics (and thus the number of coefficients β _{ i } to be estimated) is too high with respect the number of samples, thus producing an instable model.
4.3 Predictive models
Comparison among predictive models considered in the case of XF9 data, APE and MAPE are estimated on the test prediction period on the basis of the model defined on the training data set
Prediction Errors and parameters  XF9 comparison of different pred. Models  

Lasso  Elastic net  Ridge reg.  LM  
APEweek 11/6  0.2425  0.1173  0.0853  0.3456 
APEweek 12/7  0.0907  0.1044  0.0429  0.1234 
APEweek 13/8  0.3879  0.1837  0.2457  0.4257 
MAPE  0.2403  0.1352  0.1246  0.2983 
Training set  Weeks 1–10  
Test/prediction  Weeks 11–13 
According to these results the ridge regression approach has been proved to be the most accurate in prediction with respect to the above mentioned approaches. Therefore, models XF9c, XF10c and PEb (Pechino Express b model) produced by using the ridge approach, have been adopted as predictive models estimating coefficients on the basis of initial weeks data with the aim of predicting the audience of the last 3 weeks major events in advance.
Consumptive results about the prediction of attendees at TV programmes XF9, XF10 and Pechino Express on the basis of the predictive models XF9c, XF10c and PEb, using both multilinear regression and ridge regression
Prediction Errors and parameters  XF9  XF10  Pechino Express  

Ridge Reg.  LM  Ridge Reg.  LM  Ridge Reg.  LM  
APEweek 11/6  0.0853  0.3456  0.0511  0.0323  0.0670  0.0696 
APEweek 12/7  0.0429  0.1234  0.0896  0.1327  0.0341  0.0998 
APEweek 13/8  0.2457  0.4257  0.4479  0.4580  0.0412  0.0093 
MAPE (11–13)/(6–8)  0.1246  0.2983  0.1962  0.2077  0.0474  0.0596 
Training set  weeks 1–10  Weeks 1–5  
Test/prediction  weeks 11–13  Weeks 6–8 
It should be noted that the precision in guessing the audience at the next and successive primetime events on the basis of the model computed on data of weeks 1–10 is very high: for all cases in the range of 92%–95% of accuracy. On the other hand, the model is not capable to perform highly reliable predictions for the last event of season in which a strong non linearity occur. The general precision is in the range of 80%–94%. In the case of XF9 and XF10 the prediction on the last major event is less accurate with respect to Pechino Express, since XF9 and XF10 last live shows presented a quite explosive final event regarding the TV audience with respect to the Pechino Express. In fact, for the PE, the prediction of the 3rd week is still in the range of 95% since the last event is not massive as in the X Factor. As a general consideration, the prediction models identified are suitable to predict reality show audience in most of the cases. And thus the identified limitations of the state of the art algorithms and solutions have been overcome.
Most of the computations were conducted in R Statistical Environment (https://www.Rproject.org/) by using different R libraries: “forecast” [23] for predictive modeling, “MASS” for model selection previously cited, “xts” [44] to manage time series,"lubridate” [14] for time variables, “gvlma” [40] to carry out model regression checking and “Metrics” to perform results validation, etc. The data related to XF9, XF10 and Pechino Express, and the corresponding R code, are available on DISIT lab at http://www.disit.org/7002.
5 Conclusions
The paper proposed an approach for creating Twitterbased models and metrics in order to predict the expected audience on television programmes. The proposed solution has been tuned by using reality shows, which are specific kinds of TV shows not addressed in the literature, and which present high volume of Twitter data due to the high involvement of audience in the trend of the programme by voting and interacting. Metrics identified have been: volume of tweets and retweets versus time; ratio between number of retweets divided by the number of the corresponding tweets; number of users involved in tweeting; natural language processing features extracted by Twitter data, and sentiment analysis assessment of tweets. These metrics have been computed on the basis of data collected in the previous days and weeks, and they are capable to help predicting the TV rating of the primetime show on the basis of the previously described predictive model. The paper reported full details about the method adopted to achieve the identification of the models and framework, and their validation by using real data. The produced predictive models have been validated and assessed in terms of quality, while highlighting the predicting capabilities for the analysed cases, namely X Factor 9, X Factor 10, and Pechino Express. In all such cases, the predictive capability of the produced models according to the identified metrics has been proved. Moreover, a comparison among four different approaches has been presented: multilinear regression, ridge regression, lasso and elastic net. The ridge approach has been demonstrated to the better ranked approach. In almost all predictive models, metrics have been defined as the ratio between the number of retweets and tweets collected and related to the major hashtags of events and they have demonstrated high predictive capabilities in explaining visitors/audience volumes. Also the volume of tweets and the sum of tweets and retweets have confirmed their predictive capabilities. Another interesting predictor can be the number of unique users involved, as well as opinion mining features, such as natural language processing and sentiment analysis related metrics earlier described. As a result, the resulting models are based on ridge and/or multiregressive for short term prediction. On the other hand, other models and approaches have been tested without success, as reported in the paper. Most of the metrics based on Twitter data have been computed by Twitter Vigilance tool and provided directly to the users, while high level metrics have been computed for the model. Future work on this topic is related to the identification of other predictive and/or early detection models for different kinds of events, with the aim of producing better results with respect to those proposed in the literature. The specific topics would be: predicting politics election results, city comparison for tourism attraction, early detection of disasters, early detection of new drugs and/or critical situation in the city, etc. On the tool development aspects, we are addressing the development for improving the usability and the flexibility in computing metrics directly on the tool.
Notes
Acknowledgements
The Twitter Vigilance service is adopted in smart city projects (such as SiiMobility SCN www.siimobility.org, RESOLUTE EC H2020 project http://www.resoluteeu.org), and by institutions like LAMMA, CNR IBIMET, ARPAT, etc., for several different purposes. The authors would like to thank all of them for the great spurs we received in the context of using and improving the solution and its derived metrics. The authors would also thanks to Simone Menabeni and Alice Cavaliere for their contributions in the project. The authors have appreciated the reviewers’ comments that really stimulated the authors in producing and providing a more effective and clear set of results.
References
 1.Achrekar H, Gandhe A, Lazarus R, Yu SH, Liu B (2012) Twitter improves seasonal influenza prediction. HEALTHINF, In, pp 61–70Google Scholar
 2.Akaike H (1987) Factor analysis and AIC. Psychometrika 52(3):317–332MathSciNetCrossRefzbMATHGoogle Scholar
 3.Asur S, Huberman BA (2010) Predicting the future with social media. CoRR abs/1003.5699. http://arxiv.org/abs/1003.5699
 4.Bermingham A, Smeaton A (2011) On using twitter to monitor political sentiment and predict election results. In: Proceedings of the workshop on sentiment analysis where AI meets psychology (SAAIP 2011). Asian Federation of Natural Language Processing, Chiang Mai, pp 2–10Google Scholar
 5.Bollen J, Mao H, Zeng XJ (2011) Twitter mood predicts the stock market. Journal of computational Science 2(1) CoRR abs/1010.3003. http://arxiv.org/abs/1010.3003
 6.Botta F, Moat HS, Preis T (2015) Quantifying crowd size with mobile phone and twitter data. R Soc open sci 2:150162. doi: 10.1098/rsos.150162 MathSciNetCrossRefGoogle Scholar
 7.Broniatowski DA, Dredze M, Paul MJ, Dugas A (2015) Using social media to perform local influenza surveillance in an InnerCity hospital: a retrospective observational study. JMIR Public Health and Surveillance 1(1):e5Google Scholar
 8.Chauhan A, Kummamuru K, Toshniwal D (2016) Prediction of places of visit using tweets. Knowl Inf Syst :1–22Google Scholar
 9.Choi H, Varian H (2009) Predicting the present with google trends. Official Google Research Blog. http://bit.ly/h9RRdW Google Scholar
 10.Esuli A, Sebastiani F (2006) Sentiwordnet: a publicly available lexical resource for opinion mining. In Proc. of the 5th conference on Language Resources and Evaluation (LREC’06). Genova, p 417–422Google Scholar
 11.Everitt B, Hothorn T (2011) An introduction to applied multivariate analysis with R. Springer Science & Business MediaCrossRefzbMATHGoogle Scholar
 12.Fellbaum C (1998) WordNet: an electronic lexical database. MIT Press, CambridgezbMATHGoogle Scholar
 13.Gaikar DD, Marakarkandy B, Dasgupta C (2015) Using twitter data to predict the performance of Bollywood movies. Ind Manag Data Syst 115(9):1604–1621CrossRefGoogle Scholar
 14.Garrett G, Hadley W (2011) Dates and times made easy with lubridate. J Stat Softw 40(3):1–25. URL http://www.jstatsoft.org/v40/i03/
 15.GayoAvello D (2013) A metaanalysis of stateoftheart electoral prediction from twitter data. Soc Sci Comput rev :0894439313493979Google Scholar
 16.Giglietto F (2013) Exploring correlations between TV viewership and twitter conversations in Italian political talk shows. Available at SSRN 2306512Google Scholar
 17.Grasso V, Zaza I, Zabini F, Pantaleo G, Nesi P, Crisci A (2016) Weather events identification in social media streams: tools to detect their evidence in twitter. PeerJ preprints 4:e2241v1. doi: 10.7287/peerj.preprints.2241v1
 18.Grasso V, Crisci A, Nesi P, Pantaleo G, Zaza I, Gozzini B. Public crowdsensing of heatwaves by social media data. 16th EMS annual meeting & 11th European conference on applied climatology (ECAC), 12–16 September 2016  Trieste, Italy, CE2/AM3, Delivery and communication of impact based forecasts and risk based warningsGoogle Scholar
 19.Gruhl D, Guha R, Kumar R, Novak J, Tomkins A (2005) The predictive power of online chatter. ACM, New York, pp 78–87Google Scholar
 20.Hastie T, Zou H (2005) Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 67(2):301–320MathSciNetCrossRefzbMATHGoogle Scholar
 21.Hoerl AE, Kennard RW (1970) Ridge regression: biased estimation for nonorthogonal problems. Technometrics 12(1):55–67CrossRefzbMATHGoogle Scholar
 22.Hsieh WT et al (2013) Predicting tv audience rating with social media. Sixth International Joint Conference on Natural Language ProcessingGoogle Scholar
 23.Hyndman RJ, Khandakar Y (2008) Automatic time series forecasting: the forecast package for R. J Stat Softw 26(3):1–22Google Scholar
 24.Jain V (2013) Prediction of movie success using sentiment analysis of tweets. The International Journal of Soft Computing and Software Engineering 3(3):308–313Google Scholar
 25.Kaiser HF (1960) The application of electronic computers to factor analysis. Educ Psychol MeasGoogle Scholar
 26.Kupavskii A, Umnov A, Gusev G, Serdyukov P (2013) Predicting the audience size of a tweet. In ICWSMGoogle Scholar
 27.Lampos V, Bie TD, Cristianini N (2010) Flu detector  tracking epidemics on twitter. Machine Learning and Knowledge 6323:599–602Google Scholar
 28.Leskovec J (2011) Social media analytics: tracking, modeling and predicting the flow of information through networks. Proceedings of the 20th international conference companion on world wide web. ACMGoogle Scholar
 29.Lochrie M, Coulton P (2012) Tweeting with the telly on! In 2012 I.E. consumer communications and networking conference (CCNC), p 729–731Google Scholar
 30.Lu Y, Kruger R, Thom D, Wang F, Koch S, Ertl T, Maciejewski R. Integrating predictive analytics and social media. In Visual Analytics Science and Technology (VAST), 2014 I.E. conference on 2014 Oct 25. IEEE, p 193–202Google Scholar
 31.Madlberger L, Almansour A. Predictions based on Twitter—A critical view on the research process. InData and Software Engineering (ICODSE), 2014 International Conference on 2014 Nov 26. IEEE, p 1–6Google Scholar
 32.Mishne G, Glance N (2006) Predicting movie sales from blogger sentiment. AAAI 2006 spring symposium on computational approaches to Analysing weblogsGoogle Scholar
 33.Molteni L, Ponce De Leon J (2016) Forecasting with twitter data: an application to Usa Tv series audience. International Journal of Design & Nature and Ecodynamics 11(3):220–229CrossRefGoogle Scholar
 34.Moreno JJM, Pol AP, Abad AS, Blasco BC (2013) Using the RMAPE index as a resistant measure of forecast accuracy. Psicothema 25(4):500–506. doi: 10.7334/psicothema2013.23 Google Scholar
 35.Nesi P, Pantaleo G, Sanesi G. A Hadoop Based Platform for Natural Language Processing of Web Pages and Documents. Accepted for publication on JVLC, Journal of Visual Languages and Computing, Elsevier. 11 Nov 2015. doi: 10.1016/j.jvlc.2015.10.017
 36.Nielsen Media Research (2015) Must see TV: how twitter activity ahead of fall season premieres could indicate success Available at http://www.nielsen.com/us/en/insights/news/2015/mustseetvhowtwitteractivityaheadoffallseasonpremierescouldindicatesuccess.html
 37.O’Connor B, Balasubramanyan R, Routledge BR, Smith NA (2010) From tweets to polls: linking text sentiment to public opinion time series. In: Proc. of 4th ICWSM. AAAI Press, p 122–129Google Scholar
 38.Oussalah M, Bhat F, Challis K, Schnier T (2013) A software architecture for Twitter collection, search and geolocation services. KnowledgeBased Systems 37:105–20.Google Scholar
 39.Paul MJ, Dredze M (2011) You are what you tweet: Analysing twitter for public health. Proc. of ICWSM, InGoogle Scholar
 40.Pena EA, Slate EH (2014). Gvlma: global validation of linear models assumptions. R package version 1.0.0.2. http://CRAN.Rproject.org/package=gvlma
 41.Pianta E, Bentivogli L, Girardi C (2002) MultiWordNet: developing an aligned multilingual database. In Proc. of the First Int. Conf. on Global WordNet, Mysore, IndiaGoogle Scholar
 42.Reddy ASS, Kasat P, Jain A (2012) Boxoffice opening prediction of movies based on hype analysis through data mining. Int J Comput Appl 56(1)Google Scholar
 43.Ritterman J, Osborne M, Klein E (2009) Using prediction markets and twitter to predict a swine flu pandemic. 1st international workshop on mining social media, vol 9. Ac. uk/miles/papers/swine09. Pdf. Accessed 26 August 2015Google Scholar
 44.Ryan JA, Ulrich JM (2014). Xts: eXtensible time series. R package version 0.9–7. http://CRAN.Rproject.org/package=xts
 45.Shimshoni Y, Efron N, Matias Y (2009) On the predictability of search trends http://doiop.com/googletrends
 46.Signorini A, Segre AM, Polgreen PM (2011) The use of twitter to track levels of disease activity and public concern in the U.S. during the influenza a h1n1 pandemic. PLoS ONE 6(5)Google Scholar
 47.Sikdar S, Adali S, Amin M, Abdelzaher T, Chan KL, Cho JH, Kang B, O'Donovan J. Finding true and credible information on Twitter. In Information Fusion (FUSION), 2014 17th International Conference on 2014 Jul 7. IEEE, p 1–8Google Scholar
 48.Sinha S, Dyer C, Gimpel K, Smith NA. Predicting the NFL Using Twitter. arXiv:1310.6998v1 [cs.SI] 25 Oct 2013Google Scholar
 49.Sitaram A, Huberman BA (2010) Predicting the future with social media. In Social Computing Lab, HP Labs, Palo AltoGoogle Scholar
 50.Sommerdijk B, Sanders E, van den Bosh A. Can Tweets Predict TV Ratings? The International Conference on Language Resources and Evaluation is organised by ELRA biennially with the support of institutions and organisations involved in HLTGoogle Scholar
 51.The FollowBack: Understanding the TwoWay Causal Influence Between Twitter Activity and TV Viewership. http://www.nielsen.com/us/en/insights/news/2013/thefollowbackunderstandingthetwowaycausalinfluencebetw.html
 52.Tibshirani R (1996) Regression shrinkage and selection via the lasso. J R Stat Soc Ser B Methodol :267–288Google Scholar
 53.Tumasjan A, Sprenger T, Sandner PG, Welpe IM (2010) Predicting elections with twitter: what 140 characters reveal about political sentiment. In: Proc. of 4th ICWSM. AAAI Press, p 178–185Google Scholar
 54.Venables WN, Ripley BD (2002) Modern applied statistics with S, 4th edn. Springer, New York. isbn:0387954570Google Scholar
 55.Wakamiya S, Lee R, Sumiya K (2011) Towards better TV viewing rates: exploiting crowd's media life logs over twitter for TV rating. Proceedings of the 5th international conference on ubiquitous information management and communication. ACMGoogle Scholar
 56.Wang X, Gerber MS, Brown DE (2012) Automatic crime prediction using events extracted from twitter posts. In: Social computing. BehaviouralCultural Modeling and Prediction. Springer, Berlin Heidelberg, pp 231–238Google Scholar
 57.Zaman T, Fox EB, Bradlow ET (2014) A Bayesian approach for predicting the popularity of tweets. The Annals of Applied Statistics 8(3):1583–1611MathSciNetCrossRefzbMATHGoogle Scholar
Copyright information
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.