Information flow size
Results of the zero-truncated negative binomial model (Table 5) indicated several statistically significant associations between the dependent variable (size) and our predictive factors, and variability in the strengths of association as indicated by the incidence rate ratio (IRR). The IRR is derived by the exponentiation of the negative binomial regression coefficients, allowing for the interpretation of retweet incidence rates (as opposed to logs of expected retweet counts). We can therefore use the IRR to report the strength of causal associations between certain factors and the information flow size, enabling us to identify quantitatively which factors are more important than others.
Holding all other factors constant, number of followers and number of previous tweets were statistically significant. In relation to the former, an increase in number of followers was positively associated with size of information flow (IRR = 1.00, z = 4.72, p < 0.00). Conversely, an increase in the number of previous tweets made by the tweeter was negatively associated with information flow size (IRR = 0.99, z = −3.73, p < 0.00).
Of the variables of interest, the by-the-second time lag association with size varied between positive and negative. Only at time lag 5 did the associations begin to emerge as significant. By retweet 5 it was evident that the time lag was highly significant (IRR = 0.99, z = −5.29, p < 0.00) and negatively associated with size. Of the control variables, day of week and time of day emerged as statistically significant, confirming previous work (Zarrella 2009).
All but tension emerged as statistically significant. Most significant were the URL and hashtag measures, which confirm the results of related work (Suh et al. 2011). We conducted several analyses using various measures of URL and hashtag presence, including binary (absent or present). We investigated both presence and frequency of URLs and hashtags and it was evident that the measure of frequency as opposed to presence explained more of the variance in the dependent and resulted in a better overall model fit. The presence of a URL was negative associated with size (z = −4.92, p < 0.01), whereas presence of a hashtag was positively associated (z = 3.25, p < 0.01). Both URL and hashtag frequency were negatively associated with size (URL frequency IRR = 0.65, z = −7.03, p < 0.00; hashtag frequency IRR = 0.85, z = −4.90, p < 0.00). Of novel importance here is the strong positive statistical association between the co-occurrence of URLs and hashtags in a tweet, and the information flow size (IRR = 1.78, z = 7.84, p < 0.00). Out of all the predictors in the model, this variable explains the most variance in the dependent.
Of particular interest to this study, the emotional aspect of sentiment also emerged as statistically significant, with tweets containing positive sentiment more likely to become large information flows (IRR = 1.11, z = 2.99, p < 0.01). The issue attention cycle also plays a role as the number of offline press reports relating to the event published on the day the tweet was made was positively associated with size of information flow (IRR = 1.00, z = 1.79, p < 0.05).
The diagnostics for the full model indicated a good fit to the data (−2 Log-likelihood = −9,434.86, BIC = 19,079.14). Based on the components derived from the PCA we specified sub-models to determine which set of factors explained most of the variance in the dependent measure of size (Table 6). Social factors explained the largest amount of variance (−2 Log-likelihood = −9,513.774, BIC = 19,151.65), followed by content factors (−2 Log-likelihood = −9,732.987, BIC = 19,605.59) and temporal factors (2 Log-likelihood = −9,769.379, BIC 19,670.61).
Information flow survival
Results of the Cox proportional hazards (Table 7) model also indicated several statistically significant associations between the dependent variable (survival) and our predictive factors. Because the model is focused on explaining proportional hazards, a positive estimate is interpreted as increasing hazards to survival, and therefore reduces the survival of the information flow. We can therefore use the estimate \(\beta\) to report the strength of causal associations between certain factors and the information flow survival, enabling us to identify quantitatively which factors are more important than others.
Holding all other factors constant, only the number of tweets previously posted by the author of a new tweet were statistically significant in predicting hazards to survival. Tweet count was found to be positively associated with hazards to survival (\(\beta\)=0.00, Wald = 33.89, p < 0.01), indicating more previous tweets reduces the survival of the information flow.
Of the variables of interest, the by-the-second time lag association with survival follows an inverse pattern to that exhibited in the size model. Though the earlier stage time lags are far less significant (and in most cases not significant) than time lag 5, which emerges as highly significant and negatively associated with hazards (\(\beta\) = −0.00, Wald = 15.181, p < 0.01). Thus, the results suggest information flows following this event will survive longer where the number of seconds between the original tweet and the 5th retweet is higher, which is intuitive in a time-series model. Of the control variables, time of day emerged as statistically significant for predicting survival; the model suggests there is a higher likelihood of hazards to survival at times of the day when tweet traffic is known to be highest (Zarrella 2009).
As with the size model, we conducted analysis with presence and frequency of URLs and hashtags, the latter being reported here because it explained more variance in the dependent. Both measures were negatively associated with hazards. URL frequency emerged as negatively associated with hazards and highly significant (\(\beta\) = −0.22, Wald = 12.76, p < 0.01), suggesting more URLs included in a tweet leads to longer survival. The co-occurrence of a URL and a hashtag in a tweet was also negatively associated with hazards and statistically significant (\(\beta\) = −0.15, Wald = 4.46, p < 0.05).
Both emotive content features, sentiment and tension, were highly statistically significant. Positive sentiment within a tweet was negatively associated with hazards (\(\beta\) = −0.08, Wald = 6.11, p < 0.01) and high levels of tension were positively associated (\(\beta\) = −0.10, Wald = 5.44, p < 0.01). This suggests tweets containing positive sentiment and a lack of tension in relation to the event produce longer living information flows. Unlike size, hashtag frequency and offline press reports were not statistically significant. Using Kaplan–Meier estimation, Figs. 3, 4, and 5 illustrate the comparative survival rates of tweets for all significant content measures. Figure 3 shows that tweets containing negative and neutral sentiment have low survival rates compared to those containing positive sentiment. Similarly Fig. 4 shows that tweets containing high levels of racist tension also die out more quickly than tweets with low levels of tension. Finally, Fig. 5 shows that the co-occurrence of URLs with hashtags (i.e. features that link the text to other text) also extends the life of tweets following such events.
The diagnostics for the full model indicated a good fit to the data (LRT = 225.4, R
2 = 0.92). We repeated the sub-model analysis (Table 8) and determined that content factors explained the largest amount of variance (LRT = 122.3, R
2 = 0.051), followed by temporal factors (LRT = 103.9, R
2 = 0.044) and social factors (LRT = 86.36, R
2 = 0.036).