The role of hidden influentials in the diffusion of online information cascades
- First Online:
- Cite this article as:
- Baños, R.A., Borge-Holthoefer, J. & Moreno, Y. EPJ Data Sci. (2013) 2: 6. doi:10.1140/epjds18
- 6.7k Downloads
In a diversified context with multiple social networking sites, heterogeneous activity patterns and different user-user relations, the concept of ‘information cascade’ is all but univocal. Despite the fact that such information cascades can be defined in different ways, it is important to check whether some of the observed patterns are common to diverse contagion processes that take place on modern social media. Here, we explore one type of information cascades, namely, those that are time-constrained, related to two kinds of socially-rooted topics on Twitter. Specifically, we show that in both cases cascades sizes distribute following a fat-tailed distribution and that whether or not a cascade reaches system-wide proportions is mainly given by the presence of so-called hidden influentials. These latter nodes are not the hubs, which on the contrary, often act as firewalls for information spreading. Our results contribute to a better understanding of the dynamics of complex contagion and, from a practical side, for the identification of efficient spreaders in viral phenomena.
Population-wide information cascades are rare events, initially triggered by a single seed or a small number of initiators, in which rumors, fads or political positions are adopted by a large fraction of an informed community. In recent years, some theoretical approaches have explored the topological conditions under which system-wide avalanches are possible [1, 2, 3, 4, 5]; whereas others have proposed threshold , rumor-  or epidemic-like  dynamics to model such phenomena. Beyond these efforts, digitally-mediated communication in the era of the Web 2.0 has enabled researchers to peek into actual information cascades arising in a variety of platforms - blogs and Online Social Networks (OSNs) mainly, but not exclusively [9, 10].
Notably, these latter empirical works deal with a wide variety of situations. First, the online platforms under analyses are not the same. Indeed, we find research focused on distinct social networks such as Facebook , Twitter [12, 13], Flickr , Digg  or the blogospehere [8, 16, 17] - which build in several types of user-user interactions to satisfy the need for different levels of engagement between users. As a consequence, although scholars make use of a mostly common terminology (‘seed’, ‘diffusion tree’, ‘adopter’, etc.) and most analyses are based on similar descriptors (size distributions, identification of influential nodes, etc.), their operationalization of a cascade - i.e., how a cascade is defined - largely varies. This fact is perfectly coherent, because how information flows differs from one context to another. Furthermore, even within the same OSN different definitions may be found (compare for instance  and ). Such myriad of possibilities is not necessarily controversial: it merely reflects a rich, complex phenomenology. And yet it places weighty constraints when it comes to generalizing certain results. The study of information cascades easily evokes that of influence diffusion patterns, which in turn has obvious practical relevance in terms of enhancing the reach of a message (i.e. marketing) or for prevention and preparedness. In these applications a unique definition would be highly desirable, as proposed in classical communication theory . On the other hand, the profusion of descriptions and the plurality of collective attention patterns  hinder some further work aimed to confirm, extend and seek commonalities among previous findings.
In this work we capitalize on a type of cascade definition which pivots on time constraints rather than ‘content chains’. Despite the aforementioned heterogeneity, all but one  empirical works on cascades revolve exclusively around information forwarding: the basic criterion to include a node i in a diffusion tree is to guarantee that (a) the node i sends out a piece of information at time ; (b) such piece of information was received from a friend j who had previously sent it out, at time ; and finally (c) i and j became friends at , before i received the piece of information (the notion of ‘friend’ changes from OSN to OSN, and must be understood broadly here). Note that no strict time restriction exists besides the fact that , the emphasis is placed on whether the same content is flowing. This work instead turns to topic-specific data in which it is safely assumed that content is similar, and the inclusion in a cascade depends not on the retransmission of a message but rather on the engagement in a ‘conversation’ about a matter.
Beyond our conceptualization of a cascade, this work seeks first to test the robustness of previous findings in different social contexts [20, 21], and then moves on towards a better understanding of how deep and fast do cascades grow. The former implies reproducing some general outcomes regarding cascade size distributions, and how such cascades scale as a function of the initial node’s position in the network. The latter aims at digging into cascades, to obtain information about their temporal and topological hidden patterns. This effort includes questions such as the duration and depth of cascades, or the relation between community structure and cascade’s outreach. Our methodology allows to prove the existence of an evasive class of reputed nodes, which we identify as ‘hidden influentials’ after , who have a major role when it comes to spawn system-wide phenomena.
Filtered hashtags and keywords
We present the results for two of these subsets. One sample consists of 1,188,946 tweets and is related to the Spanish grassroots movement popularly known as ‘15M’, after the events on the 15th of May, 2011. This movement has however endured over time, and in this work we will refer to it as grassroots. Messages were generated by 115,459 unique users. It is worth stressing that some hashtags that might appear to be disconnected from the 15M movement were included either for technical or for sociological reasons. For instance, ‘anonymous’ spontaneously arises from a previous ‘15M’ dataset, which comprised messages exchanged from the 25th of April to the 26th of May, 2011. During the gathering of data used in this work, this hashtag appeared with a relatively high frequency (313 filtered tweets during the period under consideration) and therefore it was included in the filtering of messages. As far as ‘occupy’ is concerned, the movement at the origin of the hashtag (the Occupy Wall Street Movement) began long after the 15M grassroots appeared. However, one can find a clear correlation between both movements suggesting that 15M users were also involved in ‘occupy’. Indeed, it is well documented that the original call for mobilizations around Occupy Wall Street was inspired by both Egyptian uprising and the Spanish ‘indignados’ [23, 24].
The second dataset includes 606,645 filtered tweets that refer to the topic ‘Spanish elections’, which were celebrated on the third week of November, 2011. This sample was generated by 84,386 unique users.
Using the Twitter API we queried for the list of followers for each of the users, discarding those who did not show outgoing activity during the period under consideration. In this way, for each data set, we obtain an unweighted directed network in which each node represents an active user (regarding a particular topic). A link from user i to user j is established if j follows i. Therefore, out-degree () represents the number of followers a node has, whereas in-degree () stands for its number of friends, i.e., the number of users it follows. The link direction reflects the fact that a tweet posted by i is (instantaneously) received by j, indicating the direction in which information flows. Although the set of links may vary in the scale of months we take the network structure as completely static, considering the topology at the moment of the scrap.
3.1 Time-constrained information cascades
Network properties summary
We apply the latter definition [20, 21] to explore the occurrence of listener cascades in the ‘grassroots’ and ‘elections’ data. In practice, we take a seed message posted by s at time and include all of s followers in the diffusion tree hanging from s. We then check whether any of these listeners showed some activity at time , increasing the depth of the tree. This is done recursively, the tree’s growth ends when no other follower shows activity. Passive listeners constitute the set of leaves in the tree. In our scheme, a node can only belong to one cascade (but could participate in it multiple times); the mentioned restriction may introduce measurement biases. Namely, two nodes sharing a follower may show simultaneous activity, but their follower can only be counted in one or the other cascade (with possible consequences regarding cascade size distributions or depth in the diffusion tree). To minimize this degeneration, we perform calculations for many possible cascade configurations, randomizing the way we process data.
In the next sections we report some results for the aforementioned data subsets (‘grassroots’, ‘elections’) considering all their time span (over one year). Our results have been obtained for hours. Previous works [20, 21] showed the robustness of cascade statistics for ; also, a 24-hour window may be regarded as an inclusive bound of the popularity of a piece of information over time on different OSNs, including Twitter [8, 11, 14, 15]. Finally, the chosen window excludes eventual correlations due to the effect of circadian activity in human online behavior or the time differences due to individuals belonging to different geographical areas.
3.2 Community analyses
The identification of modules in complex networks has attracted much attention of the scientific community in the last years, and social networks constitute a prominent example. A modular view of a network offers a coarse-grained perspective in which nodes are classified in subsets on the basis of their topological position and, in particular, the density of connections between and within groups. In OSNs, this classification usually overlaps with node attribute data, like gender, geographical proximity or ideology [31, 32].
where is the number of links in the network; is 1 if there is a link from node i to j and 0 otherwise; is the connectivity (degree) of node i; and finally the Kronecker delta function equals 1 if nodes i and j are classified in the same community and 0 otherwise. Summarizing, Q quantifies how far a certain partition is from a random counterpart (null model).
From the definition of Q, algorithms and heuristics to optimize modularity have appeared ever faster and with an increased degree of accuracy . All these efforts have led to a considerable success regarding the quality of detected community structure in networks, and thus a more complete topological knowledge at this level has been attained. In this work we present results for communities detected using the Walktrap method  in which a fair balance between accuracy and efficiency is sought. The algorithm exploits random walk dynamics. The basic idea is that a random walker tends to get trapped in densely connected parts of the graph, which correspond to communities. Pons and Latapy’s proposal is particularly efficient because, as Q is increasingly optimized, vertices are merged into a coarse-grained structure, reducing the computational cost of the dynamics. The resulting clusters at each stage of the algorithm are aggregated, and the process is repeated iteratively. Although results in the following section refer to a partition extracted through Walktrap, other methods (Louvain  and Infomap ) have been tested with similar results.
A community analysis is useful because it provides a deeper understanding of the position of a node  at an intermediate (i.e., mesoscale) topological level. In terms of information diffusion - and much like in  - we explore whether community structure (and in particular, the relation of a seed node with the module it belongs to) has an impact on the success of a cascade. To do so we adopt the node descriptors proposed by Guimerà et al. in : the z-score of the internal degree of each node in its module, and the participation coefficient of a node i () defined as how the node is positioned in its own module and with respect to other modules.
is the so-called z-score.
where is the number of links of node i to nodes in module C, and is the total degree of node i. Note that the participation coefficient has a maximum at , when the i’s links are uniformly distributed among all the modules (), while it is 0 if all its links belong to its own module. Those nodes that deviate largely from average internal connectivity are local hubs, whereas large values of stands for connector nodes bridging different modules together.
4.1 Cascade size distributions
4.2 Cascades’ temporal and topological penetration
Temporal patterns, as given by the lifetime Δt of a cascade, follow a similar trend: most cascades die out after 24 hours, which closely resembles previously reported results . However, in Figure 3 (upper panels) we observe a richer distribution (compared to topological penetration Δr) such that cascades may last over 100 days, suggesting that the survival of a conversation does not exhibit an obvious pattern. Again, this result confirms - from a different point of view - empirical results published elsewhere [11, 19]. Finally, the duration of cascades takes into account the fact that a node may participate multiple times in a single cascade - although it is counted just once. This is implicit in the definition of a time-constrained cascade, which comprehends self-sustained activity. In any case, Figure 3 (lower panels) illustrates that survival can not guarantee system-wide cascades, although an increasing pattern is observed as survival time grows.
4.3 Identification and role of hidden influentials
Up to now we have related a cascade’s size to certain features of the seed node. Although we observe a clear positively correlated pattern (the larger the seed’s descriptor, the larger the resulting cascade), one might fairly argue that in a wide range of values below the maximum, a similar outcome is obtained. So, for instance, seeds in the range (Figure 2) can sometimes trigger large cascades; the same can be said for . This finding prompts us to hypothesize that the success of an activity cascade might greatly depend on intermediate spreaders characteristics, and not only on the properties of the seed nodes. That being so, a large seed (i.e. its follower set) may be a sufficient but not a necessary condition for the generation of large-scale cascades. In this section we explore how some topological features of the train of spreaders involved in a cascade affects its final size.
To quantify such effect, we introduce the multiplicative number of a given node i, Δl (in analogy with the basic reproductive number in disease spreading), which is the quotient of the number of listeners reached one time step after i showed activity, , and the number of i’s nearest listeners, i.e., those who instantaneously received its message, (which is given by the number of followers of i that are involved in the cascade). Thus, the ratio Δl measures the multiplicative capacity of a node: indicates that a user has been able to increase the number of listeners who received the message beyond its immediate followers.
4.4 The role of community structure in information diffusion
It is generally accepted that cohesive sub-structures play an important role for the functioning of complex systems, because topologically dense clusters impose restrictions to dynamical processes running on top of the structure [45, 46]. For example, in the context of OSNs, detected communities in @mention Twitter networks were found to encode both geographical and political information , suggesting that a large fraction of interactions take place locally, but many of them also correspond to global modules - for instance, users rely on mass media accounts to amplify their opinion. Focusing on information diffusion, inter- and intra-modular connections in OSNs have already been explored  regarding the nature of user-user ties. We instead investigate other questions, such as: (i) are modules actual bottlenecks for information diffusion?; (ii) is the spreading of information more successful for ‘kinless’ nodes (those who have links in many communities besides their own one)? Or (iii) do local hubs - those with larger-than-expected intra-modular connectivity - have higher chances to trigger system-wide cascades?
In just one decade social networking sites have revolutionized human communication routines, placing solid foundations to the advent of the Web 2.0. The academia has not ignored such eruption, some researchers foreseeing a myriad of applications ranging from e-commerce to cooperative platforms; while others soon realized that OSNs could represent a unique opportunity to bring empirical evidence at large into open sociological problems. Information cascades fall somewhere in between, both attracting the interest of viral marketing experts - who worry about optimal outreach and costs - and collective social action and political scientists - concerned about grassroots movements, opinion contagion, etc.
However, the diversity of OSNs - which constrains the format and the way information flows between users - and the complexity of human communication patterns - heterogeneous activity, different classes of collective attention - have resulted in a multiplicity of empirical approaches to cascading phenomena - let alone theoretical works. While all of them highlight different interesting aspects of information dissemination, little has been done to confirm results testing its robustness across different social platforms and social contexts.
In this regard, the present work capitalizes on previous research to collect, in new large datasets, the statistics of time-constrained information cascades. Message chains are reconstructed assuming that conversation-like activity is contagious if it takes place in relatively short time windows. The main preceding observed trends are here reproduced successfully. Furthermore, we extend the study to uncover other internal facets of these cascades. First, we have discussed how long in time and how deep in the topology cascades go, to realize that, as in neuronal activity, time-constrained cascades can exhibit self-sustained activity. We have then paid attention not only to the nodes that trigger a cascade, but also to those that actively participate in and sustain a cascade beyond its onset.
Our main results point at two counterintuitive facts, by which hubs can short-circuit information pathways and close-to-average users - hidden influentials - fuel system-wide events. We have found that for a cascade to be successful in terms of the number of users involved in it, key nodes should be engaged. These nodes are not the hubs, which more than often behave as firewalls, but belong to a middle class that either has a high multiplicative capacity or bridges the modules that make up the system. Presumably, modular topologies - abundant in the real world - entail the presence of information bottlenecks (poor inter-modular connectivity) which place constraints to efficient diffusion dynamics. Indeed, we find that medium-sized and small cascades (the most frequent ones) happen mainly within the community where a cascade originated. Furthermore, those seed nodes which happen to be poorly classified (they participate in many modules besides their own) are more successful at triggering large cascades.
A better understanding of time-constrained cascading behavior in complex systems leads to new questions. First, it seems clear that the bulk of theoretical work devoted to information spreading is not meant to model this conversation-type dynamics - it is rather focused on rumor and epidemic models. Other approaches need to be sought to fill this gap. Also, time-constrained cascades have always been studied in the context of political discussion and mobilization. As such, this is a fairly limited view of what happens in a service with (as of late 2012) over 200 million active users. Results like the ones obtained here will anyhow provide new hints for a better understanding of social phenomena that are mediated by new communication platforms and for the development of novel manmade algorithms for effective and costless dissemination (viral) dynamics.
We thank A Rivero for helping us to collect and process the data used in this paper. We are also indebted to S González-Bailón and JP Gleeson for their useful comments on the manuscript. This work has been partially supported by MINECO through Grant FIS2011-25167; Comunidad de Aragón (Spain) through a grant to the group FENOL and by the EC FET-Proactive Project PLEXMATH (grant 317614). RAB acknowledges support from the FPI program of the Government of Aragón, Spain.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.