The emergence of roles in large-scale networks of communication
- First Online:
- Cite this article as:
- González-Bailón, S., Wang, N. & Borge-Holthoefer, J. EPJ Data Sci. (2014) 3: 32. doi:10.1140/epjds/s13688-014-0032-y
- 3.2k Downloads
Communication through social media mediates coordination and information diffusion across a range of social settings. However, online networks are large and complex, and their analysis requires new methods to summarize their structure and identify nodes holding relevant positions. We propose a method that generalizes the sociological theory of brokerage, originally devised on the basis of local transitivity and paths of length two, to make it applicable to larger, more complex structures. Our method makes use of the modular structure of networks to define brokerage at the local and global levels. We test the method with two different data sets. The findings show that our approach is better at capturing role differences than alternative approaches that only consider local or global network features.
Keywordsmodularity bridges structural holes structural similarity online networks
Networks of inter-personal communication have grown larger and more complex with the emergence of digital technologies . Methods that summarize this intricacy help identify the building blocks that explain network dynamics like diffusion. Much research has been conducted in the last two decades to assess how the structure of networks correlates with their dynamics. Several methods have been developed to reduce the complexity of networks through the identification of their characteristic features, which involves determining the differences between observed and expected patterns under a null model of random connections. Reduction techniques like community detection or backbone extraction belong to this tradition; both techniques use the distribution of edges as the basis from where to identify the inner structure of a network , .
Blockmodels and structural equivalence offer an alternative approach, shifting attention from the edges to the nodes and to how similar they are as assessed by their connections to other nodes in the network –. The notion of structural equivalence has a long history in the analysis of social networks and the definition of roles, but it has recently been extended to make it less strict in its technical definition, more scalable, and applicable to other complex (non-social) networks . Devising methods for role identification is important for network science because roles offer a scheme for network reduction and the construction of simplified maps of the original structure; but also, and significantly, because they offer a criterion to group nodes in categories that might be associated with similar behavior. We follow this prior research and propose a new method for the identification of roles in large networks.
The main assumption of our work is that roles respond to a division of labor that reflects different functions, or behavior, within the network. Detecting structurally similar positions depends on the network features emphasized by the method. Our goal is to compare different methods and assess which structural features are more successful at identifying relevant nodes. We use two different empirical data sets for that endeavor: the first, collected from Twitter, tracks patterns of communication around a political protest organized in May of 2012. The second, smaller and collected manually, is the classic (and renowned) Zachary’s Karate club network , which we use mostly as a robustness test for our method. We analyze these data with two aims: to determine whether users that shared similar network positions (according to alternative methods) behaved similarly in the exchange of information; and to identify the roles that were more significant in that exchange.
On a theoretical level, we want to add a mesoscopic dimension to classic theories of brokerage in social networks, which have traditionally focused on local information flows –. As the proponents of this previous work, we sustain that roles are theoretically important because they are independent of the specific network under analysis: roles help us draw a typology of actors that transcends the composition of a given network. Unlike that previous research, however, we also contend that large networks require identifying roles on a coarse-grained level of analysis that takes the entire network structure into account; the relevant source of actor heterogeneity is otherwise lost in the detail of local positions. This is particularly important in the study of collective phenomena like protest coordination via social media: these communication networks tend to be large and complex, with distinct local affiliations, a small core of highly active actors, and a large periphery of lowly committed users –. The ability to broker information in such networks does not arise from local connectivity alone.
The rest of the paper proceeds as follows. First, we revise previous work on roles in social networks and defend the need to follow a more mesoscopic approach to role identification. Then we discuss our data and methods, emphasizing that we can generalize our analytical strategy to other data sets and case studies. The presentation of findings considers whether actors that are classified as structurally similar exhibit also similar behavior. We compare three approaches to the identification of roles, and we evaluate their performance when it comes to capturing differences in communication activity. The findings show that our method, which defines structural similarity in terms of local and global brokerage, is better able to capture differences in communication dynamics. A summary and discussion of the main findings closes the paper.
2 Roles in social networks
Sociologists have long characterized individual actors by the roles they play in social systems . Roles refer to a set of expectations and patterns of behavior associated to positions in a social structure. The network translation of this idea usually relies on the notion of structural equivalence and its looser version, structural similarity: actors playing similar roles will have similar patterns of connections with other actors in a network , –. Two structurally similar actors have comparable ties to each other and to the other actors in the network. Roles offer an abstraction of these similarities: they provide categories that reduce networks to comparable building blocks.
Traditionally, research on social networks has assumed that network structure mirrors preexisting roles: being a professor at a university or a director in a company translates into having a set of similar ties with the other actors in the network (i.e. students and faculty, or other directors). Networks reflect, in this sense, formal roles that are defined within the parameters of shared institutions and mutual expectations. But roles can also emerge spontaneously from interactions in networks: some actors are able to play a differential role because of their structural position, for instance as when brokers emerge in informal networks of communication , . According to this idea, ties do not reflect preexisting roles; instead, they allow actors to play certain functions - for instance, broker information. Actors playing a brokering role share a common trait: they span structural holes in a network by building bridges across sub-graphs that would be unconnected in their absence. This position gives them the possibility to control information flows, a role they share with other brokers in the network.
Although the idea of structural similarity requires taking into account connections at the level of the overall network, most network measures of roles (and brokerage in particular) still rely on local patterns. The identification of brokers in social networks is based either on structural constraint , which roughly measures transitivity in personal networks; or on the betweenness of actors in paths of length two , that is, on being the vertex in an open triad through which the two unconnected nodes interact. In addition to using only local information, these definitions of social roles are purely structural - that is, they capture the potential for brokerage given the absence of alternative routes through which information can flow, but they do not assess the extent to which information actually flows , . In other words, roles are defined in terms of positions in a structure, and they are rarely validated in terms of behavior or action.
In larger networks, using paths of length two as the main criterion to identify structurally similar positions does not offer much space for discrimination - these local configurations abound and most nodes will be part of many of them. This is one of the reasons why a more scalable criterion to identify roles in complex networks was proposed recently by Guimerà and Amaral (GA from now on, ). As with the GF typology, the GA method requires a partition of the network in groups, often drawn from the topology itself using community detection methods . Once this partition is generated, nodes are classified in a two-dimensional space that measures their intra-module centrality (i.e. the number of connections with other nodes classified in the same group) and their participation coefficient (i.e. the propensity to connect to nodes that are classified in other groups). This two-dimensional space is then fragmented in regions according to some heuristic derived from how nodes distribute in the space. In general, though, there are four main types of nodes, summarized in Figure 1, panel B: (a) provincial hubs, or nodes that are highly connected with other nodes in their module, but poorly connected to other modules; (b) connector hubs, or nodes that are highly connected to other nodes in their module and are also well connected to other groups; (c) non-hub connectors, or nodes that are poorly connected to their module but well connected to other modules; and (d) peripheral nodes, or nodes that are poorly connected to both their module and other groups.
The two role schemes summarized in Figure 1 overlap to some extent: brokers classified as coordinators are more likely to appear in categories (a) or (d), whereas brokers playing a liaison role are more likely to appear in categories (b) and (c). There are two important differences between the two schemes, however. The first is that the GA approach allows us to make a global definition of what counts as a structural hole based on the relative absence of ties across groups (as identified by community detection methods); this is more discriminatory than the local definition of holes based on paths of length two. In doing so, though, the GA method disregards the importance of local connections: it is personal networks that help us identify the role that actors play in their immediate social environments; the ability to broker connections on that level of analysis can also be consequential in emergent group dynamics , . The second difference is that the GA method disregards the directionality of connections, but directionality is crucial in communication networks - and well captured by the GF schema. In the following section, we propose a third hybrid method that integrates the two schemes (HM from here on), and we assess its relative performance for the identification of similarly behaving nodes.
3 Data and methods
We use two data sets to compare the three role detection methods. The first was collected from Twitter for the period April 30 to May 30 2012 using the platform’s search API. We applied filtering parameters to select messages that contained hashtags related to the Occupy and ‘indignados’ movements, retrieving about 445,000 messages. The filtering parameters and observation window were selected with the goal of analyzing communication dynamics around an international call for action, a political protest that took place in May 12. Using the unique author identifiers contained in the messages, we run further queries to reconstruct the network of following/followers. This one-step snowball crawl returned a network of more than 38 million users; of these, we only retained users that had sent at least one message during the observation period, as well as their connections to other users that were also involved in protest-related communication. In addition, we parsed the messages to identify re-tweets (RTs) and mentions (@), which allowed us to reconstruct direct interactions and explicit channels of information flow amongst users. More information about the data can be found in , and a discussion of the sampling procedure in .
Sampling choices and defining the boundaries for data collection are always important parameters in any research design. On the basis of prior work assessing the bias in samples of Twitter networks  we believe that we are probably underestimating the number of users in the sparser parts of the network (i.e. the periphery) and, as a consequence, also the centrality of the most central users: they get many of their connections from peripheral users (whose individual contributions are small, but statistically large when aggregated). An important aspect of the data for our purposes is that it captures a topology of connections (the follower structure) and a more dynamic layer of actual communication and information exchange (via mentions and RTs). This information is difficult to obtain from other sources of data - including our second data set.
The second network we analyze was collected in the context of anthropological research on conflict dynamics in small groups . This network, which tracks communication among the members of a Karate club, is a well-known benchmark in community detection research because it can be characterized by the existence of two large groups or communities - the two factions that ended up splitting and resulted in the foundation of a new club. This network is small () and because of that - and the anthropological research providing contextual information - it offers a more intuitive validity test for role detection methods. We use this data to examine the extent to which our method can be generalized across data sources and social settings.
where the graph is represented as a matrix with cells , which are valued 1 if a link exists between i and j, 0 otherwise; the second term corresponds to the configuration model, that is, the expected counts under the assumption of random connections between the nodes; finally, Kronecker delta has a value 1 if nodes i and j belong to the same community C, and 0 otherwise.
The fast greedy approach to graph optimal partitioning relies on a recursive agglomerative scheme. As such, all nodes belong to their own community at start time (); then the algorithm attempts to merge communities into larger ones. If the resulting partition improves the previous Q value, the merge is accepted and the algorithm advances towards further attempted mergers. Compared to other algorithms to optimize Q (and detect some sort of modular structure), the greedy approach is efficient and reliable: it can handle very large networks and it typically yields a number of modules , which fits well the nature of our data.
We are aware that the fast greedy approach and the partition it yields may not be the most resolved or the most stable in terms of different criteria, for instance persistence ,  or stability , . Determining the stability or persistence of the method is beyond the scope of the current work. Instead, we settle for a partition that we know yields meaningful results and is coherent with a substantive interpretation of the data. The fast greedy approach is also just one of the many methods we could have employed to identify communities in the networks; however, its validity is supported by the fact that the resulting partitions are aligned with contextual information about the data: for the Zachary network, the communities identified are consistent with the ethnographic narrative of the original study ; for the Twitter data, the outputs reflect the divide between ‘indignados’ and Occupy users, and we know that different community detection methods yield qualitatively similar results . Because of this, we are confident that the communities identified by the fast greedy approach have empirical validity.
Finally, we consciously chose a method for the identification of non-overlapping communities because the three role schemes we test required us to do so - different theoretical goals or empirical settings might benefit from a role scheme that allows overlapping membership. However, prior research has established that disjoint communities apply important constraints to dynamics on networks, specifically to dynamics of information diffusion (see, for instance ). It remains an empirical question whether allowing for overlapping membership (where it makes substantive sense) can improve the modeling and explanation of those dynamics.
Once obtained, we used the community partition to determine vertex roles as defined in both the GF and the GA schemes. For the Twitter data, we first compiled the contribution matrices and of N nodes to M modules, where the rows of in- and out-C correspond to nodes and the columns correspond to modules. The elements of are the number of links that node i dedicates to (or receives from) module α, and can be easily obtained by multiplying the adjacency matrix of the network and the partition matrix S. For the Zachary data, we adapted these computations to take into account that ties are undirected. These matrices summarize the tie contributions necessary to apply the GF and GA classification schemes.
where is the number of links of node i to other nodes in the same module , is the average of κ within module , and is the standard deviation of κ in module .
In other words, this score measures how many standard deviations away nodes are from the mean of their communities - the extent to which they are outliers in their own groups.
where is the number of links of node i to nodes in modules s, and is the total degree of node i. This coefficient is closer to 1 if the links are uniformly distributed among all modules and 0 if all links are within the same module.
Although the GA method has proved fruitful in the study of metabolic networks  and in other areas , , it is far from ideal when analyzing directed networks, where a node might have a prominent role because of its out-degree but a modest one when it comes to in-degree. In communication networks this difference matters: it amounts to differentiating sources of information from spreaders. In the context of online networks, and Twitter in particular, this difference is especially relevant since a user can follow a potentially unlimited amount of accounts, thus critically biasing the validity of the measures on which the GA roles are defined. Throughout this work we report on GA roles as obtained from the contribution matrix. We chose to report the results for out-going links following the intuition that the role as an information producer is more relevant than the role as a receiver - admittedly an arbitrary decision, which ought to be interpreted as an additional reason for us to develop a method that integrates in- and out-connectivity in a more meaningful way.
We define the horizontal axis of the two-dimensional space with a module-dependent measure that captures brokerage on a mesoscale. Our approach to this relies, again, on the GF scheme (Figure 1, panel A) but we depart from the original proposal in that we obtain the group partition using the network structure itself (as opposed to some exogenous node attribute). Using the in- and out-contribution matrices and , we measure the frequency with which a node is involved in each of the 5 role categories as defined by GF. We keep these counts in a vector g.
The idea behind this operationalization is that higher ranked brokers play a more important role in building global bridges, given the modular structure of the network and our choice to use modules as the group partition. A node playing the liaison role, for instance, bridges more structural holes at the mesolevel (i.e. it creates ties across communities) than a node playing a coordinating role (i.e. where ties span holes that are internal to the same community). Gatekeepers are lower ranked than representatives because we assume them to have a more passive role in information exchange - although arguably this depends on the actual dynamics of communication. Since we want s to lie in the interval , we normalize the sum by the maximum possible count of the liaison role l in vector g, i.e. the score that an ideal super-connector node would display in the network under study, i.e. .
There are two main questions we want to answer with the methods introduced above: Do nodes that share similar network positions behave similarly in the exchange of information? And which nodes are more significant in allowing information to diffuse? To answer these questions, we first apply and compare the three methods, and then assess their definition of structural similarity through the lens of actual node behavior.
4.1 Distribution of roles
These alternative definitions of roles respond to different theoretical intuitions of why connectivity in networks matters. The first (GF) assumes that nodes that directly mediate communication between pairs that would be disconnected otherwise can benefit from that close arbitration; however, this definition does not tell us much about the larger network structure, and how central or significant brokers are in the overall picture. The second method (GA) presumes that relative centrality and the diversity of connections determine the relevance of a node: it relies on identifying the most prominent nodes within and across groups. Finally, the third method (HM) undermines the relevance of centrality and highlights, instead, the importance of mediation both at the local and global levels. These operationalizations result in substantially different classification of nodes - in the context of our data, for instance, the GA schema turns out to be very restrictive in the definition of region 2 (‘connector hubs’), where no nodes can be found in either of the two datasets.
Most nodes in the Twitter network play (predominantly) a representative role (GF scheme); most are peripheral: they have a low within-module degree and low participation coefficient (GA scheme); and most have low brokerage scores, both at the global and local levels (HM scheme). In the Zachary network, the HM scheme sets two nodes clearly apart from the rest - these nodes are the leaders of the two factions identified in the original ethnographic study , and the centers of the two largest communities identified in Figure 2, panel B. The GF method fails to identify the relevant leaders because all actors are predominantly coordinators, and it only identified a small elite of 7 actors of which the two leaders form part. The GA method, on the other hand, gives more prominence to actors that are central in the smaller communities and hides the instigators of the split. Only the HM role classification clearly sets these two pivotal actors apart.
Overall, these roles are defined on the basis of network topology only: they tell us that there are little opportunities for communication across communities, and that the vast majority of users don’t control many direct diffusion channels - either because they are not well connected, or because their connections are redundant. The following section considers if this distribution of roles is associated with actual communication dynamics - which is what Zachary couldn’t test with his data since his definition of ties implied face-to-face communication.
4.2 Association of roles with activity
The relevance of role schemes relies on their ability to sort out similar behavior. In the context of Twitter data, the expectation is that users with structurally similar positions will behave similarly in the exchange of information. We make use of RTs and mentions as a way to assess communication dynamics. We separate RTs from mentions because they allow information to flow in opposite directions and respond to different logics: RTs are about diffusing information; mentions are about raising awareness or engaging in conversation.
To have a benchmark for comparison, we reshuffled the roles variable for each of the schemes (1,000 permutations), and calculated the mean number of RTs and mentions for the randomized data. To further assess the significance of the comparisons, we also calculated the confidence intervals for the mean values associated to each role (95% level based on bootstrapped percentiles). In panels A and D, for instance, we see that the users who predominantly engage in coordination or liaison roles are less central and visible than users acting as gatekeepers, itinerants, and - most significantly - representatives. Panels B and E suggest, on the other hand, that ‘provincial hubs’ (nodes with high within-module degree but low participation coefficient) are both more central and salient, although these differences are not significant.
The HM scheme uncovers differences that the other two methods fail to capture: the most re-tweeted users are those playing the role of local and global brokers; but the most salient users are brokers only on the local level (panels C and F). This suggests two things: first, that users with role 2 are essential conduits for global flows of information: they not only have more brokering ties across communities, they are also re-tweeted more often, which is a precondition for information cascades; and two, the findings also reveal that there is an organic division of labor in the network: users acting as authorities are different (and have different network positions) than users that become the target of conversation more often. The HM scheme identifies this division of labor more clearly than the other two methods.
Role schemes aim to identify network features that make nodes structurally similar. The assumption is that nodes that are similar in their connections will exhibit a similar behavior or fulfill a comparable function within the system the network maps. The three methods we compare here use the modular structure of the network to define roles, but they differ on how they capture the heterogeneity of individual positions: the GF approach makes use of local features (i.e. mediation in paths on length two); the GA approach focuses on the distribution of degree centrality, and the number of ties that link to different communities, measures both that require global knowledge of the network; and the HM approach, which combines the previous two by assessing brokerage at the local (i.e. personal network) and global (i.e. inter-community ties) levels.
This focus on different network features results in alternative role classifications that, as the findings reported show, capture with more or less success differences in behavior. On the basis of our data, we can conclude that the hybrid model combining local and global measures of brokerage is able to better characterize significant behavior. In the Zachary network, it helps identify the leaders of the two opposing factions; in the context of the Twitter network, it helps identify the division of labor that emerges in the orchestration of political action: on the one hand, users playing role 2 (local and global brokers) act as sources of information, but the most visible users play role 1 (they are the local brokers); on the other hand, most of the messages sending those RTs and mentions are generated by users playing role 4. This allocation of roles not only helps uncover an organic division of labor (organic for unplanned); it also allows inferring the logic that lies behind communication dynamics. If RTs are used to disseminate information from a source to an audience, mentions are used to try to gain the attention of other users or engage in direct conversation. It is not coincidental that users receiving many RTs have more global networks than users receiving many mentions.
There are many other alternative schemes that could possibly shed light on communication behavior; and many of the parameters we fixed in our operationalizations could also be modified. For instance, applying an alternative community detection method would have yielded a different partition; or we could have employed a partition completely independent of network topology, based on some node attribute like membership or affiliation to specific organizations. We think, however, that the modular structure of networks contains significant information about how social systems self-organize, and so we advocate for the topological approach to the classification of nodes (unlike the GF scheme as originally formulated). The two datasets we use exhibit a clear modular structure that responds to identifiable social factors (i.e. conflict dynamics, membership to social movements). This makes the use of our partition substantively meaningful - it reflects well the groups that operate within the network - but the method might yield uninterpretable results if the forces driving the formation of modules in networks are unknown.
Thus, before generalizing our approach and making it applicable to other settings, careful consideration should be given to how the partition is generated and whether it makes sense for the empirical data under analysis. A relevant aspect to take into account in this respect is that in some scenarios overlapping communities might be more meaningful. The three schemes we consider here are based on the assumption that communities are exclusive, but future research should address the effects that multiple memberships would have on the definition of roles.
The second parameter that could be modified is the threshold we use to delineate role regions. We used the 95% percentile, which is arguably an ad hoc heuristic. This threshold was chosen to give prominence to the tail of distributions that are skewed, but future work should also assess changes in the performance of role schemes with shifts in this threshold. The sensitivity of role classification to different heuristics is, ultimately, an empirical question: for some data structures, some divisions will yield more informative classifications than others.
On a theoretical level, we argue that a role scheme based on brokerage is more discriminating than a scheme based on degree. The findings show that roles based on brokerage result in a reduction of the network that is more informative about the functions that nodes play. The reason, we sustain, is that the analysis of brokerage allows integrating more levels of analysis (from local bridges in open triads to ties linking modules), and it thus offers a more informative definition of structural similarity.
This paper extends previous sociological research on the importance of brokerage to the analysis of large networks of communication. Our goal was twofold: to determine whether users that share similar network positions behave similarly; and to identify the roles that are more significant for information diffusion. Our findings confirm the expectation that nodes creating bridges within and across communities (role 2 in the HM scheme) are more likely to be used as sources of information: they are the leaders in the Zachary network and the users receiving a higher number of RTs in the Twitter network. This suggests that these users are crucial in the global exchange of information. For the Twitter data, we also found that roles reveal a division of labor in how communication takes place: users that act as authorities are not the most visible; and the most central or visible are not the most active. Overall, the relative performance of different role schemes depends on the empirical data; but the methods to define structural similarity can be generalized beyond the specificities of particular data sets. Improving our definitions of structural similarity is important because they offer a method to model and simplify complex structures - an exercise that is increasingly necessary to make sense of large-scale digital networks and the communication they facilitate.
This article is published under license to BioMed Central Ltd.Open Access This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited.