Effects of time window size and placement on the structure of an aggregated communication network
- First Online:
- 14k Downloads
Complex networks are often constructed by aggregating empirical data over time, such that a link represents the existence of interactions between the endpoint nodes and the link weight represents the intensity of such interactions within the aggregation time window. The resulting networks are then often considered static. More often than not, the aggregation time window is dictated by the availability of data, and the effects of its length on the resulting networks are rarely considered. Here, we address this question by studying the structural features of networks emerging from aggregating empirical data over different time intervals, focussing on networks derived from time-stamped, anonymized mobile telephone call records. Our results show that short aggregation intervals yield networks where strong links associated with dense clusters dominate; the seeds of such clusters or communities become already visible for intervals of around one week. The degree and weight distributions are seen to become stationary around a few days and a few weeks, respectively. An aggregation interval of around 30 days results in the stablest similar networks when consecutive windows are compared. For longer intervals, the effects of weak or random links become increasingly stronger, and the average degree of the network keeps growing even for intervals up to 180 days. The placement of the time window is also seen to affect the outcome: for short windows, different behavioural patterns play a role during weekends and weekdays, and for longer windows it is seen that networks aggregated during holiday periods are significantly different.
Complex networks have become a standard tool for representing the interaction structure of complex systems [1, 2]. The strength of the network approach comes from its ability to cast the essential features of increasingly complex systems into a manageable form - in the simplest representation, interacting elements are mapped to nodes that are connected by links if they are known to interact. While this coarse-grained view has given a lot of insight into the key characteristics of such systems, it is evident that it entails several approximations and underlying assumptions. The first is the criterion for the existence of links - if the interactions are not binary (on/off) by nature, when is an interaction strong enough to be represented as a link? A common way of taking such strengths into account is to assign weights to the links of the network . The second approximation is related to the time domain. Standard network theory deals with networks that are either static or only slowly changing in time. However, in reality, there are typically dynamical changes in the network structure on multiple time scales. Consequently, representing an empirical system as a static network involves aggregating or integrating over the network dynamics over some time interval. In addition, in many cases, the interactions of the system are not continuously active. While the microdynamics of link activations may be taken into account with the temporal network framework , for the aggregated network approach, the interaction frequencies are often used to define the edge weights. It is evident that when aggregating interactions over time, the choice of the aggregation window and its length have consequences on the characteristics of the resulting networks . However, this issue has often been neglected in the literature; often, the aggregation interval has been dictated by the availability of data, while it would be beneficial to ensure that the network properties that one is interested in are captured by the aggregated networks.
In this paper, we address this question by monitoring and analyzing the features of network structure emerging from aggregation over different time intervals for an empirical data set human communication. We present a detailed study of the effects of the aggregation window on the structural features human communication networks that are known to display dynamics on multiple overlapping time scales. The data comes in the form of a time-stamped sequence of mobile telephone calls between anonymized customers of a Belgian mobile operator for a period of 6 months. This sequence is then aggregated over time to form links between customers, and key features of the resulting networks are studied. Although we only study a single set of data, we expect that our conclusions generally hold for similar data sets, as the mechanisms behind network formation are expected to be rather general for such communication networks (see Discussion).
There is an increasing number of studies of human social networks derived from telecommunication records. However, the networks analyzed in the literature have been constructed using very different time windows - a day , a week , one month , and several months (e.g. [9, 10]) - and therefore it is crucial to understand what features of the underlying system are captured by different aggregation intervals. For such social communication networks, there are several mechanisms that are expected to affect the resulting network structure. First, the distribution of link weights, i.e. call frequencies, is broad [9, 11]. Thus there are high-weight links that should on average be observed earlier on in the aggregation process, and many links of low weight that take a long time to be observed. Second, link weights are correlated with network topology, such that high-weight links are associated with denser network neighbourhoods . Third, for links of any weight, it is known that the distributions of inter-call times are also broad, i.e. call sequences are bursty [12, 13], giving rise to longer-than-Poissonian waiting times between calls. Fourth, there are circadian patterns , where the overall level of call activity varies by hour, as well as weekly patterns where call behaviour depends on the day of the week. Fifth, there are changes in the network itself too - relationships grow and wane in strength, new links appear, and old ones are terminated. The aggregated network structure then reflects the joint effect of the above mechanisms that are associated with different time scales. Thus, one cannot expect that there is a proper aggregation interval that represents the true network; rather, different structural features emerge with different aggregation times. In order to understand what the network structure represents, it is important to understand this process.
This paper is structured as follows: first, we discuss the structural and temporal inhomogeneities that are expected to affect the features of aggregated networks. Then, we characterize the dependence of fundamental scalar measures of network structure on the aggregation interval, and address the properties of links added at different times during the aggregation procedure. We find that clustering of the network peaks at 9 days, as the strongest links associated with dense clusters are observed early on in the process. Another time scale is related to the stability of the aggregated networks - networks aggregated for around 30 days display the largest similarity between consecutive windows. Moving from scalar measures to distributions, we find that the degree and weight distributions become surprisingly stationary in 1-2 weeks of aggregation time. Finally, we investigate in detail the effects of different aggregation window placements, and show that the underlying behavioural patterns affect the aggregated networks: on short time scales, weekends differ from weekdays, and on longer scales, holiday periods give rise to anomalies in the aggregated network structure.
Our data consist of the anonymized mobile telephone call records of the customers of a Belgian mobile operator from October 1, 2006 to March 31, 2007. Each customer is uniquely identified, and each call is associated with a time stamp and a duration. This data set has already been studied from a static perspective in several papers [10, 15, 16]. As our focus is on link dynamics, we filter out all customers who have modified their subscription plan during the data collection period. This removes new customers, and customers who have cancelled their subscription during the period. We also only concentrate on the customers of this specific operator (market share in Belgium ∼30%), and discard all calls to/from customers of outside operators. The above filtering yields a network that has 2.1 million customers, making over 170 million calls during the collection period.
For reference, we also construct two randomized ensembles, based on two randomization techniques of the time stamps. For both cases, the resulting randomized reference sequences contain the same number of calls between the same individuals as the original data. In the first ensemble, the time stamps of all calls are generated uniformly at random over the complete time range, in order to remove the system-level call frequency pattern (daily and weekly pattern). In the second ensemble, the time stamps of all calls are randomly reshuffled, which retains the daily and weekly patterns, but removes other temporal correlations between the timings of calls of links. When aggregating over the entire observation period, the call sequences from both reference models produce networks that are equal to the network from aggregating the original data. In the remaining, we will refer to these references as respectively the “uniform” and the “shuffled” references.
Structural and temporal inhomogeneities
Evolution of network structure
In contrast, the growth in the number of edges is much more gradual, as seen in Figure 3(b). Here, an aggregation time of days is required for catching 90% of the edges of the final 6-month aggregated network. In addition, unlike for the number of nodes, for long aggregation times, the number of edges keeps on growing steadily and no saturation in growth is observed. This is also reflected in the growth of the average degree (Figure 3(c)). Hence, even though the number of nodes becomes fairly stable in an aggregation period of 6 months, one cannot claim to have captured all the edges of the underlying network, and for longer windows, the average degree would still increase. This reflects the joint effect of several factors: first, as the edge weight distribution is broad, there are large numbers of edges with very low call frequencies, and observing those evidently takes a long time; there may be many edges where calls take place less frequently than once in six months. In addition, the ubiquitous burstiness that results in longer waiting times between calls slows down the growth in the number of links especially for the low-weight links - this effect is visible in Figure 3(b), although it is not very strong. Second, for such long observation periods, one can argue that the changes in the network structure should already have a visible effect: new social ties are formed while older ties wane in strength and may even cease to exist. Third, as the data contains all the calls made by the subscribers, many of the calls may be random in the sense that they do not reflect the structure of the underlying social network – as there is no background information on the nature of the calls, a random call to one’s dentist or a call in response to an advertisement on used car sales are counted as links, just as calls to one’s friends or relatives. This third mechanism would naturally result in an ever-growing number of links. The average link weights (Figure 3(d)) must necessarily keep on growing, since all new calls on existing edges are added to their link weight. This growth slows down towards the end of the observation period but does not become as linear-looking as the average degree growth; note that the new links giving rise to growing degrees also affect average weights. Comparison with the uniformly random times reference reveals the effect of burstiness - weights grow faster in the original data because of burstiness, where rapid sequences of calls following one another quickly increase link weights.
where is the number of common neighbours of i and j, and and are their degrees. Thus the overlap measures the fraction of common neighbours out of all neighbours of the two connected nodes. Figure 5(b) displays the average final 6-month overlap of the added links as a function of aggregation time. Here we have calculated the overlap of each link in the final 6-month aggregated network, and averaged over these values for links that are added to the network at time t. It is seen that the links that are added early on in the aggregation process have on average a higher overlap than those added later; the final overlap is a decreasing function. Hence, even when the aggregation times are short, the networks capture features of the community structure of the final aggregated networks. Interestingly, the overlap also shows a strong circadian and weekly pattern - its highest peaks correspond to the early morning when the overall call rate is very low. Thus, if calls are made during these hours, they are likely to be targeted towards people in the strongest clusters of friends and family.
Behaviour of statistical distributions
On the effects of aggregation window placement
In all analysis so far, we have assumed that the exact placement of the aggregation window, i.e. the time point of its beginning, plays no role in the results. However, as the characteristic daily and weekly patterns of Figure 2(b) indicate, the overall level of call activity in the network displays large variations by hour and day, and this is expected to have an effect on the aggregated networks, at least on shorter time scales. In addition, there may be less trivial effects if the actual behavioural patterns of individuals - affecting who they call - are also time-dependent. In this final section, we will address these issues.
In many cases, complex networks studied in the literature are constructed by aggregating links or sequences of interactions between the constituent nodes over some period of time, often limited by the availability of data, and their static structural features are then studied. The effects of the aggregation interval length and placement have been discussed only rarely [5, 21]. In order to shed some insight into this problem, we have investigated the structural features of mobile telephone call networks aggregated over aggregation intervals of increasing length. To ensure that the results are not affected by churn, i.e. customers leaving and subscribing to the operator, we only considered customers whose subscriptions did not change over the entire data interval from Oct 1st, 2006 to March 31th, 2007.
Evidently, there several dynamical mechanisms and inhomogeneities that affect the features of networks aggregated over different time intervals, from broad distributions of numbers of calls on links to burstiness-related long inter-call times and dynamical changes in the network itself, and disentangling the effects of such features is not possible on the basis of time-stamped data alone. Thus the resulting networks display properties that arise from the interplay of such features associated with multiple time scales, and the question of a “correct” or proper aggregation interval length is ill-posed. However, on the basis of our analysis, some statements about the general emergence of network features can be made. First, because of the broad link weight distribution and Granovetterian weight-topology correlations, where strong links are associated with dense neighbourhoods, the seeds of the underlying community structure are visible in aggregated networks already for rather short aggregation intervals of ∼1 week: the clustering coefficient of the network peaks at around 9 days, and the earlier a link is observed, the more likely it is to participate in a dense neighbourhood in the final network aggregated over the available data period, as seen by monitoring the neighbourhood overlap of the links. However, at the same time, although the growth of the number of nodes saturates fairly early, the number of links and the average degree of nodes keep on growing even for long aggregation intervals. This suggests that for short windows, the cluster and community structures dominate, whereas for longer windows, the contribution of both “weak” links and links that are practically random, i.e. arise from one-off-calls, increases. When networks from consecutive windows of different lengths are compared, they are seen to be maximally similar at a length of ∼30 days; this can be considered as the time scale of the recurrent, stable links, beyond which the weaker links start to have a considerable effect on network structure. The scaled degree and weight distributions become stationary already for short time intervals of a few days or weeks, respectively.
As the above results are from one dataset only, it is worth considering how general they are. As there are common underlying features of social networks - broad tie strength distributions and the Granovetterian relationship between tie strengths and topology - we believe that the fast emergence of clusters of strong links followed by increasing numbers of weaker links not associated with triangles is a general feature that holds across different communication networks. Likewise, one may assume that the time scale for obtaining stablest networks (∼30 days in our case) should remain roughly similar. However, in both cases, the exact numbers for the characteristic time scales might differ as they may also be affected by the overall call activity level. We also believe that the collapse of scaled distributions, indicating stationarity in the underlying processes, should be observable in other data sets too.
In addition to the effects of the aggregation window length, we have shown that comparing networks aggregated over windows of different placement can yield insight into the dynamic features of the behavioural patterns of individuals. The differences in the growth of the largest connected component point towards different behavioural modes in the weekends and during weekdays, where weekend calls are more frequently related to high-overlap links and dense clusters, and thus build the largest connected component more slowly; weekday calls play the role of “topological shortcuts” in the aggregation process and more rapidly give rise to overall network connectivity. Additionally, we have observed very different calling patterns during holiday periods, giving rise to aggregated networks that significantly differ from the networks constructed from data outside the holiday periods. Thus, the aggregation interval placement matters, and care should be taken when interpreting the structural features of networks constructed from data that involves holidays or other special periods.
GK, MK, SB, VB and JS designed the research and analysis, GK prepared the data, GK and SB performed the analysis, GK, MK and JS wrote the paper.
aBecause social networks are geospatially embedded, zip-code-based geospatial sampling yields networks that preserve rather well the typical characteristics of social networks. On the contrary, in snowball sampling where all nodes at a chosen graph distance from the focal node are included in the sample, the majority of nodes are at the “surface” of the snowball, as the number of nodes grows exponentially with graph distance. This artifact results in low clustering and makes observing community structure difficult.
MK and JS acknowledge financial support of the Future and Emerging Technologies (FET) programme within the Seventh Framework Programme for Research of the European Commission, under FET-Open grant number: 238597 (project ICTeCollective). GK acknowledges support from the Concerted Research Action (ARC) “Large Graphs and Networks” from the “Direction de la recherche scientifique - Communauté française de Belgique”. The scientific responsibility rests with its authors.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.