Data preparation—creating Email-based social network
The network that has been chosen for experiments was extracted from the email logs of the Wroclaw University of Technology (WrUT). The experimental data were collected during the period of 21 months (February 2006–October 2007). The network was created in the course of the data cleansing process and removing fake and external email addresses. The employees of WrUT are the nodes of the network, whereas email messages exchanged between them were used to infer their relationships (edges in the network). Although every single email message provides information about the sender’s activity, it can simultaneously be sent to many recipients. An email sent to only one person reflects strong attention of the sender directed to this recipient, while the same email sent to 20 people does not. For that reason, the intensity of email communication I(x, y) between email user x and y has been defined as
$$ I(x,y)=\displaystyle{\sum_{i=1}^{card({\rm EM}(x,y))}\frac{1}{n_i(x,y)}} $$
(5)
where EM(x, y) is the set of all email messages sent between x and y and n
i
(x, y) denotes the number of all recipients of the ith email sent between x and y (Kazienko et al. 2009).
In consequence, every email with more than one recipient is treated as 1/n of a regular one (n is the number of its recipients). Although ‘to-list’ recipients are likely to be of much greater message-network importance than the ‘cc-list’ recipients, both groups are treated in the same way, i.e. the total number of the recipients of an email is always taken into account. Such approach results from the fact that the obtained data do not include information if the recipient of the email is on the ‘to-list’ or ‘cc-list’.
The resulting social network \(SN = \langle N, I\rangle\) is defined as a tuple consisting of a set of network nodes N and a set of relationships that are described by their mean intensity \({{I : N\times N \rightarrow\ \mathbb{R}^+ \cup \{0\},}}\) given by Eq. 5. Note that the resulting structure is a non-directed graph with intensity I as a label assigned to the relationships.
It should be emphasized that the social network derived from the email logs does not have a static structure. The existence of any link in such a graph (i.e. relationship) is a result of a series of discrete events (email messages) which occur in certain time instants and usually with changing frequency. We may also think of the computed relationships’ intensity as of the social distance between network members (nodes). Greater I reflects smaller distance in the social space. In order to track changes in relationship strength we have used a sliding window approach.
For the experiments the data from a period of 84 days were selected and divided into frames covering 7 days each. This allowed to create 12 social network graphs \({\hbox{SN}}(t_0), {\hbox{SN}}(t_1), \ldots {\hbox{SN}}(t_n)\) where \(t_0, t_1,\ldots, t_n\) are discrete instants of time. Each network is created according to the procedure defined above on the basis of 7-day period starting in \(t_0, t_1, \ldots, t_n.\) The networks \({\hbox{SN}}(t_0), {\hbox{SN}}(t_1), \ldots {\hbox{SN}}(t_n)\) are temporal images of evolving social structure which was built on the basis of email communication. In addition, only users who were active in all time windows were taken into account as they constitute the core of the network.
Distance matrix creation
The distance between two nodes should reflect their proximity. The most obvious choice—graph distance expressed as the length of the shortest path between the nodes, does not really fit the problem of modelling the dynamics of an email network, especially if the graph is weighted. For example, suppose that the shortest path between nodes x and y has a total weight of 0.7, but it passes through two intermediate nodes. At the same time the shortest path between v and w has a total weight of 0.3, but there are no intermediate nodes at all. In the context of an email network it means that v and w communicate directly, but not very often. On the other hand, x and y do not communicate directly with each other, but their nearest neighbours do it frequently, and x and y communicate with the neighbours frequently too. In practice it means that x and y may not even know each other, while v and w certainly do. Hence in this case the standard graph distance is misleading and for our experiments we propose an alternative definition of social distance. Denoting by \(D_{\hbox{EC}}(x \leftrightarrow y)\) the number of edges in the shortest undirected path between nodes x and y and by \(D_{\hbox{EW}}(x \leftrightarrow y)\) the sum of weights along the same path, normalized to the (0,1) range, the total distance between nodes x and y is given by the following formula:
$$D(x \leftrightarrow y) = \left\{ \begin{array}{ll}D_{\hbox{EC}} (x \leftrightarrow y) + D_{\hbox{EW}} (x \leftrightarrow y) \; if\; \exists (x \leftrightarrow y)\\ \max (D_{\hbox{EC}})+\max (D_{\hbox{EW}})+1 \; if\; \nexists (x \leftrightarrow y)\end{array} \right. $$
(6)
As a result the distance will always fall into the (1,2) interval for directly connected nodes, (2,3) if there is one intermediate node, etc. Note, that in this setting the number of edges in the shortest path contributes the most, while the additional information given by the edge weights is also taken advantage of. Equation 6 also assigns some finite distance value to all pairs of nodes not connected by any path, as one of the requirements imposed by the embedding algorithm we have used was that the distance should be defined for every pair of nodes.
Embedding networks in the Euclidean space
With the distance matrices in place the graphs \({\hbox{SN}}(t_0), {\hbox{SN}}(t_1),\ldots,{\hbox{SN}}(t_n)\) can be embedded in the Euclidean space (two or more dimensional), where each node is represented by a point with given coordinates. The resulting sets of points \({\hbox{SN}}_0, {\hbox{SN}}_1,\ldots,{\hbox{SN}}_n\) represent the temporal network images.
An important issue, which should be discussed here, is the dimensionality of the embedding space. Most embedding algorithms have been designed for the purpose of graph visualization. This naturally implies a two- or three-dimensional embedding. However, the higher the dimensionality of the embedding, the more accurately the social distances are mapped into the Euclidean space. Figure 3 depicts the average distance distortionFootnote 1 of the embedding as a function of dimensionality for the WrUT email network and for (a) BBS, (b) CMDS methods. As expected, in both cases the accuracy of the embedding grows with dimensionality. Please note that the scales on the vertical axes in the Fig. 3 are different. It should be emphasized that the pace of accuracy growth with increasing dimensionality is much faster in the case of BBS than CMDS. Intuitively we should choose the number of dimensions to be as high as possible. There is a limit, however, which results from the so called ‘curse of dimensionality’ (Bishop et al. 1995), and especially the ‘distance concentration‘ phenomenon, which as demonstrated in (Budka et al. 2011) is particularly relevant in the context of dynamic molecular simulation of potential fields in the Euclidean space.
It has been observed that as the number of dimensions grows, the Euclidean distance loses its discriminative power, regardless of the characteristics of the dataset (Aggarwal et al. 1973; Francois et al. 2005). The reason for this is that under a broad set of conditions the mean value of the L
2-norm distribution grows with data dimensionality while the variance remains approximately constant (Fig. 4) (Francois et al. 2005). As a result, the nearest and furthest neighbours of any molecule appear to be at approximately the same distance, which makes the ratio of distances to the nearest and farthest neighbour tend to converge to 1. As argued in (Beyer et al. 1999), it can occur even for sets with as few as ten dimensions and the decrease in the ratio between the farthest and nearest neighbour distance is steepest in the first 20 dimensions. The effect is additionally magnified by the limited precision of calculations a computer can handle and often leads to the molecular simulation failing to converge (Budka et al. 2011). Hence in practice the embedding dimensionality needs to be a compromise between the distance distortion and negative effects of high dimensionality. For this reason we have decided to embed each graph into \(2,3,\ldots,20\) dimensions to investigate the mapping between graph distances and distances in embedded graph.
Embedding algorithm has to assure that the Euclidean distances between points (nodes) fit in the best possible way the distances in a social space (relation strengths in original graphs). As a result one obtains the representation of social system in which the network is seen as an assembly of N particles, representing the nodes of a social network.
After reviewing several embedding methods, it has been decided that two sets of experiments will be performed: (1) the Big-Bang Simulation and (2) CMDS as these methods enable to embed graph into an arbitrary number of dimensions. Additionally, BBS models the network nodes as a set of particles, which is consistent with the next part of the experiments where molecular modelling approach is used to determine the dynamics of a social network.
Embedding was performed on 12 previously extracted social networks. Each of the networks was embedded into \(2,3,\ldots,20\) dimensions using BBS. CMDS inherently selects the best number of dimensions (in excess of 400 in our case), so in this case the parameter was not set during the experiments, but only first \(2,3,\ldots,20\) dimensions produced by CMDS have been used in our simulations.
For each of the dimensionalities given above we have analysed how well the distances between particles from the social networks (graph) are reflected in the embedded space. To avoid negative effects of high dimensionality we decided to select the lowest number of dimensions that allowed to embed the graph in a way that the mean values of the distances after embedding, which correspond to the graph distances in the ranges <1; 2), <2; 3), <3; 4) etc. were well separated. This has been achieved for 12 dimensions, where for both BBS (Fig. 5) and CMDS (Fig. 6) the distributions of distances in the embedding space are approximately unimodal and their expected values are in the required range.
Due to the aforementioned, the actual molecular simulation has been performed in 12-dimensional space. However, for the visualisation purposes, where appropriate and to present general idea, the figures were presented for the two-dimensional embedding and molecular simulation.
After selecting the number of dimensions, the next stage of the experiments was to embed the created social networks snapshots into Euclidean space. As discussed in Sect. 3, embedded graphs serve as an input to the molecular simulation process.
Setting up the dynamic molecular model
Because the sets of network nodes in \({\hbox{SN}}(t_0), {\hbox{SN}}(t_1),\ldots,{\hbox{SN}}(t_n)\) are equal, each point (node) is represented in any of the sets \({\hbox{SN}}_0, {\hbox{SN}}_1,\ldots,{\hbox{SN}}_n\) and is active in each of the windows. We may think of these points as of particles moving as a result of interactions (email communication) between them. At this point we use the formalism of molecular dynamics to associate a potential U with every particle (network node). The actual characteristic of this potential depends on the behaviour of the particles changing their positions in time instants \(t_0, t_1,\ldots, t_n.\)
First experiments were performed using standard Lennard–Jones potential function (Juszczyszyn et al. 2009; Musial et al. 2010). The analysis of server logs has revealed some features of the dynamics of email communication—the growing intensity of communication is always followed by the periods of less frequent email activity. This resembles the repelling force emerging between particles when their distance becomes less than some minimum. We noticed that intense email communication (which results in very small distances in social graph) is never sustained for a longer period of time. On the other hand, fading communication is (in most cases) followed by frequent message exchanges.
It should be stressed that the Lennard–Jones potential was used only for the first experiments and did not accurately fit the underlying data. In the experiments presented in this paper the social network-specific potential function on the basis of available data was developed. In order to do that, first the distance transition probability defined as the probability that a given distance in one window will change into another given distance in the next time frame, was calculated. For the 12 time windows, 11 transition probability matrices were obtained. The matrices were then averaged. The resulting final matrix is presented in Fig. 7a. The force that governs the changes of the location of the particles in the Euclidean space is proportional to the distance change. Hence the third-degree polynomial presented in Fig. 7b, inferred from the distance transition probability, describes the force used in the molecular simulation. Please note that the force will be different for different datasets.
The presented force allows to simulate the changes between communication patterns in consecutive time instants. The potentials associated with the nodes reflect their abilities and tendency to establish future connections with their neighbours—the nodes which are close in terms of social space (thus changing the distances in social space which is analogous to the behaviour of particles moving under influence of electrical/gravitational forces).