# Molecular model of dynamic social network based on e-mail communication

- First Online:

- Received:
- Revised:
- Accepted:

## Abstract

In this work we consider an application of physically inspired sociodynamical model to the modelling of the evolution of email-based social network. Contrary to the standard approach of sociodynamics, which assumes expressing of system dynamics with heuristically defined simple rules, we postulate the inference of these rules from the real data and their application within a dynamic molecular model. We present how to embed the *n*-dimensional social space in Euclidean one. Then, inspired by the Lennard-Jones potential, we define a data-driven social potential function and apply the resultant force to a real e-mail communication network in a course of a molecular simulation, with network nodes taking on the role of interacting particles. We discuss all steps of the modelling process, from data preparation, through embedding and the molecular simulation itself, to transformation from the embedding space back to a graph structure. The conclusions, drawn from examining the resultant networks in stable, minimum-energy states, emphasize the role of the embedding process projecting the non–metric social graph into the Euclidean space, the significance of the unavoidable loss of information connected with this procedure and the resultant preservation of global rather than local properties of the initial network. We also argue applicability of our method to some classes of problems, while also signalling the areas which require further research in order to expand this applicability domain.

### Keywords

Complex systems Link prediction Molecular modelling Social networks## 1 Introduction

The emergence of complex behaviour in a system composed of many interacting elements is one of the most fascinating phenomena and recently also a prominent area of research. There are many types of complex networked systems, which can be classified in many different ways. One of the approaches distinguishes infrastructural (Internet, WWW, energy and transportation networks) and natural complex systems (biological networks, social systems and ecosystems) (Barrat et al. 2008). Another classification divides complex networks into technological, social, biological and information networks (Kolaczyk 2009). There is no commonly accepted definition of a complex networked system but there is an agreement that such structure consists of multiple interacting components whose global behaviour cannot be simply inferred from the behaviour of the individual components (Holland 1996; Barrat et al. 2008). The elements of the network are not independent but are rather connected via relationships and in consequence they influence each other. The number of nodes in these networks can differ from hundreds to millions (Watts and Strogatz 1998). One of the challenges is to identify which component influences the behaviour of other components, which is directly connected with the dynamics of such structures.

Complex systems that are investigated in this paper are social networks where nodes represent people (but can also be other social entities such as departments or even whole organisations), connected by different types of social relationships (e.g. friendship, co-working, family) (Garton et al. 1997; Wasserman et al. 1994). Although the general concept of social networks seems to be simple, the fact that the underlying structure is a network implies a set of characteristics, which are typical to all complex systems, i.e. the sum of the interactions between the users does not allow to draw conclusions about the behaviour of the social system as a whole. The consequence of this is that tracking changes in social networks is a very challenging and resource-consuming task, especially that the number of edges of the graphs representing social networks that are nowadays at our disposal can be counted in millions.

Due to the scale and complexity of such systems, computer simulations became an increasingly popular tool for investigating the dynamics of complex systems including social networks. Simulations supplement traditional approaches—formal theories and empirical studies and serve as analytical models enabling making certain predictions about the future behaviour of complex systems. In this research, we focus on the predicting the changes in the network structure. This is especially important as the network structure affects the functions of the network (Strogatz 2001). We also face a typical trade-off between simulations that take into account the detailed, microscopic description of the system (an approach, which in theory assures the most accurate predictions, often with an unacceptable computational overhead) and the minimal set of rules that allows to model the evolution of the system (Schweitzer et al. 2003).

It should be emphasized that many properties of complex systems are hardly definable in terms of any analytical model. Therefore, computer simulations seem to be the only way to gain insight into global system dynamics (Schweitzer et al. 2003; Boccaletti et al. 2006). So far, physics has provided several methodological approaches to tackle this issue. We hence argue that the spatial mobility and concentration of interacting particles can be modelled by employing the molecular dynamics paradigm, leading to many interesting extensions of standard approaches, based on the reinterpretation of potentials and distance in a given space (Weidlich 1991). One proposition of such modification is described in detail in this work. Another family of approaches successfully applied to physics, biology, evolutionary biology and social sciences are cellular automata (Wolfram 1986), starting from the famous *Game of Life* artificial life model of Conway. One of the first researchers who applied the particle–based approach to social dynamics was Dirk Helbing, who in (Helbing 2010) proposed a fundamental dynamic model which includes many established models as special cases, (e.g. logistic equation, gravity model, some diffusion models, the evolutionary game theory and the social field theory) and also implies numerous new results.

However, in this work we argue that the rapid development of social portals and social media gives us an unique opportunity of the investigation of social systems on the basis of real data. When we consider inferring social relations from the records gathered from systems providing communication and recommendation services, the relations may be quantified and directly measured. On the other hand, a standard approach of sociodynamics assumes a global (and relatively simple) definition of social potential (which reflects the character of “social force” driving the changes in the relations between the system components) which is used to simulate and analyse the collective behaviour of system components (Epstein 2008). This approach has been proved useful for many classes of social systems and the modelling of opinion dynamics (Malarz et al. 2011).

Taking the above into account, we propose to infer the character of social potential from the real-life social system data (using an email-based social network as an example) and to verify the possibility of using it to determine the evolution of the system. This requires embedding *n*-dimensional social space in Euclidean space to apply the physically inspired methods. According to the best of our knowledge no computational models for assessing the evolutionary schemes of real-world internet-based social structures, in which the edges can not only be formed but can also fade, were developed so far. Hence, in this paper, we propose application of molecular dynamics to modelling the evolution of email-based social network. We focus on the equilibrium state of a network, i.e. the state after the molecular simulation has converged and discuss various issues and challenges encountered during this research. Moreover, we argue that, in the presence of the data coming from real system, the verification of such a model should be done by means of checking if it is possible to recreate the social network from simulation results and compare it with the real network structures which have evolved in the period of time covered by the simulation.

The rest of the paper is structured as follows: in Sect. 2 related work in the fields of social networks dynamics, graph embedding and dynamic molecular modelling are presented. In Sect. 3 methodology followed in this paper is outlined and Sect. 4 explains the experimental set-up. Section 5 is devoted to the molecular simulation and its outcomes. Section 6 aims at presenting the concept of social network recreation from the simulation results and Sect. 7 includes the analysis of the retrieved social networks. Finally, in Sect. 8 results arising from the conducted research are summed up and the future work is presented.

## 2 Related work

### 2.1 Dynamics of social networks

In the past few years the problem of predicting the future interactions between users in social networks has become an important research challenge. Due to the availability of datasets of online activities and communication between people, scientists try to describe both structure and evolution of such networks. Most of the approaches addressing the complex networks growth take into consideration a limited set of global characteristics of the networks and develop models that reproduce only these characteristics, e.g. node degree distribution (Barabasi 2003), clustering coefficient (Watts 2002) or network diameter (Bollobas 1985).

There are some approaches that aim at developing specific models for online social networks and take into consideration some information characteristic to such networks (Kumar et al. 2006; Lescovec et al. 2008; Bringmann et al. 2010; Braha and Bar-Yam 2006; Liben-Nowell and Kleinberg 2007; Davis et al. 2012; Kashoob et al. 2012). Different models propose different methods of network growth. In (Kumar et al. 2006), on the basis of the analysis of real-world networks such as *Flickr* and *Yahoo 360!*, the users have been divided into three different types: passive, linkers and inviters. The members of the first group (passive users) join the network out of curiosity or because of being invited by a friend. These users, as their name suggests, never engage in any significant activities within the network and do not interact with other users. Inviters on the other hand, are interested in migrating the group that they have in the real world into a virtual world; thus their actions focus on inviting their friends to participate in an online social network. Linkers actively connect themselves to other members within the online social network. Based on the analysis of datasets the authors define a rule-based system that follows specific rules used for evolution of the social network. The method that describes the network growth can be defined as the set of steps: (1) at each time step, a node arrives, and one of the statuses: passive, linker or inviter is randomly assigned to it; (2) during the same time step, *x* edges arrive and the source of each of the edges is chosen at random from the existing inviters and linkers in the network using preferential attachment. Depending on the chosen type of the source node (inviter or linker) different actions are performed. If the source is an inviter, then it invites a non-member to join the network, and so the destination is a new node. If the source is a linker, then the destination is chosen from among the existing linkers and inviters, again using preferential attachment (Kumar et al. 2006). This model represents the growth of a network, i.e. it takes into account adding new nodes and edges. However, the problem of link prediction covers not only the creation of new links but also fading of existing relations.

In (Bringmann et al. 2010) the authors have presented another approach that defines a set of rules regarding how the network evolves. They focus on discovering patterns of interactions between users and their evolution over time. The authors propose to create a single graph that represents social network, which is supplemented with additional information—a time-stamp, added to each relation when it appears in the network for the first time. The experiments were performed on the DBLP database (Bringmann et al. 2010). Similarly to the previous presented study, also this one assumes that the users and the relations between them can only be added to the system and will never be deleted. Moreover, both of approaches presented so far allow to investigate the creation of new edges but do not allow to follow the dynamics of the relationships strengths between users, which is one of their limitations.

Yet another framework for the network growth was developed in (Lescovec et al. 2008) where the authors studied four online social networks: *Flickr*, *Delicious*, *Answers* and *LinkedIn*. They proposed to apply the maximum-likelihood estimation principle to compare a family of parameterised models in terms of their probability of generating the observed data and as a result to select the model that reflects the data in the best possible way. The task in this framework was to predict which nodes will a new edge connect. For every edge arriving to the network the likelihood that it will connect two given nodes under some model is assessed. The product of these values over all edges gives the likelihood of the model and the model with the highest likelihood is chosen. Similarly to the previously presented methods this one also does not consider the strength of the relation as well as the fact that an edge can disappear from the network.

A survey of other link prediction methods can be found in (Liben-Nowell and Kleinberg 2007), where the approaches like common neighborurs, Jaccard’s coefficient and Adamic/Adar method, preferential attachment, Katz method, PageRank and its variants, low-rank approximation, unseen bigriams and clustering are discussed.

A set of approaches that take into consideration the fact that links can disappear from the network have been proposed in (Hill et al. 2010; Braha and Bar-Yam 2006) where the authors have detected a dramatic time dependance in network centrality and the role of nodes, something not apparent from static analysis of node connectivity and network topology. Their experiments studied large-scale email networks consisting of 57,000 users based on data gathered over a period of 113 days. They found that although the daily networks were scale-free, the well-connected nodes in these networks changed from day to day.

A recent method also accounting for the disappearing links has been proposed and investigated in (Juszczyszyn et al. 2011a, b, c), where based on the changes in the local structure, a 1st order probabilistic model of transitions between various triad types has been derived. The model results from an observation that there exist distinctive patterns which drive the evolution of connections between nodes. Node disappearance has also been addressed in (Sarr et al. 2012), but in a somewhat different the context of disruption of the information flow.

Our approach differs from these presented above as we do not propose a model for network growth per se but we investigate the limitations of sociodynamic model verified on data coming from real system. Our proposition takes into account both creation and vanishing of the relationships. Additionally, the network investigated in this work is a structure where strength of the relationships changes over time, which is an important factor in social networks due to the cognitive limitations of people (Hill and Dunbar 2002). In our approach, we do not assign roles to users as this may be misleading. We rather assess, based on the current interactions between users, how the relations strength and the structure of the network may look like in the future.

### 2.2 Distance preserving graph embedding

Following the in-depth discussion presented in (Watts 2002) we cannot expect the social space to be metric, i.e. the triangle inequality between any three nodes does not hold. On the other hand, as it was mentioned above, molecular modelling assumes the interaction between the particles embedded in the Euclidean space. For this reason, to apply molecular modelling we must first perform embedding of the social network graph in a metric, Euclidean space. Numerous embedding methods exist whose overview is presented below.

The Big Bang embedding algorithm (BBE) presented in (Shavitt and Tankel 2004) simulates an explosion of particles that represent network users under a force field that is derived from the embedding error. Each particle is the geometric image of a vertex. The force field reduces the potential energy of the particles which is related to the total embedding error of all particle pairs. In the Big Bang Simulation (BBS) all particles are initially placed in the same location in space. The whole process is performed in an iterative manner and each iteration moves the particles in discrete time intervals. Every iteration begins with calculation of the field force on each particle at the current particles’ positions (for the first iteration forces are chosen randomly). As it was mentioned, the forces are derived from the potential energy. In the next step, the positions and velocities at the next time step are calculated. The final step of each iteration is to evaluate the new potential energy. This method allows to embed the network into a freely selected number of dimensions.

Another method that can be used to embed a graph in Euclidean space is called the Multidimensional Scaling (MDS) (Torgerson 1965). MDS defines a suite of methods often used in information visualization and exploration of similarities or dissimilarities in data. There are two variations of MDS, i.e. classical multidimensional scaling (CMDS) algorithm and standard MDS (Bronstein and Kimmel 2006; Kruskal et al. 1978). Classical metric MDS develops the metric as a symmetric bilinear form and calculates the leading *d* eigenvalues of the corresponding matrix (Torgerson 1965). An MDS algorithm starts with a matrix of similarities between objects (similarity relation does not have to be symmetrical) and then assigns a location of each item in a low-dimensional space. It hence estimates the coordinates of a set of objects in a space of specified dimensionality on the basis of measuring the distances (which, however do not have to be metric) between pairs of objects. A variety of models can be used that include different ways of computing distances and various functions relating the distances to the actual data. Both methods allow to embed graph into different numbers of dimensions. However, the problem that we faced during our experiments with MDS was that the computational overhead was very high and we were not able to obtain results within reasonable time.

In High–Dimensional Embedding (Harel and Koren 2004), which is a fast method for creating 2D representations of large graphs, the graph is first embedded into a very high dimensional space—usually associated with the number of nodes—and then projected into a 2D plane using Principal Components Analysis. This method is used for undirected graphs. It will not be useful from the perspective of our experiments as one of the goals of this study is to embed the graph into different dimensions and verify which number of dimensions helps to achieve best results from the link prediction perspective.

Minimum Volume Embedding (MVE) presented in (Shaw and Jebara 2007, 2009) is an algorithm for non–linear dimensionality reduction that uses semi-definite programming (SDP) and matrix factorization to find a low-dimensional embedding that preserves local distances between points while representing the dataset in fewer dimensions. Authors of MVE emphasise that in all cases MVE in comparison with Semi-definite Embedding and Kernel Principal Component Analysis is able to capture more of the variance of the data in the first two eigenvectors, providing a more accurate 2-dimensional embedding (Shaw and Jebara 2007, 2009). The main features of the minimum volume embedding approach are (1) MVE for a given dataset returns always the same set of coordinates, (2) isolated nodes are neglected in the embedding process and (3) MVE is stable, i.e. adding one node with very weak connections does not influence significantly the positions of the remaining nodes. Enumerated characteristics of MVE means that the graph can be embedded only into 2D space which is not enough from the perspective of the proposed experiments in this paper as one of the goals is to find out what is the best number of dimensions to which the graph should be embedded. Moreover, in the case of not connected graphs the algorithm does not work.

### 2.3 Dynamic molecular modelling and simulation

*U*is the interaction potential.

*p*denotes the momentum of the particle and

*F*is a vector sum of forces from all other particles in the system. For objects with constant mass this formula takes the following familiar form:

*a*, defined as a second derivative of the position vector

*s*. The above equation must be solved for every particle in every simulation step. In order to start the algorithm, the initial positions of all particles, the formula for the force which is identical for all particle pairs and the interacting potential need to be specified. One of the standard potential functions used to describe the many-particle problems is Lennard–Jones potential which is given by

*r*denotes the distance between particles. The Lennard–Jones potential, which has been depicted in Fig. 1, is fully defined by two parameters: the depth of the potential

*ε*-responsible for the strength of interactions between particles and

*σ*-related to the minimum distance between two particles. As it can be seen the potential has a global minimum equal to

*ε*for

*r*

_{min}= 2

^{1/6},

*σ*= 1.12

*σ*. An important characteristic of this potential is that the nature of interaction between two particles depends on their distance. Namely, for distances bigger than

*σ*the particles attract each other, while for distances smaller then

*σ*the character of the potential changes to strongly repulsive.

Having the analytical formula for the interaction potential one can easily obtain the formula for the force by simple differentiation, which should be performed analytically to avoid accumulation of numerical approximations. As a first approximation all particles are assumed to have unit mass. Knowing the forces, the Verlet algorithm may be used to obtain the position and the velocity of each particle in consecutive time steps (Juszczyszyn et al. 2009).

The concept of molecular modelling and simulation is used in this study to model the dynamics of a social network. The users who are the nodes of the network become particles in Euclidean space and the distance between particles will be determined based on the relationship strength between the users. The changes in the distances between particles over time will be the basis for inferring the potential in a purely data-driven way and in consequence for determining the force between particles.

## 3 Methodology

The approach proposed in this paper is to reformulate the problem of time evolution in social networks and interpersonal relations into the language of multiple-particle interacting system. This is achieved by assigning the position of the node in the social network graph to the position of the particle in a metric space so that the inter-particle distance reflects the strength of the relation.

In our previous attempts the form of the interacting potential was assumed to reflect the tendency of two particles (nodes) to change their distance in social space. The experimental data were used to extract some characteristic features of the interaction and a modified Lennard–Jones potential was used to reproduce the time evolution at the macroscale. Another possibility to gain an insight into the character of social interaction is to examine in details the distance between each pair of nodes in consecutive time windows. The experimental data, i.e. the positions of each particle (node) in each time window are sufficient to obtain the dependence of the variation of the distance between two particles on their distance (see Sect. 4.2 where the distance matrix is created from the adjacency matrix).

Using only this dependence it is possible to simulate the behaviour of interacting particle ensemble in the following way. Knowing the initial distances between all pairs of particles the change of two-particle distance can be read from the experimental curve. The change of the distance between each two particles can be easily transformed into the displacement vector. The displacement vector has its beginning at initial position of the particle and its end in its final position (it is defined as a difference between the initial and final position vector). Its direction coincides with the direction of a line connecting two particles under consideration and it is pointing into the direction of the centre of mass when the distance between two particles is decreasing, and in opposite direction if the distance is increasing. Because of the equal masses of both interacting particles each of them changes its position by the half of the calculated distance change between them. This procedure allows to define the displacement vector for a considered particle and one of all the other particles from the ensemble. Such an operation should be repeated for all other particles to obtain all displacement vectors for a given particle. Since the displacement vectors calculated in this way represent the forces exerted by other particles, their superposition determines the direction in which the particle under consideration should be moved. We ignore the magnitude of the total force, as moving any particle by this value, which is the length of a negative gradient of the field potential, would most likely result in overshooting and lack of convergence. Instead, we optimize our system in an iterative manner, shifting all particles by a small, fixed step at a time until it reaches a steady state. In that way we are able to simulate the time dependence of the position of each particle knowing only the initial positions of all particles and experimentally obtained relation between the change of the distance between two particles and their distance.

*Data preparation* As the real-world evolving network is investigated in this paper, the first step is to prepare data in a way that they can be used in further parts of the experiments. This includes extraction of the interactions and time stamps of their occurrence from email logs dataset and dividing this set into time windows of a given size. From each time frame a single social network is created. Note, that in the case of email communication the underlying social network is directed and weighted. However, adjacency matrix fed to the embedding process has to be symmetrical. Thus the directed social network is transformed into undirected one by aggregating the communication between every two nodes.

*Distance matrix creation*: Creating a distance matrix for each social network snapshot, in which the distances between nodes reflect the intensity of communication between them, is the next step. The distance needs to be calculated for all pairs of nodes, including the pairs which are not connected directly or at all. In our approach the distance between two particles reflects the length of the shortest weighted path linking the two nodes in question.

*Embedding distance matrices in Euclidean space* The goal of this step is to project the created distance matrices into the Euclidean space in a way that the distances between nodes are reflected in the best possible way. After a review of existing embedding methods, Big Bang Simulation and Classic Multidimensional Scaling were chosen. These two methods facilitate embedding of non-metric spaces into almost arbitrary number of dimensions, limited by characteristics of the network under consideration, with moderate computational requirements.

*Definition of molecular model* In this step the potential field describing the evolution of interactions between nodes is defined. On this basis the whole molecular model of email-based social network is created. The potential function is determined from the changes in distances between two consecutive network snapshots. As the shape of the potential function depends on past data, the force governing the molecular simulation is different for each dataset.

*Molecular simulation* This part of the experiments utilizes the outputs of the previous steps: the embedded social network windows and the potential force which is used to move particles in the Euclidean space. The simulation terminates when the set of particles achieve a stable state.

*Recreation of network snapshots and analysis of the results* The study aims at assessing the characteristics of a network in a stable state, which is an outcome of the molecular simulation. In order to do that, the reverse process to the embedding has to be performed. For each time window the results of simulation, which are the set of nodes’ positions in Euclidean space, are taken and the network graph is created based on the distances between nodes (particles) after simulation. This process is straightforward: if the distance between two particles is lower than a given threshold value, the link between these particles in social network is created. The experiments were performed for different values of the distance threshold. Finally, the properties of these retrieved networks are investigated. Two main properties were taken into account: node degree distribution and clustering coefficient.

## 4 Experiment setup

### 4.1 Data preparation—creating Email-based social network

*I*(

*x*,

*y*) between email user

*x*and

*y*has been defined as

*x*,

*y*) is the set of all email messages sent between

*x*and

*y*and

*n*

_{i}(

*x*,

*y*) denotes the number of all recipients of the

*i*th email sent between

*x*and

*y*(Kazienko et al. 2009).

In consequence, every email with more than one recipient is treated as 1/*n* of a regular one (*n* is the number of its recipients). Although ‘to-list’ recipients are likely to be of much greater message-network importance than the ‘cc-list’ recipients, both groups are treated in the same way, i.e. the total number of the recipients of an email is always taken into account. Such approach results from the fact that the obtained data do not include information if the recipient of the email is on the ‘to-list’ or ‘cc-list’.

The resulting social network \(SN = \langle N, I\rangle\) is defined as a tuple consisting of a set of network nodes *N* and a set of relationships that are described by their mean intensity \({{I : N\times N \rightarrow\ \mathbb{R}^+ \cup \{0\},}}\) given by Eq. 5. Note that the resulting structure is a non-directed graph with intensity *I* as a label assigned to the relationships.

It should be emphasized that the social network derived from the email logs does not have a static structure. The existence of any link in such a graph (i.e. relationship) is a result of a series of discrete events (email messages) which occur in certain time instants and usually with changing frequency. We may also think of the computed relationships’ intensity as of the social distance between network members (nodes). Greater *I* reflects smaller distance in the social space. In order to track changes in relationship strength we have used a sliding window approach.

For the experiments the data from a period of 84 days were selected and divided into frames covering 7 days each. This allowed to create 12 social network graphs \({\hbox{SN}}(t_0), {\hbox{SN}}(t_1), \ldots {\hbox{SN}}(t_n)\) where \(t_0, t_1,\ldots, t_n\) are discrete instants of time. Each network is created according to the procedure defined above on the basis of 7-day period starting in \(t_0, t_1, \ldots, t_n.\) The networks \({\hbox{SN}}(t_0), {\hbox{SN}}(t_1), \ldots {\hbox{SN}}(t_n)\) are temporal images of evolving social structure which was built on the basis of email communication. In addition, only users who were active in all time windows were taken into account as they constitute the core of the network.

### 4.2 Distance matrix creation

*x*and

*y*has a total weight of 0.7, but it passes through two intermediate nodes. At the same time the shortest path between

*v*and

*w*has a total weight of 0.3, but there are no intermediate nodes at all. In the context of an email network it means that

*v*and

*w*communicate directly, but not very often. On the other hand,

*x*and

*y*do not communicate directly with each other, but their nearest neighbours do it frequently, and

*x*and

*y*communicate with the neighbours frequently too. In practice it means that

*x*and

*y*may not even know each other, while

*v*and

*w*certainly do. Hence in this case the standard graph distance is misleading and for our experiments we propose an alternative definition of social distance. Denoting by \(D_{\hbox{EC}}(x \leftrightarrow y)\) the number of edges in the shortest undirected path between nodes

*x*and

*y*and by \(D_{\hbox{EW}}(x \leftrightarrow y)\) the sum of weights along the same path, normalized to the (0,1) range, the total distance between nodes

*x*and

*y*is given by the following formula:

As a result the distance will always fall into the (1,2) interval for directly connected nodes, (2,3) if there is one intermediate node, etc. Note, that in this setting the number of edges in the shortest path contributes the most, while the additional information given by the edge weights is also taken advantage of. Equation 6 also assigns some finite distance value to all pairs of nodes not connected by any path, as one of the requirements imposed by the embedding algorithm we have used was that the distance should be defined for every pair of nodes.

### 4.3 Embedding networks in the Euclidean space

With the distance matrices in place the graphs \({\hbox{SN}}(t_0), {\hbox{SN}}(t_1),\ldots,{\hbox{SN}}(t_n)\) can be embedded in the Euclidean space (two or more dimensional), where each node is represented by a point with given coordinates. The resulting sets of points \({\hbox{SN}}_0, {\hbox{SN}}_1,\ldots,{\hbox{SN}}_n\) represent the temporal network images.

^{1}of the embedding as a function of dimensionality for the WrUT email network and for (a) BBS, (b) CMDS methods. As expected, in both cases the accuracy of the embedding grows with dimensionality. Please note that the scales on the vertical axes in the Fig. 3 are different. It should be emphasized that the pace of accuracy growth with increasing dimensionality is much faster in the case of BBS than CMDS. Intuitively we should choose the number of dimensions to be as high as possible. There is a limit, however, which results from the so called ‘curse of dimensionality’ (Bishop et al. 1995), and especially the ‘distance concentration‘ phenomenon, which as demonstrated in (Budka et al. 2011) is particularly relevant in the context of dynamic molecular simulation of potential fields in the Euclidean space.

*L*

_{2}-norm distribution grows with data dimensionality while the variance remains approximately constant (Fig. 4) (Francois et al. 2005). As a result, the nearest and furthest neighbours of any molecule appear to be at approximately the same distance, which makes the ratio of distances to the nearest and farthest neighbour tend to converge to 1. As argued in (Beyer et al. 1999), it can occur even for sets with as few as ten dimensions and the decrease in the ratio between the farthest and nearest neighbour distance is steepest in the first 20 dimensions. The effect is additionally magnified by the limited precision of calculations a computer can handle and often leads to the molecular simulation failing to converge (Budka et al. 2011). Hence in practice the embedding dimensionality needs to be a compromise between the distance distortion and negative effects of high dimensionality. For this reason we have decided to embed each graph into \(2,3,\ldots,20\) dimensions to investigate the mapping between graph distances and distances in embedded graph.

Embedding algorithm has to assure that the Euclidean distances between points (nodes) fit in the best possible way the distances in a social space (relation strengths in original graphs). As a result one obtains the representation of social system in which the network is seen as an assembly of *N* particles, representing the nodes of a social network.

After reviewing several embedding methods, it has been decided that two sets of experiments will be performed: (1) the Big-Bang Simulation and (2) CMDS as these methods enable to embed graph into an arbitrary number of dimensions. Additionally, BBS models the network nodes as a set of particles, which is consistent with the next part of the experiments where molecular modelling approach is used to determine the dynamics of a social network.

Embedding was performed on 12 previously extracted social networks. Each of the networks was embedded into \(2,3,\ldots,20\) dimensions using BBS. CMDS inherently selects the best number of dimensions (in excess of 400 in our case), so in this case the parameter was not set during the experiments, but only first \(2,3,\ldots,20\) dimensions produced by CMDS have been used in our simulations.

Due to the aforementioned, the actual molecular simulation has been performed in 12-dimensional space. However, for the visualisation purposes, where appropriate and to present general idea, the figures were presented for the two-dimensional embedding and molecular simulation.

After selecting the number of dimensions, the next stage of the experiments was to embed the created social networks snapshots into Euclidean space. As discussed in Sect. 3, embedded graphs serve as an input to the molecular simulation process.

### 4.4 Setting up the dynamic molecular model

Because the sets of network nodes in \({\hbox{SN}}(t_0), {\hbox{SN}}(t_1),\ldots,{\hbox{SN}}(t_n)\) are equal, each point (node) is represented in any of the sets \({\hbox{SN}}_0, {\hbox{SN}}_1,\ldots,{\hbox{SN}}_n\) and is active in each of the windows. We may think of these points as of particles moving as a result of interactions (email communication) between them. At this point we use the formalism of molecular dynamics to associate a potential *U* with every particle (network node). The actual characteristic of this potential depends on the behaviour of the particles changing their positions in time instants \(t_0, t_1,\ldots, t_n.\)

First experiments were performed using standard Lennard–Jones potential function (Juszczyszyn et al. 2009; Musial et al. 2010). The analysis of server logs has revealed some features of the dynamics of email communication—the growing intensity of communication is always followed by the periods of less frequent email activity. This resembles the repelling force emerging between particles when their distance becomes less than some minimum. We noticed that intense email communication (which results in very small distances in social graph) is never sustained for a longer period of time. On the other hand, fading communication is (in most cases) followed by frequent message exchanges.

The presented force allows to simulate the changes between communication patterns in consecutive time instants. The potentials associated with the nodes reflect their abilities and tendency to establish future connections with their neighbours—the nodes which are close in terms of social space (thus changing the distances in social space which is analogous to the behaviour of particles moving under influence of electrical/gravitational forces).

## 5 Molecular simulation

Figures 9 and 10 present the results of the molecular simulation in 2D space for windows 2 and 4, respectively (the windows have been selected for illustration purposes). For each of these windows the BBS and CMDS embedding as well as the result of the simulation are presented. Although the embedded graphs look differently for CMDS and BBS, the final outcomes of simulations are similar. These two windows were chosen to present how different shapes of embedded graphs behave during the simulation process.

## 6 Reconstruction of social networks

The result of the molecular simulation is a set of particle collections in their stable states. In order to investigate the characteristics of obtained structures a reverse-embedding process needs to be performed. During this phase the social non-metric graph is created from the particles embedded in Euclidean space whose positions were determined during the molecular simulation. The graph is recreated using the pairwise Euclidean distances between the particles.

## 7 Analysis of the experimental results

Distance thresholds for each time window in which average node degree is closest to 150 in comparison to other thresholds

Window no. | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 |
---|---|---|---|---|---|---|---|---|---|---|---|

BBS threshold no. | 51 | 45 | 47 | 52 | 49 | 51 | 52 | 51 | 53 | 54 | 52 |

CMDS threshold no. | 17 | 51 | 50 | 56 | 53 | 55 | 56 | 53 | 55 | 16 | 54 |

*v*has neighbours \(\mathcal{N}(v),\) with \(|\mathcal{N}(v)|=k_v.\) At most

*k*

_{v}(

*k*

_{v}− 1)/2 edges can exist between them (this occurs when

*v*is part of a

*k*

_{v}-clique). The clustering coefficient of a vertex,

*CC*

_{v}, is defined as the fraction of these edges that actually exist. The clustering coefficient of the graph is defined as the average clustering coefficient of all the vertices in the graph. The distributions of the clustering coefficient for the selected networks are presented in Fig. 16 (BBS) and Fig. 17 (CMDS). Similarly to the node degree distributions, this one also follows Poisson distribution, i.e. most of the users have similar clustering coefficient (at the level of 0.35 for both BBS and CMDS—see Table 2). Moreover, the standard deviation of the clustering coefficient is low – 0.01 for BBS for all windows; for CMDS it mostly varies between 0.01 and 0.08 and reaches its maximum—0.16—for Windows 1 and 10. The clustering coefficient at this level is characteristic for real-world social network. Comparing these results with random and ordered networks of the same size (Table 4), it is clear that all of the recreated networks share the features of both these types of networks: their clustering coefficient is larger than the one for random network, which is 0.18 and smaller than that in the ordered network −0.74. We can conclude by saying that the analysed networks follow a small-world network model in terms of clustering coefficient.

Average clustering coefficient (ACC) and its standard deviation for the reconstructed networks

Window no. | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 |
---|---|---|---|---|---|---|---|---|---|---|---|

ACC (BBS) | 0.35 | 0.36 | 0.35 | 0.35 | 0.35 | 0.35 | 0.35 | 0.36 | 0.38 | 0.35 | 0.35 |

Standard deviation (BBS) | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 |

ACC (CMDS) | 0.70 | 0.35 | 0.34 | 0.35 | 0.34 | 0.34 | 0.35 | 0.35 | 0.35 | 0.70 | 0.35 |

Standard deviation (CMDS) | 0.16 | 0.08 | 0.07 | 0.02 | 0.07 | 0.07 | 0.07 | 0.01 | 0.01 | 0.16 | 0.06 |

Average path length and its standard deviation for the reconstructed networks

Window no. | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 |
---|---|---|---|---|---|---|---|---|---|---|---|

APL (BBS) | 1.82 | 1.82 | 1.84 | 1.83 | 1.83 | 1.84 | 1.83 | 1.82 | 1.82 | 1.84 | 1.83 |

Standard deviation (BBS) | 0.41 | 0.41 | 0.41 | 0.41 | 0.41 | 0.41 | 0.41 | 0.41 | 0.41 | 0.41 | 0.41 |

APL (CMDS) | 2.06 | 1.81 | 1.82 | 1.82 | 1.82 | 1.81 | 1.87 | 1.83 | 1.82 | 2.04 | 1.87 |

Standard deviation (CMDS) | 0.71 | 0.42 | 0.41 | 0.41 | 0.41 | 0.41 | 0.48 | 0.41 | 0.41 | 0.73 | 0.48 |

Network models characteristics; *N* = 825 – number of nodes; *k* = 150 – average node degree

Feature | Random network | Small-world network | Ordered network |
---|---|---|---|

Clustering coefficient | Low \({k \over N}=0.18\) | Between order and randomness BBS: [0.35; 0.38] CMDS: [0.34; 0.70] | High \({{3(k-2)}\over{4(K-1)}}=0.74\) |

Average path length | Short \({{\hbox{ln} N}\over{\hbox{ln} k}}=1.34\) | Between order and randomness BBS: [1.82; 1.84] CMDS: [1.81; 2.06] | Long \( { {{N}} \over {2k}}=2.75\) |

The performed analyses revealed that the networks recreated after the molecular simulation are small-world networks. They follow Poisson node degree distribution, have big clustering coefficient and small average path length. Molecular simulation terminates when the system achieves stable state. We showed that networks reconstructed after the simulations feature the three, enumerated above characteristics of social networks. The fact that the final networks are small-world ones and resemble typical characteristics of real-world social networks can be an indication that molecular simulation can be a new way of generating this type of networks and may be effectively applied in sociodynamical analysis.

## 8 Conclusions

We have proposed to model the dynamics of a complex social system using molecular simulation, where the interactions between the individuals are determined from the data in a form of a social force, which corresponds to the particle interaction force used in the simulation. In our case the social relation was defined on the basis of communication events (message exchange) recorded in the computer system (email server). This allowed to define a social distance as inversely proportional to the number of messages exchanged between users and to estimate the character of the social force determining the changes of social distance. It was also shown that the global dynamics of such system may be modelled by treating the users as interacting particles embedded in an Euclidean social space. The movement of particles is determined by the social force and their trajectories are determined by their initial positions, derived from the email server logs and allowing to create the social network.

To the best of our knowledge this is the first attempt to apply a molecular modelling approach to the problem of social network dynamics. It has hence required careful verification, especially with respect to representation of the network evolutionary processes and chosen network structural properties, commonly used in network analysis. The experiments have shown that the proposed approach allows to reason about structural properties of evolving social network, while benefitting from the algorithmic simplicity of molecular modelling.

The embedding procedure projecting the non-metric social graph into the Euclidean space should be chosen with care, taking into account the inherent trade-off between preserving the distances from social graph with the required accuracy and limiting the dimensionality of the Euclidean space. This has proven to be especially difficult for network hubs, regardless of the embedding method used.

The character of social force leading to changes in social distances can be generalized; however, this process is inherently connected with the loss of information in the case of individuals who behave statistically differently from the mean pattern (typical behaviour) derived from the whole network data.

The molecular model of social dynamics allows to reconstruct the social network from positions of the users (moving particles) in an Euclidean social space. While the reconstructed network preserves some of the global characteristics, local properties at the level of individual nodes usually cannot be recovered.

The reconstructed social network follows the small–world network model with large clustering coefficient and small average path length.

## Footnotes

- 1.
The distance distortion is defined for each pair as the maximum of the ratio between the original and Euclidean distance and its inverse.

## Notes

### Acknowledgments

The research presented in this paper has been partially supported by the European Union within the European Regional Development Fund Program No. POIG.01.03.01-00-008/08. The research leading to these results has received funding from the European Union Seventh Framework Programme (FP7/2007-2013) under Grant Agreement No. 251617.

### References

- Aggarwal C, Hinneburg A, Keim D (2001) On the surprising behavior of distance metrics in high dimensional space. In: Bussche J, Vianu V (eds) Database Theory—ICDT 2001, Lecture Notes in Computer Science, vol. 1973, Springer, Berlin, pp 420–435Google Scholar
- Barabasi AL (2003) Linked: how everything is connected to everything else and what it means. Plume, Newyork Google Scholar
- Barrat A, Barthelemy M, Vespignani A (2008) Dynamical processes on complex networks. Cambridge University Press, CambridgeGoogle Scholar
- Beyer K, Goldstein J, Ramakrishnan R, Shaft U (1999) When is “nearest neighbor” meaningful. In: Beeri C, Buneman P (eds) Databases Theory—ICDT 1999, Lecture Notes in Computer Science, vol 1540, Springer, Berlin, pp 217–235Google Scholar
- Bishop C (1995) Neural networks for pattern recognition. Oxford University Press, New YorkGoogle Scholar
- Boccaletti S, Latora V, Moreno Y, Chavez M, Hwang DU (2006) Complex networks: structure and dynamics. Phys Rep 424(4–5):175–308MathSciNetCrossRefGoogle Scholar
- Bollobas B (1995) Random graphs. Academic, LondonGoogle Scholar
- Braha D, Bar-Yam Y (2006) From centrality to temporary fame: dynamic centrality in complex networks. Complexity 12:59–63CrossRefGoogle Scholar
- Bringmann B, Berlingero M, Bonch F, Gionis A (2010) Learning and predicting the evolution of social networks. IEEE Intell Syst 25(4):26–35CrossRefGoogle Scholar
- Bronstein M, Kimmel R (2006) Generalized multidimensional scaling: a framework for isometry-invariant partial surface matching. Proc Natl Acad Sci 103(5):1168–1172MathSciNetMATHCrossRefGoogle Scholar
- Budka M, Gabrys B (2011) Electrostatic field framework for supervised and semi-supervised learning from incomplete data. Natural Comput 10:921–945. doi:10.1007/s11047-010-9182-4
- Davis D, Lichtenwalter R, Chawla N (2012) Supervised methods for multi-relational link prediction. Social Netw Anal Min. 1–15. doi:10.1007/s13278-012-0068-6
- Epstein J (2008) Why model? J Artif Soc Soc Simul 11(4). http://jasss.soc.surrey.ac.uk/11/4/12.html
- Francois D, Wertz V, Verleysen M (2005) Non–Euclidean metrics for similarity search in noisy datasets. In: Proceedings of the European symposium on artificial neural networks, d–side publications, pp 339–334Google Scholar
- Garton L, Haythornthwaite C, Wellman B (1997) Studying online social networks. J Comput Mediat Commun 3(1). http://jcmc.indiana.edu/vol3/issue1/garton.html
- Harel D, Koren Y (2004) Graph drawing by high-dimensional embedding. J Graph Algorithms Appl 8(2):195–214MathSciNetMATHCrossRefGoogle Scholar
- Helbing D (2010) Quantitative sociodynamics: stochastic methods and models of social interaction processes. Springer, BerlinGoogle Scholar
- Hill RA, Dunbar RIM (2002) Social network size in humans. Human Nat 14(1):53–72CrossRefGoogle Scholar
- Hill S, Braha D (2010) Dynamic model of time-dependent complex networks. Phys Rev E. 82 (arXiv:0901.4407v2) Google Scholar
- Holland J (1996) Hidden order: how adaptation builds complexity. Basic Books, NewyorkGoogle Scholar
- Juszczyszyn K, Musial A, Musial K, Brodka P (2009) Molecular dynamics modelling of the temporal changes in complex networks. In: IEEE Congress on Evolutionary Computing, Trondheim, Sweden. IEEE Computer Society Press, Newyork, pp 553–559Google Scholar
- Juszczyszyn K, Budka M, Musial K (2011a) The dynamic structural patterns of social networks based on triad transitions. In: 2011 International Conference on Advances in social networks analysis and mining (ASONAM), pp 581–586. doi:10.1109/ASONAM.2011.50.http://dl.acm.org/citation.cfm?id=2055729
- Juszczyszyn K, Musial K, Budka M (2011b) Link prediction based on subgraph evolution in dynamic social networks. In: The Third IEEE international conference on social computing (SocialCom 2011), pp 27–34 (2011) doi:10.1109/PASSAT/SocialCom.2011.15. http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=06113091
- Juszczyszyn K, Musial K, Budka M (2011c) On analysis of complex network dynamics changes in local topology. In: The fifth workshop on social network mining and analysis co-located with the 17th ACM SIGKDD international conference on knowledge discovery and data mining (SNA-KDD)Google Scholar
- Kashoob S, Caverlee J (2012) Temporal dynamics of communities in social bookmarking systems. Social Netw Anal Min 2:387–404. doi: 10.1007/s13278-012-0054-z
- Kazienko P, Musial K, Zgrzywa A (2009) Evaluation of node position based on email communication. Control Cybern 38(1):67–86MATHGoogle Scholar
- Kolaczyk E (2009) Statistical analysis of network data. Springer, BerlinGoogle Scholar
- Kruskal JB, Wish M (1978) Multidimensional scaling, Sage University Paper series on Quantitative Application in the Social Sciences. Sage Publications, Thousand Oaks Google Scholar
- Kumar R, Novak J, Tomkins A (2006) Microscopic evolution of social network. In: The 14th ACM SIGKDD international conference on knowledge discovery and data mining. ACM Press, NewyorkGoogle Scholar
- Lescovec J, Backstrom L, Kumar R, Tomkins A (2008) Microscopic evolution of social networks. In: ACM SIGKDD international conference on knowledge discovery and data mining (KDD) Google Scholar
- Liben-Nowell D, Kleinberg J (2007) The link-prediction problem for social networks. J Am Soc Info Sci Technol 58(7):1019–1031CrossRefGoogle Scholar
- Malarz K, Gronek P, Kulakowski K (2011) Zaller-deffuant model of mass opinion. J Artif Soc Soc Simul 14(1):1–20Google Scholar
- Musial A, Juszczyszyn K, Musial K, Brodka P (2010) Utilizing dynamic molecular modelling technique for predicting changes in complex social networks. In: IEEE/WIC/ACM Joint International Conference on Web Intelligence and Intelligent Agent Technology. IEEE Press, Newyork, pp 1–4Google Scholar
- Sarr I, Missaoui R (2012) Managing node disappearance based on information flow in social networks. Soc Netw Anal Min. 1–13. doi:10.1007/s13278-012-0071-y
- Schweitzer F (2003) Brownian agents and active particles—collective dynamics in the natural and social sciences. Springer Series in Synergetics. Springer, BerlinGoogle Scholar
- Shavitt Y, Tankel T (2004) Big-bang simulation for embedding network distances in euclidean space. IEEE/ACM Trans Netw 12(6):993–1006CrossRefGoogle Scholar
- Shaw B, Jebara T (2007) Minimum volume embedding. In: Proceedings of the eleventh international conference on artificial intelligence and statisticsGoogle Scholar
- Shaw B, Jebara T (2009) Structure preserving embedding. In: Proceedings of the 26th international conference on machine learningGoogle Scholar
- Strogatz SH (2001) Exploring complex networks. Nature 410(6825):268–276CrossRefGoogle Scholar
- Torgerson W (1965) Multidimensional scaling of similarity. Psychometrika 30(4):379–393CrossRefGoogle Scholar
- Wasserman S, Faust K (1994) Social network analysis: methods and applications. Cambridge University Press, New YorkGoogle Scholar
- Watts D (2002) Small worlds: dynamic of networks between order and randomness. Princeton University Press, PrincetonGoogle Scholar
- Watts D, Strogatz S (1998) Collective dynamics of ‘small-world’ networks. Nature 393(6684):440–444CrossRefGoogle Scholar
- Weidlich W (1991) Physics and social science—the approach of synergetics. Phys Rep 1(204):1–163MathSciNetCrossRefGoogle Scholar
- Wolfram S (1986) Theory and applications of cellular automata. World Scientific, SingaporeGoogle Scholar

## Copyright information

**Open Access**This article is distributed under the terms of the Creative Commons Attribution License which permits any use, distribution, and reproduction in any medium, provided the original author(s) and the source are credited.