The dynamic random subgraph model for the clustering of evolving networks
Abstract
In recent years, many clustering methods have been proposed to extract information from networks. The principle is to look for groups of vertices with homogenous connection profiles. Most of these techniques are suitable for static networks, that is to say, not taking into account the temporal dimension. This work is motivated by the need of analyzing evolving networks where a decomposition of the networks into subgraphs is given. Therefore, in this paper, we consider the random subgraph model (RSM) which was proposed recently to model networks through latent clusters built within known partitions. Using a state space model to characterize the cluster proportions, RSM is then extended in order to deal with dynamic networks. We call the latter the dynamic random subgraph model (dRSM). A variational expectation maximization (VEM) algorithm is proposed to perform inference. We show that the variational approximations lead to an update step which involves a new state space model from which the parameters along with the hidden states can be estimated using the standard Kalman filter and Rauch–Tung–Striebel smoother. Simulated data sets are considered to assess the proposed methodology. Finally, dRSM along with the corresponding VEM algorithm are applied to an original maritime network built from printed Lloyd’s voyage records.
Keywords
State space model Variational inference Variational expectation maximization Maritime data1 Introduction
Network analysis has become a mature discipline, since the original work of Moreno (1934), which is no longer limited to sociology and is now applied in many areas such as biology (Albert and Barabási 2002; Barabási and Oltvai 2004; Palla et al. 2005), geography (Ducruet 2013) or history (Rossi et al. 2014). The growing interest in network analysis is explained partly by the strong presence of this type of data in the digital world, and by recent advances in the modeling and the processing of these data. The clustering methods allow in particular clusters of vertices sharing homogeneous connection profiles to be uncovered. Most methods look for specific structures, so called communities, which exhibit a transitivity property such that nodes of the same community are more likely to be connected (Hofman and Wiggins 2008). A popular approach for community discovering, though asymptotically biased (Bickel and Chen 2009), is based on the modularity score given by Girvan and Newman (2002). Alternative methods usually rely on the latent position cluster model (LPCM) of Handcock et al. (2007) which assumes that the links between the vertices depend on their positions in a social latent space.
The stochastic block model (SBM; Wang and Wong 1987; Nowicki and Snijders 2001) is a flexible random graph model which can also characterize communities, but not only. It is based on a probabilistic generalization of the method applied by White et al. (1976) on Sampson’s famous monastery (Fienberg and Wasserman 1981). The SBM model assumes that each vertex belongs to a latent group, and that the probability of connection between a pair of vertices depends exclusively on their group. Because no assumption is made on the connection probabilities, various types of structures of vertices can be taken into account. While SBM was originally developed to analyze mainly binary networks, many extensions have been proposed since to deal for instance with valued edges (Mariadassou et al. 2010) or to take into account prior information (Zanghi et al. 2010; Matias and Robin 2014). In particular, the random subgraph model (RSM) of Jernite et al. (2014) aims at modeling categorical edges using prior knowledge of a partition of the network into subgraphs. These known subgraphs are assumed to be made of latent clusters which have to be inferred. The vertices are then connected with a probability depending only on the subgraphs whereas the edge type is assumed to be sampled conditionally on the latent groups. This model was applied in the original paper to analyze a historical network in merovingian Gaul. Note that other extensions of SBM have focused on looking for overlapping clusters (Airoldi et al. 2008; Latouche et al. 2011). The inference of SBM like models is usually done using variational expectation maximization (VEM; Daudin et al. 2008), variational Bayes EM (VBEM; Latouche et al. 2012), or Gibbs sampling (Nowicki and Snijders 2001). Moreover, we emphasize that various strategies have been derived to estimates the number of corresponding clusters using model selection criteria (Daudin et al. 2008; Latouche et al. 2012), allocation sampler (Mc Daid et al. 2013), greedy search (Côme and Latouche 2015), or non parametric schemes (Kemp et al. 2006).
Recently, a few attempts have been made to extend the models mentioned previously in order to deal with dynamic networks. The main idea consists in introducing temporal processes in order to characterize the temporal evolution of nodes and edges through time. Thus, Yang et al. (2011) proposed a dynamic version of SBM allowing a node to switch its class at time \(t+1\) depending on its state at time t. The switching probabilities are all characterized by a transition matrix. The alternative extension for SBM of Xu and Hero III (2013) focuses on modeling the temporal changes through a state space model and relies on the Kalman filter for inference. Contrary to Yang et al. (2011), Xu and Hero III (2013) treated the edge probabilities as time varying parameters. In parallel, the mixed membership SBM (MMSBM) of Airoldi et al. (2008), capable of characterizing overlapping clusters, was adapted to deal with dynamic networks by Xing et al. (2010), Ho et al. (2011) and Kim and Leskovec (2013). Moreover, Sarkar and Moore (2005) derived a dynamic version of the LPCM model of Handcock et al. (2007) keeping the transitivity property that nodes which are close in a social latent space should be more likely to connect. Finally, we would like to highlight the work of Dubois et al. (2013) and Heaukulani and Ghahramani (2013). In Dubois et al. (2013) a non homogeneous Poisson process is considered. Thus, contrary to most clustering models for dynamic networks, a continuous time period is taken into account and events, i.e. the creation or removal of an edge, occur one at a time. While models usually focus on modeling the dynamic of networks through the evolution of their latent structures, Heaukulani and Ghahramani (2013) extended the dynamic latent feature model of Foulds et al. (2011) to define how observed social interactions can affect future unobserved latent structures. In the same vein, a dynamic model inspired by SBM was proposed recently by Xu (2015).
In this paper, we aim at modeling dynamic networks with binary or more generally typed edges, for which a partition of the nodes is given. As an example, we will consider an original network, built from printed Lloyd’s voyage records and describing maritime flows between ports where the geographical positions of the ports play an important role. The partition was obtained by associating each port to a region according to its geographical position. Figure 1 presents the evolution of network navigations, for 23 years between October 1985 and October 2008. A (given) partition of the nodes is seen here as a decomposition of the network into known subgraphs that we propose to model using unobserved clusters that have to be inferred from the data in practice. Thus, considering a slightly different version of the original RSM model of Jernite et al. (2014) and relying on a state space model as in Xing et al. (2010), we propose a new random graph model for evolving networks that we call the dynamic RSM (dRSM) model. The model focuses on describing the network dynamic by characterizing the evolution of the cluster proportions within the known subgraphs. A logistic transformation is used to link the hidden states and the clusters proportions, as in Blei and Lafferty (2007a); Ahmed and Xing 2007). The inference of the model is done using a VEM algorithm.
2 The dynamic random subgraph model
This section presents the context of the work and introduces the dRSM model along with the modeling of its dynamic. The joint distribution associated with the model is also detailed.
2.1 Context and notations
We consider a set of T networks \(\lbrace {\mathcal {G}}^{(t)}\rbrace _{t=1}^{T}\), where \( {\mathcal {G}}^{(t)} \) is a directed graph observed at time t . Each \( {\mathcal {G}}^{(t)} \) is represented by its \( N \times N \) adjacency matrix \(X^{(t)} \) where N denotes the number of nodes. The edge \( X_ {i,j}^ {(t)} \), describing the relationship between nodes i and j , is assumed to take its values in \( \lbrace 0, \ldots C \rbrace \) such that \(X_{ij}^{(t)} = c \) means that nodes i and j are linked by a relationship of type c at time t and \(X_{ij}^{(t)} = 0\) indicates the absence of relationship between the two nodes at time t. Note that no self loops are considered, i.e. the connection of a node to itself, thus \(X_{ii}^{(t)}=0,\,\forall \, i, t\).
Moreover, a partition \({\mathcal {P}}\) of the network into S classes of vertices is assumed to be given. We emphasize that the observed partition induces a decomposition of the graph into subgraphs where each class of vertices corresponds to a specific subgraph. To describe the subgraph membership of each vertex, the variable s is introduced. The variable takes its values in \(\lbrace 1,\ldots S\rbrace \) and is such that \(s_{i}\) indicates the subgraph of vertex i. In some cases, and in order to clarify the equations, we will also consider the indicator variables \(y_{is}\) such that \(y_{is}=1\) if node i is in subgraph s, 0 otherwise. Finally, because the vertex i can only belong to a single subgraph, we have \(\sum _{s=1}^{S}y_{is}=1\).
Our goal is to cluster at each time t the N nodes into K latent groups with homogeneous connection profiles, i.e. find an estimate of the set Z of latent variables \(Z_ {ik} ^ {(t)}\) such that \( Z_ {ik} ^ {(t)} = \) 1 if at time t , the node i belongs to the class k , and 0 otherwise. Please note that N, C, \({\mathcal {P}}\), S and K are all assumed to be constant over time.
2.2 The model at each time t
As in the original RSM model, the (known) subgraphs are assumed to be built from K unobserved clusters of vertices, with varying proportions. Thus, each subgraph s has its own mixing proportion vector \(\alpha _{s}^{(t)}=(\alpha _{s1}^{(t)},\ldots ,\alpha _{sK}^{(t)})\) where \(\alpha _{sk}^{(t)}\) is the proportion of cluster k in subgraph s at time t and \(\,\sum _{k=1}^{K}\alpha _{sk}^{(t)}=1, \, \forall \, s,t\). The network is then assumed to be generated at each time t as follows.
Figure 2 presents an example of a dRSM network, observed at time t, made of 9 nodes belonging to 2 subgraphs (denoted through the form of nodes) and split into 3 clusters (indicated by the colors).
2.3 Modeling the evolution of random subgraphs
Summary of the notations used in the paper
Notations | Description |
---|---|
X | Adjacency matrix \(X_{ij}^{(t)} \in \{0,\ldots ,C\}\) at each t |
Z | Binary matrix. \(Z_{ik}^{(t)}=1\) indicates that i belongs to cluster k at t |
N | Number of vertices in the network |
K | Number of latent clusters |
S | Number of subgraphs |
C | Number of edge types |
\(\Pi \) | \(\Pi ^c_{kl}\) is the probability of having an edge of type c between vertices of clusters k and l |
\(\alpha \) | \(\alpha _{sk}^{(t)}=f_k(\gamma _s^{(t)})\) is the proportion of cluster k in the subgraph s at t |
Notice that the state space model for linear dynamic systems may suffer from model identifiability issues and constraints have to be introduced (see for instance Harvey 1989). In the following, we derive the inference procedure in a general context since different constraints can be considered. In practice, in all the experiments that we carried out, we fixed A, B, and \(V_0\) to be equal to the identity matrix \(I_{K-1}\) and all components of \(\mu _{0}\) to zero.
The model described here has three sets of latent variables (\(\nu =(\nu ^{(t)})_t,\gamma =(\gamma _s^{(t)})_{st},Z=(Z_{ik}^{(t)})_{ikt}\)) and is parameterized by \(\theta =(\mu _{0},A,B,\Phi ,V_{0},\Sigma ,\Pi )\). Note that all parameters in \( \theta \) depend neither on time nor subgraphs. This model is called the dynamic random subgraph model (dRSM) in the rest of the document. Figure 3 gives the graphical model for dRSM and Table 1 summarizes the notations used in the model.
At this point, it is possible to see some links and differences between dRSM and dM3SBM (Ho et al. 2011), which is the closest model in the litterature. On the one hand, dRSM and dM3SBM share a common way to model the latent clusters and the temporal dynamic through a state space model. On the other hand, dRSM is able to handle categorical edges, which is a useful feature when working on real-world networks, whereas dM3SBM cannot. In addition, dRSM requires the knowledge of the subgraphs whereas dM3SBM proposes to estimate them. Furthermore, dM3SBM allows the nodes to belong to different clusters. However, allowing to estimate the subgraphs and multi-group belongings may conduce dM3SBM to be a too flexible model and thus to fail in recovering the network structure. Indeed, providing the subgraphs to dRSM allows it to avoid looking for obvious structures such that it can focus on the search of hidden patterns. The comparisons presented in Sect. 4 seem to confirm this thesis.
2.4 Joint distribution of dRSM
3 Estimation
This section focuses on the inference of the model proposed above. A variational EM algorithm is considered and a model selection criterion is derived.
3.1 A variational framework
Following the work of Lafferty and Blei (2006) on correlated topic models, we propose a new bound of \({\mathcal {L}}(q,\theta )\) based on a variational lower bound of \(p(Z|\gamma )\), as in Jordan et al. in Jordan et al. (1999).
Proposition 3.1
Note that the variational parameters \(\xi _{s}^{(t)}\) can be optimized to obtain tight bounds (see the end of Sect. 3.2). Moreover, we emphasize that a variational parameter \(\xi _{s}^{(t)}\) is considered for each subgraph s and each time t for more flexibility and to improve the inference procedure. We point out that the quality of the variational approximation we propose cannot be tested analytically since \(\tilde{{\mathcal {L}}}(q,\theta ,\xi )\) and the Kullback-Leibler divergence in (6) are not tractable. Nevertheless, we rely on them for inference purposes. Note that similar approximation schemes have been used for instance by Bishop and Svensén (2003) and Latouche et al. (2014), in the context of model selection.
3.2 A VEM algorithm for the dRSM model
In this section, we first assume that the variational terms \(\xi \), which were introduced for approximation purposes, are given. This allows the use of a VEM algorithm (Jordan et al. 1999) to maximize the lower bound \(\tilde{{\mathcal {L}}}(q,\theta ,\xi )\) with respect to \(q(Z,\gamma ,\nu )\) and the model parameters \(\theta \). Such an optimization procedure is iterative and involves a series of successive updates. In the E step, the model parameters are fixed and the lower bound is optimized with respect to \(q(Z,\gamma ,\nu )\). Conversely, during the M step, the variational distribution is held fixed while \(\tilde{{\mathcal {L}}}(q,\theta ,\xi )\) is maximized with respect to \(\theta \). In standard VEM algorithms, a unique set of latent variables is usually considered. In our case, there are three sets \((Z, \gamma , \nu )\) of latent variables and therefore the E step itself involves iterative updates (as in Latouche et al. 2014, for instance). All distributions in \(q(Z, \gamma , \nu )\) are held fixed, except one, which is optimized. This procedure is repeated for all distributions in turn.
In the following, we give the update formulae for the E and M steps. The details of the calculations along with the derivation of the lower bound are given in the “Appendices 1, 2 and 3”.
Proposition 3.2
Note that \(\tau _{ik}^{(t)}\) is the approximate posterior probability that node i belongs to cluster k at time t.
Proposition 3.3
Proposition 3.4
3.3 Optimization of \(\xi \)
3.4 Model selection: choice of the number K of latent groups
4 Numerical experiments and comparisons
This section aims at proving on synthetic data the validity of the inference algorithm presented in Sect. 3. An introductory example is first considered to highlight the main features of the proposed approach. Model selection is then considered to validate the criterion choice. Extensive comparisons with state-of-the-art methods conclude this section.
4.1 Experimental setup
In order to validate our approach, we use in this section artificial data generated according to a common experimental setup. To simplify the characterization and facilitate the reproducibility of the experiments, we designed five different scenarios. The generation setup for each scenario is summarized in Table 2. Data from scenario 0 are drawn using SBM at each time t and without an explicit temporal dependence. The data sets for all other scenarios (scenarios 1–4) are drawn according to the dRSM model. Therefore, the temporal dependence is generated through a state space model. All generated networks are made of \(N=300\) nodes, distributed into \(K=4\) latent groups and have \(T=10\) time points. Depending on the scenario, the networks have \(S=1\) or 2 subgraphs, with binary (\(C=1\)) or categorical (\(C=2\)) edges. When \(S>1\), the nodes are randomly assigned uniformly to the subgraphs. Notice that scenario 2 has a parameter \(\Pi _{kl,k \ne l}^0\) equal to 0.8 which leads to less heterogeneous latent groups.
Parameter values for the five types of graphs used in the experiments
Parameters | Scenario 0 | Scenario 1 | Scenario 2 | Scenario 3 | Scenario 4 |
---|---|---|---|---|---|
N | 300 | ||||
K | 4 | ||||
T | 10 (indep.) | 10 (SSM) | |||
S | 1 | 1 | 1 | 2 | 2 |
C | 1 | 1 | 1 | 1 | 2 |
\((\Pi _{ll}^{0})_{l=1,\ldots ,K}\) | (0.1,0.4,0.5,0.6) | ||||
\(\Pi _{kl,k\ne l}^{0}\) | 0.99 | 0.8 | 0.99 | ||
\(\Pi _{kl}^{c\ne 0}\) | \((1-\Pi _{kl}^{0})/C\) |
The model parameters used for the simulation are as follows. For the simulation of \(\gamma \), it is assumed that the matrices A, B and \(V_0\) are set to \(I_{K-1}\), and that \(\Sigma = 0.1 \times K \times I_{K-1}\) and \(\Phi = 0.01 \times I_{K-1}\). Finally, the tensor matrix \(\Pi \), which defines the connection probabilities between clusters for the C different types, is set up such that, within the clusters, the probability \(1-\Pi _{ll}^{0}\) of having an edge of any type is larger than the corresponding connection probabilities between clusters \(1-\Pi ^0_{k l,k\ne l}\) (see Table 2). Notice that such a choice of parameters induces networks made of communities. Then, in case of a connection between two nodes, the edge type is sampled uniformly, i.e.\(\Pi ^{c \ne 0}_{k l} = (1 - \Pi ^0_{k l}) / C,~\forall k,l\).
4.2 An introductory example
We first focus on an introductory example to illustrate the global behavior of the proposed methodology. To this end, we simulated a single network according to scenario 2 for facilitating the understanding of the results. We remind that in this setup the number K of latent groups is fixed to 4 and that \(C=1\). Therefore, the network is binary and \(\Pi _{kl}^{1}\) indicates the occurrence probability of an edge. We ran the VEM algorithm on it for a number K of groups ranging from 3 to 6. We selected afterward the most appropriate number of groups using the BIC criterion.
Figure 4 shows the BIC values associated to the results provided by our VEM algorithm for the different values of K. One can observe that the criterion picks at \(K=4\), which is the actual simulated value for K. Figure 5 presents the evolution of the bound \(\tilde{{\mathcal {L}}}\) for this specific value of K along the 10 iterations of the VEM algorithm. A clear plateau of the bound is visible on the figure, which indicates the convergence of the algorithm.
To quickly assess the estimation quality, Table 3 allows to compare the actual (left panel) and estimated (right panel) values of the terms \(\Pi _{kl}^1\) in the tensor matrix \(\Pi \), which define the connection probabilities between the latent clusters. On this single example, the estimated values \(\Pi _{kl}^1\) turn out to be extremely close to the true ones. Similarly, Fig. 6 compares the actual (dashed red lines) and estimated (solid black lines) values of the group proportions \(\alpha \) for the simulated example. Once again, the estimation of \(\alpha \) appears to be very close to the true proportions.
4.3 Choice of K
Actual (left) and estimated (right) values for the terms \(\Pi _{kl}^1\) of the tensor matrix \(\Pi \)
Cluster | 1 | 2 | 3 | 4 | Cluster | 1 | 2 | 3 | 4 |
---|---|---|---|---|---|---|---|---|---|
Actual values | Estimated values | ||||||||
1 | 0.90 | 0.01 | 0.01 | 0.01 | 1 | 0.89 | 0.01 | 0.01 | 0.01 |
2 | 0.01 | 0.60 | 0.01 | 0.01 | 2 | 0.01 | 0.59 | 0.01 | 0.01 |
3 | 0.01 | 0.01 | 0.50 | 0.01 | 3 | 0.01 | 0.01 | 0.48 | 0.01 |
4 | 0.01 | 0.01 | 0.01 | 0.40 | 4 | 0.01 | 0.01 | 0.01 | 0.39 |
To validate the combination of our VEM algorithm with the BIC criterion, the analysis was repeated for 50 different data sets, generated according to scenario 2, for a number K of latent groups ranging from 3 to 6. This allows us to both verify the consistency of the BIC criterion and to study the clustering ability of our approach. Figure 7 shows the repartition of the criterion values (left panel) as well as the associated ARI values (right panel). These results first confirm that BIC is a valid criterion for selecting the number of groups in this context. Indeed, the value \(K=4\) is the one which is the most frequently associated with the highest value of BIC. We remind that \(K=4\) is the actual number of latent groups. One can also observe that the partition resulting from our VEM algorithm is associated, for this value of K, to an ARI value extremely close to 1 which denotes a good matching with the actual partition of the data.
4.4 Comparison with the other stochastic models
Our third set of experiments now aims at comparing the performance of our approach to that of state-of-the-art methods. We are here interested in the comparison of dRSM with the following methods: SBM (Nowicki and Snijders 2001), RSM (Jernite et al. 2014) and dM3SBM (Ho et al. 2011). Once again, the evaluation of the results is done using the ARI criterion. In order to fit a SBM on a dynamic network, we ran the mixer package (Ambroise et al. 2010) for the R software at each time t and the ARI is then computed on the concatenation of all group labels. However, let us notice that SBM was not able to handle networks with categorical edges (scenario 3). For RSM, we used the Rambo package (Bouveyron et al. 2013) for R, on an aggregated version of the whole network. Conversely to SBM, RSM is only able to deal with categorical networks and, consequently, it works only in scenario 4. Finally, we used the Matlab toolbox dM3SBM, kindly provided by the authors, to fit the dM3SBM on the dynamic networks. However, dM3SBM is also not able to handle networks with categorical edges (scenario 4).
In order to consider a wide type of networks, we compare here the methods over the five simulation scenarios. We remind that Table 2 summarizes the main features of each scenario. This comparison has been conducted in two different situations: with and without the knowledge of the actual number of clusters. Table 4 presents the clustering results for the four studied methods in the case where the actual number K = 4 of groups has been provided to each method. Conversely, Table 5 presents the clustering results when the methods have to look for the value of K. Reported values are averaged ARI values (with standard deviations) on 20 networks for each scenario. The average selected number K of latent groups is also provided for Table 5.
Clustering results for the four studied methods on networks simulated according to the five scenarios
Method | Scenario 0 | Scenario 1 | Scenario 2 | Scenario 3 | Scenario 4 |
---|---|---|---|---|---|
SBM | 0.10 \(\pm \) 0.04 | 0.12 \(\pm \) 0.05 | 0.18 \(\pm \) 0.07 | 0.14 \(\pm \) 0.09 | – |
RSM | – | – | – | – | 0.01 \(\pm \) 0.01 |
dM3SBM | 0.36 \(\pm \) 0.09 | 0.30 \(\pm \) 0.16 | 0.25 \(\pm \) 0.16 | 0.32 \(\pm \) 0.20 | – |
dRSM | 1.00 \(\pm \) 0.00 | 0.98 \(\pm \) 0.04 | 0.90 \(\pm \) 0.20 | 0.97 \(\pm \) 0.07 | 0.75 \(\pm \) 0.24 |
Clustering results for the four studied methods on networks simulated according to the five scenarios
Method | Scenario 0 | Scenario 1 | Scenario 2 | Scenario 3 | Scenario 4 | |||||
---|---|---|---|---|---|---|---|---|---|---|
ARI | K | ARI | K | ARI | K | ARI | K | ARI | K | |
SBM | 0.01 \(\pm \) 0.04 | 4.00 \(\pm \) 0.00 | 0.18 \(\pm \) 0.13 | 3.94 \(\pm \) 0.71 | 0.21 \(\pm \) 0.11 | 3.97 \(\pm \) 0.46 | 0.13 \(\pm \) 0.05 | 4.16 \(\pm \) 0.79 | – | – |
RSM | – | – | – | – | – | – | – | – | 0.01 \(\pm \) 0.01 | 2.00 \(\pm \) 0.00 |
dM3SBM | 0.01 \(\pm \) 0.01 | 5.55 \(\pm \) 1.39 | 0.35 \(\pm \) 0.21 | 5.95 \(\pm \) 1.15 | 0.30 \(\pm \) 0.21 | 4.35 \(\pm \) 1.63 | 0.32 \(\pm \) 0.19 | 5.15 \(\pm \) 1.17 | – | – |
dRSM | 1.00 \(\pm \) 0.00 | 4.00 \(\pm \) 0.00 | 0.87 \(\pm \) 0.17 | 4.01 \(\pm \) 0.65 | 0.89 \(\pm \) 0.21 | 4.10 \(\pm \) 0.30 | 0.85 \(\pm \) 0.22 | 4.10 \(\pm \) 0.45 | 0.68 \(\pm \) 0.30 | 4.05 \(\pm \) 0.51 |
In scenario 3, the simulated dynamic networks are now made of two subgraphs (\(S=2\)), still with binary edges (\(C=1\)). Naturally, SBM does not perform well in this situation too. The dM3SBM provides clustering results similar to the ones of previous scenarios: it globally succeeds in recovering the dynamic but fails in recognizing the clustering pattern. On the other hand, dRSM provides again accurate clustering results associated with good estimations of K, meaning that it succeeds in identifying both the dynamic and clustering patterns.
Finally, scenario 4 considers the case of dynamic networks with two subgraphs (\(S=2\)) and categorical edges (\(C=2\)). Only RSM and dRSM are able to deal with this kind of networks. Similarly to SBM in previous scenarios, RSM does not succeed in recovering the dynamic and provides very unsatisfactory clustering results. Conversely, dRSM gives very good clustering results regarding the difficulty of the situation. It is worth noticing the sharp estimation made by dRSM of the number K of group in this case too. This confirms the efficiency of both our inference algorithm and our model selection criterion.
We also used scenario 4 to highlight that providing the methodology with the right subgraph structure helps in clustering the vertices. Thus, with the knowledge of the actual number of clusters, we ran dRSM with the wrong subgraph structure (\(S=1\)), and we obtained an average ARI of \(0.54\pm 0.2\). This result is to be compared to the ARI performances for scenario 4, as presented in Table 4.
5 Maritime network
This section presents an application of the proposed methodology for the analysis of a network of maritime flows in which a temporal dynamic is present. The dynamic network was provided by Dr. César Ducruet, from the Géographie-Cités laboratory, who is interested in studying the evolution of maritime flows over time. The data was extracted from the well-known Lloyd’s list which has recorded almost all ship movements worldwide since 1890.
5.1 Data and study protocol
The time points considered in the maritime network
Time point | Date |
---|---|
\(t_{1}\) | October 1890 |
\(t_{2} \dots t_{4}\) | October 1925 to October 1940, every 5 years |
\(t_{5}\) | October 1946 |
\(t_{6}\) | October 1951 |
\(t_{7} \) | October 1960 |
\(t_{8} \dots t_{16}\) | October 1965 to October 2000, every 5 years |
\(t_{17}\) | October 2008 |
Data was obtained from the printed Lloyds voyage record published every October between 1890 until 2008. The list details, for each merchant vessel, its successive movements from one port to another. From the raw database of vessel flows, we extracted a dynamic network with 17 time points. The first observation is October 1890 and the network ends in October 2008. Table 6 provides the correspondence between the 17 time points and the actual dates.
At each time point, the adjacency matrix between ports was constructed as follows. First, for every pair of ports, we calculated the total number of ship movements between those ports. Then, we set the associated entry in the adjacency matrix to 1 if the number of ship movements between the two ports is greater or equal to 1, and to 0 otherwise. The original network contained 4472 ports worldwide. We however had to reduce the network size to only 286 ports since most of the ports were not active throughout the whole period of the study.
We finally applied dRSM to a maritime network which describes the navigation of ships among 286 ports in the world at 17 time points. Let us highlight that the study period includes many major historical or economical events (the two world wars, the oil crisis, the economic crisis in Europe, \(\ldots \)), which could directly affect the navigation movements at a global scale and could also change the port behaviors.
The partition of the network into subgraphs is here provided by the port memberships to the four main maritime basins: Asia–Pacific, Europe–Atlantic, Mediterranean–Black Seas, and Middle East–Indian Ocean. Figure 8 presents this partition of the ports where the colors indicates the different subgraphs.
5.2 Results
Regarding cluster 7, one can see on Fig. 12 that its proportions in the subgraphs are higher than those of cluster 6. The ports of cluster 7 can be qualified as hubs of second class which are subordinated to the main hubs of cluster 6. Most of them are marked by a colonial logic, such as Marseille, Kolkata or Cape Town. The evolution of this cluster until the recent period shows a persisting link North–South (e.g. Le Havre–Casablanca) or East–West (e.g. Spain–Brazil–Canaries).
The cluster 5 is mainly made of ports from the Asia–Pacific and Middle East–India basins except during major crises, such as World War II and the oil crisis. During those crises, the cluster mainly contains European ports. The rapid modification of this cluster appears clearly on Fig. 12 around 1946, 1980 and 2008. This cluster can be interpreted as made of active ports from the developing world which move to cluster 2 during the crises. This may highlight the disintegration of long distance links during such crises. Conversely, cluster 2 turns out to be mostly made, except during crises, of European ports of average size, mainly on the atlantic coast. Those ports are rather a reflection of a past glory and most of them have declined over the century. This may due to a failed industrialization or a significant distance to the major trade routes.
Finally, clusters 3 and 4 are made of very small ports with low activity. Those ports are usually not connected together and communicate with the rest of the network only through ports of clusters 2 and 5. The connection with clusters 2 and 5 explains the brutal changes in the proportions of clusters 3 and 4 that one can also observe.
6 Conclusion
This work has considered the problem of analyzing dynamic networks with categorical edges and for which a subgraph partition is known. This kind of networks is frequent in a wide range of scientific fields, such as Geography in particular. For this purpose, we proposed an extension of the RSM model to the dynamic setting. The new model, called dRSM, uses a state space model to model the evolution of the latent group proportions over time. A variational expectation maximization (VEM) algorithm is proposed to perform inference. We have shown in particular that the variational approximations lead to a new state space model from which the parameters can be estimated using the standard Kalman filter and the Rauch–Tung–Striebel (RTS) smoother. Model selection is also considered through an approximate BIC criterion.
Numerical experiments have highlighted the main features of the dRSM model and have demonstrated the efficiency of both the VEM algorithm and the model selection criterion. A numerical comparison has also shown that existing methods, dynamic or not, are less flexible and efficient than dRSM when applied to dynamic networks. Finally, dRSM has been applied to a dynamic maritime flow network, build from the famous Lloyd’s list, and has allowed to characterize interesting dynamic phenomena.
Notes
Acknowledgments
The authors would like to greatly thank César Ducruet, from the Géographie-Cités laboratory, Paris, France, for providing the maritime network and for his painstaking analysis of the results. The data were collected in the context of the ERC Grant No. 313847 “World Seastems” (http://www.world-seastems.cnrs.fr). The authors would like also to thank Catherine Matias and Stéphane Robin for their useful remarks and comments on this work.
References
- Ahmed A, Xing EP (2007) On tight approximate inference of logistic-normal admixture model. In: Proceedings of the international conference on artificial intelligence and statistics, pp 1–8Google Scholar
- Airoldi E, Blei D, Fienberg S, Xing E (2008) Mixed membership stochastic blockmodels. J Mach Learn Res 9:1981–2014MATHGoogle Scholar
- Akaike H (1974) A new look at the statistical model identification. IEEE Trans Autom Control 19:716–723MathSciNetCrossRefMATHGoogle Scholar
- Albert R, Barabási A (2002) Statistical mechanics of complex networks. Mod Phys 74:47–97MathSciNetCrossRefMATHGoogle Scholar
- Ambroise C, Grasseau G, Hoebeke M, Latouche P, Miele V, Picard F (2010) The mixer R package (version 1.8). http://cran.r-project.org/web/packages/mixer/
- Barabási A, Oltvai Z (2004) Network biology: understanding the cell’s functional organization. Nat Rev Genet 5:101–113CrossRefGoogle Scholar
- Bickel P, Chen A (2009) A nonparametric view of network models and Newman–Girvan and other modularities. Proc Natl Acad Sci 106(50):21068–21073CrossRefMATHGoogle Scholar
- Bishop C, Svensén M (2003) Bayesian hierarchical mixtures of experts. In: Kjaerulff U, Meek C (eds) Proceedings of the 19th conference on uncertainty in artificial intelligence, pp 57–64Google Scholar
- Blei D, Lafferty J (2007a) A correlated topic model of science. Ann Appl Stat 1:17–35Google Scholar
- Blei D, Lafferty J (2007b) A correlated topic model of science. Ann Appl Stat 1(1):17–35MathSciNetCrossRefMATHGoogle Scholar
- Bouveyron C, Jernite Y, Latouche P, Nouedoui L (2013) The rambo R package (version 1.1). http://cran.r-project.org/web/packages/Rambo/
- Côme E, Latouche P (2015) Model selection and clustering in stochastic block models with the exact integrated complete data likelihood. Stat Model. doi:10.1177/1471082X15577017 MathSciNetGoogle Scholar
- Daudin J-J, Picard F, Robin S (2008) A mixture model for random graphs. Stat Comput 18(2):173–183MathSciNetCrossRefGoogle Scholar
- Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc Ser B (Methodol) 39:1–38Google Scholar
- Dubois C, Butts C, Smyth P (2013) Stochastic blockmodelling of relational event dynamics. In: International conference on artificial intelligence and statistics, vol 31 of the J Mach Learn Res Proc, pp 238–246Google Scholar
- Ducruet C (2013) Network diversity and maritime flows. J Transp Geogr 30:77–88CrossRefGoogle Scholar
- Fienberg S, Wasserman S (1981) Categorical data analysis of single sociometric relations. Sociol Methodol 12:156–192CrossRefGoogle Scholar
- Foulds JR, DuBois C, Asuncion AU, Butts CT, Smyth P (2011) A dynamic relational infinite feature model for longitudinal social networks. In: International conference on artificial intelligence and statistics, pp 287–295Google Scholar
- Girvan M, Newman M (2002) Community structure in social and biological networks. Proc Natl Acad Sci 99(12):7821MathSciNetCrossRefMATHGoogle Scholar
- Handcock M, Raftery A, Tantrum J (2007) Model-based clustering for social networks. J R Stat Soc Ser A (Stat Soc) 170(2):301–354MathSciNetCrossRefGoogle Scholar
- Harvey A (1989) Forecasting, structural time series models and the Kalman filter. Cambridge University Press, CambridgeGoogle Scholar
- Hathaway RJ (1986) Another interpretation of the EM algorithm for mixture distributions. Stat Probab Lett 4(2):53–56MathSciNetCrossRefMATHGoogle Scholar
- Heaukulani C, Ghahramani Z (2013) Dynamic probabilistic models for latent feature propagation in social networks. In: Proceedings of the 30th international conference on machine learning (ICML-13), pp 275–283Google Scholar
- Ho Q, Song L, Xing EP (2011) Evolving cluster mixed-membership blockmodel for time-evolving networks. In: International conference on artificial intelligence and statistics, pp 342–350Google Scholar
- Hofman J, Wiggins C (2008) Bayesian approach to network modularity. Phys Rev Lett 100(25):258701CrossRefGoogle Scholar
- Jernite Y, Latouche P, Bouveyron C, Rivera P, Jegou L, Lamassé S (2014) The random subgraph model for the analysis of an acclesiastical network in Merovingian Gaul. Ann Appl Stat 8(1):55–74CrossRefMATHGoogle Scholar
- Jordan M, Ghahramani Z, Jaakkola T, Saul LK (1999) An introduction to variational methods for graphical models. Mach Learn 37(2):183–233CrossRefMATHGoogle Scholar
- Kemp C, Tenenbaum J, Griffiths T, Yamada T, Ueda N (2006) Learning systems of concepts with an infinite relational model. In: Proceedings of the national conference on artificial intelligence, vol 21, pp 381–391Google Scholar
- Kim M, Leskovec J (2013) Nonparametric multi-group membership model for dynamic networks. In: Weiss Y, Schölkopf B, Platt J (eds) Advances in neural information processing systems, vol 25. MIT Press, Cambridge, pp 1385–1393Google Scholar
- Krishnan T, McLachlan G (1997) The EM algorithm and extensions. Wiley, New YorkMATHGoogle Scholar
- Lafferty JD, Blei DM (2006) Correlated topic models. In: Weiss Y, Schölkopf B, Platt J (eds) Advances in neural information processing systems, vol 18. MIT Press, Cambridge, pp 147–154Google Scholar
- Latouche P, Birmelé E, Ambroise C (2011) Overlapping stochastic block models with application to the french political blogosphere. Ann Appl Stat 5(1):309–336MathSciNetCrossRefMATHGoogle Scholar
- Latouche P, Birmelé E, Ambroise C (2012) Variational bayesian inference and complexity control for stochastic block models. Stat Model 12(1):93–115MathSciNetCrossRefGoogle Scholar
- Latouche P, Birmelé E, Ambroise C (2014) Model selection in overlapping stochastic block models. Electron J Stat 8(1):762–794MathSciNetCrossRefMATHGoogle Scholar
- Leroux B (1992) Consistent estimation of amixing distribution. Ann Stat 20:1350–1360CrossRefMATHGoogle Scholar
- Mariadassou M, Robin S, Vacher C (2010) Uncovering latent structure in valued graphs: a variational approach. Ann Appl Stat 4(2):715–742MathSciNetCrossRefMATHGoogle Scholar
- Matias C, Robin S (2014) Modeling heterogeneity in random graphs through latent space models: a selective review. ESAIM Proc Surv 47:55–74MathSciNetCrossRefMATHGoogle Scholar
- Mc Daid A, Murphy T, Friel N, Hurley N (2013) Improved bayesian inference for the stochastic block model with application to large networks. Comput Stat Data Anal 60:12–31MathSciNetCrossRefGoogle Scholar
- Minka T (1998) From hidden markov models to linear dynamical systems. Technical report, MITGoogle Scholar
- Moreno J (1934) Who shall survive?: A new approach to the problem of human interrelations. Nervous and Mental Disease Publishing CoGoogle Scholar
- Nowicki K, Snijders T (2001) Estimation and prediction for stochastic blockstructures. J Am Stat Assoc 96(455):1077–1087MathSciNetCrossRefMATHGoogle Scholar
- Palla G, Derenyi I, Farkas I, Vicsek T (2005) Uncovering the overlapping community structure of complex networks in nature and society. Nature 435:814–818CrossRefGoogle Scholar
- Rand W (1971) Objective criteria for the evaluation of clustering methods. J Am Stat Assoc 66:846–850Google Scholar
- Rauch H, Tung F, Striebel T (1965) Maximum likelihood estimates of linear dynamic systems. AIASS J 3(8):1445–1450MathSciNetGoogle Scholar
- Rossi F, Villa-Vialaneix N, Hautefeuille F (2014) Exploration of a large database of French notarial acts with social network methods. Digit Mediev 9:1–20Google Scholar
- Sarkar P, Moore AW (2005) Dynamic social network analysis using latent space models. ACM SIGKDD Explor Newsl 7(2):31–40CrossRefGoogle Scholar
- Schwarz G (1978) Estimating the dimension of a model. Ann Stat 6:461–464MathSciNetCrossRefMATHGoogle Scholar
- Svensén M, Bishop C (2004) Robust bayesian mixture modelling. Neurocomputing 64:235–252CrossRefGoogle Scholar
- Wang Y, Wong G (1987) Stochastic blockmodels for directed graphs. J Am Stat Assoc 82:8–19MathSciNetCrossRefMATHGoogle Scholar
- White H, Boorman S, Breiger R (1976) Social structure from multiple networks. I. Blockmodels of roles and positions. Am J Sociol 81:730–780Google Scholar
- Xing E, Fu W, Song L (2010) A state-space mixed membership blockmodel for dynamic network tomography. Ann Appl Stat 4(2):535–566MathSciNetCrossRefMATHGoogle Scholar
- Xu KS (2015) Stochastic block transition models for dynamic networks. In: International conference on artificial intelligence and statistics, pp 1079–1087Google Scholar
- Xu KS, Hero III AO (2013) Dynamic stochastic blockmodels: statistical models for time-evolving networks. In: Greenberg AM, Kennedy WG, Bos ND (eds) Social computing, behavioral-cultural modeling and prediction. Springer, Berlin, Heidelberg, pp 201–210Google Scholar
- Yang T, Chi Y, Zhu S, Gong Y, Jin R (2011) Detecting communities and their evolutions in dynamic social networks a Bayesian approach. Mach Learn 82(2):157–189MathSciNetCrossRefMATHGoogle Scholar
- Zanghi H, Volant S, Ambroise C (2010) Clustering based on random graph model embedding vertex features. Pattern Recognit Lett 31(9):830–836CrossRefGoogle Scholar