The stochastic topic block model for the clustering of vertices in networks with textual edges
Abstract
Due to the significant increase of communications between individuals via social media (Facebook, Twitter, Linkedin) or electronic formats (email, web, e-publication) in the past two decades, network analysis has become an unavoidable discipline. Many random graph models have been proposed to extract information from networks based on person-to-person links only, without taking into account information on the contents. This paper introduces the stochastic topic block model, a probabilistic model for networks with textual edges. We address here the problem of discovering meaningful clusters of vertices that are coherent from both the network interactions and the text contents. A classification variational expectation-maximization algorithm is proposed to perform inference. Simulated datasets are considered in order to assess the proposed approach and to highlight its main features. Finally, we demonstrate the effectiveness of our methodology on two real-word datasets: a directed communication network and an undirected co-authorship network.
Keywords
Random graph models Topic modeling Textual edges Clustering Variational inferenceMathematics Subject Classification
62F15 62F861 Introduction
1.1 Statistical models for network analysis
On the one hand, there is a long history of research in the statistical analysis of networks, which has received strong interest in the last decade. In particular, statistical methods have imposed theirselves as efficient and flexible techniques for network clustering. Most of those methods look for specific structures, the so-called communities, which exhibit a transitivity property such that nodes of the same community are more likely to be connected (Hofman and Wiggins 2008). Popular approaches for community discovering, though asymptotically biased (Bickel and Chen 2009), are based on the modularity score given by Girvan and Newman (2002). Alternative clustering methods usually rely on the latent position cluster model (LPCM) of Handcock et al. (2007), or the stochastic block model (SBM) (Wang and Wong 1987; Nowicki and Snijders 2001). The LPCM model, which extends the work of Hoff et al. (2002), assumes that the links between the vertices depend on their positions in a social latent space and allows the simultaneous visualization and clustering of a network.
The SBM model is a flexible random graph model which is based on a probabilistic generalization of the method applied by White et al. (1976) on Sampson’s famous monastery (Fienberg and Wasserman 1981). It assumes that each vertex belongs to a latent group, and that the probability of connection between a pair of vertices depends exclusively on their group. Because no specific assumption is made on the connection probabilities, various types of structures of vertices can be taken into account. At this point, it is important to notice that, in network clustering, two types of clusters are usually considered: communities (vertices within a community are more likely to connect than vertices of different communities) and stars or disassortative clusters (the vertices of a cluster highly connect to vertices of another). In this context, SBM is particularly useful in practice since it has the ability to characterize both types of clusters.
While SBM was originally developed to analyze mainly binary networks, many extensions have been proposed since to deal for instance with valued edges (Mariadassou et al. 2010), categorical edges (Jernite et al. 2014), or to take into account prior information (Zanghi et al. 2010; Matias and Robin 2014). Note that other extensions of SBM have focused on looking for overlapping clusters (Airoldi et al. 2008; Latouche et al. 2011) or on the modeling of dynamic networks (Yang et al. 2011; Xu and Hero 2013; Bouveyron et al. 2016; Matias and Miele 2016).
The inference of SBM-like models is usually done using variational expectation maximization (VEM) (Daudin et al. 2008), variational Bayes EM (VBEM) (Latouche et al. 2012), or Gibbs sampling (Nowicki and Snijders 2001). Moreover, we emphasize that various strategies have been derived to estimate the number of corresponding clusters using model selection criteria (Daudin et al. 2008; Latouche et al. 2012), allocation sampler (Mc Daid et al. 2013), greedy search (Côme and Latouche 2015), or nonparametric schemes (Kemp et al. 2006). We refer to (Salter-Townshend et al. 2012) for a overview of statistical models for network analysis.
1.2 Statistical models for text analytics
On the other hand, the statistical modeling of texts appeared at the end of the last century with an early model described by Papadimitriou et al. (1998) for latent semantic indexing (LSI) (Deerwester et al. 1990). LSI is known in particular for allowing to recover linguistic notions such as synonymy and polysemy from “term frequency - inverse document frequency” (tf-idf) data. Hofmann (1999) proposed an alternative model for LSI, called probabilistic latent semantic analysis (pLSI), which models each word within a document using a mixture model. In pLSI, each mixture component is modeled by a multinomial random variable and the latent groups can be viewed as “topics.” Thus, each word is generated from a single topic and different words in a document can be generated from different topics. However, pLSI has no model at the document level and may suffer from overfitting. Notice that pLSI can also be viewed as an extension of the mixture of unigrams, proposed by Nigam et al. (2000).
The model which finally concentrates the most desired features was proposed by Blei et al. (2003) and is called latent Dirichlet allocation (LDA). The LDA model has rapidly become a standard tool in statistical text analytics and is even used in different scientific fields such as image analysis (Lazebnik et al. 2006) or transportation research (Côme et al. 2014) for instance. The idea of LDA is that documents are represented as random mixtures over latent topics, where each topic is characterized by a distribution over words. LDA is therefore similar to pLSI except that the topic distribution in LDA has a Dirichlet distribution. Several inference procedures have been proposed in the literature ranging from VEM (Blei et al. 2003) to collapsed VBEM (Teh et al. 2006).
Note that a limitation of LDA would be the inability to take into account possible topic correlations. This is due to the use of the Dirichlet distribution to model the variability among the topic proportions. To overcome this limitation, the correlated topic model (CTM) was also developed by Blei and Lafferty (2006). Similarly, the relational topic model (RTM) (Chang and Blei 2009) models the links between documents as binary random variables conditioned on their contents, but ignoring the community ties between the authors of these documents. Notice that the “itopic” model (Sun et al. 2009) extends RTM to weighted networks. The reader may refer to Blei (2012) for an overview on probabilistic topic models.
1.3 Statistical models for the joint analysis of texts and networks
Finally, a few recent works have focused on the joint modeling of texts and networks. Those works are mainly motivated by the will of analyzing social networks, such as Twitter or Facebook, or electronic communication networks. Some of them are partially based on LDA: the author-topic (AT) (Steyvers et al. 2004; Rosen-Zvi et al. 2004) and the author-recipient-topic (ART) (McCallum et al. 2005) models. The AT model extends LDA to include authorship information, whereas the ART model includes authorships and information about the recipients. Although potentially powerful, these models do not take into account the network structure (communities, stars, ...) while the concept of community is very important in the context of social networks, in the sense that a community is a group of users sharing similar interests.
Among the most advanced models for the joint analysis of texts and networks, the first models which explicitly take into account both text contents and network structure are the community-user-topic (CUT) models proposed by (Zhou et al. 2006). Two models are proposed: CUT1 and CUT2, which differ in the way they construct the communities. Indeed, CUT1 determines the communities only based on the network structure, whereas the CUT2 model determines the communities based on the content information solely. The CUT models therefore deal each with only a part of the problem we are interested in. It is also worth noticing that the authors of these models rely for inference on Gibbs sampling which may prohibit their use on large networks.
A second attempt was made by Pathak et al. (2008) who extended the ART model by introducing the community-author-recipient-topic (CART) model. The CART model adds to the ART model that authors and recipients belong to latent communities and allows CART to recover groups of nodes that are homogenous both regarding the network structure and the message contents. Notice that CART allows the nodes to be part of multiple communities and each couple of actors to have a specific topic. Thus, though extremely flexible, CART is also a highly parametrized model. In addition, the recommended inference procedure based on Gibbs sampling may also prohibit its application to large networks.
More recently, the topic-link LDA (Liu et al. 2009) also performs topic modeling and author community discovery in a unified framework. As its name suggests, topic-link LDA extends LDA with a community layer where the link between two documents (and consequently its authors) depends on both topic proportions and author latent features through a logistic transformation. However, whereas CART focuses only on directed networks, topic-link LDA is only able to deal with undirected networks. On the positive side, the authors derive a variational EM algorithm for inference, allowing topic-link LDA to eventually be applied to large networks.
Finally, a family of four topic-user-community models (TUCM) were proposed by Sachan et al. (2012). The TUCM models are designed such that they can find topic-meaningful communities in networks with different types of edges. This is in particular relevant in social networks such as Twitter where different types of interactions (followers, tweet, re-tweet, ...) exist. Another specificity of the TUCM models is that they allow both multiple community and topic memberships. Inference is also done here through Gibbs sampling, implying a possible scale limitation.
1.4 Contributions and organization of the paper
We propose here a new generative model for the clustering of networks with textual edges, such as communication or co-authorship networks. Conversely to existing works which have either too simple or highly parametrized models for the network structure, our model relies for the network modeling on the SBM model which offers a sufficient flexibility with a reasonable complexity. This model is one of the few able to recover different topological structures such as communities, stars, or disassortative clusters (see Latouche et al. 2012 for instance). Regarding the topic modeling, our approach is based on the LDA model, in which the topics are conditioned on the latent groups. Thus, the proposed modeling will be able to exhibit node partitions that are meaningful both regarding the network structure and the topics, with a model of limited complexity, highly interpretable, and for both directed and undirected networks. In addition, the proposed inference procedure—a classification-VEM algorithm—allows the use of our model on large-scale networks.
The proposed model, named stochastic topic block model (STBM), is introduced in Sect. 2. The model inference is discussed in Sect. 3 as well as model selection. Section 4 is devoted to numerical experiments highlighting the main features of the proposed approach and proving the validity of the inference procedure. Two applications to real-world networks (the Enron email and the Nips co-authorship networks) are presented in Sect. 5. Section 6 finally provides some concluding remarks.
2 The model
This section presents the notations used in the paper and introduces the STBM model. The joint distributions of the model to create edges and the corresponding documents are also given.
2.1 Context and notations
A directed network with M vertices, described by its \(M \times M\) adjacency matrix A, is considered. Thus, \(A_{ij}=1\) if there is an edge from vertex i to vertex j, 0 otherwise. The network is assumed not to have any self-loop and therefore \(A_{ii}=0\) for all i. If an edge from i to j is present, then it is characterized by a set of \(D_{ij}\) documents, denoted \(W_{ij}=(W_{ij}^{d})_d\). Each document \(W_{ij}^d\) is made of a collection of \(N_{ij}^{d}\) words \(W_{ij}^d=(W_{ij}^{dn})_n\). In the directed scenario considered, \(W_{ij}\) can model for instance a set of emails or text messages sent from actor i to actor j. Note that all the methodology proposed in this paper easily extends to undirected networks. In such a case, \(A_{ij}=A_{ji}\) and \(W_{ij}^{d}=W_{ji}^{d}\) for all i and j. The set \(W_{ij}^{d}\) of documents can then model for example books or scientific papers written by both i and j. In the following, we denote \(W=(W_{ij})_{ij}\) the set of all documents exchanged, for all the edges present in the network.
Our goal is to cluster the vertices into Q latent groups sharing homogeneous connection profiles, i.e., find an estimate of the set \(Y=(Y_1, \dots , Y_M)\) of latent variables \(Y_i\) such that \(Y_{iq}=1\) if vertex i belongs to cluster q, and 0 otherwise. Although in some cases, discrete or continuous edges are taken into account, the literature on networks focuses on modeling the presence of edges as binary variables. The clustering task then consists in building groups of vertices having similar trends to connect to others. In this paper, the connection profiles are both characterized by the presence of edges and the documents between pairs of vertices. Therefore, we aim at uncovering clusters by integrating these two sources of information. Two nodes in the same cluster should have the same trend to connect to others, and when connected, the documents they are involved in should be made of words related to similar topics.
2.2 Modeling the presence of edges
2.3 Modeling the construction of documents
As mentioned previously, if an edge is present from vertex i to vertex j, then a set of documents \(W_{ij}=(W_{ij}^d)_d\), characterizing the oriented pair (i, j), is assumed to be given. Thus, in a generative perspective, the edges in A are first sampled using previous section. Given A, the documents in \(W=(W_{ij})_{ij}\) are then constructed. The generative process we consider to build documents is strongly related to the latent Dirichlet allocation (LDA) model of Blei et al. (2003). The link between STBM and LDA is made clear in the following section. The STBM model relies on two concepts at the core of the SBM and LDA models, respectively. On the one hand, a generalization of the SBM model would assume that any kind of relationships between two vertices can be explained by their latent clusters only. In the LDA model, on the other hand, the main assumption is that words in documents are drawn from a mixture distribution over topics, each document d having its own vector of topic proportions \(\theta _d\). The STBM model combines these two concepts to introduce a new generative procedure for documents in networks.
2.4 Link with LDA and SBM
As mentioned in Sect. 2.2, the second part of Eq. (4) involves the sampling of the clusters and the construction of binary variables describing the presence of edges between pairs of vertices. Interestingly, it corresponds exactly to the complete data likelihood of the SBM model, as considered in Zanghi et al. (2008) for instance. Such a likelihood term only involves the model parameters \(\rho \) and \(\pi \).
3 Inference
As mentioned previously, we introduce our methodology in the directed case. However, we emphasize that the STBM package for R we developed, implements the inference strategy for both directed and undirected networks.
3.1 Variational decomposition
3.2 Model decomposition
3.3 Optimization
In this section, we derive the optimization steps of the C-VEM algorithm we propose, which aims at maximizing the lower bound \({\mathcal {L}}\). The algorithm alternates between the optimization of \(R(Z, \theta ), Y\), and \((\rho , \pi , \beta )\) until convergence of the lower bound.
Estimation of\(R(Z,\theta )\) The following propositions give the update formulae of the E step of the VEM algorithm applied on Eq. (7).
Proposition 1
Proposition 2
Estimation of the model parameters
Maximizing the lower bound \({\mathcal {L}}\) in Eq. (7) is used to provide estimates of the model parameters \((\rho , \pi , \beta )\). We recall that \(\beta \) is only involved in \(\tilde{{\mathcal {L}}}\) while \((\rho , \pi )\) only appear in the SBM complete data log-likelihood. The derivation of \(\tilde{{\mathcal {L}}}\) is given in Appendix 3.
Proposition 3
Estimation ofY At this step, the model parameters \((\rho , \pi , \beta )\) along with the distribution \(R(Z, \theta )\) are held fixed. Therefore, the lower bound \({\mathcal {L}}\) in (7) only involves the set Y of cluster membership vectors. Looking for the optimal solution Y maximizing this bound is not feasible since it involves testing the \(Q^M\) possible cluster assignments. However, heuristics are available to provide local maxima for this combinatorial problem. These so-called greedy methods have been used for instance not only to look for communities in networks by Newman (2004) and Blondel et al. (2008) but also for the SBM model (Côme and Latouche 2015). They are sometimes referred to as on line clustering methods (Zanghi et al. 2008).
The algorithm cycles randomly through the vertices. At each step, a single vertex is considered and all membership vectors \(Y_{j}\) are held fixed, except \(Y_{i}\). If i is currently in cluster q, then the method looks for every possible label swap, i.e., removing i from cluster q and assigning it to a cluster \(r \ne q\). The corresponding change in the SBM complete data log-likelihood is then computed. If no label swap induces an increase in the SBM complete data log-likelihood, then \(Y_{i}\) remains unchanged. Otherwise, the label swap that yields the maximal increase is applied, and \(Y_{i}\) is changed accordingly.
3.4 Initialization strategy and model selection
- 1.
The VEM algorithm (Blei et al. 2003) for LDA is applied on the aggregation of all documents exchanged from vertex i to vertex j, for each pair (i, j) of vertices, in order to characterize a type of interaction from i to j. Thus, a \(M \times M\) matrix X is first built such that \(X_{ij}=k\) if k is the majority topic used by i when discussing with j.
- 2.The distance \(M \times M\) matrix \(\varDelta \) is then computed as follows:The first term looks at all possible edges from i and j toward a third vertex h. If both i and j are connected to h, i.e., \(A_{ih}A_{jh}=1\), the edge types \(X_{ih}\) and \(X_{jh}\) are compared. By symmetry, the second term looks at all possible edges from a vertex h to both i as well as j, and compare their types. Thus, the distance computes the number of discordances in the way both i and j connect to other vertices or vertices connect to them.$$\begin{aligned} \varDelta (i,j)= & {} \sum _{h=1}^{N}\delta (X_{ih}\ne X_{jh})A_{ih}A_{jh}\nonumber \\&+ \sum _{h=1}^{N}\delta (X_{hi}\ne X_{hj})A_{hi}A_{hj}. \end{aligned}$$(8)
Proposition 4
Parameter values for the three simulation scenarios (see text for details)
Scenario | A | B | C |
---|---|---|---|
M (nb of nodes) | 100 | ||
K (topics) | 4 | 3 | 3 |
Q (groups) | 3 | 2 | 4 |
\(\rho \) (group prop.) | \((1/Q,\ldots ,1/Q)\) | ||
\(\pi \) (connection prob.) | \(\left\{ \begin{array}{l} \pi _{qq}= 0.25\\ \pi _{qr,\, r\ne q}= 0.01 \end{array}\right. \) | \(\pi _{qr,\,\forall q,r}=0.25\) | \(\left\{ \begin{array}{l} \pi _{qq}= 0.25\\ \pi _{qr,\, r\ne q}= 0.01 \end{array}\right. \) |
\(\theta \) (prop. of topics) | \({\left\{ \begin{array}{ll} \theta _{111}=\theta _{222}= \quad 1\\ \theta _{333}= \quad 1\\ \theta _{qr4,\, r\ne q}= \quad 1\\ \text {otherwise} \quad 0 \end{array}\right. }\) | \({\left\{ \begin{array}{ll} \theta _{111}=\theta _{222}= \quad 1\\ \theta _{qr3,\, r\ne q}= \quad 1\\ \text {otherwise} \quad 0 \end{array}\right. }\) | \({\left\{ \begin{array}{ll} \theta _{111}=\theta _{331}= &{} 1\\ \theta _{222}=\theta _{442}= &{} 1\\ \theta _{qr3,\, r\ne q}= &{} 1\\ \text {otherwise} &{} 0 \end{array}\right. }\) |
4 Numerical experiments
This section aims at highlighting the main features of the proposed approach on synthetic data and at proving the validity of the inference algorithm presented in the previous section. Model selection is also considered to validate the criterion choice. Numerical comparisons with state-of-the-art methods conclude this section.
4.1 Experimental setup
First, regarding the parametrization of our approach, we chose \(\alpha _{k}=1,\forall k\) which induces a uniform distribution over the topic proportions \(\theta _{qr}\).
Scenario A consists in networks with \(Q=3\) groups, corresponding to clear communities, where persons within a group talk preferentially about a unique topic and use a different topic when talking with persons of other groups. Thus, those networks contain \(K=4\) topics.
Scenario B consists in networks with a unique community, where the \(Q=2\) groups are only differentiated by the way they discuss within and between groups. Persons within groups 1 and 2 talk preferentially about topics 1 and 2, respectively. A third topic is used for the communications between persons of different groups.
Scenario C, finally, consists in networks with \(Q=4\) groups which use \(K=3\) topics to communicate. Among the 4 groups, two groups correspond to clear communities where persons talk preferentially about a unique topic within the communities. The two other groups correspond to a single community and are only discriminated by the topic used in the communications. People from group 3 use topic 1 and the topic 2 is used in group 4. The third topic is used for communications between groups.
4.2 Introductory example
Percentage of selections by ICL for each STBM model (Q, K) on 50 simulated networks of each of three scenarios
\(K \backslash Q\) | 1 | 2 | 3 | 4 | 5 | 6 |
---|---|---|---|---|---|---|
Scenario A (\(Q=3, K=4\)) | ||||||
1 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | 12 | 0 | 0 | 0 | 0 | 0 |
3 | 0 | 0 | 0 | 0 | 0 | 0 |
4 | 0 | 0 | 82 | 2 | 0 | 2 |
5 | 0 | 0 | 2 | 0 | 0 | 0 |
6 | 0 | 0 | 0 | 0 | 0 | 0 |
Scenario B (\(Q=2\),\(K=3\)) | ||||||
1 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | 12 | 0 | 0 | 0 | 0 | 0 |
3 | 0 | 88 | 0 | 0 | 0 | 0 |
4 | 0 | 0 | 0 | 0 | 0 | 0 |
5 | 0 | 0 | 0 | 0 | 0 | 0 |
6 | 0 | 0 | 0 | 0 | 0 | 0 |
Scenario C (\(Q=4, K=3\)) | ||||||
1 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | 0 | 0 | 0 | 0 | 0 | 0 |
3 | 0 | 0 | 2 | 82 | 0 | 0 |
4 | 0 | 0 | 0 | 16 | 0 | 0 |
5 | 0 | 0 | 0 | 0 | 0 | 0 |
6 | 0 | 0 | 0 | 0 | 0 | 0 |
4.3 Model selection
This experiment focuses on the ability of the ICL criterion to select appropriate values for Q and K. To this end, we simulated 50 networks according to each of the three scenarios and STBM was applied on those networks for values of Q and K ranging from 1 to 6. Table 2 presents the percentage of selections by ICL for each STBM model (Q, K) on 50 simulated networks of each of the three scenarios.
In the three different situations, ICL succeeds most of the time in identifying the actual combination of the number of groups and topics. For scenarios A and B, when ICL does not select the correct values for Q and K, the criterion seems to underestimate the values of Q and K, whereas it tends to overestimate them in the case of scenario C. One can also notice that wrongly selected models are usually close to the simulated one. Let us also recall that, since the data are not strictly simulated according to a STBM model, the ICL criterion does not have the model which generated the data in the set of tested models. This experiment allows to validate ICL as a model selection tool for STBM.
4.4 Benchmark study
This third experiment aims at comparing the ability of STBM to recover the network structure both in term of node partition and topics. STBM is here compared to SBM, using the mixer package (Ambroise et al. 2010), and LDA, using the topicmodels package (Grun and Hornik 2013). Obviously, SBM and LDA will be only able to recover either the node partition or the topics. We chose here to evaluate the results by comparing the resulting node and topic partitions with the actual ones (the simulated partitions). In the clustering community, the adjusted Rand index (ARI) (Rand 1971) serves as a widely accepted criterion for the difficult task of clustering evaluation. The ARI looks at all pairs of nodes and checks whether they are classified in the same group or not in both partitions. As a result, an ARI value close to 1 means that the partitions are similar. Notice that the actual values of Q and K are provided to the three algorithms.
Clustering results for the SBM, LDA and STBM on 20 networks simulated according to the three scenarios
Method | Scenario A | Scenario B | Scenario C | |||
---|---|---|---|---|---|---|
Node ARI | Edge ARI | Node ARI | Edge ARI | Node ARI | Edge ARI | |
Easy | ||||||
SBM | 1.00 ± 0.00 | – | 0.01 ± 0.01 | – | 0.69 ± 0.07 | – |
LDA | – | 0.97 ± 0.06 | – | 1.00 ± 0.00 | – | 1.00 ± 0.00 |
STBM | 0.98 ± 0.04 | 0.98 ± 0.04 | 1.00 ± 0.00 | 1.00 ± 0.00 | 1.00 ± 0.00 | 1.00 ± 0.00 |
Hard 1 | ||||||
SBM | 0.01 ± 0.01 | – | 0.01 ± 0.01 | – | 0.01 ± 0.01 | – |
LDA | – | 0.90 ± 0.17 | – | 1.00 ± 0.00 | – | 0.99 ± 0.01 |
STBM | 1.00 ± 0.00 | 0.90 ± 0.13 | 1.00 ± 0.00 | 1.00 ± 0.00 | 1.00 ± 0.00 | 0.98 ± 0.03 |
Hard 2 | ||||||
SBM | 1.00 ± 0.00 | – | \(-0.01\pm 0.01\) | – | 0.65 ± 0.05 | – |
LDA | – | 0.21 ± 0.13 | – | 0.08 ± 0.06 | – | 0.09 ± 0.05 |
STBM | 0.99 ± 0.02 | 0.99 ± 0.01 | 0.59 ± 0.35 | 0.54 ± 0.40 | 0.68 ± 0.07 | 0.62 ± 0.14 |
In the “Easy” situation, the results are coherent with our initial guess when building the simulation scenarios. Indeed, besides the fact that SBM and LDA are only able to recover one of the two partitions, scenario A is an easy situation for all methods since the clusters perfectly match the topic partition. Scenario B, which has no communities and where groups only depend on topics, is obviously a difficult situation for SBM but does not disturb LDA which perfectly recovers the topics. In scenario C, LDA still succeeds in identifying the topicsm, whereas SBM well recognizes the two communities but fails in discriminating the two groups hidden in a single community. Here, STBM obtains in all scenarios the best performance on both nodes and edges (Table 3).
The “Hard 1” situation considers the case where the communities are actually not well differentiated. Here, LDA is little affected (only in scenario A), whereas SBM is no longer able to distinguish the groups of nodes. Conversely, STBM relies on the found topics to correctly identifies the node groups and obtains, here again, excellent ARI values in all the three scenarios.
The last situation, the so-called “Hard 2” case, aims at highlighting the effect of the word sampling in the recovering of the used topics. On the one hand, SBM now achieves a satisfying classification of nodes for scenarios A and C while LDA fails in recovering the majority topic used for simulation. On those two scenarios, STBM performs well on both nodes and topics. This proves that STBM is also able to recover the topics in a noisy situation by relying on the network structure. On the other hand, scenario B presents an extremely difficult situation where topics are noised and there are no communities. Here, although both LDA and SBM fail, STBM achieves a satisfying result on both nodes and edges. This is, once again, an illustration of the fact that the joint modeling of network structure and topics allows to recover complex hidden structures in a network with textual edges.
5 Application to real-world problems
5.1 Analysis of the Enron email network
The dataset considered here contains 20 940 emails sent between the \(M=149\) employees. All messages sent between two individuals were coerced in a single meta-message. Thus, we end up with a dataset of 1 234 directed edges between employees, each edge carrying the text of all messages between two persons.
Topic 1 seems to refer to the financial and trading activities of Enron.
Topic 2 is concerned with Enron activities in Afghanistan (Enron and the Bush administration were suspected to work secretly with Talibans up to a few weeks before the 9/11 attacks).
Topic 3 contains elements related to the California electricity crisis, in which Enron was involved, and which almost caused the bankruptcy of SCE-corp (Southern California Edison Corporation) early 2001.
Topic 4 is about usual logistic issues (building equipment, computers, ...).
Topic 5 refers to technical discussions on gas deliveries (mmBTU represents 1 million of British thermal unit, which is equal to 1055 joules).
Figure 10 presents a visual summary of connection probabilities between groups (the estimated \(\pi \) matrix) and majority topics for group interactions. A few elements deserve to be highlighted in view of this summary. First, group 10 contains a single individual who has a central place in the network and who mostly discusses about logistic issues (topic 4) with groups 4, 5, 6, and 7. Second, group 8 is made of 6 individuals who mainly communicates about Enron activities in Afghanistan (topic 2) between them and with other groups. Finally, groups 4 and 6 seem to be more focused on trading activities (topic 1), whereas groups 1, 3, and 9 are dealing with technical issues on gas deliveries (topic 5).
As a comparison, the network has also been processed with SBM, using the mixer package (Ambroise et al. 2010). The chosen number K of groups by SBM was 8. Figure 11 allows to compare the partitions of nodes provided by SBM and STBM. One can observe that the two partitions differ on several points. On the one hand, some clusters found by SBM (the bottom-left one for instance) have been split by STBM since some nodes use different topics than the rest of the community. On the other hand, SBM isolates two “hubs” which seem to have similar behaviors. Conversely, STBM identifies a unique “hub” and the second node is gathered with other nodes, using similar discussion topics. STBM has therefore allowed a better and deeper understanding of the Enron network through the combination of text contents with network structure.
5.2 Analysis of the Nips co-authorship network
Topic 1 seems to be focused on neural network theory, which was and still is a central topic in Nips.
Topic 2 is concerned with phoneme classification or recognition,
Topic 3 is a more general topic about statistical learning and artificial intelligence.
Topic 4 is about Neuroscience and focuses on experimental works about the visual cortex.
Topic 5 deals with network learning theory.
Topic 6 is also about Neuroscience but seems to be more focused on EEG.
Topic 7 is finally devoted to neural coding, i.e., characterizing the relationship between the stimulus and the individual responses.
As a conclusive remark on this network, STBM has proved its ability to bring out concise and relevant analyses on the structure of a large and dense network. In this view, the meta-network of Fig. 13 is a great help since it summarizes several model parameters of STBM.
6 Conclusion
This work has introduced a probabilistic model, named the stochastic topic bloc model (STBM), for the modeling and clustering of vertices in networks with textual edges. The proposed model allows the modeling of both directed and undirected networks, authorizing its application to networks of various types (communication, social medias, co-authorship, ...). A classification variational EM (C-VEM) algorithm has been proposed for model inference and model selection is done through the ICL criterion. Numerical experiments on simulated datasets have proved the effectiveness of the proposed methodology. Two real-world networks (a communication and a co-authorship network) have also been studied using the STBM model and insightful results have been exhibited. It is worth noticing that STBM has been applied to a large co-authorship network with thousands of vertices, proving the scalability of our approach.
Further work may include the extension of the STBM model to dynamic networks and networks with covariate information on the nodes and / or edges. The extension to the dynamic framework would be possible by adding for instance a state space model over group and topics proportions. Such an approach has already been used with success on SBM-like models, such as in Bouveyron et al. (2016). It would also be possible to take into account covariate information available on the nodes by adopting a mixture of experts approach, such as in Gormley and Murphy (2010). Extending the STBM model to overlapping clusters of nodes would be another natural idea. It is indeed commonplace in social analysis to allow individuals to belong to multiple groups (family, work, friends, ...). One possible choice would be to derive an extension of the MMSBM model (Airoldi et al. 2008). However, this would increase significantly the parameterization of the model. Finally, STBM could also be adapted in order to take into account the intensity or the type of communications between individuals.
Notes
Acknowledgments
The authors would like to greatly thank the editor and the two reviewers for their helpful remarks on the first version of this paper, and Laurent Bergé for his kind suggestions and the development of visualization tools.
Supplementary material
References
- Airoldi, E., Blei, D., Fienberg, S., Xing, E.: Mixed membership stochastic blockmodels. J. Mach. Learn. Res. 9, 1981–2014 (2008)MATHGoogle Scholar
- Akaike, H.: Information theory and an extension of the maximum likelihood principle. In: Second International Symposium on Information Theory, pp. 267–281 (1973)Google Scholar
- Ambroise, C., Grasseau, G., Hoebeke, M., Latouche, P., Miele, V., Picard, F.: The mixer R package (version 1.8) (2010). http://cran.r-project.org/web/packages/mixer/
- Bickel, P., Chen, A.: A nonparametric view of network models and newman-girvan and other modularities. Proc. Natl Acad. Sci. 106(50), 21068–21073 (2009)CrossRefGoogle Scholar
- Biernacki, C., Celeux, G., Govaert, G.: Assessing a mixture model for clustering with the integrated completed likelihood. IEEE Trans. Pattern Anal. Mach. Intel. 7, 719–725 (2000)CrossRefGoogle Scholar
- Biernacki, C., Celeux, G., Govaert, G.: Choosing starting values for the EM algorithm for getting the highest likelihood in multivariate gaussian mixture models. Comput. Stat. Data Anal. 41(3–4), 561–575 (2003)MathSciNetMATHCrossRefGoogle Scholar
- Bilmes, J.: A gentle tutorial of the EM algorithm and its application to parameter estimation for gaussian mixture and hidden markov models. Int. Comput. Sci. Inst. 4, 126 (1998)Google Scholar
- Blei, D., Lafferty, J.: Correlated topic models. Adv. Neural Inf. Process. Syst. 18, 147 (2006)Google Scholar
- Blei, D.M.: Probabilistic topic models. Commun. ACM 55(4), 77–84 (2012)MathSciNetCrossRefGoogle Scholar
- Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)MATHGoogle Scholar
- Blondel, V.D., Guillaume, J.-L., Lambiotte, R., Lefebvre, E.: Fast unfolding of communities in large networks. J. Stat. Mech. 10, 10008–10020 (2008)CrossRefGoogle Scholar
- Bouveyron, C., Latouche, P., Zreik, R.: The dynamic random subgraph model for the clustering of evolving networks. Comput. Stat. (2016)Google Scholar
- Celeux, G., Govaert, G.: A classification em algorithm for clustering and two stochastic versions. Comput. Stat. Q. 2(1), 73–82 (1991)MATHGoogle Scholar
- Chang, J., Blei, D.M.: Relational topic models for document networks. In: International Conference on Artificial Intelligence and Statistics, pp. 81–88 (2009)Google Scholar
- Côme, E., Randriamanamihaga, A., Oukhellou, L., Aknin, P.: Spatio-temporal analysis of dynamic origin-destination data using latent dirichlet allocation. application to the vélib? bike sharing system of paris. In: Proceedings of 93rd Annual Meeting of the Transportation Research Board (2014)Google Scholar
- Côme, E., Latouche, P.: Model selection and clustering in stochastic block models with the exact integrated complete data likelihood. Stat. Model. doi:10.1177/1471082X15577017 (2015)
- Daudin, J.-J., Picard, F., Robin, S.: A mixture model for random graphs. Stat. Comput. 18(2), 173–183 (2008)MathSciNetCrossRefGoogle Scholar
- Deerwester, S., Dumais, S., Furnas, G., Landauer, T., Harshman, R.: Indexing by latent semantic analysis. J. Am. Soc. Inf. Sci. 41(6), 391 (1990)CrossRefGoogle Scholar
- Dempster, A., Laird, N., Rubin, D.: Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. B 39(1), 1–38 (1977)MathSciNetMATHGoogle Scholar
- Fienberg, S., Wasserman, S.: Categorical data analysis of single sociometric relations. Sociol. Methodol. 12, 156–192 (1981)CrossRefGoogle Scholar
- Girvan, M., Newman, M.: Community structure in social and biological networks. Proc. Natl Acad. Sci. 99(12), 7821 (2002)MathSciNetMATHCrossRefGoogle Scholar
- Gormley, I.C., Murphy, T.B.: A mixture of experts latent position cluster model for social network data. Stat. Methodol. 7(3), 385–405 (2010)MathSciNetMATHCrossRefGoogle Scholar
- Grun, B., Hornik, K.: The mixer topicmodels package (version 0.2-3). http://cran.r-project.org/web/packages/topicmodels/ (2013)
- Handcock, M., Raftery, A., Tantrum, J.: Model-based clustering for social networks. J. R. Stat. Soc. A 170(2), 301–354 (2007)MathSciNetCrossRefGoogle Scholar
- Hathaway, R.: Another interpretation of the EM algorithm for mixture distributions. Stat. Prob. Lett. 4(2), 53–56 (1986)MathSciNetMATHCrossRefGoogle Scholar
- Hoff, P., Raftery, A., Handcock, M.: Latent space approaches to social network analysis. J. Am. Stat. Assoc. 97(460), 1090–1098 (2002)MathSciNetMATHCrossRefGoogle Scholar
- Hofman, J., Wiggins, C.: Bayesian approach to network modularity. Phys. Rev. Lett. 100(25), 258701 (2008)CrossRefGoogle Scholar
- Hofmann, T.: Probabilistic latent semantic indexing. In: Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, 50–57. ACM, New York (1999)Google Scholar
- Jernite, Y., Latouche, P., Bouveyron, C., Rivera, P., Jegou, L., Lamassé, S.: The random subgraph model for the analysis of an ecclesiastical network in Merovingian Gaul. Ann. Appl. Stat. 8(1), 55–74 (2014)MathSciNetMATHCrossRefGoogle Scholar
- Kemp, C., Tenenbaum, J., Griffiths, T., Yamada, T., Ueda, N.: Learning systems of concepts with an infinite relational model. Proc. Natl Conf. Artif. Intell. 21, 381–391 (2006)Google Scholar
- Latouche, P., Birmelé, E., Ambroise, C.: Overlapping stochastic block models with application to the French political blogosphere. Ann. Appl. Stat. 5(1), 309–336 (2011)MathSciNetMATHCrossRefGoogle Scholar
- Latouche, P., Birmelé, E., Ambroise, C.: Variational Bayesian inference and complexity control for stochastic block models. Stat. Model. 12(1), 93–115 (2012)MathSciNetCrossRefGoogle Scholar
- Lazebnik, S., Schmid, C., Ponce, J.: Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In: 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 2, pp. 2169–2178. IEEE, Piscataway (2006)Google Scholar
- Liu, Y., Niculescu-Mizil, A., Gryc, W. : Topic-link lda: joint models of topic and author community. In: proceedings of the 26th annual international conference on machine learning, pp. 665–672. ACM, New York (2009)Google Scholar
- Mariadassou, M., Robin, S., Vacher, C.: Uncovering latent structure in valued graphs: a variational approach. Ann. Appl. Stat. 4(2), 715–742 (2010)MathSciNetMATHCrossRefGoogle Scholar
- Matias, C., Miele, V.: Statistical clustering of temporal networks through a dynamic stochastic block model. Preprint HAL. n.01167837 (2016)Google Scholar
- Matias, C., Robin, S.: Modeling heterogeneity in random graphs through latent space models: a selective review. Esaim Proc. Surv. 47, 55–74 (2014)MathSciNetMATHCrossRefGoogle Scholar
- McDaid, A., Murphy, T., Friel, N., Hurley, N.: Improved bayesian inference for the stochastic block model with application to large networks. Comput. Stat. Data Anal. 60, 12–31 (2013)MathSciNetCrossRefGoogle Scholar
- McCallum, A., Corrada-Emmanuel, A., Wang, X.: The author-recipient-topic model for topic and role discovery in social networks, with application to enron and academic email, pp. 33–44. In: Workshop on Link Analysis, Counterterrorism and Security (2005)Google Scholar
- Newman, M.E.J.: Fast algorithm for detecting community structure in networks. Phys. Rev. Lett. E. 69, 0066133 (2004)CrossRefGoogle Scholar
- Nigam, K., McCallum, A., Thrun, S., Mitchell, T.: Text classification from labeled and unlabeled documents using em. Mach. Learn. 39(2–3), 103–134 (2000)MATHCrossRefGoogle Scholar
- Nowicki, K., Snijders, T.: Estimation and prediction for stochastic blockstructures. J. Am. Stat. Assoc. 96(455), 1077–1087 (2001)MathSciNetMATHCrossRefGoogle Scholar
- Papadimitriou, C., Raghavan, P., Tamaki, H., Vempala, S.: Latent semantic indexing: a probabilistic analysis. In: Proceedings of the tenth ACM PODS, pp. 159–168. ACM, New York (1998)Google Scholar
- Pathak, N., DeLong, C., Banerjee, A., Erickson, K.: Social topic models for community extraction. In: The 2nd SNA-KDD workshop, vol. 8. Citeseer (2008)Google Scholar
- Rand, W.: Objective criteria for the evaluation of clustering methods. J. Am. Stat. Assoc. 66, 846–850 (1971)CrossRefGoogle Scholar
- Rosen-Zvi, M., Griffiths, T., Steyvers, M., Smyth, P.: The author-topic model for authors and documents. In: Proceedings of the 20th conference on Uncertainty in artificial intelligence, pp. 487–494. AUAI Press, Arlington (2004)Google Scholar
- Sachan, M., Contractor, D., Faruquie, T., Subramaniam, L.: Using content and interactions for discovering communities in social networks. In: Proceedings of the 21st international conference on World Wide Web, pp. 331–340. ACM, New York (2012)Google Scholar
- Salter-Townshend, M., White, A., Gollini, I., Murphy, T.B.: Review of statistical network analysis: models, algorithms, and software. Stat. Anal. Data Min. 5(4), 243–264 (2012)MathSciNetCrossRefGoogle Scholar
- Schwarz, G.: Estimating the dimension of a model. Ann. Stat. 6, 461–464 (1978)MathSciNetMATHCrossRefGoogle Scholar
- Steyvers, M., Smyth, P., Rosen-Zvi, M., Griffiths, T.: Probabilistic author-topic models for information discovery. In: Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 306–315. ACM, New York (2004)Google Scholar
- Sun, Y., Han, J., Gao, J., Yu, Y.: itopicmodel: Information network-integrated topic modeling. In: Ninth IEEE International Conference on Data Mining, 2009. ICDM’09, pp. 493–502. IEEE, Piscataway (2009)Google Scholar
- Teh, Y., Newman, D., Welling, M.: A collapsed variational bayesian inference algorithm for latent Dirichlet allocation. Adv. Neural Inf. Process. Syst. 18, 1353–1360 (2006)Google Scholar
- Than, K., Ho, T.: Fully sparse topic models. Machine Learning and Knowledge Discovery in Databases, Lecture Notes in Computer Science. vol. 7523, pp. 490–505. Springer, Berlin (2012)Google Scholar
- Wang, Y., Wong, G.: Stochastic blockmodels for directed graphs. J. Am. Stat. Assoc. 82, 8–19 (1987)MathSciNetMATHCrossRefGoogle Scholar
- White, H., Boorman, S., Breiger, R.: Social structure from multiple networks. I. Blockmodels of roles and positions. Am. J. Sociol. 81, 730–780 (1976)CrossRefGoogle Scholar
- Xu, K., Hero III, A.: Dynamic stochastic blockmodels: statistical models for time-evolving networks. In: Social Computing, Behavioral-Cultural Modeling and Prediction, pp. 201–210. Springer, Berlin (2013)Google Scholar
- Yang, T., Chi, Y., Zhu, S., Gong, Y., Jin, R.: Detecting communities and their evolutions in dynamic social networks: a bayesian approach. Mach. Learn. 82(2), 157–189 (2011)MathSciNetMATHCrossRefGoogle Scholar
- Zanghi, H., Ambroise, C., Miele, V.: Fast online graph clustering via Erdos–Renyi mixture. Pattern Recognit. 41, 3592–3599 (2008)MATHCrossRefGoogle Scholar
- Zanghi, H., Volant, S., Ambroise, C.: Clustering based on random graph model embedding vertex features. Pattern Recognit. Lett. 31(9), 830–836 (2010)CrossRefGoogle Scholar
- Zhou, D., Manavoglu, E., Li, J., Giles, C., Zha, H.: Probabilistic models for discovering e-communities. In: Proceedings of the 15th international conference on World Wide Web, pp. 173–182. ACM, New York (2006)Google Scholar