1 Introduction

Large-scale systems of various kinds including social, informational or biological systems are pervasive in human life. Prominent examples of such systems are online social networks, the Internet, power grids or neural networks in the brain. Naturally, there is a strong interest in analyzing and understanding these systems, e.g., to estimate the effects of interventions on parts of the system, to alter or preserve their functionality, or to predict their future evolution. Commonly, such systems are abstracted as networks, where nodes represent the entities in the system, and links between nodes represent the (form of) interaction between the entities.

There are three basic necessities for a feasible empirical research of large networks: (i) identification of functional groups, frequently called communities (Girvan and Newman 2002; Radicchi et al. 2004), (ii) models for the interaction within and between these functional groups, (iii) visualization of large networks and their dynamics in human graspable form.

Community detection (Girvan and Newman 2002; Fortunato 2010; Fortunato and Hric 2016) is an established tool satisfying these three basic requirements, as the identification of functional groups in networks is often facilitated by implicitly or explicitly assuming specific interaction models and further allows to visualize the network at a more granular level. Consequently, researchers proposed numerous community detection algorithms in recent years (Girvan and Newman 2002; Clauset et al. 2004; Rosvall and Bergstrom 2008; Rosvall et al. 2009; Pons and Latapy 2005; Blondel et al. 2008; Raghavan et al. 2007; Reichardt and Bornholdt 2006; Peixoto 2014a).

This plethora of algorithms leaves us frequently wondering which is the “best” community detection algorithm for a given practical application? Typically, researchers compare algorithms based on their ability to identify ground truth communities in artificially generated benchmark networks (Orman and Labatut 2009; Yang et al. 2016) or ground truth extracted from node-metadata in empirical networks. However, Peel et al. (2017) recently showed that such an evaluation is more delicate: they provide a No Free Lunch theorem, stating that there can be no single best algorithm for all possible detection scenarios, and furthermore, community evaluation on empirical networks based on node meta-data is contestable in a general case. Hence, as different algorithms (potentially) uncover different structural aspects when applied to a given network, the choice of method depends, among other criteria, on the type of community one is looking for.

One prominent class of community detection methods characterizes communities based on random walks on a network, with Infomap (Rosvall and Bergstrom 2008; Rosvall et al. 2009) and Walktrap (Pons and Latapy 2005) being two popular representatives (see Sect. 2 for a compact description). Along the lines of the No Free Lunch theorem, both methods have their strengths and their weaknesses. Whereas Infomap accurately uncovers communities that are strongly connected internally (as characterized by the mixing parameter; see Sect. 3.1), it fails to do so for loosely connected communities (cf.  Yang et al. 2016, Sect. 5.2). On the other hand, while Walktrap delivers reasonable results over a broader spectrum of the strength of the community structures, we find that its performance strongly depends on the degree distribution of the network. In addition, Walktrap requires a selection of a hyper-parameter that is commonly chosen empirically. As one typically does not know about the community structure of a network a priori, it is unclear for practitioners how to select among these two random walk methods.

This raises an interesting question: can we combine the strengths of both methods to arrive at a community detection method that is more robust across a wider range of problem settings? In this paper, we tackle this question by presenting Synwalk—a community detection method where we model community properties by designing a synthetic random walk model. Specifically, Synwalk assumes a class of random walks with independent and identically distributed (i.i.d.) movements within and between candidate communities. It then simultaneously optimizes the distribution parameters of these i.i.d. movements (closed-form solution) and the candidate community structure (combinatorial optimization) such that the thus synthesized random walk resembles the random walk induced by the network under consideration. Due to the structure of the i.i.d. movements and the aim to synthesize an existing random walk, Synwalk thus shares ideas from both Infomap and stochastic block modelling (Sect. 4.2). We discuss the properties of the resulting Synwalk objective in Sect. 4.1 and thoroughly investigate and compare the behavior of Synwalk to Infomap and Walktrap, the Louvain method, and stochastic block model inference on generated benchmark graphs in Sect. 5.2. Furthermore, we illustrate the applicability of our method on empirical undirected networks with non-overlapping communities (Sect. 5.3).

In this work we present a novel instance of community detection via random walk modelling, which adapts the concept of (stochastic) block modelling to random walk-based community detection. At the same time, Synwalk combines the strengths of the popular random walk-based community detection algorithms Infomap and Walktrap, achieving more robust results across a range of generated and empirical networks without the need for hyper-parameter optimization. We believe that our method and results can initiate future theoretical and practical work to fully unlock the potential of synthetic random walk models for community detection by, e.g., (i) designing objective functions that enable robust detection of communities on specific classes of networks, (ii) designing random walk models tailored for detecting specific types of communities, or (iii) employing different notions of graph-induced random walks to discover different aspects/communities of a network.

2 Related work

Different approaches to community detection have been inspired by different definitions of communities (see Fortunato and Hric 2016 for an excellent survey). Accordingly, Rosvall et al. (2019) argue that different approaches to community detection can be categorized into four big groups, i.e., cut-based community detection, clustering, stochastic block modelling, and community detection based on network flows or random walks. We will now briefly summarize the concepts and approaches relevant for this work.

In a classical view, communities are densely connected subnetworks of a network that are well separated, which resonates with cut- or clustering-based community detection. This view takes the internal and external node degrees w.r.t. an assumed community structure into account. Popular metrics for measuring the existence and strength of a community structure are the mixing parameter (Lancichinetti et al. 2008; Lancichinetti and Fortunato 2009a) and the modularity (Newman and Girvan 2004; Newman 2006). The mixing parameter \(\mu \) of a node is the ratio between the number of links to nodes outside of its community and the total number of its links. The related quantity modularity compares the density of links within communities to links between communities and is used as an objective function for community detection (Clauset et al. 2004; Blondel et al. 2008).

While we will use the mixing parameter and modularity in setting up and evaluating our experiments in Sects. 5.2 and 5.3 , the method we propose falls into the category of random walk-based community detection methods. Random walks provide a simple proxy for diffusion processes describing the dynamics of a network. Here, the notion of a community is related to the average time a random walker spends within a certain subgroup of nodes of a network, and community detection becomes equivalent to finding an appropriate state space partition of the corresponding random walk.

This connection was utilized by Piccardi (2011), who proposed finding a community structure such that in the next time step the random walker stays within its current candidate community with high probability. The similar notion of Markov stability discussed by Lambiotte et al. (2014) has connections to modularity maximization and Infomap’s map equation. Other approaches to the aggregation of random walks include non-negative matrix factorization of the random walk’s transition probability matrix to obtain a low-rank representation (Ghasemi et al. 2020), spectral aggregation techniques (Zhang and Wang 2020), or information-theoretic approaches to Markov chain aggregation, cf. Amjad et al. (2020), Faccin et al. (2020) and Deng et al. (2011). All these approaches can, under appropriate circumstances and settings, be utilized to partition a network into overlapping or non-overlapping communities.

Hurley and Duriakova (2015, 2016) proposed an information-theoretic method for community detection that combines a random walk-based approach with (classic) block modelling. More specifically, the authors aim to find a candidate clustering of the network under investigation such that the random walk induced on this clustering is similar to an arbitrarily chosen target random walk, where similarity is measured by the Kullback–Leibler divergence and optimized using a Hartigan-style optimization procedure.

Similarly to Faccin et al. (2020) and Piccardi (2011)), the method proposed in Hurley and Duriakova (2015, 2016)) is predominantly based on modelling random walks on clusters. In contrast, Synwalk is based on the random walk induced by the network under investigation, i.e., a random walk on the network’s nodes.

Another recent method proposed by Peixoto and Rosvall (2017) detects communities on possibly dynamic networks by combining elements from Markov aggregation and stochastic block model inference. Specifically, they try to co-cluster the states and preceding trajectories (”memories”) of multiple realizations of a random walk with the aim of minimizing its description length. Similarly to our approach, this work assumes a synthetic random walk model. However, whereas they try to infer its parameters (i.e., its transition probabilities) jointly with the optimal (co-)clustering in a Bayesian approach, our method makes explicit assumptions about the synthetic random walk’s parameters and tries to find the optimal clustering by comparing the synthetic to a graph-induced random walk.

We close this section by reviewing two prominent examples for random walk-based community detection methods, Infomap (Rosvall and Bergstrom 2008; Rosvall et al. 2009) and Walktrap (Pons and Latapy 2005), against which we will compare our approach experimentally.

Assuming a certain clustering (i.e., a candidate community structure), Infomap encodes the movements of a random walker on a network with a two-level codebook scheme. Each cluster has its own codebook with codewords for each member node, plus a dedicated exit codeword. Additionally, there is a global index codebook with codewords for each cluster. Now, for every move of the random walker, Infomap records the codeword of the next node from the codebook of its containing cluster. Moreover, whenever the random walker changes clusters, Infomap records the exit codeword of the old cluster’s codebook and the codeword of the new cluster from the index codebook before it records the new node. By minimizing the average description length of realizations of such a random walk, Infomap obtains a clustering that compactly describes the network dynamics and hence should fit the true community structure well. Notably, it is not necessary to actually simulate random walks, as the movements of the random walker are characterized by the network topology, allowing to compute the average description length via the map equation (Rosvall and Bergstrom 2008; Rosvall et al. 2009).

Walktrap formulates a random walk-based distance measure between clusters. Given a fixed number of steps, a random walker starting at a certain node will visit a neighboring node with a given probability. These probabilities hold information about how well two nodes are connected. Now, assuming two nodes are within the same community (i.e., well-connected), their probabilities to reach any other node within the network for a given number of steps should be similar. This observation yields a distance measure based on a weighted mean squared difference of such probabilities. Walktrap greedily merges nodes/clusters based on the described distance to arrive at a suitable clustering (Pons and Latapy 2005).

While Infomap and Walktrap predict clusterings by analyzing these random walks, our method predicts clusterings by synthesizing the network-induced random walk from a restricted class of candidate random walks. Searching for a proper random walk within this class makes our method robust across different network types. Additionally, being able to design the candidate class opens up possibilities for exploring alternative designs in future research.

3 Preliminaries

3.1 Networks and clusterings

Let \({\mathcal {G}}:=(\mathcal {X},E,W)\) be a weighted network with nodes \(\mathcal {X}=\{1,\dots ,N\}\), links \(E\subseteq \mathcal {X}^2\) and weight matrix W. The weight matrix is given by \(W:=[w_{\alpha \rightarrow \beta }]_{\alpha ,\beta \in \mathcal {X}}\) where \(w_{\alpha \rightarrow \beta }\ge 0\) denotes the weight of the link \((\alpha ,\beta )\in E\) starting at node \(\alpha \) and pointing at node \(\beta \). (We use Greek letters to indicate nodes.) For an undirected network we set \((\alpha ,\beta ) = (\beta ,\alpha )\) and require that either \((\alpha ,\beta )\in E\) or \((\beta ,\alpha )\in E\) to avoid the double counting of edges. A set \(C\subseteq \mathcal {X}\) is a clique if it is a complete subnetwork of \({\mathcal {G}}\), i.e., for any two distinct nodes \(\alpha ,\beta \in C\) there exists a connecting link \((\alpha ,\beta ) \in E\).

For an undirected network, the degree \(k_\alpha \) of node \(\alpha \) is the number of links connected to it. We denote the average degree of the network as \({\overline{k}}\). The network density \(\rho \) is defined as

$$\begin{aligned} \rho = 2 \cdot \frac{|E|}{|\mathcal {X}| (|\mathcal {X}| - 1)} = \frac{{\overline{k}}}{|\mathcal {X}| - 1}. \end{aligned}$$
(1)

Consider a clustering \(\mathcal {Y}\) of \(\mathcal {X}\) into a set of K nonempty elements \(\mathcal {Y}_i\), i.e., where denotes the index set of \(\mathcal {Y}\). We index the elements of such a clustering by Roman letters and refer to them as clusters or communities. If the clusters are disjoint we call them non-overlapping and the clustering a partition. A partition induces a mapping function \(m{:}\ \mathcal {X}\rightarrow \mathcal {K}^\mathcal {Y}\), mapping each node of \({\mathcal {G}}\) to the index of its containing cluster, i.e., \(m(\alpha ) = i\) iff \(\alpha \in \mathcal {Y}_i\). For the remainder of this paper we assume all clusterings to be partitions.

For an undirected, unweighted network and a candidate clustering \(\mathcal {Y}\), the mixing parameter of node \(\alpha \) is defined as (Lancichinetti et al. 2008; Lancichinetti and Fortunato 2009a)

$$\begin{aligned} \mu (\alpha ) = \frac{k^{ext}_\alpha }{k_\alpha } \end{aligned}$$
(2)

where \(k^{ext}_\alpha \) is the number of links between \(\alpha \) and nodes outside of its community \(\mathcal {Y}_{m(\alpha )}\). A cluster \(\mathcal {Y}_i\) is a strong community (Radicchi et al. 2004) if for all of its nodes \(\mu (\alpha ) < 0.5\), but communities can be defined in a weak sense also for larger values (Lancichinetti and Fortunato 2009a). Similarly, for an undirected, unweighted network and a candidate clustering \(\mathcal {Y}\), we can define the modularity of the clustering as (Newman and Girvan 2004; Newman 2006)

(3)

where \(|E_i|\) denotes then number of internal edges in cluster \(\mathcal {Y}_i\). Brandes et al. (2008) showed that the modularity ranges from \(-1/2\) to 1, with small values indicating weak community structures of the candidate clustering \(\mathcal {Y}\).

3.2 Random walks

We consider random walks \(\{X_t\}_{t\in {\mathbb {N}}}\) on the network \({\mathcal {G}}\), i.e., \(\{X_t\}\) is a first-order Markov chain on \(\mathcal {X}\). We assume that its stationary transition probability matrix \(P:=[p_{\alpha \rightarrow \beta }]_{\alpha ,\beta \in \mathcal {X}}\) is derived from the network’s weight matrix W via

$$\begin{aligned} p_{\alpha \rightarrow \beta } =\frac{w_{\alpha \rightarrow \beta }}{\sum _{\beta '} w_{\alpha \rightarrow \beta '}}. \end{aligned}$$
(4)

While there are other notions of random walks on networks, cf. Lambiotte et al. (2014) and Masuda et al. (2017) for an overview, we selected this formulation due its simplicity and connection to modularity maximization and Infomap’s map equation, cf. the discussion around equations (6) and (34) in Lambiotte et al. (2014). We furthermore initialize the random walk with an invariant state distribution \(p := [p_\alpha ]_{\alpha \in \mathcal {X}}\) that satisfies

$$\begin{aligned} p_\beta = \sum _{\alpha } p_\alpha p_{\alpha \rightarrow \beta }. \end{aligned}$$
(5)

We assume that the network is strongly connected, thus p is unique and positive.

Setting \(Y_t:=m(X_t)\) defines a stationary process \(\{Y_t\}\) on the clusters. Specifically, the marginal and joint probabilities describing \(\{Y_t\}\) are obtained as

$$\begin{aligned} p_i&:= {\mathbb {P}}(Y_t=i)=\sum _{\alpha \in i }p_\alpha , \quad i \in \mathcal {K}^\mathcal {Y} \end{aligned}$$
(6a)

and

$$\begin{aligned} p_{i,j}&:={\mathbb {P}}(Y_{t+1}=j,Y_{t}=i) = \sum _{\alpha \in i}\sum _{\beta \in j}p_\alpha p_{\alpha \rightarrow \beta }, \quad i,j \in \mathcal {K}^\mathcal {Y}. \end{aligned}$$
(6b)

We further abbreviate \(p_{\lnot i} := 1 - p_i = {\mathbb {P}}(Y_t \ne i)\) for the marginal complement, \(p_{i,\lnot j} := p_i - p_{i,j} = {\mathbb {P}}(Y_{t+1}\ne j,Y_{t}=i)\) for the joint complement, and \( p_{i\rightarrow j} := {\mathbb {P}}(Y_{t+1}=j|Y_{t}=i)\), respectively \(p_{i\not \rightarrow j} := 1- p_{i\rightarrow j} = {\mathbb {P}}(Y_{t+1}\ne j|Y_{t}=i)\) for the conditional and its complement.

3.3 Information theory

We make use of the following quantities from information theory that are well-described by Cover and Thomas (2006, Chapter 2). Let YZ denote random variables (RV), then we call \(H(Z)\) the entropy of Z, H(Y|Z) the conditional entropy of Y given Z and \(I(Y;Z)\) the mutual information between Y and Z. Furthermore, let p and q denote discrete probability distributions over the same alphabet. Then we call \(D(p\Vert q)\) the Kullback–Leibler divergence between p and q. If pq are Bernoulli distributions i.e., \(p=[p_1,1-p_1]\) and \(q=[q_1,1-q_1]\), then we abbreviate \(D(p_1\Vert q_1) :=D(p\Vert q)\). Furthermore, let \(P:=[p_{\alpha \rightarrow \beta }]_{\alpha ,\beta \in \mathcal {Z}}\) and \(Q:=[q_{\alpha \rightarrow \beta }]_{\alpha ,\beta \in \mathcal {Z}}\) be transition probability matrices of equal size. The Kullback–Leibler divergence rate \({\overline{D}}(\cdot \Vert \cdot )\) between two stationary Markov chains governed by P and Q is

$$\begin{aligned} {\overline{D}}(P\Vert Q) := \sum _{\alpha \in \mathcal {Z}}\sum _{\beta \in \mathcal {Z}} p_\alpha p_{\alpha \rightarrow \beta }\log \frac{p_{\alpha \rightarrow \beta }}{q_{\alpha \rightarrow \beta }} \end{aligned}$$
(7)

given that the Markov chains are irreducible (Rached et al. 2004, Th. 1).

4 Community detection via random walk modelling

We now introduce the Synwalk objective, derive some of its properties, and discuss its relations to Infomap, stochastic block modelling, and model reduction techniques for random walks. For the sake of readability we defer proofs to Appendix A.

4.1 Derivation and properties of the Synwalk objective

Assume a network \({\mathcal {G}}= (\mathcal {X}, E, W)\) with an inherent community structure \(\mathcal {Y}^{\mathrm {true}}\). Consider further a random walker moving on \({\mathcal {G}}\) governed by the transition probability matrix P, which is derived from the weight matrix W. We refer to this random walker as the network-induced random walker, as its movements depend on the topology of \({\mathcal {G}}\) (i.e., implicitly its community structure). In the next step we design a synthetic random walker, governed by some transition probability matrix \(Q^\mathcal {Y}\) which, in contrast to P, explicitly depends on some candidate partition \(\mathcal {Y}\). In essence, our approach then aims to find a partition \(\mathcal {Y}\) such that the synthetic random walker behaves (stochastically) as similarly to the network-induced walker as possible. Intuitively, the resulting partition will resemble the intrinsic partition \(\mathcal {Y}^{\mathrm {true}}\) very closely. We formalize this concept in the following.

The transition probability matrix that governs the synthetic random walker has a particular structure that depends on a candidate partition \(\mathcal {Y}\). Specifically, suppose that at a given time step the synthetic random walker is at node \(\alpha \) in cluster \(\mathcal {Y}_i\in \mathcal {Y}\). We decide whether to leave or to stay in the current cluster \(\mathcal {Y}_i\) in the next time step based on a cluster-specific Bernoulli distribution \([s_i, 1-s_i]\). In case of a cluster change, we choose a new cluster \(\mathcal {Y}_j \ne \mathcal {Y}_i\) according to a distribution over clusters \([u_i]_{i\in \mathcal {K}^\mathcal {Y}}\). Finally, we choose the next node \(\beta \) lying in the new cluster \(\mathcal {Y}_j\) by a cluster-specific distribution over nodes \([r_\beta ^j]_{\beta \in \mathcal {Y}_j}\) (note that \(\mathcal {Y}_j=\mathcal {Y}_i\) if we stay in the current cluster). This particular structure yields the transition probability matrix \(Q^\mathcal {Y}=[q_{\alpha \rightarrow \beta }]_{\alpha ,\beta \in \mathcal {X}}\) where

$$\begin{aligned} q_{\alpha \rightarrow \beta } = {\left\{ \begin{array}{ll} r_\beta ^{m(\beta )} \cdot (1-s_{m(\alpha )}), &{} m(\alpha )=m(\beta ),\\ r_\beta ^{m(\beta )} \cdot s_{m(\alpha )} \cdot \frac{u_{m(\beta )}}{1 - u_m{(\alpha )}}, &{}\text {otherwise.} \end{array}\right. } \end{aligned}$$
(8)

Note that when switching clusters we have to normalize the distribution over clusters by \(1 - u_{m(\alpha )} = \sum _{k\ne m(\alpha )} u_k\) since we exclude the current cluster \(m(\alpha )\) as a choice.

The aim is now to find a candidate partition \(\mathcal {Y}\) and corresponding parameters of (8) that maximize the similarity between the synthetic and the network-induced random walk. We quantify this similarity via the Kullback–Leibler divergence rate \({\overline{D}}(P\Vert Q^\mathcal {Y})\), i.e., the lower \({\overline{D}}(P\Vert Q^\mathcal {Y})\), the more similar are P and \(Q^\mathcal {Y}\), and the more likely it is that the synthetic random walker produces realizations of random walks that are also typical for the network-induced random walker (Kesidis and Walrand 1993). Hence, the optimal partition \(\mathcal {Y}^*\) satisfies

$$\begin{aligned} \mathcal {Y}^* \in {{\,\mathrm{arg\,min}\,}}_{\mathcal {Y}} \left[ \min _{\{[r_\alpha ^i]_{\alpha \in \mathcal {Y}_i},\ s_i,\ u_i\}_{i\in \mathcal {K}^\mathcal {Y}}} {\overline{D}}(P\Vert Q^\mathcal {Y}) \right] . \end{aligned}$$
(9)

The cluster-specific distributions over nodes and the cluster-specific Bernoulli distributions minimizing (9) can be shown to be

$$\begin{aligned} r_\alpha ^{i,*}&= \frac{p_\alpha }{p_i} = {\mathbb {P}}(X_t=\alpha ,Y_t=i) \end{aligned}$$
(10)
$$\begin{aligned} s_i^*&= p_{i\not \rightarrow i} = {\mathbb {P}}(Y_{t+1}\ne i|Y_{t}=i). \end{aligned}$$
(11)

Regarding the distribution over clusters \([u_i]_{i\in \mathcal {K}^\mathcal {Y}}\) there exists no closed-form solution to the best of our knowledge. Nevertheless, by choosing

$$\begin{aligned} u_i&= p_i = {\mathbb {P}}(Y_t=i) \end{aligned}$$
(12)

as a sub-optimal solution we can relax the original optimization problem in (9) to arrive at (see Proposition 1 in Appendix A.1)

$$\begin{aligned} \mathcal {Y}^* \in {{\,\mathrm{arg\,max}\,}}_{\mathcal {Y}} \sum _{i\in \mathcal {K}^\mathcal {Y}} p_iD(p_{i\rightarrow i}\Vert p_i). \end{aligned}$$
(13)

We hence define the Synwalk objective as follows.

Definition 1

The Synwalk objective for a given partition \(\mathcal {Y}\) is

$$\begin{aligned} \mathcal {J}(\mathcal {Y}) :&= \sum _{i\in \mathcal {K}^\mathcal {Y}} p_{i,i}\log \frac{p_{i\rightarrow i}}{p_i} + p_{i,\lnot i}\log \frac{p_{i\not \rightarrow i}}{p_{\lnot i}} = \sum _{i\in \mathcal {K}^\mathcal {Y}} p_iD(p_{i\rightarrow i}\Vert p_i) \end{aligned}$$
(14)

As we show in Proposition 2 in Appendix A.2, \(\mathcal {J}(\mathcal {Y})\) is bounded via

$$\begin{aligned} 0 \le \mathcal {J}(\mathcal {Y}) \le I(Y_t;Y_{t-1}) \le I(X_t;X_{t-1}). \end{aligned}$$
(15)

Note that for a given candidate clustering \(\mathcal {Y}\) all probabilities in (14) can be computed from the fixed transition probabilities and the invariant state distribution induced by a network’s weight matrix according to Sect. 3.2. Hence, optimizing the Synwalk objective is a combinatorial problem over all possible clusterings of a given network. It is a common problem that the number of possible clusterings grows super-exponentially in the number of nodes and thus, an exact solution is intractable. We therefore employ a suitable search algorithm to find a near optimal clustering w.r.t. the Synwalk objective. See Sect. 5.1 and Appendix B for further details.

The following observation supports the rationale behind Synwalk’s aptness as a community detection method. Consider an unweighted network of disconnected cliques. As we show in Appendix A.3, the Synwalk objective achieves its global maximum for a community structure identical to the clique structure of this network. Although isolated cliques are an unrealistic scenario for community detection, they carry the intuition of the concept of a community, i.e., strong internal and weak external connections. Synwalk’s optimal behaviour in this idealized edge case theoretically grounds our strong experimental results in Sect. 5.2. For additional insights based on theoretical considerations and synthetic toy data we refer the reader to Toth (2020, Sections 3.2 & 5.1).

4.2 Relation to Infomap and (stochastic) block modelling

The design of our random walk model was inspired by Infomap’s coding scheme. Recall Infomap’s two-level codebook structure described in Sect. 2, i.e., the cluster codebooks with node and exit codewords, and the global index codebook. The distributions assembling the dynamics of our synthetic random walker in (8) correspond to these codebooks: (i) the cluster-specific distributions over nodes \([r_\alpha ^i]_{\alpha \in \mathcal {Y}_i}\) correspond to the cluster codebooks, (ii) the cluster-specific Bernoulli distributions \(\{s_i\}_{i\in \mathcal {K}^\mathcal {Y}}\) determining a cluster change correspond to the exit codewords, and (iii) the distribution over clusters \([u_i]_{i\in \mathcal {K}^\mathcal {Y}}\) corresponds to the index codebook.

Thus, while Infomap takes an analytic approach to community detection by applying the minimum description length principle with a specific codebook structure, Synwalk takes a synthetic approach by trying to mimic the network-induced random walk with our synthetic random walk model.

The definition of \(Q^\mathcal {Y}\) in (8) and of the optimization problem (9) are reminiscent of stochastic block modelling under Kullback–Leibler divergence. The main difference is that in stochastic block modelling, one tries to infer model parameters—e.g., community structure, inter- and intra-community edge probabilities—such that the likelihood of a given graph is maximized. In other words, block modelling infers the parameters of a random graph model, i.e., a generative model from which graphs can be drawn, such that the likelihood of the graph under consideration is maximized. An essential point for stochastic block models is that these models have limited degrees of freedom, and that a good fit between the model and the graph is achieved by selecting an appropriate candidate clustering for the former. In contrast, Synwalk first transforms the graph under consideration to a random walk model, characterized by the transition probability matrix P. Then, the aim of Synwalk is to infer the parameters—i.e., the community structure and parameters of \(Q^\mathcal {Y}\)—of another random walk model such that the resulting random walk is “close” to the original one in a well-defined sense. Furthermore, it is essential that \(Q^\mathcal {Y}\) has less degrees of freedom than P; while, for N nodes and K candidate clusters, P has \(N(N-1)\) degrees of freedom, the degrees of freedom of \(Q^\mathcal {Y}\) are limited to \(K + (K-1) + (N-K)=N+K-1\). Thus, Synwalk can adequately be interpreted as an approach to “random walk modelling”.

Finally, Hurley and Duriakova (2015, 2016) proposed a method that combines random walks on networks with (generalized) block modelling. While they consider the network-induced random walk on clusters rather than on nodes, they also use the Kullback–Leibler divergence to measure the similarity with a target random walk and, thus, the fitness of the candidate clustering. For a specific target random walk it can be shown that their approach becomes equivalent to the goal of maximizing \(I(Y_t;Y_{t-1})\) (Hurley and Duriakova 2015, Sec. III.A). The same cost function was also proposed by Deng et al. (2011) for Markov chain aggregation, where it was shown that the bipartition of states is related to spectral partition via the Fiedler vector. It is also a special case of the cost function \(I(Y_t;Y_{t+T})\) proposed by Faccin et al. (2020), who showed that maximizing this quantity for \(T=1\) and a random walk on an unweighted and undirected network is equivalent to maximizing the likelihood of a degree-corrected stochastic block model. By assuming a less restrictive structure of the distribution over clusters in (8) the optimization problem in (9) becomes equivalent to maximizing \(I(Y_t;Y_{t-1})\) over the possible clusterings (see Appendix A.2 for a concrete derivation). Thus, we can achieve this cost function by designing a suitable synthetic random walk model.

5 Experimental evaluation

In the following experiments we compare Synwalk to four well established community detection methods, namely, Infomap (Rosvall and Bergstrom 2008; Rosvall et al. 2009) and Walktrap (Pons and Latapy 2005) (both random walk-based), Louvain (Blondel et al. 2008) (based on modularity maximization), and stochastic block model (SBM) inference (Peixoto 2014a). The source code for reproducing these experiments can be found at https://github.com/synwalk/synwalk-analysis.

5.1 Implementations

To find a near optimal clustering w.r.t. our Synwalk objective we reuse Infomap’s stochastic and recursive search algorithm (Rosvall and Bergstrom 2010, Appendix S1). See Appendix B for additional information about our implementation. The resulting framework used in the course of this work can be found at https://github.com/synwalk/synwalk.

For Walktrap and Louvain we use the implementations provided by igraph (Csardi and Nepusz 2006). Note that for Walktrap we assume a default value of \(T=4\) where T is the hyper-parameter describing the random walk length used to compute the node and cluster distances. We use GraphTool (Peixoto 2014b) for inferring a degree-corrected SBM for a given network. Hereafter we will refer to the SBM inference method simply as GraphTool. Unless otherwise noted, we use default parameters of these implementations. We use the same setup in both our experiments with LFR benchmark graphs and with empirical networks.

5.2 Experiments on the LFR benchmark

To validate and compare the results of community detection methods it is common practice to evaluate their performance on benchmark networks (Yang et al. 2016; Fortunato and Hric 2016; Newman and Girvan 2004; Lancichinetti and Fortunato 2009b; Orman and Labatut 2009) where the ground truth community structure is known. The prevalent benchmark in more recent studies (Yang et al. 2016; Orman and Labatut 2009) is the LFR benchmark (Lancichinetti et al. 2008; Lancichinetti and Fortunato 2009a) and hence, we adopt it in our experiments. We generate the LFR benchmark networks with parameters as given in Table 1.

Table 1 Parameter setup for generating the LFR benchmark networks

We employ the adjusted mutual information (AMI, Vinh et al. 2010) as a performance measure when comparing the partitions found by different community detection algorithms with the ground truth community structure. AMI values close to 1 indicate high similarity between the found partition and the ground truth, whereas a values around 0 reflect low similarity. Let \(\mathcal {Y}^{\mathrm {true}}\) denote the ground truth clustering and \(\mathcal {Y}\) any predicted clustering, then the AMI is defined as

$$\begin{aligned} I^{adj}(\mathcal {Y}^{\mathrm {true}}, \mathcal {Y}) = \frac{I(\mathcal {Y}^{\mathrm {true}}; \mathcal {Y}) - \mathrm {E}\{I(\mathcal {Y}^{\mathrm {true}}; \mathcal {Y})\}}{\frac{1}{2}[ H(\mathcal {Y}^{\mathrm {true}}) + H(\mathcal {Y})] - \mathrm {E}\{I(\mathcal {Y}^{\mathrm {true}}; \mathcal {Y})\}}, \end{aligned}$$
(16)

where \(\mathrm {E}\{\cdot \}\) denotes the expectation operator with respect to a chosen permutation model. We normalize the AMI by the arithmetic mean as in (16).

5.2.1 AMI as a function of the mixing parameter

In this experiment we use parameter set A (see Table 1) to generate the LFR benchmark networks. We fix the network size and average degree of the generated LFR networks while varying their mixing parameter \(\mu \) between 0.2 and 0.8. Experiments with varying network sizes are shown in Appendix C.

The results for the AMI as a function of the mixing parameter are shown in Fig. 1. As can be seen, Infomap correctly identifies the communities for sufficiently small values of \(\mu \) and transitions to vanishing AMI around \(\mu \approx 0.5\). This behavior reflects the definition of communities in a strong and weak sense as proposed by Radicchi et al. (2004) and was also observed by Yang et al. (2016). We explain this behaviour by looking at Infomap’s coding scheme. If \(\mu >0.5\), then the random walker will have a higher probability of exiting a community than staying within it. Hence, for the ground truth community structure, the coding overhead due to sending exit codewords will dominate, and clusterings resulting in more efficient encodings can be found, e.g., by putting all nodes into a single common cluster. Indeed, we observed exactly this behavior for Infomap in our experiments.

Fig. 1
figure 1

Comparison of Infomap, Synwalk, Walktrap, Louvain and GraphTool on LFR benchmark networks with given average degree and network size. The lines and shaded areas show the mean and standard deviation of AMI as a function of the mixing parameter, obtained from 100 different network realizations. Synwalk outperforms Infomap for sufficiently high mixing parameter and network density. Performance of Synwalk and Walktrap increases with higher average degrees while holding the network density fixed (Color figure online)

Unlike Infomap, Synwalk does not penalize frequent transitions between communities, although our random walk model resembles Infomap’s coding scheme (cp. Sect. 4.2). Thus, the performance transitions of Synwalk, similarly to Walktrap, occur at increasing values of \(\mu \) for increasing network densities (Fig. 1, columns from top to bottom and rows from right to left). Intriguingly, for roughly the same network density the transition phases shift to higher values of the mixing parameter as the average degree increases (cp. Appendix C). Hence, neither the mixing parameter nor the network density sufficiently characterizes the AMI performance of Synwalk and Walktrap. We analyze this phenomenon further in Sect. 5.2.2. In contrast, our experiments show that even as we vary the average degree, Infomap’s performance mainly depends on the mixing parameter of the networks.

Overall, Synwalk outperforms Infomap in terms of AMI on sufficiently dense networks or networks with mixing parameters \(\mu > rapprox 0.5\). We perform approximately on par with Walktrap, where we see slightly better performance on networks with lower density (Fig. 1, top row) and a slight disadvantage on networks with higher density (Fig. 1, bottom row).

Considering the methods not based on random walks, we see that GraphTool does not perform well w.r.t. the AMI metric. We observed that GraphTool detects many small communities, apparently capturing a different aspect of the network structure. We think that inferring a hierarchical SBM may lead to better AMI values when looking at a clustering on a suitable hierarchy level (e.g. by choosing the clustering with the highest modularity score, similar to Walktrap).

Finally, Louvain (modularity maximization) yields the highest AMI values on networks with lower densities. For denser networks the performance of Synwalk and Walktrap comes close to or even slightly better than that of Louvain (cp. Fig. 1d, g, h).

We want to point out that the comparison between the different methods, more gravely between the random walk-based and non-random walk-based methods, should not solely be based on the AMI values achieved on this benchmark. For example, although GraphTool does not achieve good AMI performance on this benchmark, it certainly yields interesting results when applied to empirical networks (cf. Sect. 5.3). These results however will differ in their characteristics from methods based on other paradigms.

5.2.2 Classification analysis using node statistics

As we have seen in Sect. 5.2.1, the AMI performance of Synwalk and Walktrap on LFR networks transitions smoothly for varying values of the mixing parameter. To get deeper insights into the behavioral differences between Synwalk and WalktrapFootnote 1 we analyze the different qualities of their predictions in these transition phases.

For this purpose we analyze networks with varying network sizes and average degrees that are generated with parameter set A (see Table 1) while trying to keep the network density and the AMI (by appropriately setting the mixing parameter) constant (cp. main diagonal in Fig. 1). We align any predicted partitions to their respective ground truth partitions using a greedy matching algorithm as described in Appendix D. We then consider the nodes in the intersection of the ground truth communities with their aligned counterparts as correctly classified nodes, whereas the residual set of nodes form the group of misclassified nodes.

Given this distinction, we can compare the degree distributions of correctly classified and misclassified nodes. In addition to the node degree \(k_\alpha \), we consider the normalized local degree (NLD) \({\hat{k}}_\alpha \), which we define as the ratio between the node degree and the maximum number of possible links in its containing cluster:

$$\begin{aligned} {\hat{k}}_\alpha = \frac{k_\alpha }{\left( \genfrac{}{}{0.0pt}{}{|\mathcal {Y}_{m(\alpha )}|}{2} \right) } \quad \text {for} \quad |\mathcal {Y}_{m(\alpha )}| \ge 2. \end{aligned}$$
(17)

Note that in general the NLD of a node will be different when computed w.r.t. its ground truth community or its predicted community.

The degree and NLD distributions are visualized in Figs. 2 and  3. Although in Figs. 1 and 10 in Appendix C we see an apparently strong dependence of the AMI performance on the average degree for Synwalk and Walktrap, for Synwalk a significant dependence is not visible in the class distributions (Fig. 2, top row). Nevertheless, the distributions of the NLDs w.r.t. the ground truth communities (Fig. 2, middle row) reveal that misclassified nodes are more likely to exhibit a low NLD than correctly classified ones.

Fig. 2
figure 2

Degree distributions for correctly classified and misclassified nodes in Synwalk results, obtained from 100 different LFR networks with common average degree, network size and mixing parameter. The top row shows the distributions of the node degrees, the middle row shows the distribution of the normalized local degrees w.r.t. the ground truth communities and the bottom row shows the distributions of the normalized local degrees w.r.t. the predicted communities. Synwalk tends to misclassify nodes with low normalized local degree (w.r.t. the ground truth communities), whereas the influence of the absolute node degree is negligible. The statistics of the normalized local degrees w.r.t. predicted communities indicate that misclassified nodes are assigned to additional, small communities (Color figure online)

Fig. 3
figure 3

Degree distributions for correctly classified and misclassified nodes in Walktrap results, obtained from 100 different LFR networks with common average degree, network size and mixing parameter. The top row shows the distributions of the node degrees, the middle row shows the distribution of the normalized local degrees w.r.t. the ground truth communities and the bottom row shows the distributions of the normalized local degrees w.r.t. the predicted communities. Walktrap tends to misclassify nodes with low normalized local degree (w.r.t. the ground truth communities) and/or low absolute node degree. The statistics of the normalized local degrees w.r.t. predicted communities resemble the ones w.r.t. the ground truth communities (Color figure online)

In contrast, although the latter observation holds for Walktrap as well (Fig. 3, middle row), the node degrees of its misclassified nodes appear to be smaller than those of correctly classified nodes (Fig. 3, top row). This behavior appears plausible when considering the mechanics of Walktrap: nodes are grouped based on cluster/node distances that are computed by considering random walks of a specified length T (in our setup \(T = 4\)). For low values of T, low-degree nodes are rarely visited, resulting in frequent ties in distance calculations, whereas in the limit of \(T\rightarrow \infty \) the distances are determined by the proportionality of the stationary distribution to the node degrees. Hence, it is necessary to make a trade-off regarding the random walk length T, which is typically chosen heuristically.

These differing properties of Synwalk and Walktrap manifest in contrasting detection behaviors on the LFR networks (cf. Fig. 4). Synwalk identifies smaller communities with greater accuracy than larger ones (dependence on the normalized local degree), i.e., the majority of misclassified nodes occur in the largest communities. While Walktrap follows this trend, misclassified nodes occur in smaller communities with increasing frequency (stronger dependence on node degree).

Fig. 4
figure 4

A sample LFR graph with communities as detected by Synwalk and Walktrap. The network has \(N=600\) nodes, an average degree of \({\overline{k}}= 25\) and a mixing parameter of \(\mu = 0.55\). Nodes are grouped according to their ground truth communities and share the same color if they belong to the same detected cluster. Misclassified nodes are highlighted with a black border. We aggregated nodes from predicted clusters that have no matching ground truth community into a single residual cluster. Synwalk places misclassified nodes into additional clusters, whereas Walktrap confuses node memberships between ground truth communities (Color figure online)

Another interesting difference appears when inspecting to which clusters misclassified nodes are assigned. Synwalk tends to place misclassified nodes in additional (i.e., clusters with no matching ground truth community), small clusters. Such behavior is indicated by the NLD distributions w.r.t. predicted communities as well, where misclassified nodes exhibit a significantly higher NLD than correctly classified ones (cp. Fig. 2, bottom row). This results in detected ground truth communities being ”pure”, i.e., they do not contain nodes from other ground truth communities. In contrast, Walktrap mainly confuses node memberships within clusters that do have a matching ground truth community. Again, these observations are supported by the NLD distributions w.r.t. predicted communities as well, where misclassified nodes exhibit a similar NLD to correctly classified ones (cp. Fig. 3, bottom row).

These behavioral differences between Synwalk and Walktrap are visible in the AMI performance as well: whereas both methods misclassify approximately the same amount of nodes in the sample network in Fig. 4, Synwalk achieves a significantly higher AMI value.

The above insights make apparent two advantages of our method. First, whereas there is no general answer on how to determine the random walk length T for Walktrap, Synwalk does not require the tuning of any hyper-parameter. Secondly, consider a network with many small communities and low average degree. Following our earlier observations, Walktrap will have many misclassified nodes due to the low average degree. In contrast, the assumption of small communities implies a reasonably high normalized local degree for the majority of nodes and thus suggests a better performance of Synwalk when compared to Walktrap. Indeed, the results in Fig. 5 support this intuition. The benchmark networks underlying these results were generated with parameter set B (see Table 1), effectively lowering the average community size for a given average degree compared to networks generated with parameter set A. Moreover, Synwalk closes the AMI performance gap to Louvain in this setup for sufficiently dense networks (cf. also similarly dense networks generated with parameter set A in Fig. 1).

Fig. 5
figure 5

Comparison of Infomap, Synwalk, Walktrap, Louvain and GraphTool on LFR benchmark networks with given average degree and network size. The lines and shaded areas show the mean and standard deviation of AMI as a function of the mixing parameter, obtained from 100 different network realizations. The networks were generated with parameter set B (see Table 1), simulating smaller communities with higher normalized local degrees. Synwalk outperforms Walktrap closes the performance gap to Louvain on sufficiently dense networks in this setup (Color figure online)

Table 2 Properties of the examined real-world networks

5.3 Illustration on empirical networks

In this section we illustrate the applicability of Synwalk on a selection of empirical networks (see Table 2) by comparing the detection results of Synwalk, Infomap, Walktrap, Louvain and GraphTool. For this purpose, we report the number of detected clusterings in Table 3, the number and fraction of non-trivial clusters thereof in Table 4, and the modularity score in Table 5.

Table 3 Number of detected clusters for on the examined empirical networks
Table 4 Number of non-trivial (less than three nodes) clusters on the examined empirical networks
Table 5 Modularity scores on the examined empirical networks

Synwalk and Infomap behave similarly in terms of their single-number characteristics. An exception to this observation is the github network, where Synwalk detects a greater number of (non-trivial) clusters. Notably, Walktrap results consistently show a higher fraction of trivial clusters when compared to Synwalk and Infomap.

The random walk-based methods Infomap, Synwalk, Walktrap yield a significantly higher number of detected communities when compared to Louvain (based on modularity maximization) and GraphTool (based on SBM Inference). GraphTool consistently detects almost no trivial clusters. Louvain detects no trivial clusters except on the pennsylvania-roads and wordnet networks. As expected, Louvain consistently yields the highest modularity scores.

Additionally, we compare the different methods by looking at the distribution of cluster sizes and normalized local degrees in their detected clusterings. For all distribution plots we consider clusters with less than three members as trivial and we do not include their statistics in the distributions. We provide the results for further cluster and node properties in Appendix E for the sake of completeness.

Infomap and Synwalk again behave similar given their cluster and node property distributions. A deviation from this pattern is apparent in the distribution of NLDs (see Fig. 6) for the github network, where Synwalk exhibits higher NLDs when compared to Infomap and Walktrap. The cluster size distributions (see Fig. 7) of Walktrap show a trend towards small clusters. As the empirical networks under consideration are significantly larger than the examined LFR networks in Sect. 5.2, a random walk length of \(T = 4\) might not be the optimal hyperparameter choice and thus could explain the many trivial clusters detected (cp. Table 4).

Fig. 6
figure 6

Distributions of normalized local degrees w.r.t. the discovered communities on empirical networks for Infomap, Synwalk and Walktrap. The distributions generated by Synwalk resemble Infomap closely. An exception here is again the github network (Color figure online)

Fig. 7
figure 7

Distributions of cluster sizes for the detection results on empirical networks for Infomap, Synwalk and Walktrap. Synwalk produces similar statistics to Infomap that are clearly distinguishable from Walktrap’s results (Color figure online)

Fig. 8
figure 8

Distributions of cluster sizes for the detection results on empirical networks for Synwalk, Louvain and GraphTool. Synwalk consistently detects smaller clusters thanGraphTool. Whereas cluster size distributions appear compact and unimodal for Synwalk and GraphTool, Louvain yields a wider spectrum of cluster sizes (Color figure online)

Interestingly, whereas Synwalk achieved similar AMI performance as Walktrap in Sect. 5.2, Synwalk shows similar qualitative behavior to Infomap regarding cluster and node property statistics on empirical networks. However, the differences in the qualitative detection behavior of Synwalk and Walktrap that we discussed in Sect. 5.2.2 could explain the different results on the larger empirical networks. We further conjecture that the common search heuristic (see Sect. 5.2) of Synwalk and Infomap acts as a ”regularizer” on larger networks, i.e., the properties of predicted clusterings become more similar the larger the networks.

Last, we compare the detection results of Synwalk, Louvain and GraphTool w.r.t. the distributions of cluster sizes in Fig. 8 and normalized local degrees in Fig. 9. Cluster size distributions are compact and unimodal for Synwalk and GraphTool, whereas Synwalk yields consistently smaller communities than GraphTool by roughly and order of magnitude. Louvain cluster sizes vary in a broader range.

Fig. 9
figure 9

Distributions of normalized local degrees w.r.t. the discovered communities on empirical networks for Synwalk, Louvain and GraphTool. Synwalk yields highest normalized local degrees, followed by GraphTool and Louvain, which delivers the lowest normalized local degrees. This observation is consistent across all networks. Interestingly, the distribution centers differ by up to several orders of magnitude (Color figure online)

A consistent pattern is visible in the distributions of normalized local degrees: Synwalk delivers the highest and Louvain the lowest values/distribution centers with GraphTool in between. Remarkably, the distribution centers differ up to several orders of magnitude between methods. These observations indicate that Synwalk detects smaller communities with higher normalized local degree when compared to Louvain and GraphTool and supports our findings in Sect. 5.2.2.

Summarizing, in our comparison of the random walk-based methods (including Synwalk), Louvain and GraphTool their fundamentally different approaches to community detection manifest in clear qualitative differences of their detected clusters. In light of the No Free Lunch theorem, each method captures different aspects of the networks’ structure, each of which may be interesting to an expert analyzing networks in his domain of expertise.

6 Conclusion

In this work, we introduced Synwalk, a community detection method based on random walk modelling, that is characterized by an information-theoretic objective function. Our experiments underline the solid theoretical basis of synthetic random walk-based models and show that we can achieve robust performance across a wide range of problem setups. For specific networks, e.g., networks with many small communities and low average degree, Synwalk outperforms Infomap and Walktrap, at least on generated LFR benchmark graphs.

We deem random walk modelling an interesting counterpart to (stochastic) block modelling for community detection that deserves more attention, as it opens up many interesting avenues for future research. For example, our present study was limited to undirected networks only, suggesting a closer investigation of Synwalk in directed and/or weighted networks or networks with special properties (e.g., small worlds, etc.). Note that, as does Infomap, we estimate the transition probabilities and invariant distribution in (4) and (5) using the PageRank (Brin and Page 1998) algorithm with a non-zero teleportation probability (cp.Appendix B, Rosvall and Bergstrom 2008; Rosvall et al. 2009). This avoids the problem that a random walk may end up in an absorbing state in directed networks and naturally supports weighted networks. Hence, the Synwalk objective and our implementation are perfectly applicable to directed and/or weighted networks. Given the results in Sect. 5.3 we surmise that on directed empirical networks Synwalk will also yield qualitatively similar results to Infomap. Further, a deeper understanding of the optimization landscape induced by the Synwalk objective and the influence of the optimization algorithm is required, as well as an extension of the approach to overlapping and hierarchical community structures.

Another interesting avenue for future work is to consider alternative definitions of the graph-induced random walks to the one we chose in (4), such as biased random walks, maximum entropy random walks, or even continuous-time random walks (cf. Masuda et al. 2017; Lambiotte et al. 2014). Recall that Synwalk tries to find a clustering that yields a synthetic random walk as close as possible to the network-induced random walk. One could view the synthetic random walk as a model of what we try to find out about the network, whereas the network-induced random walk as the lens through which we view this network. In that sense, choosing a certain lens will determine what aspects we can and cannot see about the network. Note that choosing a different network-induced random walk will not change the form of our objective function in (14). Only the transition probabilities and invariant distribution in (4) and (5) need to be computed accordingly. Thus, only little implementation effort would be necessary to realize these variants (cp. Appendix B). Yet, they may capture very different and interesting aspects of one and the same network.

Finally, as it is not only thinkable to change the network-induced random walk but also the synthetic random walk model, future research shall investigate random walk modelling approaches with a different synthetic random walk design from that in (8). As discussed above, the synthetic random walk is a model of what we are trying to find out about a network. Hence, the question arises whether such approaches can be tailored to detect communities of specific types or within specific network classes. Note that by changing the synthetic model necessarily the resulting objective function will be different from the one derived in this work, which in turn may entail new challenges in the optimization procedure.