Abstract
The connectivity structure of graphs is typically related to the attributes of the vertices. In social networks for example, the probability of a friendship between any pair of people depends on a range of attributes, such as their age, residence location, workplace, and hobbies. The highlevel structure of a graph can thus possibly be described well by means of patterns of the form ‘the subgroup of all individuals with certain properties X are often (or rarely) friends with individuals in another subgroup defined by properties Y’, ideally relative to their expected connectivity. Such rules present potentially actionable and generalizable insight into the graph. Prior work has already considered the search for dense subgraphs (‘communities’) with homogeneous attributes. The first contribution in this paper is to generalize this type of pattern to densities between a pair of subgroups, as well as between all pairs from a set of subgroups that partition the vertices. Second, we develop a novel informationtheoretic approach for quantifying the subjective interestingness of such patterns, by contrasting them with prior information an analyst may have about the graph’s connectivity. We demonstrate empirically that in the special case of dense subgraphs, this approach yields results that are superior to the stateoftheart. Finally, we propose algorithms for efficiently finding interesting patterns of these different types.
Similar content being viewed by others
Explore related subjects
Find the latest articles, discoveries, and news in related topics.Avoid common mistakes on your manuscript.
1 Introduction
Reallife graphs (also known as networks) often contain attributes for the vertices. In social networks for example, where vertices correspond to individuals, vertex attributes can include the individuals’ interests, education, residency, and more. The connectivity of the network is usually highly related to those attributes (Fond and Neville 2010; McPherson et al. 2001; Aral et al. 2009; Li et al. 2017). The attributes of individuals affect the likelihood of them meeting in the first place, and, if they meet, of becoming friends. Hence, it appears likely it should be possible to understand the connectivity of a graph in terms of those attributes, at least to a certain extent.
One approach to identify the relations between the connectivity and the attributes is to train a link prediction classifier, with as input the attribute values of a vertex pair, predicting the edge as present or absent (Gong et al. 2014; Yin et al. 2010; Barbieri et al. 2014; Wei et al. 2017). Such global models often fail to provide insight though . To address this, the local pattern mining community introduced the concept of subgroup discovery, where the aim is to identify subgroups of data points for which a target attribute has homogeneous and/or outstanding values (Herrera et al. 2011; Atzmueller 2015). Such subgroup rules are local patterns, in that they provide information only about a certain part of the data.
Research on local pattern mining in attributed graphs has so far focused on identifying dense vertexinduced subgraphs, dubbed communities, that are coherent also in terms of attributes. There are two complementary approaches, as stated in Atzmueller et al. (2016). The first explores the space of communities that meet certain criteria in terms of density, in search for those that are also homogeneous with respect to some of the attributes (Moser et al. 2009; Mougel et al. 2010). The second explores the space of rules over the attributes, in search for those that define subgroups (of vertices) that form a dense community (Pool et al. 2014; Galbrun et al. 2014; Atzmueller et al. 2016). This is effectively a subgroup discovery approach to dense subgraph mining.
Limitations of the stateoftheart Both these approaches hinge on the existence of attribute homophily in the network: the tendency of links to exist between vertices with similar attributes (McPherson et al. 2001). Yet, while the assumption of homophily is often reasonable, it limits the scope of application of prior work . A first limitation of the stateoftheart is thus its inability to find e.g. sparse subgraphs.
A second limitation is the fact that the interestingness of such patterns has invariably been quantified using objective measures—i.e. measures that do not depend on the data analyst’s prior knowledge. Yet, the most ‘interesting’ patterns found are often obvious and implied by such prior knowledge (e.g. communities involving highdegree vertices, or in a student friendship network, communities involving individuals practicing the same sport). Not only may uninteresting patterns appear interesting if prior knowledge is ignored, also interesting patterns may appear uninteresting and are hence not found. E.g., a pattern in a student friendship network that indicates tennis lovers are rarely connected may be due to the lack of suitable facilities or a tennis club.
A third limitation of prior work is that the patterns describe only the connectivity within a single group and not between two potentially distinct groups. As an obvious example, this excludes patterns that describe friendships between a particular subgroup of female and a subgroup of male individuals in a social network, but as we will show in the experiments reallife networks contain many less obvious examples.
Contributions We depart from the existing literature in formalizing a subjective interestingness measure, rather than an objective one, and this for sparse as well as for dense subgraph patterns. In this way, we overcome the first and second limitations of prior work discussed above. More specifically, we build on the ideas from the exploratory data mining framework FORSIED (De Bie 2011a, 2013). This framework stipulates in abstract terms how to formalize the subjective interestingness of patterns. Basically, a background distribution is constructed to model prior beliefs the analyst holds about the data. Given that, one can identify patterns which strongly contrast to this background knowledge and are highly surprising to the analyst. Moreover, this interestingness measure is naturally applicable for patterns describing a pair of subgroups, to which we will refer as bisubgroup patterns. Hence, our method overcomes the third limitation of prior work. Finally, apart from a local pattern mining strategy which is used to identify interesting patterns one by one, we also propose a strategy to mine patterns globally, that is, to summarize the whole graph in a meaningful way such that all the interesting patterns can immediately be seen. The resulting summarization can be considered as a type of global pattern. Our specific contributions are:

Novel definitions of singlesubgroup patterns and bisubgroup patterns, as well as patterns that are global summaries for attributed graphs. (Sect. 3)

A quantification of their Subjective Interestingness (SI), based on what prior beliefs an analyst holds, or what information an analyst gains when observing a pattern. (Sect. 4)

An algorithm to mine bisubgroup patterns based on beam search. (Sect. 5)

An algorithm to mine global (or summarization) patterns from which a series of interesting singlesubgroup and bisubgroup patterns can be revealed. (Sect. 5)

An empirical evaluation of our method on realworld data, to investigate its ability to encode the analyst’s prior beliefs and identify subjective interesting patterns. (Sect. 6)
This manuscript is a significant extension of Deng et al. (2020). The main additions include the further generalization of the singlesubgroup and bisubgroup patterns (both types are local patterns) to global patterns, the quantification of the SI as well as the search algorithm for global patterns. Moreover, we substantially extend the experiment section by analyzing the parameter sensitivity of our beam search methods to the beam width, further investigating research questions already proposed in Deng et al. (2020) on more realworld datasets, as well as evaluating the performance of our global pattern mining method.
2 Related work
In this section, we first briefly review some graph modelling work (Sect. 2.1), more specifically, those based on formulating a statistical ensemble of networks (i.e., the collection of all possible realizations into which the considered network may reasonably evolve with a probability (Fronczak 2012)). The numerical and analytical study of such ensembles provides the foundation of model fitting, model selection, for various applications including the pattern mining (Casiraghi et al. 2016). We then review related work dedicated to pattern mining in attributed graphs. This review is along two dimensions, concerning local patterns (Sect. 2.2.1) and global patterns (Sect. 2.2.2) respectively.
2.1 Graph modelling
Graph modelling typically considers a given network (i.e., the one we observe) as merely a realization among a large number of possibilities. All possible realizations including the observed one that are consistent with some given aggregate statistics, forms the socalled statistical ensemble of networks.
A wellfounded probabilistic framework to such graph modelling is provided by exponential random graph models (ERGMs) (Holland and Leinhardt 1981; Harris 2013). In ERGMs, each graph has a probability that depends on a number of chosen statistics of the network. Such models allow one to sample random graphs that match certain graph properties as closely as possible, without the need to know the underlying network generation process (Fronczak 2012). Nevertheless, a downside of ERGMs is their intractable fitting on large, finite networks. Recently, Casiraghi et al. introduce a broad class of analytically tractable statistical ensembles of finite, directed and weighted networks, referred to as generalized hypergeometric ensembles (Casiraghi et al. 2016).
Unlike ERGMs that aim to be an accurate and objective probabilistic model for the data, the aim of our method is to provide the data analyst with subjectively interesting insights into the data. To do that, intelligible pattern syntaxes need to be designed to represent the data‘s local or global information. Secondly, the found patterns must be contrasted with a model of the data analyst’s belief state about the data (called the background distribution) to quantify their interestingness to the data analyst (this makes our approach a subjective one). A further distinction from ERGMs is that our method is naturally an iterative method, allowing the data analyst to gain new insights from one or a few patterns at a time.
2.2 Pattern mining in attributed graphs
Reallife graphs often have attributes on the vertices. Pattern mining considering both structural aspect and attribute information promises more meaningful results, and thus has received increasing research attention.
2.2.1 Local pattern mining
The problem of mining cohesive patterns was introduced by Moser et al. (2009). They define a cohesive pattern as a connected subgraph whose edge density exceeds a given threshold, and vertices exhibit sufficient homogeneity in the attribute space. Mougel et al. (2010) computes all maximal homogeneous clique sets that satisfy some userdefined constraints. All these works emphasize the graph structure and consider attributes as complementary information. Rather than assuming attributes to be complementary, descriptive community mining, introduced by Pool et al. (2014) aims to identify cohesive communities that have a concise description in the vertices’ attribute space. They propose cohesiveness measure, which is based on counting erroneous links (i.e., connections that are either missing or obsolete w.r.t. the ‘ideal’ community given the induced subgraph). To a limited extent, their method can be driven by user’s domainspecific background knowledge, and more specifically, it is a preliminary description or a set of vertices that are expected to be part of a community. Then the search is triggered by those seed candidates. Our proposed SI, in contrast, is more versatile in a sense that allows incorporating more general background knowledge. Galbrun et al. (2014) proposes a similar target to Pool et al.’s, but relies on a different density measure, which is essentially the average degree. Atzmueller et al. (2016) introduces descriptionoriented community detection. In this work, a subgroup discovery approach is applied to mine patterns in the description space so it comes naturally that the identified communities have a succinct description.
All previous works quantify the interestingness in an objective manner, in the sense that they cannot consider a data analyst’s prior beliefs and thus operate regardless of context. Also, all previous works focus on a set of communities or dense subgraphs, overlooking other meaningful structures such as a sparse or dense subgraph between two different subgroups of vertices.
2.2.2 Global pattern mining by summarizing or clustering
Discovering global patterns that can uncover useful insights in attributed graphs are typically tailored to a graph summarization or a clustering task. Although these two tasks can both output graph summary, their goals (even when solely considering the structural aspect) are fundamentally different. Graph summarization seeks to group together vertices that connect with the rest of the graph in a similar way, while clustering simply group vertices that are densely connected to each other and are well separated from other groups (Liu et al. 2018).
Graph summarization Tian et al. (2008) proposes SNAP and kSNAP for controlled and intuitive graph summarization. These methods can produce customized summaries based on userselected attributes and relationships that are of interest. Furthermore, the resolutions of the resulting summaries can also be controlled by users. Then Zhang et al. (2010) further builds on this work by addressing two key limitations. First, they allow automatic categorization of numeric attributes (which is a common scenario). Second, they propose a measure to access the interestingness of summaries so that the user does not have to manually inspect a large number of summaries to find the interesting ones. However, their interestingness measure is not subjective, simply considering the tradeoff among diversity, coverage and conciseness. Chen et al. (2009) proposes SUMMARIZEMINE, a framework that performs the detection of frequent subgraphs on randomised summaries for multiple iterations, so that a lossy compression can be effectively turned into a virtually lossless one. In addition to pattern discovery, graph summarization on attributed graphs can serve for several applications including compression (Hassanlou et al. 2013; Wu et al. 2014), influence analysis (Shi et al. 2015; Adhikari et al. 2017) and so on. For a more comprehensive review of existing publications regarding these goals, we refer the interested readers to a survey paper by Liu et al. (2018).
Graph clustering Prior methods of clustering attributed graphs seek to partition the given graph into clusters with cohesive intracluster structures and homogeneous attribute values. Some enforce homogeneity in all attributes (Akoglu et al. 2012; Zhou et al. 2009; Xu et al. 2012; Cheng et al. 2011). However, they are not guaranteed to reveal meaningful patterns in datasets without efforts of attribute selection, since irrelevant attributes can strongly obfuscate clusters. More recently, subspace clustering is used to loosen this constraint (Günnemann et al. 2010, 2011). Perozzi et al. (2014) detects focused clusters and outliers based on user preferences, allowing the user to control the relevance of attributes and as a consequence, the graph mining results. Wang et al. (2016) proposes a novel nonnegative matrix factorization (NMF) model in which sparsity penalty is introduced to select the most related attributes for each cluster.
Unlike all previous graph summarization or clustering methods where the resulting vertex groups are forced to satisfy some prespecified topologies or edges structures (e.g., being more densely connected within the group), patterns revealed in our summarization approach are not limited to that, as their interestingness is quantified by a subjective measure depending on the user’s prior expectation.
3 Subgroup pattern and summary syntaxes for graphs
In this section we introduce both single subgroup and bisubgroup patterns along with summaries for graphs. Here, we first introduce some notation.
An attributed graph is denoted as a triplet \(G=(V, E, A)\) where V is a set of \(n=V\) vertices, \(E\subseteq \left( {\begin{array}{c}V\\ 2\end{array}}\right) \) is a set of \(m=E\) undirected edges,^{Footnote 1} and A is a set of attributes \(a\in A\) defined as functions \(a:V\rightarrow \text{ Dom}_a\), where \(\text{ Dom}_a\) is the set of values the attribute can take over V. For each attribute \(a\in A\) with categorical \(\text{ Dom}_a\) and for each \(y\in \text{ Dom}_a\), we introduce a Boolean function \(s_{a,y}: V \rightarrow \{\text {true},\text {false}\}\), with \(s_{a,y}(v)\triangleq \text {true}\) for \(v\in V\) iff \(a(v)=y\). Analogously, for each \(a\in A\) with \(\text{ Dom}_a\subseteq {\mathbb {R}}\) and for each \(l,u\in \text{ Dom}_a\) such that \(l<u\), we define \(s_{a,[l,u]}: V \rightarrow \{\text {true},\text {false}\}\), with \(s_{a,[l,u]}(v)\triangleq \text {true}\) iff \(a(v)\in [l,u]\). We call these Boolean functions selectors, and denote the set of all selectors as S. A description or rule W is a conjunction of a subset of selectors: \(W = s_1\wedge s_2\ldots \wedge s_{W}\). The extension \(\varepsilon (W)\) of a rule W is defined as the subset of vertices that satisfy it: \(\varepsilon (W) \triangleq \{v\in V  W(v) = \text {true}\}\). We also informally refer to the extension as the subgroup. Now a descriptioninduced subgraph can be formally defined as:
Definition 1
(Descriptioninducedsubgraph) Given an attributed graph \(G=(V,E,A)\), and a description W, we say that a subgraph \(G[W]=(V_W,E_W,A)\) where \(V_W \subseteq V, E_W \subseteq E\), is induced by W if the following two properties hold,

(i)
\(V_W=\varepsilon (W)\), i.e., the set of vertices from V that is the extension of the description W;

(ii)
\(E_W=\left( {\begin{array}{c}V_W\\ 2\end{array}}\right) \cap E\), i.e., the set of edges from E that have both endpoints in \(V_W\).
Example 1
Figure 1 displays an example attributed graph \(G=(V, E, A)\) with \(n=9\) vertices, \(m=12\) edges (Graph in Fig. 1a, vertex attributes in Fig. 1b). Each vertex is annotated with one realvalued attribute (i.e., a) and three nominal (or for simplicity, binary) attributes (i.e., b,c,d). Consider a description \(W=s_{a,[0,3]}\wedge s_{b,1}\). The extension of this description is the set of vertices with attribute a value from 0 to 3 and attribute b as 1, i.e., \(\varepsilon (W)=\{0,1,2,3\}\). The subgraph induced by W is formed from \(\varepsilon (W)\) and all the edges connecting pairs of vertices in that set (highlighted with red (dark in greyscale) in Fig. 1a).
3.1 Local pattern
3.1.1 Singlesubgroup pattern
A first pattern syntax we consider, and which has already been studied in prior work, informs the analyst about the density of a descriptioninduced subgraph G[W]. We assume the analyst is satisfied by knowing whether the density is unusually small, or unusually large, and given this does not expect to know the precise density. It thus suffices for the pattern syntax to indicate whether the density is either smaller than, or larger than, a specified value. We thus formally define the singlesubgroup pattern syntax as a triplet \((W,I,k_W)\), where W is a description and \(I\in \{0,1\}\) indicates whether the number of edges \(E_W\) in subgraph G[W] induced by W is greater (or less) than \(k_W\). Thus, \(I=0\) indicates the induced subgraph is dense, whereas \(I=1\) characterizes a sparse subgraph. The maximum number of edges in G[W] is denoted by \(n_W\), equal to \(\frac{1}{2}\varepsilon (W)(\varepsilon (W)1)\) for undirected graphs without selfedges. One example of a singlesubgroup pattern in Fig. 1 can be \((s_{a,[0,3]}\wedge s_{b,1}, 0, 6)\), corresponding to the dense subgraph highlighted in red (dark in greyscale).
Remark 1
(Difference to dense subgraph pattern in van Leeuwen et al. (2016)) Though the syntax for our singlesubgroup pattern seems similar to that of the dense subgraph pattern (i.e., \((W, k_W)\)) proposed by van Leeuwen et al. (2016), they are essentially different definitions serving for different data mining tasks. In van Leeuwen et al. (2016), the aim is to identify subjectively interesting subgraphs based on merely link information. For this aim, W in the dense subgraph pattern syntax represents the set of vertices in the subgraph, which has no association with node attributes. Moreover, an indicator I is included in our pattern syntax. This allows to regard not only surprisingly dense subgraphs but also surprisingly sparse ones as interesting. In contrast, van Leeuwen et al. (2016) focuses on those surprisingly dense subgraphs. Because of these differences in W and I, \(k_W\) is different accordingly.
3.1.2 Bisubgroup pattern
We also define a pattern syntax informing the analyst about the edge density between two potentially different subgroups. More formally, we define a bisubgroup pattern as a quadruplet \((W_1,W_2,I,k_W)\), where \(W_1\) and \(W_2\) are two descriptions, and \(I\in \{0,1\}\) indicates whether the number of connections between \(\varepsilon (W_1)\) and \(\varepsilon (W_2)\) is upper bounded (1) or lower bounded (0) by the threshold \(k_W\). The maximum number of connections between the extensions \(\varepsilon (W_1)\) and \(\varepsilon (W_2)\) is denoted by \(n_{W}\triangleq \varepsilon (W_1)\varepsilon (W_2)\frac{1}{2}\varepsilon (W_1\wedge W_2)(\varepsilon (W_1\wedge W_2)+1)\) for undirected graphs without selfedges. For example, the bisubgroup pattern \((s_{a,[0,3]}\wedge s_{b,1}, s_{b,0},1, 0)\) in Fig. 1, expresses sparse (or more precisely, zero) connection between the red vertex group (i.e., \(\{0,1,2,3\}\)) and the blue one (i.e., \(\{4,5,6\}\)) . Note that singlesubgroup patterns are a special case of bisubgroup patterns when \(W_1\equiv W_2\).
Remark 2
(Setting of \(k_W\)) Although \(k_W\) for a pattern \((W_1,W_2,I,k_W)\) can be any value with which the number of connections between \(\varepsilon (W_1)\) and \(\varepsilon (W_2)\) (or within \(\varepsilon (W_1)\) when \(W_1\equiv W_2\)) are bounded, our work focuses on identifying patterns whose \(k_W\) is the actual number of connections between these two subgroups (or within this single subgroup when \(W_1\equiv W_2\)), as such patterns are maximally informative.
3.2 Global pattern: summarization for graphs
Here we define a global pattern syntax, which describes the edge density between any pair of subgroups selected from a set of subgroups that form a partition of the vertices. We first define the notion of a summarization rule, before introducing the global pattern syntax itself.
Definition 2
(Summarization rule for an attributed graph) Given an attributed graph \(G=(V, E, A)\), the summarization rule \({\mathbb {S}}\) of G is a set of descriptions such that their extensions are vertexclusters that form a partition of the whole vertex set. That is, \({\mathbb {S}}= \{W_{i} i = 1,2, \ldots , c \}\) where \(c\in {\mathbb {N}}\) is the number of disjoint vertexclusters, where \(\cup ^{c}_{i=1}\varepsilon (W_i)=V\), \(\forall W_i\in {\mathbb {S}}\) it holds that \(\varepsilon (W_i) \ne \emptyset \), and \(\forall W_i, W_j \in {\mathbb {S}}, i \ne j\) it holds that \( \varepsilon (W_i) \cap \varepsilon (W_j) = \emptyset \).
Definition 3
(Summary for an attributed graph based on a summarization rule) A summary \({\mathcal {S}}\) for an attributed graph \(G=(V, E, A)\) based on a summarization rule \({\mathbb {S}}= \{W_{i} i = 1,2, \ldots , c \}\) is a complete weighted graph \({\mathcal {S}}=(V^{{\mathbb {S}}}, E^{{\mathbb {S}}},w)\) with weight function \(w:E^{{\mathbb {S}}}\rightarrow {\mathbb {R}}\), whereby \(V^{{\mathbb {S}}}=\{\varepsilon (W)W\in {\mathbb {S}}\}\) is the set of vertices (referred to as supervertices of the original graph G, i.e. each vertex from \({\mathcal {S}}\) is a set of vertices from G), \(E^{{\mathbb {S}}}=\left( {\begin{array}{c}V^{{\mathbb {S}}}\\ 2\end{array}}\right) \cup V^{{\mathbb {S}}}\) is the set of edges (to which we refer as superedges; the superedges in \(\left( {\begin{array}{c}V^{{\mathbb {S}}}\\ 2\end{array}}\right) \) represent the undirected edges between distinct supervertices, and the superedges in \(V^{{\mathbb {S}}}\) represent the selfloops). The weight \(w(\{\varepsilon (W_{i}),\varepsilon (W_{j})\})\) for each superedge \(\{\varepsilon (W_{i}),\varepsilon (W_{j})\}\in E^{{\mathbb {S}}}\) will be denoted shorthand by \(d_{i,j}\), and is defined as the number of edges between vertices from \(\varepsilon (W_{i})\) and those from \(\varepsilon (W_{j})\).
We define a global pattern syntax informing the analyst about the summarization for an attributed graph \(G=(V,E,A)\) with c disjoint vertexclusters. More formally, we define a summarization pattern as a tuple \(({\mathbb {S}}, {\mathcal {S}})\) where \({\mathbb {S}}\) is the summarization rule, and \({\mathcal {S}}\) is the corresponding summary. Note that when revealing a summarization pattern \(({\mathbb {S}}, {\mathcal {S}})\) to an analyst, she or he gets access to its related local subgroup patterns: c singlesubgroup patterns and \(c(c1)/2\) bisubgroup patterns. An example of the global pattern for Fig. 1 can be \((\{s_{a,[0,3]}\wedge s_{b,1}, \lnot s_{a,[0,3]}\wedge s_{b,1}, s_{b,0}\}, {\mathcal {S}}^{*})\) where \({\mathcal {S}}^{*}\) represents the corresponding summary (see Fig. 2).
4 Formalizing the subjective interestingness
4.1 General approach
We follow the approach as outlined by De Bie (2011b) to quantify the subjective interestingness of a pattern, which enables us to account for prior beliefs a data analyst may hold about the data. In this framework, the analyst’s belief state is modeled by a background distribution P over the data space. This background distribution represents any prior beliefs the analyst may have by assigning a probability (density) to each possible value for the data according to how plausible the analyst thinks this value is. As such, the background distribution also makes it possible to evaluate the probability for any given pattern to be present in the data, and thus to assess the surprise of the analyst when informed about its presence. It was argued that a good choice for the background distribution is the maximum entropy distribution subject to some particular constraints that represent the analyst’s prior beliefs about the data. As the analyst is informed about a pattern, the knowledge about the data will increase, and the background distribution will change. For details see Sect. 4.2.
Given a background distribution, the Subjective Interestingness (SI) of a pattern can be quantified as the ratio of the Information Content (IC) and the Description Length (DL) of the pattern. The IC is defined as the amount of information gained when informed about the pattern’s presence, which can be computed as the negative log probability of the pattern w.r.t. the background distribution P. The DL is quantified as the length of the code needed to communicate the pattern to the analyst. These are discussed in more detail in Sect. 4.3, but first we further explain the background distribution (Sect. 4.2).
Remark 3
(Positioning with respect to directly related literature) Here we clarify how previous work is leveraged, and what concepts are newly introduced in our work. We define single/bisubgroup patterns and global patterns in an attributed graph. To quantify the SI measure for such patterns, we follow the framework outlined by De Bie (2011b). As mentioned above, in this framework, the SI is computed as the ratio of the IC and the DL w.r.t. the background distribution which models the analyst’s belief state. This framework also provides the general idea for deriving the initial background distribution and updating it to reflect newly acquired knowledge. Adriaens et al. (2017) later introduced a new type of graphrelated prior that the background distribution can incorporate, and this prior is considered in our work. In van Leeuwen et al. (2016), this framework was used to identify subjectively interesting dense subgraphs, merely based on link information. In our work, we leverage some computational results from van Leeuwen et al. (2016) (i.e., in updating the background distribution, approximating the IC), and made further adaptions such that the framework proposed by De Bie (2011b) can serve for our newly proposed patterns based on attribute information (i.e., singlesubgroup patterns, bisubgroup patterns and global patterns).
4.2 The background distribution
4.2.1 The initial background distribution
To derive the initial background distribution, we need to assume what prior beliefs the data analyst may have. Here we discuss three types of prior beliefs which are common in practice: (1) on individual vertex degrees; (2) on the overall graph density; (3) on densities between bins (particular subsets of vertices).
 (1–2):

Prior beliefs on individual vertex degrees and on the overall graph density. Given the analyst’s prior beliefs about the degree of each vertex, De Bie (2011b) showed that the maximum entropy distribution is a product of independent Bernoulli distributions, one for each of the random variable \(b_{u,v}\), which equals to 1 if \((u,v)\in E\) and 0 otherwise. Denoting the probability that \(b_{u,v}=1\) by \(p_{u,v}\), this distribution is of the form:
$$\begin{aligned} P(E)&=\prod _{u,v} {p_{u,v}}^{b_{u,v}}\cdot (1p_{u,v})^{1b_{u,v}},\\ \text {where}\quad p_{u,v}&= \frac{\exp (\lambda ^{r}_{u}+\lambda ^{c}_{v})}{1+\exp (\lambda ^{r}_{u}+\lambda ^{c}_{v})}. \end{aligned}$$This can be conveniently expressed as:
$$\begin{aligned} P(E)=\prod _{u,v} \frac{\exp ((\lambda ^{r}_{u}+\lambda ^{c}_{v})\cdot b_{u,v})}{1+\exp (\lambda ^{r}_{u}+\lambda ^{c}_{v})}. \end{aligned}$$The parameters \(\lambda ^{r}_{u}\) and \(\lambda ^{c}_{v}\) can be computed efficiently. For a prior belief on the overall density, every edge probability \(p_{u,v}\) simply equals the assumed density.
 (3):

Additional prior beliefs on densities between bins. We can partition vertices in an attributed graph into bins according to their value for a particular attribute. For example, vertices representing people in a university social network can be partitioned by class year. Then expressing prior beliefs regarding the edge density between two bins is possible. This would allow the data analyst to express, for example, an expectation about the probability that people in class year \(y_1\) are connected to those in class year \(y_2\). If the analyst believes that people in different class years are less likely to connect with each other, a discovered pattern would be more informative if it contrasts more with this kind of belief, i.e. if it reveals a high density between two sets of people from different class years. As shown in Adriaens et al. (2017), the resulting background distribution is also a product of Bernoulli distributions, one for each of the random variables \(b_{u,v}\in \{0,1\}\):
$$\begin{aligned} P(E)=\prod _{u,v}\frac{\exp ((\lambda ^{r}_{u}+\lambda ^{c}_{v}+\gamma _{k_{u,v}})\cdot b_{u,v})}{1+\exp (\lambda ^{r}_{u}+\lambda ^{c}_{v}+\gamma _{k_{u,v}})}, \end{aligned}$$(1)where \(k_{u,v}\) is the index for the block corresponding to the intersecting part of two bins which vertex u and vertex v belongs to correspondingly. \(\lambda ^{r}_{u}\), \(\lambda ^{c}_{v}\) and \(\gamma _{k_{u,v}}\) are parameters and can be computed efficiently. Note our model is not limited to incorporate this type of belief related to a single attribute. Vertices can be partitioned differently by another attribute. Our model can consider multiple attributes so that analysts could express prior beliefs regarding the edge densities between bins resulting from multiple partitions^{Footnote 2}.
4.2.2 Updating the background distribution
Upon being represented with a pattern, the background distribution should be updated to reflect the data analyst’s newly acquired knowledge. The beliefs attached to any value for the data that does not contain the pattern should become zero. In the present context, once we present a subgroup pattern \((W_1, W_2, I, k)\) to the analyst, the updated background distribution \(P'\) should be such that \(\phi _W(E)\ge k_W\) (if \(I=0\)) or \(\phi _W(E)\le k_W\) (if \(I=1\)) holds with probability one, where \(\phi _W(E)\) denotes a function counting the number of edges between \(\varepsilon (W_1)\) and \(\varepsilon (W_2)\). De Bie (2011a) presented an argumentation for choosing \(P'\) as the Iprojection of the previous background distribution onto the set of distributions consistent with the presented pattern. Then van Leeuwen et al. (2016) showed that the resulting \(P'\) is again a product of Bernoulli distributions:
How to compute \(\lambda _{W}\) is also given in van Leeuwen et al. (2016).
Remark 4
(Updating P if a summarization pattern is presented) In the case that a summarization pattern \(({\mathbb {S}}, {\mathcal {S}})\) is presented to the analyst, we simply update the background distribution as if all the subgroup patterns related to \(({\mathbb {S}}, {\mathcal {S}})\) were presented, and we denote such updated background distribution by \(P_{({\mathbb {S}}, {\mathcal {S}})}\).
4.3 The subjective interestingness measure
We now discuss how the SI measure can be formalized by relying on the background distribution, first for local and then for global patterns.
4.3.1 The SI measure for a local pattern
The information content (IC) Given a pattern \((W_1,W_2,I,k_W)\), and a background distribution defined by P, the probability of the presence of the pattern is the probability of getting more than \(k_W\) (for \(I=0\)) or \(n_Wk_W\) (for \(I=1\)) successes in \(n_W\) trials with possibly different success probability \(p_{u,v}\) (for \(I=0\)) or \(1p_{u,v}\) (for \(I=1\)). More specifically, we consider a success for the case \(I=0\) to be the presence of an edge between some pair of vertices (u, v) for \(u \in \varepsilon (W_1)\), \(v \in \varepsilon (W_2)\), and \(p_{u,v}\) is the corresponding success probability. In contrast, the absence of an edge between some vertices (u, v) is deemed to be a success for the case \(I=1\), with the probability as \(1p_{u,v}\). The work of van Leeuwen et al. (2016) proposed to tightly upper bound the probability of a similar dense subgraph pattern by applying the general Chernoff/Hoeffding bound (Chernoff 1952; Hoeffding 1963). Here, we can use the same approach, which gives:
where
\(\mathbf {KL}\bigg (\frac{k_W}{n_W}\parallel p_W\bigg )\) is the KullbackLeibler divergence between two Bernoulli distributions with success probabilities \(\frac{k_W}{n_W}\) and \(p_W\) respectively. Note that:
We can thus write, regardless of I:
The information content is the negative log probability of the pattern being present under the background distribution. Thus, using the above:
The description length (DL) A pattern with larger IC is more informative. Yet, sometimes it is harder for the analyst to assimilate as its description is more complex. A good SI measure should trade off IC with DL. The DL should capture the length of the description needed to communicate a pattern. Intuitively, the cost for the data analyst to assimilate a description W depends on the number of selectors in W, i.e., W. Let us assume communicating each selector in a description W has a constant cost of \(\alpha \) and the cost for I and \(k_W\) is fixed. The total description length of a pattern \((W_1,W_2,I,k_W)\) can then be written as
The subjective interestingness (SI) In summary, we obtain:
Remark 5
(Justification about choices of \(\alpha \) and \(\beta \)) In all our experiments for use cases, we apply \(\alpha =0.6, \beta =1\). We here state the reason for this choice.
In practice, the absolute value of the SI from Eq. 5 is largely irrelevant, as it is only used for ranking the patterns, or even just for finding a single pattern (i.e., the most interesting one to the analyst). Thus, we can set \(\beta =1\) without losing generality, such that the only remaining parameters is \(\alpha \).
Tuning \(\alpha \) biases the results toward more or fewer selectors to describe the subgroup pattern. Notice an optimal extent of such kind of bias cannot be determined by doing model selection in the statistical sense, but rather should be chosen based on aspects of human cognition (e.g., larger \(\alpha \) should be used when the analyst prefers patterns in a more succinct form). In this work, we set \(\alpha =0.6\) throughout all use cases which gives qualitative results. However, \(\alpha \) can be flexibly tuned for adapting to the analyst’ preferences.
4.3.2 The SI measure for a global pattern
The information content (IC) The probability of a global summarization pattern turns out to be harder to formulate analytically, and thus also the negative log probability of the pattern – which is the subjective amount of information gained by observing the pattern. However, it is relatively straightforward to quantify the (subjective) amount of information in the connectivity in the graph prior to observing the pattern, and after observing the pattern. The difference between these two is thus the information gained. More formally, we thus mathematically define the IC of a summarization pattern \(({\mathbb {S}}, {\mathcal {S}})\) as the difference between the log probability for the connectivity in the graph (i.e., the edge set E) under \(P_{({\mathbb {S}}, {\mathcal {S}})}\) and that under P:
This quantity is straightforward to compute where \(P_{({\mathbb {S}}, {\mathcal {S}})}\) is computed as the updated background distribution as if all the subgroup patterns related to \(({\mathbb {S}}, {\mathcal {S}})\) were presented (previously mentioned in Remark 4 in Sect. 4.2.2).
The description length (DL). We search for optimal \({\mathbb {S}}\) by a strategy that is based on splitting a binary search tree (for details see Sect. 5.2.1). Thus, the cost for the data analyst to assimilate \({\mathbb {S}}\) is linear to the number of descriptions in \({\mathbb {S}}\), i.e. c. As for \({\mathcal {S}}\), assimilating it costs quadratically to c, because \({\mathcal {S}}\) is essentially a complete graph with c vertices and \(c(c+1)/2\) edges. The total description length of a pattern \(({\mathbb {S}}, {\mathcal {S}})\) can be written as
where \(\theta \) is a constant term for mitigating the quadratically increasing drop in SI value given by an increasing c, and this helps to avoid early stopping.
The subjective interestingness (SI) In summary, we obtain:
Remark 6
(Justification about choices of \(\zeta \), \(\eta \) and \(\theta \)) In all our experiments, we use \(\zeta =0.02, \eta =0.02, \theta =1\). As stated in Remark 5 in Sect. 4.3.1, parameters of the DL indicate how much the data analyst prefers patterns that can be described succinctly, and thus should be determined based on aspects of human cognition instead of statistical model selection. We here follow the similar sense to choose the DL parameters for global patterns (i.e.,\(\zeta , \eta \) and \(\theta \) in Eq. 8). Notice we set a high value for \(\theta \) (i.e., 1) in comparison with \(\zeta \) (i.e., 0.02) and \(\eta \) (i.e., 0.02). This is a safe choice to avoid early stopping (i.e., the iterating stops before the analyst observes a suitable global pattern).
5 Algorithms
This section describes the algorithms for mining interesting patterns locally and globally, in Sects. 5.1 and 5.2 respectively, followed by an outline to the implementation in Sect. 5.3.
5.1 Local pattern mining
Since the proposed SI interestingness measure is more complex than most objective measures, we consider applying some heuristic search strategies to help maintain the tractability. For searching singlesubgroup patterns, we used beam search (see Sect. 5.1.1). To search for the bisubgroup patterns, however, a traditional beam over both \(W_1\) and \(W_2\) simultaneously turned out to be more difficult to apply effectively. We thus propose a nested beam search strategy to handle this case. More details about this strategy are covered by Sect. 5.1.2.
5.1.1 Beam search
In the case of mining singlesubgroup patterns, we applied a classical heuristic search strategy over the space of descriptions—the beam search. The general idea is to only store a certain number (called the beam width) of best partial description candidates of a certain length (number of selectors) according to the SI measure, and to expand those next with a new selector. This is then iterated. This approach is standard practice in subgroup discovery, being the search algorithm implemented in popular packages such as Cortana (Meeng and Knobbe 2011), One Click Miner (Boley et al. 2013), and pysubgroup (Lemmerich and Becker 2018).
5.1.2 Nested beam search
The basic idea of this approach is to nest one beam search into the other one where the outer search branches based on a ‘beam’ of promising selector candidates for the description \(W_1\) , and the inner search expands those for \(W_2\). The detailed procedure for this nested beam search is shown in Algorithm 1, and related notation displayed in Table 1.
The total number of interesting patterns identified by Algorithm 1 is \(x_1\cdot x_2\). Note that we deliberately constrain the beam to contain at least \(x_1\) different \(W_1\) descriptions so that a sufficient diversity among all the discovered patterns is guaranteed (see lines 22–23 in Algorithm 1).
5.2 Global pattern mining
To identify the most interesting global (or summarization) pattern, a greedy search strategy (see Sect. 5.2.1) equipped with some speedup strategies (see Sect. 5.2.2) are adopted.
5.2.1 The basic search strategy
The algorithm begins by checking each possible summarization rule only containing a singleselector description and its negation. Applying such a rule at the beginning means cutting the whole vertex set into two nonoverlapping clusters, each of which satisfies a description in this rule correspondingly. The rule whose corresponding summarizaiton pattern has the maximal SI value is selected as a seed set for \({\mathbb {S}}\). Then the algorithm iterates in the following way to greedily grow that set: for each existing description in the set, the algorithm again checks the application of an additional singleselector description and its negation. This further separates a particular vertex cluster into two subclusters, one of which additionally satisfies this description and the other does not. The optimal combination of the existing description to further specify and the additional singleselector description are selected. The search stops when reaching some search budget (e.g. the maximum number of iterations). The detailed procedure for this search is displayed in Algorithm 2.
5.2.2 Speedup strategies
Parallel processing Our search strategy is trivially parallelizable. To gain some speedup, the search process for each attribute and its related selectors (lines 10–24 in Algorithm 2) is executed simultaneously in multiple processors.
Reusing some computations We further speedup the search by circumventing some redundant computations when computing the SI for each candidate of summarization pattern. As mentioned above in Sect. 4.2.2, \(P_{({\mathbb {S}}, {\mathcal {S}})}\) is computed as an updated background distribution as if all the subgroup patterns related to \(({\mathbb {S}}, {\mathcal {S}})\) were presented, which requires to determine \(\lambda _{W}\) for each related subgroup pattern. Nevertheless, when branching in different ways during the search (i.e., using different pairs of a selector and its negation to extend a given description), extensions do not interfere with subgroup patterns whose descriptions are not extended. Hence, their \(\lambda _W\) do not need to be recomputed, providing a speed up.
Here we illustrate that, by taking the attributed network in Fig. 1 as the example (see Fig. 3 which visualizes the corresponding adjacency matrix with arranged vertex indices in left and in bottom; Entries are not indicated for simplicity). Assume the network is currently divided into two vertex subgroups each respectively satisfying \(b=1\) and \(b=0\), and the search is in the step of finding the optimal selector to specify the description \(b=1\) (indices of corresponding vertices are highlighed in red (dark in greyscale) in Fig. 3a). Though the adjacency matrix is cut in two different ways, refining the description \(b=1\) into two more specific ones by adding \(a\le 3\) and \(a>3\) (in Fig. 3b), or adding \(c=0\) and \(c=1\) (in Fig. 3c), both do not interfere with the subgroup satisfying \(b=0\) (the blue striped area).
5.3 Implementation
For mining pattern locally, Pysubgroup (Lemmerich and Becker 2018), a Python package for subgroup discovery implementation written by Florian Lemmerich, was used as a base to be built upon. We integrated our nested beam search algorithm and SI measure (along with other stateoftheart interestingness measures for comparison) into this original interface. A Python implementation of all the algorithms and the experiments is available at https://bitbucket.org/ghentdatascience/globalessd_public. All experiments were conducted on a PC with Ubuntu OS, Intel(R) Core(TM) i77700K 4.20GHz CPUs, and 32 GB of RAM.
6 Experiments
We evaluate our methods on six realworld networks. In the following, we first describe the datasets (Sect. 6.1). Then we present the conducted experiments and discuss the results with a purpose to address the following questions:
 RQ1:

Are our local pattern mining algorithms sensitive to the beam width? (Sect. 6.2)
 RQ2:

Does our SI measure outperform stateoftheart objective interestingness measures? (Sect. 6.3)
 RQ3:

Is the SI truly subjective, in the sense of being able to consider a data analyst’s prior beliefs? (Sect. 6.4)
 RQ4:

How can optimizing SI help avoid redundancy between iteratively mined patterns? (Sect. 6.5)
 RQ5:

Is our global pattern mining approach able to summarize the whole graph in a meaningful way such that all the interesting patterns can be revealed? (Sect. 6.6)
 RQ6:

How do the algorithms scale? (Sect. 6.7)
6.1 Data
Basic data information is summarized in Table 2.
Caltech36 and Reed98 Two Facebook social networks from the Facebook100 (Traud et al. 2012) data set, gathered in September 2005: one for Caltech Facebook users, and one for Reed University. Vertex attributes describe the person’s status (faculty or student), gender, major, minor, dorm/house, graduation year, and high school.
Lastfm A social network of friendships between Lastfm.com users, generated from the publicly available dataset (Cantador et al. 2011) in the HetRec 2011 workshop. In this dataset, tag assignments of a list of mostlistened musical artists provided by each user are given in [user, tag, artist] tuples, where those tags are unstructured text labels that users used to express songs of artists. We then took tags that a user ever assigned to any artist and assigned those to the user as binary attributes expressing a user’s music interests. This dataset has been used in many publications to evaluate local pattern mining methods (Pool et al. 2014; Atzmueller et al. 2016; Galbrun et al. 2014).
DBLPtopics A citation network generated from the DBLP citation data V11^{Footnote 3} (Tang et al. 2008; Sinha et al. 2015) by choosing a random subset of publications from 20 conferences^{Footnote 4} selected to cover 4 research areas: Machine Learning, Database, Information Retrieval, and Data Mining. Vertices represent publications, and directed edges represent citation relationships. Each publication is annotated with 50 attributes (denoted by \(a_1, a_2, \ldots , a_{50}\)) whose value indicates the relevance of this paper to a certain topic. These attributes are obtained by computing the first 50 latent semantic indexing (LSI) components for the original papertopic matrix (of size \(10837 \times 9074\)) where each entry value indicates the relevance of a paper (represented by row) to a field of study (represented by column) and this value is provided by the original DBLP data. In our work, the selector space on which the search is carried does not include every attribute value pair. A discretization is applied here: values for each attribute are sorted and discretized into 4 partitions of equal size by 3 quartiles. This gives \(3\times 2=6\) selectors for each attribute (\(6\times 50=300\) selectors in total) three of which respectively assign true to vertices with value smaller than the first, second, third quartile of the total values for this attribute, and the other three are the corresponding negations. We denote the ith quartile of values for the attribute a by \(Q_{i}^{a}\).
DBLPaffs A DBLP citation network based on a random subset of publications same as the one for the above task. Only papers for which the authors’ country (or state, in the USA) of affiliation is available are included as vertices. The resulting 116 countries/states are included as binary vertex attributes, set to 1 iff one of the paper’s authors is affiliated to an institute in that country/state.
MPvotes The Twitter social network generated from friendships between Members of Parliament (MPs) in UK (Chen et al. 2020). Their voting records on Brexit from 12th June 2018 to 3rd April 2019 are included as 39 binary vertex attributes, set to be 1, or \(1\) iff this MP vote for/abstain or, against/abstain respectively. Note we include abstain on both positive and negative sides rather than make abstain (or not abstain) alone being a value, because a selector that describes a subgroup of MPs abstaining (or not abstaining) in a particular vote is not very meaningful in practice.
6.2 Parameter sensitivity (RQ1)
For mining local patterns, we used the standard beam search for singlesubgroup patterns, and the nested beam search for bisubgroup patterns. In all experiments, we set the search depth \(D=2\) (because patterns that are described by more than 2 selectors often appear less interesting in practice, and they would add unnecessary difficulty for interpretation). Then the performance of those beam search methods ultimately depends on the beam width.
6.2.1 Experimental setup
Choice of datasets We used Lastfm to investigate the effect of the beam width on the performance of singlesubgroup pattern mining, as it involves the largest search space (given by the largest number of selectors i.e., 21695). With regard to that on bisubgroup pattern mining, because the search is more timeconsuming, we used Lastfm while only considering 100 most frequently used tags as attributes (i.e., giving 200 selectors as the search space). We also used Reed98 as it involves the largest search space among datasets that were used in our experiments on bisubgroup pattern mining.
Other settings Though we applied the SI measure with \(\alpha =0.6\), \(\beta =1\) in all use cases of local pattern mining (as previously mentioned in Remark 5 in Sect. 4.3.1), to more meaningfully investigate the parameter sensitivity in this experiment, we set \(\alpha \) to be smaller, i.e., \(\alpha =0.1\).^{Footnote 5}
6.2.2 Results
Effect of the beam width on singlesubgroup pattern mining First, we analyze the sensitivity of the standard beam search w.r.t. the beam width for singlesubgroup pattern mining. How the search performance changes with the beam width (denoted by x) is illustrated below (see Fig. 4a for the SI value of the identified best pattern and Fig. 4b for the run time).
Clearly, increasing x from 1 to 40 results in the same best pattern (with the SI value as 258.7, the description as ‘IDM = 1’) along with a gentle increase in the run time. Though it shows a greedy search (i.e., \(x=1\)) can already perform well, this is not guaranteed.
As indicated in a further investigation, increasing the beam width is rendered useless by the existence of a dominant pattern with a single selector (i.e., ‘IDM = 1’) such that there are no other patterns that have higher SI value than it and its children. Once our method incorporates this dominant pattern into the background distribution for one subsequent iteration to reflect the data analyst’s newly acquired knowledge, the advantage of a lager beam width appears as the best pattern is identified when x increases to be 3 (see Fig. 5a). The run time grows linearly as x increases (see Fig. 5b).
Effect of the beam width on bisubgroup pattern mining To study the effects of the beam width, we implemented all cases with \(x_1\) and \(x_2\) being 1, 2, 3, 4, or 7.
In Lastfm, clearly from Fig. 6a, small beam widths (e.g., when \(x_1=1\) with \(x_2=3\)) are sufficient for our algorithm to identify the best bisubgroup pattern (i.e., the one with SI as 194.8). This is even more the case for Reed98 network, as our method of bisubgroup pattern mining always identify the same best bisubgroup pattern (i.e., the one with SI as 728) when gradually increasing \(x_1\) and \(x_2\).
For bisubgroup pattern mining in either Lastfm or Reed98, the run time experiences an approximately linear growth as \(x_1\) or \(x_2\) increases with the other beam width is fixed (see Fig. 6b and c for Lastfm, Fig. 7b and c for Reed98).
Summary This empirical analysis suggests that overall our algorithms are not sensitive to the beam width. A small beam width is usually sufficient, particularly if there is a dominant pattern. When that is not the case, slightly increasing the beam width was sufficient in our experiments.
We recommend an initial setting with \(x=5\) for singlesubgroup pattern discovery and \(x_1 =2, x_2=3\) for bisubgroup pattern discovery, which is usually more than sufficient. If it is not sufficient, the analyst can increment x, either \(x_1\) or \(x_2\) by 1 iteratively until satisfying results are yielded.
6.3 Comparative evaluation (RQ2)
6.3.1 Experimental setup
A comparison between the SI and other objective interestingness measures can only be made on their performances on singlesubgroup pattern discovery (or more precisely, dense subgraph mining), because those existing objective measures are limited to quantify the interestingness of a dense subgraph community.
Choice of datasets and prior beliefs To constrain the search that uses our SI measure to only identify dense subgraphs, we applied individual vertex degrees as the prior beliefs, and chose sparse networks (i.e, Lastfm and DBLPaffs) for this comparative task. When using the individual vertex degree as priors, singlesubgroup patterns’ density will not be explainable merely from the individual degrees of the constituent vertices. For realworld networks, given its sparsity (which is common), incorporating this prior leads to a background distribution with a low average connection probability. In this case, our algorithm identify mostly dense clusters (i.e. \(I=0\)), as these are more informative in the sense of strongly contrasting with the expectation which is towards sparsity. Lastfm, DBLPtopics and DBLPaffs are all evidently sparse networks. Among them, Lastfm and DBLPaffs were chosen as their attributes and the discovered patterns are more readily understood.
Baselines For this comparative evaluation, we consider the following baselines:

Edge density. The number of edges divided by the maximal number of edges.

Average degree. The degree sum for all vertices divided by the number of vertices.

Pool’s community score (Pool et al. 2014). The reduction in the number of erroneous links between treating each vertex as a single community and treating all vertices as a whole.

Edge surplus (Tsourakakis et al. 2013). The number of edges exceeding the expected number of edges assuming each edge is present at the same probability \(\alpha \).

Segregation index (Freeman 1978). The difference between the number of expected interedges to the number of observed interedges, normalized by the expectation.

Modularity of a single community (Newman 2006; Nicosia et al. 2009). The modularity measure of a single community based on transforming the definition of modularity to a local measure.

Inverse averageODF (outdegree fraction) (Yang and Leskovec 2015). 1 minus the average fraction of vertices’ outdegrees to degrees.

Inverse conductance. The number of edges inside the cluster divided by the number of edges leaving the cluster.
More detailed descriptions along with mathematical definitions for these baselines can be found in Table 11 in “Appendix A”.
Other settings For singlesubgroup pattern discovery on both Lastfm and DBLPaffs networks, we use beam search with beam width 5 and search depth 2.
6.3.2 Results
Four most interesting patterns w.r.t. the SI and these baseline measures on Lastfm are presented in Tables 3 and 4 respectively. For each pattern, we display values for elements that constitute the pattern syntax including W, I, \(k_W\), and also other statistics including its rank, \(\varepsilon (W)\), and #interedges. #interedges is the number of connections between \(\varepsilon (W)\) and \(V \setminus \varepsilon (W)\), telling how isolated a particular group of members is. Particularly for patterns discovered using the SI, we also display \(p_W \cdot n_W\), the expected number of connections within \(\varepsilon (W)\) w.r.t. the background distribution. Comparing \(p_W \cdot n_W\) to \(k_W\) gives a direct sense of how much the analyst’s expectation differs from the truth (Recall \(p_W\) from Eq. 2).
Here, we summarize the main findings.
Using baselines Each of those objective measures exhibits a particular bias that arguably makes the obtained patterns less useful in practice. The edge density is easily maximized to a value of 1 simply by considering very small subgraphs. That’s why the patterns identified by using this measure are all those composed of only 2 vertices with 1 connecting edge. In contrast, using the average degree tends to find very large communities, because in a large community there are many other vertices for each vertex to be possibly connected to. Although Pool argued that their measure may be larger for larger communities than for smaller ones, in their own experiments on the Lastfm network as well as in our own results, it yields relatively small communities (Pool et al. 2014). As they explained, the reason was Lastfm’s attribute data is extremely sparse with a density of merely \(0.15\%\). Note that patterns with the top 10 edge surplus values are the same as those for the Pool’s measure. Although these two measures are defined in different ways, Pool’s measure can be further simplified to a form essentially the same as the edge surplus. Pursuing a larger segregation index essentially targets communities which have much less crosscommunity links than expected. This measure emphasizes more strongly the number of crosscommunity links, and yields extremely small or large communities with few interedges on Lastfm. Using the modularity of a single community tends to find rather large communities representing audiences of mainstream music. The results for the inverse averageODF and the inverse conductance are not displayed in the supplement, because the largest values for these two measures can be easily achieved by a community with no edges leaving this community, for which a trivial example is the whole network.
Using the SI We argue that the patterns extracted using our SI measure are most insightful, striking the right balance between coverage (sufficiently large) and specificity (not conveying too generic or trivial information). The top one characterises a group of 78 IDM (i.e., intelligent dance music) fans. Audiences in this group are connected more frequently than expected (96 vs. 8.93), and they altogether only have 496 connections to those people not into IDM, which is much sparser than connections within the IDM group (as the connectivity density across the group and that within the group are respectively \(496/(78\times 1814)\approx 0.0035\) and \(96/(78\times (781)/2)\approx 0.0320\)).
Remark 7
(Results on DBLPaffs) For DBLPaffs, the same conclusion as above can also be reached. See top 4 singlesubgroup patterns on DBLPaffs w.r.t. our SI and other measures in Tables 12 and 13 respectively in “Appendix A”.
Summary Unlike stateoftheart objective interestingness measures, each of which exhibits a particular bias, the proposed SI measure achieves a natural balance between coverage and specificity, arguably leading to more insightful patterns.
6.4 The effects of different prior beliefs: a subjective evaluation (RQ3)
6.4.1 Experimental setup
To demonstrate the SI’s subjectiveness, we consider different prior beliefs, in search for patterns w.r.t. the SI. We deliberately perform this evaluation on bisubgroup pattern discovery for a more generic and interesting setting.
Choice of datasets In the following, we analyze results on Caltech36 and Reed98. These two networks are chosen, because their straightforward domain knowledge provides us the ease for prior belief settings. People, even those that are not social scientists, normally hold prior beliefs about this sort of friendship network (e.g., they commonly believe that students of different class years are less likely to know each other than students from the same class year).
Other settings For bisubgroup pattern discovery, we applied the nested beam search with \(x_1 = 2, x_2=3\), and \(D=2\). Moreover, we constrain the target descriptions \(W_1\) and \(W_2\) to include at least one common attribute but with various values, so that the corresponding pair of subgroups \(\varepsilon (W_1)\) and \(\varepsilon (W_2)\) do not overlap with each other. Under this setting, the obtained patterns are more explainable, and the results are easier to evaluate.
6.4.2 Results
The 4 most subjectively interesting patterns under each prior belief are presented in Table 6 (for Caltech36) and Table 7 (for Reed98), with their associated notations are summarized in Table 5.
Incorporating Prior 1 We first incorporated prior belief on the individual vertex degree (i.e. Prior 1). In general, the identified patterns belong to knowledge commonly held by people, and are not useful. The top 4 patterns on Caltech36 all reveal people graduating in different years rarely know each other (rows for Prior 1 in Table 6), in particular between ones in class of 2006 and ones in class of 2008 (indicated by the most interesting pattern). Although \(W_2\) of the second pattern (i.e., status = alumni) does not contain the attribute graduation year, it implicitly represents people who had graduated in former year. For Reed98, the discovered patterns under Prior 1 also express the negative influence of different graduation years on connections (rows for Prior 1 in Table 7).
Incorporating Prior 1 and Prior 2 We then incorporated prior beliefs on the densities between bins for different graduation years (i.e., Prior 2). All the extracted top 4 patterns on Caltech 36 indicate rare connections between people living in different dormitories, and this is also not surprising (rows for Prior 1 + Prior 2 in Table 6).
For Reed98, incorporating Prior 1 and Prior 2 provides interesting patterns (rows for Prior 1 + Prior 2 in Table 7). The top one indicates people living in dormitory 88 are friends with many in dormitory 89. In contrast, what people commonly believe is that people living in different dormitories are less likely to know each other. For an analyst who has such preconceived notion, this pattern is interesting. Both the fourth and the seventh patterns reveal a certain person knew more people in class of 2009 than expected.
Incorporating Prior 1, Prior 2 and Prior 3 For Caltech 36, by additionally incorporating prior beliefs on the dependency of the connectivity probability on the difference in dormitories (i.e., Prior 3), patterns characterizing some interesting dense connections are discovered (rows for Prior 1 + Prior 2 + Prior 3 in Table 7). For instance, the top pattern indicates three people in class of 2004 connect with many in class of 2008. In fact, these three people’s graduation had been postponed, as their status is ‘student’ rather than ‘alumni’ in year 2005. Furthermore, the starting year for those 2008 cohort is exactly when these three people should have graduated. Therefore, these two groups had opportunities to become friends. The fourth pattern indicates an alumnus who had studied in a high school knew almost all the students living in a certain dormitory. The reason behind this pattern might be worth investigating, which could be for instance, this alumni worked in this dormitory.
Summary As the results show, incorporating different prior beliefs leads to discovering different patterns that strongly contrast with these beliefs. The proposed SI measure thus succeeds in quantifying the interestingness in a subjective manner.
6.5 Evaluation on iterative pattern mining (RQ4)
6.5.1 Experimental setup
Our method is naturally suited for iterative pattern mining, in a way to incorporate the newly obtained pattern into the background distribution for subsequent iterations. We show this on searching for bisubgroup patterns because they are more generic.
Choice of datasets Dataset DBLPaffs and Lastfm are used, as the meanings of their attributes are clear and straightforward, giving an ease to explain the discovered patterns.
Other settings Other settings for this task are the same as for addressing RQ2. The nested beam search with \(x_1 = 2, x_2=3\), and \(D=2\) was applied. The target descriptions \(W_1\) and \(W_2\) are constrained to include at least one common attribute but with various values, making the corresponding pair of subgroups \(\varepsilon (W_1)\) and \(\varepsilon (W_2)\) not overlap with each other.
6.5.2 Results
Results for Lastfm are displayed and discussed in “Appendix B”. Here we only analyze the results on DBLPaffs. Table 8 displays top 3 patterns found in each of the four iterations on DBLPaffs.
Iteration 1 Initially, we incorporated prior on the overall graph density. The resulting top pattern indicates papers from institutes in USA seldom cite those from other countries.
Iteration 2 After incorporating the top pattern in iteration 1, a set of dense patterns were identified. All the top 3 patterns reveal a highlycited subgroup of papers whose authors are affiliated to institutes in California and New Jersey. This agrees with fact that many of the world’s largest hightech corporations and reputable universities are located in these regions. Examples include Silicon valley, Stanford university in CA, NEC Laboratories, AT&T Laboratories in NJ, among others.
Iteration 3 The top 3 patterns in iteration 3 reveal that papers from authors with Chinese affiliations are rarely cited by papers with authors from other countries. However, they are frequently cited by papers with Chinese authors, as indicated by our identified top singlesubgroup pattern in DBLPaffs (see Table 12 in “Appendix A”). This indicates researchers with Chinese affiliations are surprisingly isolated, the reason of which might be interesting to investigate.
Iteration 4 The top patterns in iteration 4 reveal that papers from institutions in Washington state are highly cited by others, in particular by papers from California. Closer inspection revealed that the majority of these papers are written by authors from Microsoft Corporation and the University of Washington.
Summary By incorporating the newly obtained patterns into the background distribution for subsequent iterations, our method can identify patterns which strongly contrast with this knowledge. This results in a set of patterns that are not redundant and highly surprising to the data analyst. Note that the lack of redundancy arises naturally, without the need for explicitly constraining the overlap between the patterns in consecutive iterations. In fact, some amount of overlap may still occur, as long as the nonredundant part of the information is sufficiently large.
6.6 Empirical results on the discovered global patterns (RQ5)
To demonstrate the use of our method for mining interesting global patterns, we illustrate and analyze the experimental results on DBLPaffs (in Sect. 6.6.1), DBLPtopics (in Sect. 6.6.2) and MP (in “Appendix C”). Each of these datasets serves an interesting case study for us to evaluate our method on.
6.6.1 Case study on DBLPaffs
Task Paper citations relate to authors’ affiliations to some extent. For example, institutions in some particular countries or regions are reputable, and often produce highlycited research. Also, collaborations and mutual citations may frequently occur in institutions from some certain countries or regions. Thus, of particular interest could be patterns that describe a subgroup of papers from affiliations A frequently (or rarely) cite papers in another subgroup from affiliations B. We show such patterns can be revealed by a summarization yielded by our approach.
The resulting summarization By running our algorithm for 6 iterations, this citation network is summarized into 7 subgroups each consisting of papers satisfying a particular description about their authors’ affiliations. These 7 subgroups are respectively defined by

1.
USA = 1 and WA (Washington) = 1;

2.
USA = 1 and WA = 0 and China = 1;

3.
USA = 1 and WA = 0 and China = 0 and CA (California) = 1 and NJ (New Jersey) = 1;

4.
USA = 1 and WA = 0 and China = 0 and CA = 1 and NJ = 0;

5.
USA = 1 and WA = 0 and China = 0 and CA = 0;

6.
USA = 0 and China = 1;

7.
USA = 0 and China = 0.
The summary is displayed in Fig. 8. In the following, we discuss properties of local subgroup patterns revealed in our summarization to access its validity.
Remark 8
(Redundancy in the descriptions) One may notice that some subgroup descriptions can be more concise. For example, the first subgroup pattern “USA = 1 and WA = 1” should induce the same extension as only“WA = 1”. There is no mechanism in our approach for the global pattern mining that would prefer the alternative shorter description of the same subgroup. Yet, such redundancy can be easily identified and adjusted in postprocessing. Moreover, this issue does not affect our single/bisubgroup pattern mining approach where each iteration of the search essentially identifies an optimal pattern rather than a split (in global pattern mining approach), and shorter description of the same subgroup would have a larger SI value given by its smaller DL value.
Discussion A series of interesting local subgroup patterns emerge from the resulting summarization. The density matrix where its entry at the ith row and the jth column is the citation density from papers in the ith subgroup to the jth is visualized by a heatmap, of which the left side is lined up with a dendrogram illustrating the splitting hierachy (see Fig. 9).
Obviously, the most cohesive subgroup are papers from institutions in Washington state in USA, as they cite those within this subgroup most frequently (indicated by the darkest green square in the top left). Closer inspection revealed that the majority of these papers are written by authors from Microsoft Corporation and the University of Washington.
The most highlycited subgroup is the third one (indicated by the dark color of all the squares along the third column except the one in the third row). This subgroup only contains 15 papers, and their authors are affiliated to institutes in California and New Jersey, neither in Washington nor China. Note this also agrees with bisubgroup patterns found in previous experiment for addressing RQ3 (Iteration 2 in Sect. 6.5). As already been pointed out, many of the world’s largest hightech corporations and reputable universities are located in this region. Examples include Silicon valley, Stanford university in CA, NEC Laboratories, AT&T Laboratories in NJ, among others.
Another interesting subgroup is the second one of which authors are with affiliations in China and USA (except Washington). Researchers related to this subgroup are surprisingly isolated, as their papers are seldom cited by those from other subgroups but very frequently (or to be more precise, the second most frequently) within this subgroup (indicated by the shallow color of all the squares along the second column except the one in the second row). In fact, Chinese affiliated with research organisations in China and Chinese affiliated with organisations in USA, have coauthored most papers in this subgroup. The reason of their isolation might be interesting for data analysts to investigate. Again, this coincides with what we found in experiment for addressing RQ3 (Iteration 3 in Sect. 6.5). The difference is the identified subgroup here is more specified (i.e., also being with affiliation in USA except Washington).
A followup experiment The rest subgroup defined by USA = 0 and China = 0 (i.e., the 7th one) contains a considerable number of members (indicated by the largest circle in Fig. 8). Continuing to run our algorithm for subsequent iterations tends to split this subgroup up such that some cohesive groups affiliated with organisations in other countries are revealed. For example, subgroups related to affiliations in Singapore, Canada, the Netherlands emerge respectively in the first 3 subsequent iterations (see the corresponding splitting hierarchy highlighted by red dashed lines in Fig. 10). They all cite papers within the same subgroup or those from the third subgroup (i.e., the overall most highlycited one) very frequently (see rows 7, 8, 9 of the heatmap in Fig. 10).
6.6.2 Case study on DBLPtopics
Task A data analyst working for an academic organization may want to obtain a highlevel view of citation vitality among different research fields. Given DBLPtopics dataset, we here show the global pattern identified by our summarization approach can provide such highlevel view, revealing interesting local subgroup patterns of the form ‘papers of study field A frequently (or rarely) cite those of field B’. We also show the obtained global pattern can provide the data analyst further insights by linking with information about paper distribution among different conferences.
The resulting summarization The summarization of DBLPtopics is generated by running our algorithm for 4 iterations, and the resulting summarization rule means to divide all papers into the following 5 subgroups:

1.
\(a_1<Q_2^{a_1} \wedge a_8\ge Q_1^{a_8}\) (Theoretical machine learning);

2.
\(a_1<Q_2^{a_1}\wedge a_8<Q_1^{a_8}\) (Practical machine learning);

3.
\(a_1\ge Q_2^{a_1}\wedge a_5<Q_3^{a_5} \wedge a_3<Q_3^{a_3}\) (Data mining);

4.
\(a_1\ge Q_2^{a_1}\wedge a_5<Q_3^{a_5} \wedge a_3 \ge Q_3^{a_3}\) (Information retrieval);

5.
\(a_1\ge Q_2^{a_1}\wedge a_5\ge Q_3^{a_5}\) (Database).
For each subgroup, we list its original description and a corresponding short interpretation (in brackets) based on summarizing attributes’ meaning. As mentioned previously (in Sect. 6.1), an attribute is essentially one of the first 50 LSI components for the original papertopic matrix. Its meaning can thus be described by its 5 subcomponents with highest absolute weights (shown in Table 9). A higher weight means this attribute’s meaning is closer (positive sign) or more contrasting (negative sign) to this research field. We will use these short interpretations rather than original descriptions in the following part, because these are more straightforward. Generally, this summarization not only successfully captures those 4 research areas that publications in DBLPtopics are intended to cover (i.e., Machine Learning, Database, Information Retrieval, and Data Mining), but also identifies a deeperlevel structure (i.e., the partition of machine leaning papers into two subgroups according to different aspects they emphasize: more practical or more theoretical).
The summary of DBLPtopics based on the resulting summarization rule is displayed in Fig. 11. To highlight the citation vitality between each pair of subgroups, the corresponding citation density matrix is visualized by a heatmap, lined up with a dendrogram on the left illustrating the splitting hierarchy (see Fig. 12).
Discussion As shown in Fig. 12, the citation density within the same subgroup is often high, indicating papers of similar research field often cite each other.
Exceptions are the second (practical machine learning) subgroup and the third one (data mining) which respectively cite the fifth (database) and the fourth (information retrieval) most frequently. This accords with the fact that solving data mining or practical machine learning research questions often necessitates database techniques or information retrieval to solve some subtasks.
Clearly, the fourth and the fifth subgroup are most cohesive (indicated by those two evidently dark green squares in the fourth and the fifth place of the diagonal). Also, these two groups cite each other and the data mining subgroup very frequently.
One downstream task: knowing more about conferences The summarization generated by our approach can be useful in some downstream analysis tasks. Here we show an example of utilizing it to know more about conferences, simply by linking with the distribution of publications in those 20 selected conferences within each subgroup (displayed in Fig. 13).
First, by merely looking at the distribution for each subgroup, the data analysts can learn the relationship between research fields and conferences, e.g., answering questions like which research field is dominated by which conference. As can be seen, a noticeable large proportion of publications in regard to the information retrieval (the fourth subgroup) are in SIGIR and CIKM. and the database publications (the fifth subgroup) are mostly in ICDE, VLDB, SIGMOD. The data mining subgroup (the third one) is special in a sense that their publications are distributed quite evenly. WWW only holds a slim majority, and publications from KDD, AAAI, ICDM, CIKM are a little bit more than those from another venue (except WWW). Moreover, it is interesting to notice KDD and ICDM appear to be more interdisciplinary, accepting papers surprisingly evenly from these research areas compared to other conferences (as there is no noticeably longer dark brown or light green rectangular in either one of these 5 horizontal bins in Fig. 13).
Also, the data analyst can combine Figs. 12 and 13 to deduce the citation vitality among different conferences. For example, publications in SIGIR and CIKM often cite those also in these two conferences (as the fourth subgroup is very cohesive), and they also often cite publications in WWW, AAAI, KDD,CIKM (those dominating the third subgroup).
Summary As shown by these case studies on different datasets, global patterns identified by our method can not only directly provide insights by revealing a series of interesting singlesubgroup and bisubgroup patterns, but also be utilized to facilitate some downstream analysis tasks.
6.7 Scalability evaluation (RQ6)
6.7.1 Experimental setup
Choice of datasets We used Lastfm to investigate the scalability to the number of selectors, because it can give a largest number of selectors (i.e., 21695) as the search space.
Other settings Same as for other experiments, in the scalability evaluation, we applied the beam search with \(x=5\) (for singlesubgroup pattern discovery), the nested beam search with \(x_1 = 2\), \(x_2 = 3\), and \(D = 2\) (for bisubgroup pattern discovery), 8 processors running in parallel (for global pattern mining).
6.7.2 Results
Effect of S. Figure 14 displays run time on Lastfm w.r.t. the number of selectors in the search space (i.e., S). It is clear that, in either singlesubgroup or global pattern mining, the run time experiences a linear growth as we gradually double the S (from 10 to 20,480), whereas the run time for bisubgroup pattern mining increases more than linearly, and exceeds 1 day when S is larger than 2560.
Run time The run time of our experiments for addressing RQ2 to RQ5, as well as the S and V statistics are listed in Table 10. The influence of the S and V on the run time is evident.
Summary The run time grows linearly in the number of attributes in both singlesubgroup and global pattern mining, whereas it grows faster than linearly in bisubgroup pattern mining.
7 Conclusion
Prior work of pattern mining in attributed graphs typically only search for dense subgraphs (‘communities’) with homogenous attributes. We generalized this type of pattern to densities within this subgraph (no matter whether dense or sparse, which we refer as singlesubgroup pattern), between a pair of different subgroups (which we refer as bisubgroup pattern), as well as between all pairs from a set of subgroups that partition the whole vertex set (which we refer as global pattern).
We developed a novel informationtheoretic approach for quantifying interestingness of such patterns in a subjective manner, with respect to a flexible type of prior knowledge the analyst may have about the graph, including insights gained from previous patterns.
The empirical results show that our method can efficiently find interesting patterns of these new different types. In the standard problem of dense subgraph mining, our method can yield results that are superior to the stateoftheart. We also demonstrated empirically that our method succeeds in taking in account prior knowledge in a meaningful way.
The proposed SI interestingness measure has considerable advantages, but a price to pay for this is in terms of computational time. To help maintain the tractability, we succumb to some accurate heuristic search strategies. It would be useful for the future work to discover a search strategy with performance guarantee and to further speed up the search (e.g., by branch and bounds).
Notes
We consider undirected graphs without selfedges for the sake of presentation and consistency with most literature. However, we note that all our results can be easily extended to directed graphs and graphs with selfedges.
Simply by replacing \(\gamma _{k_{u,v}}\) in Eq. 1 with \(\sum _{i=1}^{i=h}\gamma _{k_{u,v}^{i}}\) where h is the number of attributes considered (also the number of partitions).
This citation dataset are extracted from DBLP website: https://dblp.unitrier.de/, containing 4107340 publications (from unknown year till May 2019) and 36624464 citation relationships. It can be accessed by: https://aminer.org/citation.
AAAI, CIKM, ECIR, ECMLPKDD, ICDE, ICDM, ICDT, ICLR, ICML, IJCAI, KDD, NIPS, PAKDD, PODS, SDM, SIGIR, SIGMOD, VLDB, WSDM, WWW.
In this sensitivity investigation, applying a relatively larger \(\alpha \) (e.g., \(\alpha =0.6\)) can more possibly lead to positive results (i.e., showing the insensitivity as the same best pattern is always identified while varying the beam width) but by a fluke: setting \(\alpha \) larger in the SI measure penalizes more complex patterns more heavily, and this makes the best pattern found before further branching in a beam search more easily dominate, giving less credible positive results. We thus safely chose \(\alpha \) to be 0.1 in this experiment.
References
Adhikari B, Zhang Y, Bharadwaj A, Prakash BA (2017) Condensing temporal networks using propagation, pp 417–425. https://doi.org/10.1137/1.9781611974973.47
Adriaens F, Lijffijt J, De Bie T (2017) Subjectively interesting connecting trees. In: Ceci M, Hollmén J, Todorovski L, Vens C (eds) Machine learning and knowledge discovery in databases: European conference, ECML PKDD 2017, Skopje, Macedonia, Sept 18–22, 2017, Proceedings, Part II, Springer, vol 10535, pp 53–69. https://doi.org/10.1007/9783319712468_4
Akoglu L, Tong H, Meeder B, Faloutsos C (2012) PICS: parameterfree identification of cohesive subgroups in large attributed graphs, pp 439–450. https://doi.org/10.1137/1.9781611972825.38
Aral S, Muchnik L, Sundararajan A (2009) Distinguishing influencebased contagion from homophilydriven diffusion in dynamic networks. Proc Natl Acad Sci 106(51):21544–21549. https://doi.org/10.1073/pnas.0908800106
Atzmueller M (2015) Subgroup discovery. WIREs Data Min Knowl Discov 5(1):35–49. https://doi.org/10.1002/widm.1144
Atzmueller M, Doerfel S, Mitzlaff F (2016) Descriptionoriented community detection using exhaustive subgroup discovery. Inf Sci 329:965–984. https://doi.org/10.1016/j.ins.2015.05.008
Barbieri N, Bonchi F, Manco G (2014) Who to follow and why: link prediction with explanations. In: The 20th ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’14, New York, NY, USA, Aug 24–27, 2014, pp 1266–1275. https://doi.org/10.1145/2623330.2623733
Boley M, Mampaey M, Kang B, Tokmakov P, Wrobel S (2013) One click mining: interactive local pattern discovery through implicit preference and performance learning. In: IDEA ’13 proceedings of the ACM SIGKDD workshop on interactive data exploration and analytics, ACM, New York, NY, USA 2013, pp 27–35. https://doi.org/10.1145/2501511.2501517
Cantador I, Brusilovsky P, Kuflik T (2011) 2nd workshop on information heterogeneity and fusion in recommender systems (hetrec 2011) In: Proceedings of the 5th ACM conference on recommender systems. ACM, New York, NY, USA, RecSys 2011
Casiraghi G, Nanumyan V, Scholtes I, Schweitzer F (2016) Generalized hypergeometric ensembles: statistical hypothesis testing in complex networks. arXiv:1607.02441
Chen C, Lin CX, Fredrikson M, Christodorescu M, Yan X, Han J (2009) Mining graph patterns efficiently via randomized summaries. Proc VLDB Endow 2:742–753
Chen X, Kang B, Lijffijt J, De Bie T (2020) ALPINE: active link prediction using network embedding. arXiv eprints arXiv:2002.01227
Cheng H, Zhou Y, Yu JX (2011) Clustering large attributed graphs: a balance between structural and attribute similarities. ACM Trans Knowl Discov Data (TKDD) 5(2):12:1–12:33. https://doi.org/10.1145/1921632.1921638
Chernoff H (1952) A measure of asymptotic efficiency for tests of a hypothesis based on the sum of observations. Ann Math Stat 23(4):493–507. https://doi.org/10.1214/aoms/1177729330
De Bie T (2011a) An information theoretic framework for data mining. In: Proceedings of the 17th ACM SIGKDD international conference on knowledge discovery and data mining, ACM, New York, NY, USA, KDD ’11, pp 564–572. https://doi.org/10.1145/2020408.2020497
De Bie T (2011b) Maximum entropy models and subjective interestingness: an application to tiles in binary databases. Data Min Knowl Discov 23(3):407–446. https://doi.org/10.1007/s1061801002093
De Bie T (2013) Subjective interestingness in exploratory data mining. In: Proceedings of the 12th international symposium on advances in intelligent data analysis XII—volume 8207, Springer, Berlin, IDA 2013, pp 19–31. https://doi.org/10.1007/9783642413988_3
Deng J, Kang B, Lijffijt J, Bie TD (2020) Explainable subgraphs with surprising densities: a subgroup discovery approach. In: Proceedings of the 2020 SIAM international conference on data mining, Cincinnati, Ohio, USA
Fond TL, Neville J (2010) Randomization tests for distinguishing social influence and homophily effects. In: Proceedings of the 19th international conference on world wide web, WWW ’10, ACM, pp 601–610
Freeman LC (1978) Segregation in social networks. Sociol Methods Res 6(4):411–429. https://doi.org/10.1177/004912417800600401
Fronczak A (2012) Exponential random graph models. arxiv:1210.7828
Galbrun E, Gionis A, Tatti N (2014) Overlapping community detection in labeled graphs. Data Min Knowl Discov 28(5–6):1586–1610. https://doi.org/10.1007/s106180140373y
Gong NZ, Talwalkar A, Mackey L, Huang L, Shin ECR, Stefanov E, Shi ER, Song D (2014) Joint link prediction and attribute inference using a socialattribute network. ACM Trans Intell Syst Technol 5(2):1–20. https://doi.org/10.1145/2594455
Günnemann S, Färber I, Boden B, Seidl T (2010) Subspace clustering meets dense subgraph mining: a synthesis of two paradigms. In: 2010 IEEE international conference on data mining, pp 845–850. https://doi.org/10.1109/ICDM.2010.95
Günnemann S, Boden B, Seidl T (2011) DBCSC: a densitybased approach for subspace clustering in graphs with feature vectors. In: Gunopulos D, Hofmann T, Malerba D, Vazirgiannis M (eds) Machine learning and knowledge discovery in databases. Springer, Berlin Heidelberg, Berlin, Heidelberg, pp 565–580
Harris JK (2013) An introduction to exponential random graph modeling, vol 173. Sage Publications, Beverly Hills
Hassanlou N, Shoaran M, Thomo A (2013) Probabilistic graph summarization. In: Wang J, Xiong H, Ishikawa Y, Xu J, Zhou J (eds) Webage information management. Springer, Berlin, pp 545–556
Herrera F, Carmona CJ, González P, del Jesus MJ (2011) An overview on subgroup discovery: foundations and applications. Knowl Inf Syst 29(3):495–525. https://doi.org/10.1007/s1011501003562
Hoeffding W (1963) Probability inequalities for sums of bounded random variables. J Ame Stat Assoc 58(301):13–30. https://doi.org/10.1080/01621459.1963.10500830
Holland PW, Leinhardt S (1981) An exponential family of probability distributions for directed graphs. J Am Stat Assoc 76(373):33–50. https://doi.org/10.1080/01621459.1981.10477598
Lemmerich F, Becker M (2018) pysubgroup: easytouse subgroup discovery in python. In: Joint European conference on machine learning and knowledge discovery in databases, pp 658–662
Li J, Wu L, Zaïane O, Liu H (2017) Toward personalized relational learning. In: Proceedings of the 17th SIAM international conference on data mining, SDM 2017, Society for Industrial and Applied Mathematics Publications, United States, pp 444–452
Liu Y, Safavi T, Dighe A, Koutra D (2018) Graph summarization methods and applications: a survey. ACM Comput Surv (CSUR) 51(3):62:1–62:34. https://doi.org/10.1145/3186727
McPherson M, SmithLovin L, Cook JM (2001) Birds of a feather: Homophily in social networks. Annu Rev Sociol 27(1):415–444. https://doi.org/10.1146/annurev.soc.27.1.415
Meeng M, Knobbe A (2011) Flexible enrichment with cortana–software demo. In: Proceedings of BeneLearn, pp 117–119
Moser F, Colak R, Rafiey A, Ester M (2009) Mining cohesive patterns from graphs with feature vectors. In: Proceedings of the 2009 SIAM international conference on data mining, pp 593–604. https://doi.org/10.1137/1.9781611972795.51
Mougel PN, Plantevit M, Rigotti C, Gandrillon O, Boulicaut JF (2010) Constraintbased mining of sets of cliques sharing vertex properties. In: Workshop on analysis of complex networks ACNE’10 colocated with ECML PKDD 2010, Barcelona, Spain, pp 48–62. https://hal.archivesouvertes.fr/hal01381539
Newman M (2006) Modularity and community structure in networks. Proce Natl Acad Sci 103(23):8577–8582. https://doi.org/10.1073/pnas.0601602103
Nicosia V, Mangioni G, Carchiolo V, Malgeri M (2009) Extending the definition of modularity to directed graphs with overlapping communities. J Stat Mech Theory Exp 03:P03024. https://doi.org/10.1088/17425468/2009/03/p03024
Perozzi B, Akoglu L, Iglesias Sánchez P, Müller E (2014) Focused clustering and outlier detection in large attributed graphs. In: Proceedings of the 20th ACM SIGKDD international conference on knowledge discovery and data mining, ACM, New York, NY, USA, KDD ’14, pp 1346–1355. https://doi.org/10.1145/2623330.2623682
Pool S, Bonchi F, Leeuwen M (2014) Descriptiondriven community detection. ACM Trans Intell Syst Technol (TIST) 5(2):28:1–28:28. https://doi.org/10.1145/2517088
Shi L, Tong H, Tang J, Lin C (2015) Vegas: visual influence graph summarization on citation networks. IEEE Trans Knowl Data Eng 27(12):3417–3431. https://doi.org/10.1109/TKDE.2015.2453957
Sinha A, Shen Z, Song Y, Ma H, Eide D, Hsu BP, Wang K (2015) An overview of Microsoft Academic Service (MAS) and applications. In: Proceedings of the 24th international conference on world wide web, ACM, pp 243–246
Tang J, Zhang J, Yao L, Li J, Zhang L, Su Z (2008) ArnetMiner: extraction and mining of academic social networks. In: Proceedings of the 14th ACM SIGKDD international conference on knowledge discovery and data mining, pp 990–998
Tian Y, Hankins RA, Patel JM (2008) Efficient aggregation for graph summarization. In: Proceedings of the 2008 ACM SIGMOD international conference on management of data, ACM, New York, NY, USA, SIGMOD ’08, pp 567–580. https://doi.org/10.1145/1376616.1376675
Traud AL, Mucha PJ, Porter MA (2012) Social structure of facebook networks. Physica A Stat Mech Appl 391(16):4165–4180. https://doi.org/10.1016/j.physa.2011.12.021
Tsourakakis C, Bonchi F, Gionis A, Gullo F, Tsiarli M (2013) Denser than the densest subgraph: extracting optimal quasicliques with quality guarantees. In: Proceedings of the 19th ACM SIGKDD international conference on knowledge discovery and data mining, ACM, New York, NY, USA, KDD ’13, pp 104–112. https://doi.org/10.1145/2487575.2487645
van Leeuwen M, De Bie T, Spyropoulou E, Mesnage C (2016) Subjective interestingness of subgraph patterns. Mach Learn 105(1):41–75. https://doi.org/10.1007/s1099401555393
Wang X, Jin D, Cao X, Yang L, Zhang W (2016) Semantic community identification in large attribute networks. In: Proceedings of the thirtieth AAAI conference on artificial intelligence, AAAI Press, AAAI’16, pp 265–271. http://dl.acm.org/citation.cfm?id=3015812.3015851
Wei X, Xu L, Cao B, Yu PS (2017) Cross view link prediction by learning noiseresilient representation consensus. In: Proceedings of the 26th international conference on world wide web, International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, CHE, WWW ’17, pp 1611–1619. https://doi.org/10.1145/3038912.3052575
Wu Y, Zhong Z, Xiong W, Jing N (2014) Graph summarization for attributed graphs. In: 2014 International conference on information science, electronics and electrical engineering, vol 1, pp 503–507. https://doi.org/10.1109/InfoSEEE.2014.6948163
Xu Z, Ke Y, Wang Y, Cheng H, Cheng J (2012) A modelbased approach to attributed graph clustering. In: Proceedings of the 2012 ACM SIGMOD international conference on management of data, ACM, New York, NY, USA, SIGMOD ’12, pp 505–516. https://doi.org/10.1145/2213836.2213894
Yang J, Leskovec J (2015) Defining and evaluating network communities based on groundtruth. Knowl Inf Syst 42(1):181–213. https://doi.org/10.1007/s101150130693z
Yin Z, Gupta M, Weninger T, Han J (2010) A unified framework for link recommendation using random walks. In: 2010 international conference on advances in social networks analysis and mining, pp 152–159. https://doi.org/10.1109/ASONAM.2010.27
Zhang N, Tian Y, Patel JM (2010) Discoverydriven graph summarization. In: 2010 IEEE 26th international conference on data engineering (ICDE 2010), pp 880–891. https://doi.org/10.1109/ICDE.2010.5447830
Zhou Y, Cheng H, Yu JX (2009) Graph clustering based on structural/attribute similarities. Proc VLDB Endow 2(1):718–729. https://doi.org/10.14778/1687627.1687709
Acknowledgements
The research leading to these results has received funding from the European Research Council under the European Union’s Seventh Framework Programme (FP7/20072013) / ERC Grant Agreement No. 615517, from the Flemish Government under the “Onderzoeksprogramma Artificiële Intelligentie (AI) Vlaanderen” programme, from the FWO (Project Nos. G091017N, G0F9816N, 3G042220), and from the European Union’s Horizon 2020 research and innovation programme and the FWO under the Marie SklodowskaCurie Grant Agreement No. 665501.
Author information
Authors and Affiliations
Corresponding author
Additional information
Responsible editor: M. J. Zaki
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix
For Section 6.3: A comparative evaluation on DBLPaffs network (RQ2)
Some objective interestingness measures we used for comparison, as well as their explanations are listed in Table 11.
We consider undirected graphs for the sake of presentation and consistency with most literature. However, we note that the generalization to directed graphs is straightforward.
For Section 6.5: Evaluation on the iterative pattern mining on Lastfm dataset (RQ4)
Table 14 displays the top 3 patterns found in each of the five iterations on the Lastfm. The description search space is built based on only 100 most frequently used tags, that means, \(S=100\times 2\).
Iteration 1 Initially, we incorporate prior belief on individual vertex degree. The extracted most interesting pattern reflects a conflict between aggressive heavy metal fans and mainstream pop lovers who do not listen to heavy metal at all.
Iteration 2 After incorporating the top pattern identified in iteration 1, what comes top is the one expressing again a conflict between mainstream and nonmainstream music preference, but another kind (i.e., pop with no indie, and experimental with no pop). Also, we can notice only the second pattern for the iteration 1 is remained in the iteration 2 top list but with a lower rank as third. The interestingness of any sparse pattern associated with the newly incorporated one under the updated background distribution is expected to decrease, as the data analyst’s would not feel surprised about such pattern.
Iteration 3 In iteration 3, our method tends to identify some interesting dense patterns, mainly related to synth pop and new wave genres. The top one states synth pop fans frequently connect with many people listening to new wave but not synth pop. This pattern appears fallacious at the first glance. Nevertheless, synth pop is a subgenre of new wave music. Also, the latter group may listen to synth pop but they use a different tag ‘synthpop’ instead of ‘synth pop’, as there are even 102 audience only tag synth pop as ’synthpop’ (see the third patten). Hence, this pattern makes sense as it describes dense connections between two groups which resemble each other.
Iteration 4 The top 3 patterns in iteration 4 all express negative associations between new wave and some sort of catchy mainstream music (eg. pop, rnb, or hiphop, among several others).
Iteration 5 Once we incorporate the most interesting one, patterns characterizing some positively associated genres stand out. For example, the top one in iteration 5 indicates instrumental audience are friends with many ambient audience who doesn’t listen to instrumental music. These two genres are not opposite concepts and share many in common (e.g., recordings for both do not include lyrics). Actually, ambient music can be regarded as a slow form of instrumental music.
Summary By incorporating the newly obtained patterns into the background distribution for subsequent iterations, our method can identify patterns which strongly contrast to this knowledge. This results in a set of patterns that are not redundant and are highly surprising to the data analyst. Note this does not means we restrict patterns in different iterations not to be associated with each other. In fact, overlapping could happen when this is informative.
For Section 6.6: One more case study on MPvotes for the evaluation of global pattern mining
Task Brexit is a hot topic of debate in UK. MPs’ voting behaviours on Brexit might affect the likelihood of their connections. Using this information to summarize MPs friendship network is thus potential to provide insights on the Brexit saga. We here investigate whether our approach can achieve this.
The resulting summarization The summarization of MPvotes generated from running our algorithm for 4 iterations splits all MPs into 5 subgroups, and they are respectively defined by

1.
\(\text{ I1 } =1\) or 0 \(\wedge \) I10 \(\text{ V3 } =1\) or 0 \(\wedge \) I10 \(\text{ V4 } =1\) or 0;

2.
\(\text{ I1 } =1\) or 0 \(\wedge \) I10 \(\text{ V3 } =1\) or 0 \(\wedge \) I10 V4 = 1;

3.
\(\text{ I1 } =1\) or 0 \(\wedge \) I10 V3 = 1;

4.
I1 = 1 \(\wedge \) I7 V4 = 1 or 0;

5.
I1 = 1 \(\wedge \) I7 \(\text{ V4 } =1\).
where ‘Ii Vj’ represents the jth vote in the ith issue. For an issue around which there exists only one vote, say the 1st issue, it is simply represented as I1. Detailed interpretation of all voting issues related to our summarization are displayed in Table 15. The summary of MPvotes is illustrated in Fig. 15. For a dedicated view of the connectivity density between each subgroup pair, the corresponding density matrix is visualized by a heatmap, aligned with an dendrogram illustration of the splitting hierachy on the left (see Fig. 16).
Discussion Clearly in Fig. 16, our summarization identifies several crucial votings that partition MPs into cohesive subgroups. That is, MPs taking the same sides in these votings connect more frequently to each other (i.e., those within the same subgroup) than MPs voting differently (i.e., those in other subgroups). The only exception is the 2nd subgroup who connect most frequently to the 3rd subgroup. More interpretations of these patterns are provided in the following.
Combining with political parties The data analyst can utilize our summarization of MPvotes to obtain insights about Brexit saga. Here, we provide one example. More specifically, we show, by combining with the distribution of MPs’ party affiliations within each subgroup (illustrated in Fig. 17), our summarization can:

(a)
reveal crucial voting issues over which MPs from different parties take different sides;

(b)
provide a highlevel view of connectivity densities among different political parties.
Now we trace the partition process based on our summarization in order to show (a). The first split is a vote on I1 of which ‘ayes’ side with the government to keep nodeal Brexit on the table as a possibility (see the dendrogram in Fig. 17). A clear opinion conflict between different parties can be observed. More specifically, all the MPs from Scottish National Party (SNP), Liberal Democrat (LD), Sinn Fein (SF), Plaid Cymru (PC), Green (Grn) and the majority of MPs in Labour (Lab) voted against I1 or abstained (the aggregation of the first, second and third subgroup). All except two MPs from Conservative (Con) and all from Democratic Unionist Party (DUP) were in favour (the aggregation of the fourth and fifth subgroup ). Then those ‘Noes’ and abstainers of I1 are divided according to their stances on Lab’s plan for a close economic relationship with the EU (i.e., I10 V3). ‘Ayes’ of I10 V3 (i.e., the third subgroup) are dominated by most MPs from Lab. The others are further split over their votes on UK membership of Efta and Eea (i.e., I10 V4), in which MPs from some nonmainstream parties voted for or abstained (i.e., the firstst subgroup) and 15 MPs from Lab voted against. In the fourth split of vote on I7 V4, MPs affiliated with Con and those with DUP are clearly separated from each other, leading to the fourth and fifth subgroup respectively.
Then we show (b) by combining our summarization (Fig. 16) and the party affiliation distribution (Fig. 17). Here we show some interesting findings. As mentioned previously, one bisubgroup pattern reveals frequent connections between the second subgroup and the third one. The second subgroup can be interpreted as a group of unrepresentative Lab MPs, whereases the third subgroup corresponds to a representative group, as closer inspection shows MPs in either of these two subgroups are mostly affiliated with Lab, though the population of the second subgroup is much smaller. Also, MPs affiliated with some nonmainstream parties (e.g., SNP, LD,SF,PC) connect much more to those affiliated with Lab than those with Con, especially those with Lab belonging to the second subgroup. Although the fourth subgroup is almost made up with purely MPs that are from Con, its relatively small selfconnectivity in comparison with that to the first and the third subgroup indicates not many MPs from Con build friendship with each other.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Deng, J., Kang, B., Lijffijt, J. et al. Mining explainable local and global subgraph patterns with surprising densities. Data Min Knowl Disc 35, 321–371 (2021). https://doi.org/10.1007/s10618020007219
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10618020007219