# Regular Decomposition of Large Graphs: Foundation of a Sampling Approach to Stochastic Block Model Fitting

- 1.3k Downloads

## Abstract

We analyze the performance of regular decomposition, a method for compression of large and dense graphs. This method is inspired by Szemerédi’s regularity lemma (SRL), a generic structural result of large and dense graphs. In our method, stochastic block model (SBM) is used as a model in maximum likelihood fitting to find a regular structure similar to the one predicted by SRL. Another ingredient of our method is Rissanen’s minimum description length principle (MDL). We consider scaling of algorithms to extremely large size of graphs by sampling a small subgraph. We continue our previous work on the subject by proving some experimentally found claims. Our theoretical setting does not assume that the graph is generated from a SBM. The task is to find a SBM that is optimal for modeling the given graph in the sense of MDL. This assumption matches with real-life situations when no random generative model is appropriate. Our aim is to show that regular decomposition is a viable and robust method for large graphs emerging, say, in Big Data area.

## Keywords

Community detection Sampling Consistency Martingales## 1 Introduction

In the conference paper [1] we conjectured the possibility of applying our regular decomposition algorithm [2] to very large graphs, for which the full adjacency information is not possible to process, using a sampling approach. In this paper we develop the corresponding theory. Using martingale techniques, we prove claims of the preceding paper and give precise conditions under which they are true. This method allows to abandon the customary assumption that the graph be generated by a SBM.

The so-called Big Data is a hot topic in science and applications. Revealing and understanding various relations embedded in such large data sets is of special interest. In mathematical terms, such relations form a huge graph. The size of the graph may be so big that sampling of subgraphs is the only practically feasible operation. Furthermore, part of the information may be inaccessible, faulty or missing. Our method suggests a way to overcome such hurdles in the case of dense data.

One example is the case of semantic relations between words in large corpora of natural language data. Relations can also exist between various data sets, forming large higher-order tensors, thus requiring integration of various data sets. In such a scenario, algorithms based on stochastic block models (SBMs, also known as generalized random graphs [3]; for a review see, e.g., [4]) and their extensions are very attractive solutions, instead of simple clustering like k-means.

A strong result, less known among practitioners, is Szemerédi’s regularity lemma (SRL) [5]. SRL in a way supports a SBM-based approach to large data analysis, because it proves the existence of a SBM-like structure for any large graph, and has generalizations to similar objects like hypergraphs. SRL has been a rich source in pure mathematics—it appears as a key element of important results in many fields. We think that in more practical research it is also good to be aware of the broad context of SRL as a guiding principle and a source of inspiration. To emphasize this point, we call this kind of approach to SBM and a concrete way to fit data to it regular decomposition (RD).

Traditionally, SBM has been used for the so-called community detection in networks, like partitioning a network into blocks that are internally well connected and almost isolated from each other. From the point of view of SRL, this is too restrictive. According to SRL, the relations between the groups are more informative than those inside the groups, traditionally emphasized in community detection. Regular decomposition takes both aspects into account on equal footage. In [6] a more general scope of spectral clustering is introduced: the discrepancy-based spectral clustering that favors regular structures.

The methodology of RD has been developed in our previous works, first introduced in [7] and later refined and extended in [1, 2, 8, 9, 10]. In [2] we used Rissanen’s minimum description length principle (MDL) [11] in RD. In [10] we extend RD to the case of large and sparse graphs using graph distance matrix as a basis.

In the current paper, we restrict ourselves to the case of very large and dense simple graphs. Other cases like rectangular matrices with real entries can be probably treated in very similar manner using the approach described in [2]. We aim to show that large-scale block structures of such graphs can be learned from bounded size, usually quite small, samples. The block structure found from a sample can then be extended to the rest of the graph in just a linear time w.r.t. *n*, the number of nodes.

The block label of a given node can be found independently of all other nodes, provided a good enough sample was used. This allows parallel computations in RD algorithm. The labeling of a node requires only the information on links from that node to the nodes of the fixed sample. As a result, the labeling can be done independently in many processing cores sharing the same graph and the sample.

The revealed structure is useful for many applications like those that involve counting numbers of small subgraphs or finding motifs. It also helps in finding a comprehensive overall understanding of graphs when the graph size is far too large for complete mapping and visualization. We also introduce a new version of the RD algorithm for simple graphs that tolerates missing data and noise.

## 2 Related Work

Our theoretical description is partly overlapping with SBM literature using the apparatus of statistical physics, see, e.g., extensive works by T.P. Peixoto. In particular, the use of MDL to find the number of blocks has been studied earlier by Peixoto [12]. We think that such multitude of approaches is only beneficial since it brings in results and ideas of different fields and is understandable by a wider community of researchers.

Recently, impressive progress has been achieved in the theory of nonparametric statistical estimation of large and sparse graphs [13, 14]. Under certain conditions, latent models like the SBM can be found from samples of links, linear in number of nodes sizes. Our work could be seen as a special case when the target is to find a block structure from a large graph using a specific algorithm. We show that in the dense case, the sample size is just a finite constant, instead of a linear function of the number of nodes, needed in the sparse case. Moreover, the algorithm in [14] is much more complicated than ours and may be difficult to use in practice. In a recent mostly experimental work [10], we attempted to use regular decomposition in a sparse graph case using graph distance as similarity measure between nodes.

A clustering method that can find the clusters from large enough samples is called consistent. As stated in [15], there are quite few results in proving consistency of clustering methods. The consistency of classical k-means clustering was proven by Pollard [16].

von Luxburg et al. [15] proved consistency of spectral clustering in a special case. It was shown that under some mild conditions, the spectral clustering of sampled similarity matrix of data with two clusters will converge to the spectral clustering of the limit corresponding to the infinite data. In this respect the so-called normalized approach that uses normalized Laplacian (see Sect. 6.2) was found preferable. Spectral clustering can be seen as an alternative to regular decomposition, since it can reveal similar block structures in adjacency—or similarity matrices [17].

## 3 Regular Decomposition of Graphs and Matrices

Regular decomposition finds a structure that mimics the regular partitions promised by SRL. SRL says, roughly speaking, that the nodes of any large enough graph can be partitioned into a bounded number, *k*, of equally sized ‘clusters,’ and one small exceptional set, in such a way that links between most pairs of clusters look like those in a random bipartite graphs with independent links and with the link probability that equals to the link density. SRL is significant for large and dense graphs. However, similar results hold also to many other cases and structures, see the references in [18].

In RD, we simply replace the random-like bipartite graphs of SRL by a set of truly random bipartite graphs and use it as a modeling space in the MDL theory. We disregard the requirement of having blocks of equal sizes and also model the internal links of regular clusters by a random graph. The next step is to find an optimal model that explains the graph in a most economic way using MDL [2].

In the case of a matrix with nonnegative entries, we replace random graph models with a kind of bipartite Poissonian block models: A matrix element \(a_{i,j}\) between row *i* and column *j* is thought to correspond to a random multi-link between nodes *i* and *j*, with the number of links distributed as a Poisson random variable with mean \(a_{i,j}\) (see [9]). The bipartition is formed by the sets of rows and columns, respectively. Then the resulting RD is very similar to the RD of binary graphs. This approach allows also the analysis of several interrelated data sets and corresponding graphs, for instance, using corresponding tensor models. In all cases, the RD algorithm is very compact, containing just few lines of code in high-level languages like Python, and uses standard matrix-algebraic operations.

*V*has a length at most (and, for large graphs, typically close to)

*m*and

*p*) distribution. A corresponding representation exists also for the coding length of a matrix, interpreted as an average of a Poissonian block model (see details in [2]).

### 3.1 Matrix Formulation of the RD Algorithm for Graphs

### Definition 1

*G*with

*n*nodes, adjacency matrix

*A*and a partition matrix \(R\in {{{\mathcal {R}}}}_k\), we denote:

*R*is by definition a binary \(n\times k\) matrix with rows that are indicators for regular group membership of the corresponding nodes. For instance, if the row

*i*of

*R*is (0, 1, 0, 0), this means that node

*i*belongs to the regular group number 2.

*R*and are denoted

*P*:

By (1), the length of the code \(l_k\) that uniquely describes the graph *G*(*A*) corresponding to *A*, using the model (*R*, *P*), \(R\in {{{\mathcal {R}}}}\), can be computed as follows:

### Definition 2

### Algorithm 1

**Greedy Two-part MDL**

**Input:** \(G=G(A)\) *is a simple graph of size* *n*.

**Output:** \((k^*,R_{k^*}\in {{{\mathcal {R}}}}_{k^*}), k^*\in \{1,2,\ldots ,n\}\), *such that the two-part code for* *G* *is approximately equal to the shortest possible for all possible block models with number of blocks in the range from* 1 *to* *n*.

*Start:* \(k=1\), \(l^*=\infty\), \(R\in {{{\mathcal {R}}}}_1=\left\{ \mathbf{{1}}\right\}\), \(k^*=1\), *where* \(\mathbf {1}\) *is the* \(n\times 1\) *matrix with all elements equal to* 1.

**1. Find**

*using subroutine*

**ARGMAX k**(

*Algorithm*2).

**2. Compute ** \(l_k(G)= \lceil l_k(G\mid {\hat{R}}(G))\rceil +l_k ({\hat{R}}(G))\)

**3. If** \(l_k(G)< l^*\) *then* \(l^*=l_k(G)\), \(R_{k^*}={\hat{R}}_k(G)\) , \(k^*=k\)

**4.** \(k=k+1\)

**5. If ** \(k>n\), **Print** \((R_{k^*},k^*)\) *and* **STOP** *the program.*

**6.** **GoTo 1**.

### Definition 3

*L*, \(\alpha ,\beta \in \{1,2,\ldots ,k\}\),

*R*:

The mapping \(\varPhi (R)\) defines a greedy optimization of partition *R*, where each node is replaced to a new block independently of all other placements, hence the term ‘greedy algorithm.’

### Algorithm 2

**ARGMAX k**

*Algorithm for finding regular decomposition for fixed* *k*.

**Input**: *A*: *the adjacency matrix of a graph (an* \(n\times n\) *symmetric binary matrix with zero trace)*; *N*: *an integer (the number of iterations in the search of a global optimum)*; *k*: *a positive integer* \(\le n\).

*Start:* \(m=1\).

**1.** \(i:=0\); *generate a uniformly random element* \(R^i\in {\mathcal {R}}_k\).

**2.**

*If any of the column sums of*\(R^i\)

*is zero*,

**GoTo 1**,

*if not, then compute:*

**3.**

**If**\(R^{i+1}\ne R^i\),

*set*\(i:=i+1\)

*and*

**GoTo 2**,

**4.** \(R(m):=R^i\); \(m=m+1\);

\(l(m):= \sum _{i=1}^n \min _{1\le \alpha \le k}(L(R(m)))_{i,\alpha }\).

**5.** *If* \(m<N\), **GoTo 1**.

**6.** \(M:=\{m: l(m)\le l(i),i=1,2,\ldots ,N\}\); \(m^*:=\inf M\).

**Output** *Solution:* \(R(m^*)\).

In **ARGMAX k** the outer loop 1–5 runs several (*N*) optimization rounds finding each time a local optimum, and finally, the best local optimum is the output of Algorithm 2. At each optimization round, the inner loop 2–3 improves the initial random partition until a fixed point is reached and no further improvements of the partition are possible. This fixed point is an indication that a local optimum is found.

For very large graphs, the program may not be solvable in the sense that it is not possible and reasonable to go through all possible values of \(k\in \{1,2,\ldots , n\}\). One option is to limit the range of *k*. In case that no minimum is found, then use as an optimal choice the model found for the largest *k* within this range. Another option is to find the first minimum with smallest *k* and stop. When the graph is extremely large, it makes sense to use only a randomly sampled subgraph as an input—indeed, when \(k^*<<n\), a large-scale structure can be estimated from a sample as we shall show.

### 3.2 Regular Decomposition with Missing Link Information

Now assume that part of the link information is lost. We only consider the simplest case when the link information is lost uniformly randomly over all node pairs. We also exclude the possibility of a sparse representation of graphs, when only those node pairs that have links are listed, since this would lead to unsolvable ambiguity in our setting; indeed, if some node pair is not in the list, it could mean one of two cases: There is no link, or the link information is lost.

We formulate how the matrices used in the previous section are modified. The RD algorithms themselves are unaltered.

*A*the adjacency matrix with missing link values replaced by \(-1\)’s.

*A*is symmetric, and we also assume that we know its true dimension (the number of nodes). We put \(-1\) on the diagonal of

*A*, just for the convenience of further formulas. Define

*D*has the same elements as

*A*, except that the entries equal to \(-1\) are replaced by 0’s. Next define a ‘mask matrix’

*B*is the indicator of known link data in

*A*. The matrix \(P_1\) is now:

*P*-matrix is defined as the link densities in the observed data:

*P*are close to the corresponding actual link densities in the case that we would have complete link data. This also assumes that the graph and the blocks must have large size.

The cost function in Definition 2 remains unaltered. Now the *P*-matrix elements are only estimates based on observed data and not the exact ones. The block sizes are the true ones that are known since we assume that we know the actual number of nodes in the graph. The interpretation is that the contributions of missing data entries in a block are replaced by the average contribution found from the observed data of that block. It should be noted that similar scaling must be done in the RD algorithms that find the optimum of the cost function. For example, *A* should be replaced with *D*, and summing over elements of *D* should be rescaled so that the contribution is proportional to the true size of the corresponding block.

## 4 Sparse Sampling Scheme Formulation

Assume that a large graph has a ‘regular structure’ that minimizes the regular decomposition MDL program with *k* regular groups, \(\xi :=\{V_1,V_2,\ldots , V_k\}\) having sizes \(n_1,n_2,\ldots , n_k\) and irreducible link density matrix with elements \(0<d_{i,j}<1\). The number of nodes in *G* is denoted as *N*, and it is assumed that the relative sizes \(r_i:=n_i/N\) of the groups are such that \(k r_i \ge c\), where *c* is a positive constant and not too small, say, \(c=1/10\). This condition means that none of the regular groups is very small. Such small groups would be ‘artifacts’ and are excluded by MDL criteria or by the practical RD algorithm that refuses forming too small groups. Here the aim of this condition is that when we make a uniformly random sample of *n* nodes, then all numbers \(r_in\), the expected number of sample nodes from each group, are big enough. On the other hand, this condition could probably be relaxed so that even small groups could be allowed. Their role in the classification problem we formulate is not decisive since the big groups are normally most decisive in this respect. However, we use this limitation in the sequel, for simplicity.

*v*into group

*j*

*v*and nodes in group

*i*. This means that the cost of changing a group membership is positive for all nodes:

*G*and its optimal regular decomposition has the property:

*N*is very large; if \(\delta\) is not comparable with

*N*, then in a reasonable sample the corresponding difference becomes hopelessly small. We see in the sequel that the expected value of such a difference is large and it seems reasonable to place this condition. Also in the case of a SBM as the generator of

*G*we would get such a condition on \(\delta\) with exponential probability. We could relax this condition by assuming that it holds for most of the nodes

*v*. For simplicity we use this condition as it is.

Consider making a uniformly random sample of *n* nodes of *G*, and retrieve the corresponding induced subgraph \(G_n\). The claim is that if *n* is sufficiently large, then \(G_n\) has a regular structure that has almost the same densities and relative sizes of the groups and correct membership in regular groups. Further, if \(G_n\) is used as a classifier, the misclassification probability is small, provided the regular structure of \(G_n\) has large enough groups to allow good approximation of the parameters of the regular structure of \(G_n\). Compare this claim with Lemma 2.3 from [18]:

### Lemma 1

*Let*

*X*

*and*

*Y*

*be vertex subsets of a graph*

*G*.

*Let*\(X' \subseteq X\)

*and*\(Y'\subseteq Y\)

*be picked uniformly at random with*\(|X'|=|Y'|=k\).

*Then*

The proof is based on Azuma’s equality for the edge exposure martingale. In our case we use a similar idea for the proofs.

## 5 Sampled Cost Function Estimation

First we show that in expectation the sampled cost function resolves well the correct optimum. Next, we study a harder problem of estimating the probability of a deviation of the randomly sampled cost function from its expected value. Finally, it is shown that with exponential probability, in *n*, the sampled cost function is able to resolve the correct block structure.

### 5.1 Expectation of the Sampled Cost Function

*G*using the regular structure as the link probability distribution function. Select a node

*v*uniformly at random. Then the cost function is a random variable. If \(v\in V_i\), the cost function of placing \(v \in V_\alpha\) has the value

*i*and

*j*, or inside a regular group

*i*if \(i=j\). As a result:

### Proposition 3

*Requiring that all groups are large and that for all pairs*\((i,\alpha ), i\ne \alpha\),

*there is at least one index*

*j*

*with*\(d_{i,j}\ne d_{j,\alpha }\),

*we have:*

*where*

*c*

*is a constant.*

### *Proof*

It is easy to check that in the above formula for \(C_{i,\alpha }\), all terms in the sum, \(-d_{i,j}\log d_{j,\alpha }-(1-d_{i,j})\log (1-d_{j,\alpha })\), are nonnegative and each of them has an absolute minimum on argument \(d_{j,\alpha }\) when the corresponding densities are equal, \(d_{i,j}=d_{j,\alpha }\). Therefore, in \(C_{i,\alpha }-C_{i,i}\) the coefficients of \(n_j\), \(-d_{i,j}\log d_{j,\alpha }-(1-d_{i,j})\log (1-d_{j,\alpha })-(-d_{i,j}\log d_{i,j}-(1-d_{i,j})\log (1-d_{i,j}))\), are nonnegative for all *j*, and, according to the condition, at least one of them is strictly positive, because the corresponding densities are not equal. Since all \(n_j\) are proportional to *N*, the claim follows.\(\square\)

In general setting, the difference \(L_{i,\alpha }(v)- L_{i,i}(v)\) is not necessarily as large as the expected value for all nodes *v*. The only thing that can be guaranteed is that the value is positive (definition of an optimum). In the case of a true stochastic block model as a generator of the graph *G*, almost all differences \(L_{i,\alpha }(v)- L_{i,i}(v)\) are of the order of the expectation (positive and linear in *N*). Sub-linear terms could be called ‘outliers,’ since there is not much difference in terms of cost, in whichever group we place the corresponding nodes.

*n*nodes from

*V*, denoted as \({\hat{V}}\subset V\). Denote by \({\hat{\xi }}\) (we shall use systematically ‘hat,’ \(\hat{}\), over a symbol of a variable to denote a corresponding sampled value) the partition of \(V_n\) that is a subpartition of \(\xi\), meaning that two nodes are in the same part of \({\hat{\xi }}\) if and only if they are in the same part of \(\xi\). Assume that all parts of \({\hat{\xi }}\) are ‘large,’ meaning that their relative sizes are close to those in \(\xi\). The probability of this condition can be made close to one, taking

*n*large enough. This probability will be estimated later on, see Proposition 9. Take a uniformly random node \(v\in V\). Consider the following classifier based on \({\hat{\xi }}:\)

*j*of \({\hat{\xi }}\) and node

*v*. The sizes of regulars groups in \({\hat{\xi }}\) are denoted as \({\hat{n}}_i\), and the groups of \({\hat{\xi }}\) are denoted as \({\hat{V}}_i\) for every \(i=1,2,\ldots ,k\), using same indexing as in \(\xi\). Note that here we use the link densities between the regular groups of the large graph. Later, we shall show that the link densities in \({\hat{\xi }}\) are close enough with a probability close to 1. This is another claim that needs to be justified (see Sect. 5.3).

*v*, then there are \(e_j(v)\) favorable outcomes out of \(n_i\) possible choices. According to a well-known result, conditionally on sample size, \({\hat{n}}_j\), the expectation is

### Proposition 4

### 5.2 Deviations of the Sampled Block Sizes from the Expected Values

Assume that all link densities fulfill \(0<d_{i,j}<1\). Excluding cases with zero or one densities are not limitations, since in both cases the sampled structures are deterministic (for density 1, all sampled pairs have links, and in the other case, no pairs have links).

*G*and otherwise \(x_j(i)=0\).

### Proposition 5

\(E_j(\cdot )\) *is a martingale.*

### *Proof*

*t*is discrete time from 0 to

*n*. First, define the following ramp functions for each \(1\le i\le k\):

### Proposition 6

*is a martingale and*

### *Proof*

At each discrete time, only one of the subprocesses \(E_j(\cdot )\) is ‘active.’ By this, we mean that for any moment of time *t* only one subprocess \(E_j(\cdot )\) is progressing, the others are either in the final state when all links to *v* are revealed, or are completely un-revealed, and the corresponding value equals the expectation. Therefore, Proposition 5 is sufficient to show the conditional expectation property. The normalization coefficient \(c_\alpha\) guarantees the upper bound for the absolute values of step-wise changes of the process. \(\square\)

As a result, we get the following:

### Proposition 7

*For all*\(\alpha , i\in \{1,2,\ldots , k\}\):

### *Proof*

Use the standard Azuma’s inequality (e.g., Theorem 2.25 in [19]). The inequality is true for all indices \(\alpha\) and *i* since there is only one link exposure process, and for any of them the inequality is true with right-hand side that is independent of those indices. \(\square\)

Next, define the following vector process. It corresponds to uniformly random sampling of *n* nodes from *V*. Its components correspond to the numbers of nodes in the classes of \({\hat{\xi }}\). It has \(n+1\) time steps and starts from all components equal to zero, and at \(t=1\), the value of *i*th component is \({\mathbf {E}}{\hat{n}}_i=n n_i/N\). After that the first node and its class in \(\xi\), let it be *i*, are revealed, and the corresponding component is changed to the expected value of \({\hat{n}}_i\) conditionally to this information. And so on, until the process ends with all components having their final values, like \({\hat{n}}_i\). By normalizing this process with the coefficient \(13\sqrt{k}\) such a process is denoted as \({\mathbf {N}}(t)\).

### Proposition 8

*Assume that the graph size* *N* *is large enough and the sample size is small with* \(n\le \sqrt{n}_i\), \(\forall i\). *Then* \({\mathbf {N}}(\cdot )\) *is a vector martingale with maximal increment of its Euclidean norm equal to 1*.

### *Proof*

*i*at time step

*t*. Then \(x_i(t)\) equals 1 if the

*t*th node happens to belong to class

*i*and equals zero otherwise. Using the obvious multivariate hypergeometric distribution we see that the non-normalized \(i\hbox {th}\) component of the process at time

*t*is

*n*is much smaller than \(n_i\) or

*N*. As a result we get:

Using the Azuma inequality for a vector-valued martingale, given in [20], we get:

### Proposition 9

*Under the same conditions as in Proposition*8,

*we have:*

### *Proof*

### 5.3 Deviations of Sampled Link Densities from Their Expectations

Finally, we estimate the deviations of link densities between sampled groups, \({\hat{d}}_{i,j}\), from their expected values \(d_{i,j}\), conditionally on sampled group sizes \({\hat{n}}_i\). We simply reuse Theorem 2.2. from [18].

### Proposition 10

*Provided*\(r_in/2\le {\hat{n}}_i\le 2 r_i n\),

*we have:*

*where*

### *Proof*

Use Theorem 2.2 of [18] and reasoning therein for a given pair (*i*, *j*) and the conditions on sizes \(r_in/2\le {\hat{n}}_i\le 2 r_i n\), and finally the union bound to estimate that all densities are within the range. The coefficient *a* corresponds to the weakest exponent. \(\square\)

### 5.4 Probability of Deviation of Sampled Cost Function from the Expectation

Using the results of previous sections we can formulate and prove the main result:

### Theorem 1

*Provided*\(n\ge m\), \(m=f ^4\),

*then uniformly random sampling of*

*n*

*nodes and grouping of these nodes as in*\(\xi\)

*has the property: If*\(v\in V\)

*is a random node from*\(V_i\),

*outside the sample, then for all*\(\alpha \ne i\):

*where*

*and where*\({\hat{d}}_{i,j}\)

*are the (random) link densities between sampled group pairs*\(({\hat{V}}_i,{\hat{V}}_j)\)

*or inside a single sampled group.*

The meaning of Theorem 1 is that if we build a classifier based on observed random sample of n nodes and its regular structure, then the rest of nodes can be classified correctly with probability that is exponentially close to one.

### *Proof*

*B*as in the claim of the theorem. Combining these estimates, we have

*n*. It can be seen that we can choose \(f_1\) as in the claim of the theorem. We had another condition \(\frac{1}{\root 4 \of {n}}\le \min (d_1/2,(1-d_2)/2)\), or \(\root 4 \of {n}\ge 1/ \min (d_1/2,(1-d_2)/2) :=f_2\).

As a result, if \(f=\max (f_1,f_2)\), then \(n\ge f^4\) implies \({\hat{L}}_{i,s}(v;\,{\hat{D}})-{\hat{L}}_{i,i}(v;\,{\hat{D}})>0\), with a probability that we estimate now. We use a simple union bound. Let \(A_1,A_2,\ldots\) denote some events that correspond to the violation of one of the conditions in our proof. Then the event that at least one of them happens is given by the union of these events. The probability of this event always fulfills \({\mathbf {P}}(\cup _i A_i)\le \sum _i{\mathbf {P}}(A_i)\). Thus, the probability that none of the conditions is violated is not less than \(1-\sum _i{\mathbf {P}}(A_i)\). All \({\mathbf {P}}(A_i)\) are exponentially small in *n*, and so is their sum, with an exponent that has the smallest positive constant. The result of this count is given in the formulation of the theorem. \(\square\)

## 6 Simulations for Large-Graph Analysis Based on Sampling

### 6.1 Regular Decomposition

*P*, that is, a \(k\times k\) symmetric matrix with entries in (0, 1). We also assume that no two rows are identical, to avoid a redundant structure. Then we generate links between blocks and within blocks using

*P*as independent link probabilities. The size of the graph

*G*is assumed large and denoted by

*N*.

Now take a uniform random sample of size *n* of the nodes. Denote the graph induced by sampling as \(G_n\). The relative sizes of blocks are denoted as \(r_i:=\frac{|V_i|}{N}\). The probability of sampling a node from block *i* is \(r_i\). We want to test whether the block structure of *G* can be reconstructed from a small sample, the size of which does not depend on *N* and with time complexity \(\varTheta (N)\). We believe that the answer is positive.

For simplicity, assume that we know the block structure of \(G_n\) by having run an ideally working RD algorithm. This is not too far from reality, although some of the nodes may in reality be misclassified. However, there is not a big difference, if most of the nodes are correctly classified and *n* is not very small. The real graphs, however, are not generated from any SBMs, and this is a more serious drawback. In the latter case we would like to show that it is not necessary to run RD on *G*; rather, it is sufficient to run RD on \(G_n\) and just classify the rest of the nodes based on that structure.

In \(G_n\), we denote the blocks as \({\hat{V}}_1,{\hat{V}}_2,\ldots , {\hat{V}}_k\), and the empirical link densities between these blocks as \({\hat{d}}_{i,j}\). The relative sizes of the blocks as well as the link densities deviate in \(G_n\) from their counterparts in *G*. Now assume that we take a node \(i\in V_\beta\) outside \(G_n\) together with its links to \(G_n\). What is the probability that the RD classifier places this node into \({\hat{V}}_\beta\)?

*v*to \({\hat{V}}_\alpha\), \(n_\alpha :=|{\hat{V}}_\alpha |\). The RD classifier based on \(G_n\) is the following program:

*i*we need to compute and check a list of

*k*numbers as well as compute the densities \(e_\alpha (i)\). The computation time is upper-bounded by \(\varTheta (k n)\). For a size

*n*that is independent on

*N*, the computation takes only a constant time. By this, we mean that it is enough to have some constant

*n*regardless of how large the graph is. Such an

*n*depends on the relative sizes of blocks and on

*k*. This is because it is obviously necessary to have enough samples from each block to be able to obtain good estimates of link densities, and if some blocks are very small, more samples are needed.

We made experiments with \(k=10\) and with equal relative sizes of blocks. The *P*-matrix elements were i.i.d. \(\hbox {Uniform}(0,1)\) r.v.’s. In this case, already \(n>200\) is such that no classification error was found in a long series of experiments with repetitions of random samples and classification instances. Of course, errors are always possible, but they become insignificant already when each block has about 20 members. A similar situation is met with non-equal block sizes, but then the small blocks dictate the required *n*. When the smallest block has more than a couple of dozen nodes, the errors become unnoticeable in experiments.

We conjecture that if such a block structure or close to it exists for a large graph, it can be found out using only a very small subgraph and a small number of links per node in order to place every node into right blocks with only very few errors. This would give a coarse view on the large graph useful for getting an overall idea of link distribution without actually retrieving any link information besides labeling of its nodes. For instance, if we have a large node set, then the obtained labeling of nodes and the revealed block densities would allow computation of, say, the number of triangles or other small subgraphs, or the link density inside a set. It would be possible to use this scheme in the case of a dynamical graph, to label new nodes and monitor the evolution of its large-scale structure.

We can also adapt our RD method to the case of multilayer networks and tensors, etc. A similar sampling scheme would also be desirable and probably doable. On the other hand, large sparse networks need principally new solutions and development of a kind of sparse version of RD. These are major directions of our future work.

### 6.2 Spectral Clustering and Comparison with RD

For comparison, we repeated the experiments of the previous section using a spectral clustering method (see, e.g., [21]), to find a block structure in the sample. For more about the rigorous theory of spectral clustering and its variants, see [17].

The purpose is to compare the performance of spectral clustering with RD. The quality of spectral block structure was then evaluated similarly as in previous RD case, by counting the success rate of classification of additional 1000 nodes of the graph, and using the same classifier as in previous section. We also checked the accuracy of spectral clustering versus RD.

*n*nodes. Define a diagonal matrix

*D*with elements

*L*.) Denote

*k*eigenvectors of \(W_D\) corresponding to

*k*largest by absolute value eigenvalues as \(X_1,\ldots X_k\). Form a matrix

*X*, where the columns are \(X_1,\ldots X_k\). As a result,

*X*is an \(n\times k\) matrix. Each row of

*X*is treated as vector in

*k*-dimensional Euclidean space. Next, run the

*k*-means algorithm on these

*n*vectors and cluster them into

*k*clusters. The clustering of nodes is taken to be identical to the

*k*-means clustering of rows of

*X*.

In our case, it turned out that if we take \(k=10\) in the matrix *X* and \(k=10\) in k-means algorithm, the result was very poor and the blocks were not found. Based on numerical experiments we choose \(k=5\) for *X* and \(k=10\) in k-means, in order to have best results for finding the \(10\times 10\) block structure of *W*. In this way the method was able to find the blocks with reasonable accuracy.

The spectral method performed somewhat worse than RD, as can be seen in Fig. 9. It needed a sample size of about 1000 nodes to have a low average error rate, instead of 200 that was enough for RD. Moreover, even for \(n=1000\), some random cases resulted in very poor performance, corresponding to a completely unsuccessful classifier. When *n* is around thousand and making 100 repetitions, we observed that almost all cases were error-free, but in a couple of cases the success rate was only around 10%, meaning a completely random classification. The blocks in our experiments did not have fixed sizes, only their expected sizes were fixed and equal. In a series of experiments, the link probability matrix was kept constant. Maybe these random variations of block sizes can explain such an instability. If *n* is not large enough, these variations can be very big. One cluster can become very small just by fluctuations caused by sampling. Such small clusters can be seen as outliers, since the nodes in such groups are very few and different from all others. It is well known that spectral clustering suffers from such outliers that should be removed from data before clustering. RD does not suffer from this feature, and we did not observe sensitivity of clustering to random fluctuations of the cluster sizes.

In the theory of spectral clustering there are usually balancing conditions on the way how the sizes of clusters grow as \(n\rightarrow \infty\) [17]. For instance, the block sizes \(n_1,\ldots ,n_k\) are such that \(\exists c: \quad 0<c\le 1/k\) s.t. \(\sum _i n_i \rightarrow \infty\) and \(n_i/n\ge c\) as \(n\rightarrow \infty\). Such conditions are needed to prove consistency properties of spectral clustering methods [15, 17].

We conclude that RD is more preferable and stable for finding a block structure of a sample graph. RD is capable of finding blocks in the case of poor balancing of block sizes. This can happen when sampling a large graph even though the blocks are balanced yet, say, the biggest block is twice the smallest. In the sample this can result in a fatal error for spectral clustering, when one sampled block, by chance, has only few representatives. RD seems to be more robust against such small samples of blocks and thus requires smaller sample sizes than spectral clustering.

## 7 Conclusions and Future Work

Our future work will be dedicated to the case of sparse graphs, which is the most important in Big Data. One idea we are currently studying is to use the matrix of hop-count distances between nodes instead of a sparse adjacency matrix [10].

For large dense graphs, testable graph parameters are nonparametric statistics that can be consistently estimated by appropriate sampling, introduced by László Lovász and coauthors, see also [17]. We plan to show that the MDL statistics and the multiway discrepancies of [6] are testable, possibly with appropriate normalizations. If so, they can be concluded from a smaller part of the graph that has advantages with regard to the computational complexity of our algorithms.

## Notes

### Acknowledgements

This work was supported by the Academy of Finland Project 294763 Stomograph and by the ECSEL MegaMaRT2 Project. The research of Marianna Bolla was supported by the BME-Artificial Intelligence FIKP grant of EMMI (BME FIKP-MI/SC).

## References

- 1.Reittu H, Norros I, Bazsó F (2017) Regular decomposition of large graphs and other structures: scalability and robustness towards missing data, In: Al Hasan M, Madduri K, Ahmed N (eds) Proceedings of fourth international workshop on high performance big graph data management, analysis, and mining (BigGraphs 2017), Boston, USAGoogle Scholar
- 2.Reittu H, Bazsó F, Norros I (2017) Regular decomposition: an information and graph theoretic approach to stochastic block models. arXiv:1704.07114 [cs.IT]
- 3.Decelle A, Krzakala F, Moore C, Zdeborová L (2011) Asymptotic analysis of the stochastic block model for modular networks and its algorithmic applications. Phys Rev E 84:066106CrossRefGoogle Scholar
- 4.Abbe E (2017) Community detection and stochastic block models: recent developments. arXiv:1703.10146v1 [math.PR]
- 5.Szemerédi E (1976) Regular partitions of graphs. Problemés Combinatories et Téorie des Graphes, number 260 in Colloques Internationaux C.N.R.S., Orsay, pp 399–401Google Scholar
- 6.Bolla M (2016) Relating multiway discrepancy and singular values of nonnegative rectangular matrices. Discrete Appl Math 203:26–34MathSciNetCrossRefGoogle Scholar
- 7.Nepusz T, Négyessy L, Tusnády G, Bazsó F (2008) Reconstructing cortical networks: case of directed graphs with high level of reciprocity. In: Bollobás B, Miklós D (eds) Handbook of large-scale random networks. Bolyai society of mathematical sciences, vol 18. Springer, Berlin, pp 325–368CrossRefGoogle Scholar
- 8.Pehkonen V, Reittu H (2011) Szemerédi-type clustering of peer-to-peer streaming system. In: Proceedings of Cnet 2011, San Francisco, USAGoogle Scholar
- 9.Reittu H, Bazsó F, Weiss R (2014) Regular decomposition of multivariate time series and other matrices. In: Fränti P, Brown P, Loog M, Escolano F, Pelillo M (eds) Proceedings of S+SSPR 2014, vol 8621 in LNCS. Springer, pp 424–433Google Scholar
- 10.Reittu H, Leskelä L, Räty T, Fiorucci M (2018) Analysis of large sparse graphs using regular decomposition of graph distance matrices. In: Proceedings of IEEE BigData 2018, Seattle USA., pp 3783–3791, Workshop: advances in high dimensional (AdHD) big data, Ed. Sotiris TasoulisGoogle Scholar
- 11.Grünwald PD (2007) The minimum description length principle. MIT Press, CambridgeCrossRefGoogle Scholar
- 12.Peixoto TP (2013) Parsimonious module inference in large networks. Phys Rev Lett 110:148701CrossRefGoogle Scholar
- 13.Caron F, Fox B (2017) Sparse graphs using exchangeable random measures. J R Stat Soc B 79(Part 5):1295–1366MathSciNetCrossRefGoogle Scholar
- 14.Borgs C, Chayes J, Lee CE, Shah D (2017) Iterative collaborative filtering for sparse matrix estimation. arXiv:1712.00710
- 15.von Luxburg V, Belkin M, Bousquet O (2008) Consistency of spectral clustering. Ann Stat 38(2):555–586MathSciNetCrossRefGoogle Scholar
- 16.Pollard D (1981) Strong consistency of k-means clustering. Ann Stat 9:35–40MathSciNetCrossRefGoogle Scholar
- 17.Bolla M (2013) Spectral clustering and biclustering: learning large graphs and contingency tables. Wiley, HobokenCrossRefGoogle Scholar
- 18.Fox J, Lovász LM, Zhao Y (2017) On regularity lemmas and their algorithmic applications. arXiv:1604.00733v3 [math.CO]
- 19.Janson S, Łuczak T, Ruciński A (2000) Random graphs. Wiley, HobokenCrossRefGoogle Scholar
- 20.Hayes TP (2005) A large-deviation inequality for vector-valued martingales. http://www.cs.unm.edu/hayes/papers/VectorAzuma/VectorAzuma20030207.pdf
- 21.Rohe K, Chatterjee S, Yu B (2011) Spectral clustering and high-dimensional stochastic blockmodel. Ann Stat 39(4):1878–1915MathSciNetCrossRefGoogle Scholar

## Copyright information

**Open Access**This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.