Efficient Network Representation Learning via Cluster Similarity

Fujiwara, Yasuhiro; Ida, Yasutoshi; Kumagai, Atsutoshi; Nakano, Masahiro; Kimura, Akisato; Ueda, Naonori

doi:10.1007/s41019-023-00222-x

Efficient Network Representation Learning via Cluster Similarity

Research Paper
Open access
Published: 12 September 2023

Volume 8, pages 279–291, (2023)
Cite this article

Download PDF

You have full access to this open access article

Data Science and Engineering Aims and scope Submit manuscript

Efficient Network Representation Learning via Cluster Similarity

Download PDF

Yasuhiro Fujiwara ORCID: orcid.org/0000-0001-9578-1118¹,
Yasutoshi Ida¹,
Atsutoshi Kumagai¹,
Masahiro Nakano¹,
Akisato Kimura¹ &
…
Naonori Ueda¹

1029 Accesses
1 Citation
Explore all metrics

Abstract

Network representation learning is a de facto tool for graph analytics. The mainstream of the previous approaches is to factorize the proximity matrix between nodes. However, if n is the number of nodes, since the size of the proximity matrix is $n \times n$, it needs $O(n^3)$ time and $O(n^2)$ space to perform network representation learning; they are significantly high for large-scale graphs. This paper introduces the novel idea of using similarities between clusters instead of proximities between nodes; the proposed approach computes the representations of the clusters from similarities between clusters and computes the representations of nodes by referring to them. If l is the number of clusters, since $l \ll n$, we can efficiently obtain the representations of clusters from a small $l \times l$ similarity matrix. Furthermore, since nodes in each cluster share similar structural properties, we can effectively compute the representation vectors of nodes. Experiments show that our approach can perform network representation learning more efficiently and effectively than existing approaches.

Graph convolutional networks: a comprehensive review

Article Open access 10 November 2019

Multi-head multi-order graph attention networks

Article 20 June 2024

Predicting properties of nodes via community-aware features

Article Open access 15 June 2024

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Recent advances in social and information science have shown that linked data pervades our society and the natural world around us. A graph is an important data structure for representing objects and their relationships. Many real-world applications can be naturally modeled as graphs [8, 12, 17, 20]. Network representation learning has become a fundamental task in recent graph analytics [16]. It converts each graph node to a fixed-length vector such that the representation vectors preserve the inherent properties and structures of the graph. Although it is difficult to directly feed nodes of a graph into well-studied methods in the vector space, it is easy to subject the representation vectors to feature-based machine learning methods such as LIBLINEAR. Therefore, network representation learning has become a fundamental task in graph analytics.

One pioneering work on network representation learning is DeepWalk [22]. It generates multiple node sequences by random walks and runs the SkipGram algorithm [18] to compute node representations. Following the introduction of DeepWalk, many network representation learning approaches have been proposed, such as LINE [28], PTE [27], and node2vec [10]. A previous study revealed that DeepWalk theoretically factorizes a matrix derived from the random walk process between nodes [14]. Therefore, DeepWalk and its variants can be viewed as implicit factorizing a closed-form matrix obtained by random walks. NetMF is a matrix factorization-based network representation learning approach based on this observation [25]. It realizes higher accuracy in performing node classification than DeepWalk and its variants. However, NetMF needs to factorize a dense $n \times n$ proximity matrix between nodes where n is the number of nodes, and thus, it needs $O(n^3)$ time and $O(n^2)$ space. This is because almost all pairs of nodes are reachable within a specified length of random walks, making it prohibitively expensive to directly construct and factorize the matrix for large-scale graphs. Although several approaches have been proposed to reduce the computational cost, such as RandNE [31], FastRP [5], FREDE [29], LightNE [23], and REFINE [32], their computation time is high for large-scale graphs since they need to perform expensive matrix computations. On the other hand, several researchers proposed to exploit node connectivity relationships to perform network representation learning effectively. For example, Bhowmick et al. proposed LouvainNE, which recursively constructs the hierarchy of subgraphs using the Louvain method; the Louvain method is a graph clustering approach [3]. It represents clusters as random vectors and aggregates them to generate the representation vectors of the nodes. However, since it needs to iterate graph clustering to obtain the hierarchy structure, LouvainNE still incurs high computation costs for large-scale graphs. In addition, Mo et al. recently proposed truss2vec [19]. This approach is designed to represent the graph’s local community properties and global network topology. Specifically, it uses the truss number to capture the local community properties effectively. The truss number of an edge is obtained from the maximum k-truss where the edge is located; a k-truss is a maximal connected subgraph where each edge is in at least $k-2$ triangles. Since the truss number is a higher-order structural information, it can more effectively capture the local community properties than simple neighborhood information such as degree. Besides, truss2vec exploits a random walk strategy to capture global network topology effectively. Unlike the sampling strategies used in previous approaches, the random walk strategy can effectively capture the similarity of nodes within the same community since it is based on the structural similarities obtained from the truss number. However, truss2vec needs a large computational time to obtain structural vectors of nodes to compute the structural similarities.

As described above, the technical challenge of the previous approaches is that they suffer from the high computational cost of performing network representation learning for large-scale graphs effectively. The contribution of this paper is to propose an efficient and accurate network representation learning approach for large-scale graphs. In this paper, we proposes Graph Clustering-based Network Representation Learning, GC-NRL. The proposed approach pays particular attention to matrix factorization and graph clustering. To achieve high efficiency and accuracy, our approach efficiently factorizes a small similarity matrix between clusters to effectively compute the embedding vectors of nodes. Specifically, we perform graph clustering for a given graph and compute a similarity matrix between clusters. We then compute the representation vectors of the clusters by factorizing their similarity matrix with sparse matrices. We determine the representation vectors of the nodes by referring to those of clusters. In terms of the matrix factorization-based approach, since our approach uses the similarity matrix between clusters instead of the proximity matrix between nodes, it can efficiently compute the representation vectors. Besides, in terms of the graph clustering-based approach, since our approach performs graph clustering only once and uses the similarities between clusters, unlike the previous approach, it can improve efficiency. In the proposed approach, we can place nodes of the same cluster closely in the representation space by referring to the representation vectors of clusters. Even if nodes are included in different clusters, our approach can closely place the nodes in the representation space if the clusters are well connected and have high similarity. Consequently, we can effectively compute the representation vectors. The main contributions are as follows:

Our approach effectively computes representations of nodes by factorizing a similarity matrix between clusters to capture their structural relationships.
The proposed approach expands and reduces the dimensionality of the representation vectors of the clusters efficiently and effectively by exploiting sparse matrices.
Experiments confirm that our approach is several orders of magnitude faster than the previous approaches while achieving higher node classification accuracy.

In the remainder of this paper, Sect. 2 describes related work. Section 3 gives an overview of the background, Sect. 4 introduces our approach, Sect. 5 reviews our experimental results, and Sect. 6 provides conclusions. The preliminary version of this article has been published in DASFAA 2023 [9].

2 Related Work

Recently, network representation learning has been extensively studied [13]. The success of network representation learning has driven a lot of downstream graph analytics. Inspired by the success of neural methods for learning word embedding, particularly the SkipGram model [18], many network representation learning approaches have been proposed. They typically sample node pairs close to each other by performing random walks and compute node representations using the SkipGram algorithm. DeepWalk first proposed using truncated random walks to capture the network structure [22]. Specifically, it samples pairs of nodes from the graph by performing random walks and then feeding them to SkipGram to obtain node representations. LINE takes a similar idea with an explicit objective function by setting the walk length as one or two [28]. It uses the negative sample strategy [18] to reduce the computation time. PTE is an extension of LINE for heterogeneous text networks [27]. node2vec generalizes these methods by taking potentially biased random walks for enhanced flexibility [10]. Since the random walk-based approach usually relies on stochastic gradient descent in the learning process, a large number of node pairs must be sampled until convergence to ensure the quality of the node representations; this incurs high computation costs. Levey et al. proved that approaches of the SkipGram model with negative sampling implicitly factorize a shifted pointwise mutual information matrix of word co-occurrences [14]. Based on a similar idea, Qiu et al. showed that DeepWalk, LINE, PTE, and node2vec implicitly approximate and factorize a node proximity matrix, which is usually some transformation of the i-step adjacency matrix. Following these analyses, NetMF was proposed for network representation learning [25]. This approach is a generalized matrix factorization framework that uses SVD in unifying several random walk-based approaches. It uses eigendecomposition to efficiently compute the proximity matrix. GraRep is another matrix factorization-based approach that directly applies SVD to preserve a high-order proximity matrix [4], while HOPE uses generalized SVD to preserve the asymmetric transitivity in direct graphs [21]. However, the matrix factorization-based approach also suffers from the high computation cost since it needs to perform expensive matrix factorization operations.

Several works have been proposed to speed up network representation learning. NetSMF, proposed by Qiu et al., is an efficient network representation learning approach based on NetMF [24]. This approach uses theories from spectral sparsification to sparsify the proximity matrix between nodes. To reduce computation time, it uses a path-sampling approach to efficiently obtain the proximity matrix used in NetMF. In addition, it uses the randomized approach to reduce the high computation cost of SVD. RandNE is a Gaussian random projection method for network representation learning [31]. Specifically, it maps the graph into a low-dimensional representation space using a Gaussian random projection approach while preserving the high-order proximity between nodes. It uses an iterative projection procedure to reduce the computational cost by avoiding explicit computation of the high-order proximities. FastRP is another efficient approach to network representation learning [5]. It explicitly constructs a node proximity matrix that captures the structural relationships in the graph and normalizes matrix entries based on node degree. It applies a random projection approach for the proximity matrix to efficiently obtain node representation. However, these previous network representation learning approaches still need significant computation time since they perform expensive matrix computations to obtain the representation vectors. REFINE is an iterative projection approach that adds an orthogonal constraint on the embedding vectors based on the randomized blocked QR with power iteration [32]. However, these network embedding approaches still need significant computational time since they perform expensive matrix computations to obtain the embedding vectors. FREDE uses personalized PageRank to obtain proximities between nodes, and it exploits the frequent directions sketching process to efficiently compute SVD [29]. However, it needs a high computational cost to compute personalized PageRank. LouvainNE constructs the hierarchy of subgraphs using the Louvain method recursively to coarsen a large graph into smaller clusters [2]; the Louvain method is a graph clustering approach [3]. It obtains a representation of each subgraph at different hierarchy levels and aggregates them to generate the embedding vectors. Since LouvainNE needs to iterate graph clustering to obtain the hierarchy structure, it still incurs high computational costs. To efficiently perform network representation learning, Fahrbach et al. proposed RandomConstraction [6] and Lin et al. proposed GPA [15]. However, they are preprocessing and initialization approaches, respectively, and are not used stand-alone; they need to be used with other embedding approaches such as LINE, NetMF, NetSMF, node2vec, and DeepWalk. As a result, these approaches need high computational costs.

3 Preliminaries

We introduce the background to this paper below. Table 1 lists the main symbols and their definitions. The problem of network representation learning for general graphs is formalized as follows: Given graph $G=({\mathbb {V}}, {\mathbb {E}})$ with ${\mathbb {V}}$ being the set of n nodes and ${\mathbb {E}}$ being the set of m edges, network representation learning computes a low-dimensional representation vector ${\textbf{x}}_v$ of dimension d for each node $v \in {\mathbb {V}}$ where $d \ll n$ is the predefined number of dimensions. Representation vector ${\textbf{x}}_v$ is set to capture the structural property of node v.

Table 1 Definitions of main symbols

Full size table

NetMF is the most popular matrix factorization-based approach [25]. It is a generalized framework that uses SVD in unifying several random walk-based approaches such as DeepWalk [22], LINE [28], and PTE [27]. NetMF computes the following $n \times n$ high-order proximity matrix between nodes:

$$\begin{aligned} \textstyle {\textbf{M}} = \log (\max ( {\textbf{M}}^\prime ,1)), \; {\textbf{M}}^\prime = \frac{vol(G)}{bT} \left( \sum\limits _{r=1}^T ({\textbf{D}}^{-1}{\textbf{A}})^r \right) {\textbf{D}}^{-1} \end{aligned}$$

(1)

In this equation, $\log (\cdot )$ is an elementwise logarithm, $vol(G) = \sum\nolimits _{1 \le i,j \le n} A[i,j]$ is the volume of the graph where A[i, j] is the [i, j] element of the adjacency matrix ${\textbf{A}}$ corresponding to the edge weight from the jth node to the ith node, b is the number of negative samples [25], T is the window size, and ${\textbf{D}}$ is the diagonal matrix ${\textbf{D}} = \textrm{diag}({\textbf{A}}{\textbf{1}}_n)$ where ${\textbf{1}}_n$ is a vector of length n with all ones. NetMF obtains the representation vectors by using the d left singular vectors and the first d singular values after computing the SVD of matrix ${\textbf{M}}$. However, it needs $O(n^3)$ time and $O(n^2)$ space to compute matrix ${\textbf{M}}$ of $n \times n$ size. Furthermore, NetMF degrades the quality of representation vectors by truncating the small nonzero elements from the proximity matrix.

4 Proposed Method

Instead of proximities between nodes, we exploit similarities between clusters to reduce the computational cost since the number of clusters is much smaller than that of nodes [26]. Since similarities between clusters represent the structural property of the graph, we can effectively generate the representation vectors by using them. Let $c_i$ be the ith cluster, we perform graph clustering and compute representation vectors of $c_i$ by using a similarity matrix between clusters. We compute the representation vectors of the nodes from the representation vectors of the clusters. Let ${\textbf{W}}$ be the row normalized adjacency matrix and ${\textbf{r}}_{c_i}$ be the representation vector of $c_i$, we compute the representation vectors of node v as ${\textbf{x}}_v = \sum\nolimits _{i=1}^l {\sum\nolimits _{w \in {\mathbb {C}}_i}} W[v,w] {\textbf{r}}_{c_i}$. In this equation, l is the number of clusters, ${\mathbb {C}}_i$ is a node set included in $c_i$, and W[v, w] is the edge weight between node v and w.

4.1 Similarity Between Clusters

To compute representation vector ${\textbf{r}}_{i}$, we factorize the similarity matrix between clusters. Our approach uses the IncMod method for graph clustering since it can compute clusters efficiently [26]. Note that, since the IncMod method can handle undirected and directed graphs, our approach can handle both graphs. The IncMod method automatically sets number of clusters l based on graph structure; we cannot specify the number of clusters.

Our approach determines the similarity matrix between clusters from the difference from a random graph. Specifically, if ${\textbf{S}}$ is the $l \times l$ similarity matrix between clusters, we define its elements as follows:

$$\begin{aligned} \textstyle S[i,j]= & {} {\sum_{v \in {\mathbb {C}}_i}} \sum _{w \in {\mathbb {C}}_j} A[v,w] \nonumber \\{} & {} - \frac{1}{vol[G]} {\sum_{v \in {\mathbb {C}}_i}} {\sum }_{w \in {\mathbb {V}}} A[v,w] {\sum_{v \in {\mathbb {V}}}} {\sum }_{w \in {\mathbb {C}}_j} A[v,w] \end{aligned}$$

(2)

Since $\frac{1}{vol[G]} {\sum\nolimits _{v \in {\mathbb {C}}_j}} \sum\nolimits _{w \in {\mathbb {V}}} A[v,w]$ is the ratio of edge weights connected to cluster $c_i$, if we assume that G is a random graph, the second term of the right side in Eq. (2), $\frac{1}{vol[G]} \sum\nolimits _{v \in {\mathbb {C}}_i} \sum\nolimits _{w \in {\mathbb {V}}} A[v,w] \sum\nolimits _{v \in {\mathbb {V}}} \sum\nolimits _{w \in {\mathbb {C}}_j} A[v,w]$, corresponds to the expectation of the sum of edge weights connected to cluster $c_i$ from $c_j$. On the other hand, the first term, $\sum\nolimits _{v \in {\mathbb {C}}_i} \sum\nolimits _{w \in {\mathbb {C}}_j} A[v,w]$, corresponds to the sum of edge weights actually connected to cluster $c_i$ from $c_j$. Therefore, S[i, j] would be positive if $c_i$ and $c_j$ are well connected compared to a random graph; otherwise, it would be negative. As a result, ${\textbf{S}}$ effectively represents the structural relationships between the clusters. Therefore, even if nodes are included in different clusters, our approach can place the nodes closely in the representation space if their clusters have high similarity. Note that we have $0 \le \sum\nolimits _{v \in {\mathbb {C}}_i} \sum\nolimits _{w \in {\mathbb {C}}_j} A[v,w] \le vol[G]$, $0 \le \sum\nolimits _{v \in {\mathbb {C}}_i} \sum\nolimits _{w \in {\mathbb {V}}} A[v,w] \le vol[G]$, and $0 \le \sum\nolimits _{v \in {\mathbb {V}}} \sum\nolimits _{w \in {\mathbb {C}}_j} A[v,w] \le vol[G]$. Therefore, from Eq. (2), we have

$$\begin{gathered} S[i,j] = {\sum_{v \in{\mathbb {C}} _{i} }} {\sum_{w \in{\mathbb {C}} _{j}} A} [v,w] - \frac{1}{{vol[G]}}\sum _{{v \in{\mathbb {C}} _{i} }} \sum _{{w \in {\mathbb{V}}}} A[v,w]\sum _{{v \in {\mathbb{V}}}} \sum _{{w \in{\mathbb {C}} _{j} }} A[v,w] \hfill \\ \quad\quad\quad \ge 0 - \frac{1}{{vol[G]}} \cdot vol[G] \cdot vol[G] = - vol[G]{\text{ }} \hfill \\ \end{gathered} $$

(3)

and

$$ \begin{aligned} \textstyle S[i,j]&\textstyle = {\sum\limits _{v \in {\mathbb {C}}_i}} {\sum\limits _{w \in {\mathbb {C}}_j}} A[v,w] \\&\quad \;\; - \frac{1}{vol[G]} \sum _{v \in {\mathbb {C}}_i} \sum _{w \in {\mathbb {V}}} A[v,w] \sum _{v \in {\mathbb {V}}} \sum _{w \in {\mathbb {C}}_j} A[v,w] \\&\textstyle \le vol[G] - \frac{1}{vol[G]} \cdot 0 \cdot 0 = vol[G] \end{aligned} $$

(4)

As a result, the range of S[i, j] is given as follows:

$$\begin{aligned} \textstyle -vol[G] \le S[i,j] \le vol[G] \end{aligned}$$

(5)

We provide an example to compute elements of ${\textbf{S}}$. As shown in Fig. 1(1-1), an example graph has seven nodes of two clusters $c_1$ and $c_2$. Cluster $c_1$ includes node $v_1$, $v_2$, and $v_3$. Cluster $c_2$ includes node $v_4$, $v_5$, $v_6$, and $v_7$. Note that nodes of $c_1$ and $c_2$ are directly connected to each other in the clusters, and these clusters have a single bidirectional edge between the clusters, as shown in Fig. 1(1-1). We assume edge weights are 1 in the graph. As a result, the adjacency matrix of the graph is given as Fig. 1(1-2). Therefore, from Fig. 1(1-2), we have

$$\begin{aligned} \textstyle vol(G) = \sum\limits _{1 \le i,j \le 7} A[i,j] = 20 \end{aligned}$$

(6)

In addition, from Fig. 1(1-2), we have

$$\begin{aligned} \textstyle \sum\limits _{v \in {\mathbb {C}}_1} \sum\limits _{w \in {\mathbb {C}}_1} A[v,w] = 6 \end{aligned}$$

(7)

and

$$\begin{aligned}{} & \textstyle \frac{1}{vol[G]} \sum\limits _{v \in {\mathbb {C}}_1} \sum\limits _{w \in {\mathbb {V}}} A[v,w] \sum\limits _{v \in {\mathbb {V}}} \sum\limits _{w \in {\mathbb {C}}_1} A[v,w] \nonumber \\{} & \quad = \frac{1}{20} \cdot 7 \cdot 7 \end{aligned}$$

(8)

As a result,

$$\begin{aligned} \textstyle S[1,1] = 6 -\frac{1}{20} \cdot 7 \cdot 7 = 3.55 \end{aligned}$$

(9)

Similarly, $S[2,2] = 12 -\frac{1}{20} \cdot 13 \cdot 13 = 3.55$ and $S[1,2] = S[2,1] =1 -\frac{1}{20} \cdot 7 \cdot 13 = -3.55$. Consequently, for the graph of Fig. 1(1-1), we have

$$\begin{aligned} \textstyle {\textbf{S}} = \left[ \begin{array}{cc} 3.55 &{} -3.55\\ -3.55 &{} 3.55 \end{array} \right] \end{aligned}$$

(10)

As shown in Eq. (10), although $c_1$ and $c_2$ are different-size clusters, they have the same self-similarity: $S[1,1] = S[2,2]$. This indicates that the similarity given by Eq. (2) can capture the property of the clusters; $c_1$ and $c_2$ are the same in terms of having nodes directly connecting each other and a single bidirectional edge between the clusters.

Our approach uses SVD on ${\textbf{S}}$ to compute the representation vectors of the clusters. Specifically, we decompose ${\textbf{S}}$ as ${\textbf{S}} = {\textbf{U}} \varvec{\Sigma } {\textbf{V}}^\top $ and compute representation matrix ${\textbf{R}}$ as ${\textbf{R}}={\textbf{U}} \varvec{\Sigma }^{\frac{1}{2}}$. If ${\textbf{r}}_{i}$ is the ith row vector of ${\textbf{R}}$, we use ${\textbf{r}}_{i}$ as the representation vector of the ith cluster. However, using row vectors in ${\textbf{R}}$ has a problem in computing the representation vectors of the clusters. Since the size of ${\textbf{S}}$ is $l \times l$, the length of ${\textbf{r}}_{i}$ is l. Therefore, if $l < d$, the representation vectors of the clusters would be shorter than the representation vectors of the nodes with length d. As a result, it is difficult to effectively use the representation vectors of the clusters to compute the representation vectors of the nodes if $l < d$.

4.2 Dimensionality Expansion

If $l < d$, our approach expands the dimensionality of the representation vectors of the clusters by exploiting a sparse matrix. Let ${\textbf{E}}$ be the $l \times d$ expansion matrix and ${\textbf{e}}_i$ be the ith column vector of ${\textbf{E}}$, we set the elements of ${\textbf{e}}_i$ as follows:

$$\begin{aligned} e_i[j] = {\left\{ \begin{array}{ll} \sqrt{\frac{l}{\log l}} &{} \text {with probability } \frac{\log l}{2l} \\ 0 &{} \text {with probability } 1 - \frac{\log l}{l} \\ -\sqrt{\frac{l}{\log l}} &{} \text {with probability } \frac{\log l}{2l} \end{array}\right. } \end{aligned}$$

(11)

To obtain ${\textbf{R}}$, we project low-dimensional matrix ${\textbf{U}}\varvec{\Sigma }^{\frac{1}{2}}$ into a high-dimensional space by using ${\textbf{E}}$. Specifically, we expand the dimensionality of the representation vectors by computing ${\textbf{R}} = {\textbf{U}}\varvec{\Sigma }^{\frac{1}{2}} {\textbf{E}}$ as shown in Algorithm 1. Let ${\textbf{r}}_i$ be the ith column vector of matrix ${\textbf{R}}$, we have the following property for ${\textbf{R}}$:

Lemma 1

Let $E(\cdot )$ represent expectation, if $i \ne j$ holds, we have $E({\textbf{r}}_i^\top {\textbf{r}}_j) =0$.

Proof

Let ${\textbf{U}}^\prime = {\textbf{U}}\varvec{\Sigma }^{\frac{1}{2}}$, since ${\textbf{r}}_i = {\textbf{U}}^\prime {\textbf{e}}_i$ from Algorithm 1, we have

$$\begin{aligned} \begin{aligned} {\textbf{r}}_i^\top {\textbf{r}}_j&\textstyle = \sum\limits _{k=1}^l \bigl ( \sum\limits _{i^\prime =1}^l U^\prime [k,i^\prime ] e_i[i^\prime ] \bigl ) \bigl ( \sum\limits _{j^\prime =1}^l U^\prime [k,j^\prime ] e_j[j^\prime ] \bigl ) \\&= \sum _{k=1}^l \sum _{i^\prime =1}^l \sum _{j^\prime =1}^l U^\prime [k,i^\prime ] \\&\quad U^\prime [k,j^\prime ] e_i[i^\prime ] e_j[j^\prime ] \end{aligned} \end{aligned}$$

(12)

If $i \ne j$ hold, from Eq. (11), we have $e_i[i^\prime ] e_j[j^\prime ]=\frac{l}{\log l}$ with probability $\frac{2(\log l)^2}{(2l)^2}$, we have $e_i[i^\prime ] e_j[j^\prime ]=-\frac{l}{\log l}$ with probability $\frac{2(\log l)^2}{(2\,l)^2}$, and we have $e_i[i^\prime ] e_j[j^\prime ]=0$ otherwise. Therefore, we have $E(e_i[i^\prime ] e_j[j^\prime ])=0$. As a result, if $i \ne j$, we have

$$\begin{aligned} \textstyle E \bigg ( \sum\limits _{k=1}^l \sum\limits _{i^\prime =1}^l \sum\limits _{j^\prime =1}^l U^\prime [k, i^\prime ] U^\prime [k, j^\prime ] e_i[i^\prime ] e_j[j^\prime ] \bigg ) = 0 \end{aligned}$$

(13)

Therefore, we have $E({\textbf{r}}_i^\top {\textbf{r}}_j) =0$ from Eq. (12). $\square $

The column vectors in ${\textbf{U}}\varvec{\Sigma }^{\frac{1}{2}}$ are orthogonal to each other since SVD produces orthogonal matrices. Specifically, let ${\textbf{u}}_i^\prime $ be the ith column vector of ${\textbf{U}}^\prime = {\textbf{U}}\varvec{\Sigma }^{\frac{1}{2}}$, we have $({\textbf{u}}_i^\prime )^\top {\textbf{u}}_j^\prime =0$ such that $i \ne j$, which is a necessary condition for preserving pairwise similarities between vectors [1]. On the other hand, Lemma 1 shows that each column of matrix ${\textbf{R}}$ would be orthogonal to each other, the same as matrix ${\textbf{U}}^\prime $. Consequently, Lemma 1 indicates that we can preserve the preferable property for the representation vectors even after dimensionality expansion. In terms of the quality of dimensionality expansion, we have the following lemma:

Lemma 2

Let $V(\cdot )$ represent variance, the following equation holds if we have $i \ne j$: $V({\textbf{r}}_i^\top {\textbf{r}}_j) = \sum\nolimits _{k=1}^l \sum _{i^\prime =1}^l \sum\nolimits _{j^\prime =1}^l (U^\prime [k,i^\prime ])^2 (U^\prime [k,j^\prime ])^2$.

Proof

From Eq. (12), we have

$$\begin{aligned} \textstyle V({\textbf{r}}_i^\top {\textbf{r}}_j)= \sum\limits _{k=1}^l V \bigg ( \sum\limits _{i^\prime =1}^l \sum\limits _{j^\prime =1}^l U^\prime [k,i^\prime ] U^\prime [k,j^\prime ] e_i[i^\prime ] e_j[j^\prime ] \bigg ) \end{aligned}$$

(14)

Since $i \ne j$ holds, we have

$$\begin{aligned} \begin{aligned}&\textstyle \bigg ( \sum\limits _{i^\prime =1}^l \sum\limits _{j^\prime =1}^l U^\prime [k,i^\prime ] U^\prime [k,j^\prime ] e_i[i^\prime ] e_j[j^\prime ] \bigg )^2 \\&\quad =\textstyle \sum\limits _{i^\prime =1}^l \sum\limits _{j^\prime =1}^l (U^\prime [k,i^\prime ])^2 (U^\prime [k,j^\prime ])^2 (e_i[i^\prime ])^2 (e_j[j^\prime ])^2 \\&\quad \quad \textstyle + 2 \sum\limits _{i^\prime =1}^l \sum\limits _{j^\prime =1}^l \sum\limits _{i^{\prime \prime }< i^\prime } \sum\limits _{j^{\prime \prime } < j^\prime } U^\prime [k,i^\prime ] U^\prime [k,j^\prime ] U^\prime [k,i^{\prime \prime }] \\&\quad U^\prime [k,j^{\prime \prime }] e_i[i^\prime ] e_j[j^\prime ] e_i[i^{\prime \prime }] e_j[j^{\prime \prime }] \end{aligned} \end{aligned}$$

(15)

From Eq. (11), $(e_i[i^\prime ])^2 = \frac{l}{\log l}$ with probability $\frac{\log l}{l}$; $(e_i[i^\prime ])^2 = 0$, otherwise. As a result, since $i \ne j$, we have $E((e_i[i^\prime ])^2 (e_j[j^\prime ] )^2) =\frac{l}{\log l} \frac{\log l}{l} \cdot \frac{l}{\log l} \frac{\log l}{l}=1$. In addition, since $i \ne j$, $i^{\prime \prime } < i^\prime $, and $j^{\prime \prime } < j^\prime $ hold, we have $e_i[i^\prime ] e_j[j^\prime ] e_i[i^{\prime \prime }] e_j[j^{\prime \prime }] =\frac{l^2}{(\log l)^2}$ with probability $\frac{8(\log l)^4}{(2\,l)^4}$, we have $e_i[i^\prime ] e_j[j^\prime ] e_i[i^{\prime \prime }] e_j[j^{\prime \prime }] = -\frac{l^2}{(\log l)^2}$ with probability $\frac{8(\log l)^4}{(2\,l)^4}$, and $e_i[i^\prime ] e_j[j^\prime ] e_i[i^{\prime \prime }] e_j[j^{\prime \prime }] = 0$ holds otherwise. Therefore, $E(e_i[i^\prime ] e_j[j^\prime ] e_i[i^{\prime \prime }] e_j[j^{\prime \prime }]) =0$ holds. As a result, from Eq. (15), we have

$$\begin{aligned}{} & {} \textstyle E \bigg ( \bigg (\sum\limits _{i^\prime =1}^l \sum\limits _{j^\prime =1}^l \! U^\prime [k,i^\prime ] U^\prime [k,j^\prime ] e_i[i^\prime ] e_j[j^\prime ] \bigg )^2 \bigg ) \nonumber \\{} &{} \quad = \! \sum _{i^\prime =1}^l \sum _{j^\prime =1}^l \! (U^\prime [k,i^\prime ] U^\prime [k,j^\prime ])^2 \end{aligned}$$

(16)

As a result, from Eqs. (13), (14), and (16), if $i \ne j$, we have

$$\begin{aligned} \begin{aligned} V({\textbf{r}}_i^\top {\textbf{r}}_j)\,\,&\textstyle = E \bigg ( \bigg ( \sum\limits _{i^\prime =1}^l \sum\limits _{j^\prime =1}^l U^\prime [k,i^\prime ] U^\prime [k,j^\prime ] e_i[i^\prime ] e_j[j^\prime ] \bigg )^2 \bigg ) \\\quad \quad& -\textstyle \bigg (E \bigg ( \sum\limits _{k=1}^l \sum\limits _{i^\prime =1}^l \sum\limits _{j^\prime =1}^l U^\prime [k, i^\prime ] U^\prime [k, j^\prime ] e_i[i^\prime ] e_j[j^\prime ] \bigg ) \bigg )^2 \\\,&\textstyle =\sum\limits _{k = 1}^l\sum\limits _{i^\prime = 1}^l \sum\limits _{j^\prime = 1}^l (U^\prime [k, i^\prime ])^2 (U^\prime [k, j^\prime ])^2 \end{aligned} \end{aligned}$$

(17)

which completes the proof. $\Box $

As shown in Lemma 2, since $V({\textbf{r}}_i^\top {\textbf{r}}_j)$ is represented as the cumulative summation of l elements, it would have a small value as number of clusters l is small. Besides, as shown in Lemma 2, $V({\textbf{r}}_i^\top {\textbf{r}}_j)$ is independent from number of dimensions d. This indicates that we have the preferable property of the column vectors in ${\textbf{R}}$ for the representation vectors regardless of the expanded dimensionality. Algorithm 1 has the following property:

Lemma 3

Algorithm 1 takes $O(d \log l + l^3)$ time and O(ld) space for computing representation matrix ${\textbf{R}}$.

Proof

It would take $O(d \log l)$ time to compute ${\textbf{E}}$. It takes $O(l^2)$ time to compute ${\textbf{S}}$. Besides, it needs $O(l^3)$ time to compute SVD on ${\textbf{S}}$. It requires $O(l^2 \log l)$ time to compute ${\textbf{R}} = {\textbf{U}}\varvec{\Sigma }^{\frac{1}{2}} {\textbf{E}}$ since ${\textbf{e}}_i$ has $\log l$ nonzero elements. It needs $O(l^2)$, $O(d \log l)$, and O(ld) spaces to hold ${\textbf{S}}$, ${\textbf{E}}$, and ${\textbf{R}}$, respectively. Therefore, Algorithm 1 needs $O(d \log l + l^3)$ time and O(ld) space. $\Box $

4.3 SVD Computation

The previous section described the approach for the case of $l < d$. If we have $l \ge d$, we can obtain the representation vectors of the clusters with length d by computing the SVD of rank d for ${\textbf{S}}$. However, since the computation cost of SVD is $O(l^3)$, it is impractical to compute SVD if we have a large number of clusters. As shown in the previous network representation learning approach [24], the randomized approach can reduce the computation time of SVD [11]. This approach derives a basic matrix from a randomized matrix to compute SVD. However, this approach still incurs high computation cost to obtain the basic matrix by performing orthonormalization [30]; it needs $O(l^2d+ld^2)$ time if it is applied to ${\textbf{S}}$. Furthermore, the memory cost of this approach is $O(l^2)$ to hold ${\textbf{S}}$; it is quadratic to the number of clusters.

To efficiently compute SVD, we use $l \times d$ basic matrix ${\textbf{B}}$ whose ith row vector, ${\textbf{b}}_i$, is set as follows:

$$\begin{aligned} b_i[j] = {\left\{ \begin{array}{ll} \sqrt{\frac{1}{\log d}} &{} \text {with probability } \frac{\log d}{2d} \\ 0 &{} \text {with probability } 1 - \frac{\log d}{d} \\ -\sqrt{\frac{1}{\log d}} &{} \text {with probability } \frac{\log d}{2d} \end{array}\right. } \end{aligned}$$

(18)

Unlike the previous approach, our approach does not perform orthonormalization to obtain the basic matrix. We use matrix ${\textbf{B}}$ to compute SVD. Algorithm 2 details the procedure. It uses basic matrix ${\textbf{B}}$ to project $l \times l$ large matrix ${\textbf{S}}$ into an $l \times d$ low-dimensional space corresponding to matrix ${\textbf{S}}^\prime $ as a form of ${\textbf{S}}^\prime =\textbf{SB}$. However, since the size of ${\textbf{S}}$ is $l \times l$, it requires high memory cost if we directly hold ${\textbf{S}}$. To reduce the memory cost, our approach processes row vectors of ${\textbf{S}}$ one by one. Specifically, let ${\textbf{s}}_i$ be the ith row vector of ${\textbf{S}}$ and ${\textbf{s}}_i^\prime $ be the ith row vector of ${\textbf{S}}^\prime $, we compute row vectors as ${\textbf{s}}^\prime _i ={\textbf{s}}_i{\textbf{B}}$, as shown in Algorithm 2. Since it does not directly use ${\textbf{S}}$, we can reduce the memory cost in computing SVD. Let ${\textbf{V}}^{\prime \top }={\textbf{V}}^\top {\textbf{B}}^\top $, Algorithm 2 computes the representation matrix by factorizing the following $l \times l$ matrix $\tilde{{\textbf{S}}}$:

$$\begin{aligned} \tilde{{\textbf{S}}} = {\textbf{S}} \textbf{BB}^\top = {\textbf{S}}^\prime {\textbf{B}}^\top = {\textbf{U}} \varvec{\Sigma } {\textbf{V}}^{\prime \top } \end{aligned}$$

(19)

We have the following property for matrix $\tilde{{\textbf{S}}}$:

Lemma 4

For matrix $\tilde{{\textbf{S}}}$, $E({\tilde{S}}[i,j]) = S[i,j]$ holds.

Proof

From Eq. (19), we have

$$\begin{aligned}{} & {} \textstyle {\tilde{S}}[i,j] = {\textbf{s}}_i {\textbf{B}} {\textbf{b}}_j^\top = {\textbf{s}}_i [{\textbf{b}}_1 {\textbf{b}}_j^\top \; {\textbf{b}}_2 {\textbf{b}}_j^\top \; \ldots \; {\textbf{b}}_l {\textbf{b}}_j^\top ]^\top \nonumber \\{} &\quad\quad {} \quad\,\,\ = {\sum\limits_{j^\prime =1}^l} S[i,j^\prime ] \bigl ( {\sum\limits_{k=1}^d} \! b_{j^\prime }[k] b_{j}[k] \bigl ) \end{aligned}$$

(20)

If $j^\prime = j$, we have $b_{j^\prime }[k] b_{j}[k] = (b_{j}[k])^2$. From Eq. (18), $(b_{j}[k])^2 = \frac{1}{\log d}$ with probability $\frac{\log d}{d}$; $(b_{j}[k])^2 = 0$, otherwise. As a result, if $j^\prime = j$, we have

$$\begin{aligned} \textstyle E \bigg ( \sum\limits _{k=1}^d b_{j^\prime }[k] b_{j}[k] \bigg ) = E \bigg ( \sum\limits _{k=1}^d (b_{j}[k])^2 \bigg ) = d \frac{1}{\log d} \frac{\log d}{d} = 1 \end{aligned}$$

(21)

If $j^\prime \ne j$, from Eq. (18), we have $b_{j^\prime }[k] b_{j}[k] = \frac{1}{\log d}$ with probability $\frac{2(\log d)^2}{(2d)^2}$, $b_{j^\prime }[k] b_{j}[k] = -\frac{1}{\log d}$ with probability $\frac{2(\log d)^2}{(2d)^2}$, and $b_{j^\prime }[k] b_{j}[k] = 0$ otherwise. As a result, we have $E(b_{j^\prime }[k] b_{j}[k])=0$. Therefore, if $j^\prime \ne j$, we have

$$\begin{aligned} \textstyle E \bigl ( \sum\limits _{k=1}^d b_{j^\prime }[k] b_{j}[k] \bigl ) =0 \end{aligned}$$

(22)

As a result, we have

$$\begin{aligned} \textstyle E({\tilde{S}}[i,j]) = E \bigg ( \sum\limits _{j^\prime =1}^l S[i,j^\prime ] \bigg ( \sum\limits _{k=1}^d b_{j^\prime }[k] b_{j}[k] \bigg ) \bigg ) = S[i,j] \end{aligned}$$

(23)

which completes the proof. $\Box $

As shown in Eq. (19), we can exactly compute matrix $\tilde{{\textbf{S}}}$ by using Algorithm 2. Therefore, this lemma indicates that we can effectively approximate ${\textbf{S}}$ as $\tilde{{\textbf{S}}}$. Concerning the approximation quality, we have the following property;

Lemma 5

We have $V({\tilde{S}}[i,j]) = \bigl ( \frac{1}{\log d} - \frac{1}{d} \bigl ) (S[i,j])^2 + \frac{1}{d} \sum\limits _{j^\prime \ne j} (S[i,j^\prime ])^2$.

Proof

From Eq. (20), we have

$$\begin{aligned} V({\tilde{S}}[i,j]) \textstyle = \sum\limits _{j^\prime =1}^l (S[i,j^\prime ])^2 V \bigg ( \sum\limits _{k=1}^d b_{j^\prime }[k] b_{j}[k] \bigg ) \end{aligned}$$

(24)

If $j^\prime = j$, we have

$$\begin{aligned}{} & {} \textstyle \bigg ( \sum\limits _{k=1}^d b_{j^\prime }[k] b_{j}[k] \bigg )^2 =\textstyle \sum\limits _{k=1}^d (b_{j}[k])^4 \nonumber \\{} & {} \quad + 2 {\sum_{k=1}^d} {\sum_{k^\prime <k}} (b_{j}[k])^2 (b_{j}[k^\prime ])^2 \end{aligned}$$

(25)

From Eq. (18), we have $(b_{j}[k])^4 = \frac{1}{(\log d)^2}$ with probability $\frac{\log d}{d}$; $(b_{j}[k])^4 = 0$, otherwise. Therefore, we have

$$\begin{aligned} \textstyle E( \sum\limits _{k=1}^d (b_{j}[k])^4 ) = d \frac{1}{(\log d)^2} \frac{\log d}{d} = \frac{1}{\log d} \end{aligned}$$

(26)

Besides, since $k^\prime <k$ holds, we have $(b_{j}[k])^2 (b_{j}[k^\prime ])^2 = \frac{1}{(\log d)^2}$ with probability $\frac{(\log d)^2}{d^2}$; $(b_{j}[k])^2 (b_{j}[k^\prime ])^2 = 0$, otherwise. Therefore, $E((b_{j}[k])^2 (b_{j}[k^\prime ])^2)=\frac{1}{(\log d)^2}\frac{(\log d)^2}{d^2} = \frac{1}{d^2}$. As a result,

$$\begin{aligned} \textstyle E(2 \sum\limits _{k=1}^d \sum\limits _{k^\prime <k} (b_{j}[k])^2 (b_{j}[k^\prime ])^2) = d(d-1)\frac{1}{d^2}=1-\frac{1}{d} \end{aligned}$$

(27)

Therefore, if $j^\prime = j$, we have

$$\begin{aligned} \textstyle E \bigl ( \bigl ( \sum\limits _{k=1}^d b_{j^\prime }[k] b_{j}[k] \bigl )^2 \bigl ) = \frac{1}{\log d} + 1 -\frac{1}{d} \end{aligned}$$

(28)

As a result, if $j^\prime = j$, we have the following equation from Eqs. (21) and (28):

$$\begin{aligned} \begin{aligned}&\textstyle V \bigg ( \sum\limits _{k=1}^d b_{j^\prime }[k] b_{j}[k] \bigg ) \\&\quad \textstyle = E \bigg ( \bigg ( \sum\limits _{k=1}^d b_{j^\prime }[k] b_{j}[k] \bigg )^2 \bigg ) - \bigg ( E \bigg ( \sum\limits _{k=1}^d b_{j^\prime }[k] b_{j}[k] \bigg ) \bigg )^2 = \frac{1}{\log d} - \frac{1}{d} \end{aligned} \end{aligned}$$

(29)

If $j^\prime \ne j$, the following equation holds:

$$\begin{aligned} \begin{aligned}&\textstyle \bigg ( \sum\limits _{k=1}^d b_{j^\prime }[k] b_{j}[k] \bigg )^2 \\&\quad = \textstyle \sum\limits _{k=1}^d (b_{j^\prime }[k])^{2} (b_{j}[k])^2 + 2 \sum _{k=1}^d \sum\limits _{k^\prime < k} b_{j^\prime }[k] b_{j}[k] b_{j^\prime }[k^\prime ] b_{j}[k^\prime ] \end{aligned} \end{aligned}$$

(30)

From Eq. (18), $(b_{j^\prime }[k])^{2} (b_{j}[k])^2 = \frac{1}{(\log d)^2}$ holds with probability $\frac{(\log d)^2}{d^2}$; $(b_{j^\prime }[k])^{2} (b_{j}[k])^2 = 0$, otherwise. Besides, if $j^\prime \ne j$, since $k^\prime <k$, $b_{j^\prime }[k] b_{j}[k] b_{j^\prime }[k^\prime ] b_{j}[k^\prime ]= \frac{1}{(\log d)^2}$ holds with probability $\frac{8(\log d)^4}{(2d)^4}$, we have $b_{j^\prime }[k] b_{j}[k] b_{j^\prime }[k^\prime ] b_{j}[k^\prime ]= -\frac{1}{(\log d)^2}$ with probability $\frac{8(\log d)^4}{(2d)^4}$, and we have $b_{j^\prime }[k] b_{j}[k] b_{j^\prime }[k^\prime ] b_{j}[k^\prime ]= 0$ otherwise. As a result, we have $E(b_{j^\prime }[k] b_{j}[k] b_{j^\prime }[k^\prime ] b_{j}[k^\prime ])=0$. Therefore, if $j^\prime \ne j$, we have

$$\begin{aligned} \textstyle E \bigg ( \bigg ( \sum\limits _{k=1}^d b_{j^\prime }[k] b_{j}[k] \bigg )^2 \bigg ) = d \frac{1}{(\log d)^2} \frac{(\log d)^2}{d^2} = \frac{1}{d} \end{aligned}$$

(31)

As a result, from Eqs. (22) and (31), if $j^\prime \ne j$, we have

$$\begin{aligned}{} & {} \textstyle V\bigg ( \sum\limits _{k=1}^d b_{j^\prime }[k] b_{j}[k] \bigg ) = E \bigg ( \bigg ( \sum\limits _{k=1}^d b_{j^\prime }[k] b_{j}[k] \bigg )^2 \bigg ) \nonumber \\{} & {} \quad - \bigg (E \bigg ( {\sum }_{k=1}^d b_{j^\prime }[k] b_{j}[k] \bigg ) \bigg )^2 = \frac{1}{d} \end{aligned}$$

(32)

Therefore, from Eqs. (24), (29), and (32), we have

$$\begin{aligned} \begin{aligned} V({\tilde{S}}[i,j])&\textstyle = \sum\limits _{j^\prime =1}^l (S[i,j^\prime ])^2 V \bigg ( \sum\limits _{k=1}^d b_{j^\prime }[k] b_{j}[k] \bigg ) \\&\textstyle = \bigg ( \frac{1}{\log d} - \frac{1}{d} \bigg ) (S[i,j])^2 + \frac{1}{d} \sum\limits _{j^\prime \ne j} (S[i,j^\prime ])^2 \end{aligned} \end{aligned}$$

(33)

which completes the proof.$\Box $

Lemma 5 indicates that $V({\tilde{S}}[i,j])$ would be small as the dimensions of the representation d increase. Therefore, Algorithm 2 can effectively compute the representation vectors as d increases. The computational and memory costs of Algorithm 2 are as follows:

Lemma 6

Algorithm 2 needs $O(l^2+ld^2)$ time and O(ld) space for computing representation matrix ${\textbf{R}}$.

Proof

Since ${\textbf{b}}_i$ would have $\log d$ nonzero elements, it would take $O(l \log d)$ time to compute ${\textbf{B}}$. It needs $O(l^2)$ time to compute ${\textbf{S}}$. Since each column of ${\textbf{B}}$ has $\log d$ nonzero elements, it needs $O(l \log d)$ time to compute ${\textbf{S}}^\prime ={\textbf{S}}{\textbf{B}}$. It takes $O(ld^2)$ time to compute SVD on ${\textbf{S}}^\prime $ and O(ld) time to compute ${\textbf{R}} = {\textbf{U}}\varvec{\Sigma }^{\frac{1}{2}}$. Besides, it needs O(l) space to hold ${\textbf{s}}_i$ and O(ld) space to hold ${\textbf{B}}$, ${\textbf{S}}^\prime $, and ${\textbf{R}}$. As a result, Algorithm 2 takes $O(l^2+ld^2)$ time and O(ld) space. $\Box $

4.4 Representation Learning Algorithm

Algorithm 3 gives a full description of our algorithm. It first identifies the clusters using the IncMod method (line 1). If the number of clusters, l, is smaller than the number of dimensions, d, it computes the clusters’ representation vectors from Algorithm 1 (line 2-3). Otherwise, it computes the representation vectors using Algorithm 2 (line 4-5). It then computes the representation vectors for the nodes from the obtained representation vectors of the clusters (line 6-7). The computational and memory costs of Algorithm 3 are given as follows:

Theorem 1

Our approach takes $O(m + d \log l + l^3)$ time and $O(nd+m)$ space if $l<d$ holds. Otherwise, it requires $(m + l^2 + ld^2)$ time and $O(nd+m)$ space.

Proof

The IncMod method needs O(m) time and O(m) space [26]. If $l<d$, as shown in Lemma 3, it takes $O(d \log l + l^3)$ time and O(ld) space to compute the representation vectors of the clusters. Otherwise, it needs $O(l^2+ld^2)$ time and O(ld) space, as shown in Lemma 6. It needs O(m) time to compute the representation vectors of the nodes. It needs O(nd) space to hold the representation vectors. As a result, it needs $O(m + d \log l + l^3)$ time and $O(nd+m)$ space if $l<d$ holds. Otherwise, it takes $(m + l^2 + ld^2)$ time and $O(nd+m)$ space. $\Box $

Table 2 Characteristics of the experimental graphs

Full size table

5 Experimental Evaluation

This section compared our approach to the previous approaches: FastRP [5], REFINE [32], RandNE [31], FREDE [29], LightNE [23], LouvainNE [2], NetMF [25], NetSMF [24], and truss2vec [19]. As shown in Table 2, we used five real-world graphs: CoCit (CC), com-DBLP (DBLP), YouTube (YT), com-LiveJournal (LJ), and com-Friendster (FS). For NetMF, we set the target rank of eigendecomposition to 1, 024 as in [25]. For NetMF and NetSMF, we set negative sampling to 20, as shown in [18]. We set the window size used in NetMF, as well as NetSMF, FastRP, REFINE, RandNE, and LightNE to ten by following [25]. We set the number of nodes from which we compute personalized PageRank to 1, 000 for FREDE. For RandNE and FastRP, we set weights used in the high-order proximity matrices to one, the same as in [25]. For REFINE and LightNE, we set the number of diffusion steps to two by following the previous paper [32]. For LouvainNE, we set the damping parameter to 0.01 following [2]. For truss2vec, we set the number of hops to two, the length of random walks to 80, the number of random walks starting at each node to ten, and the attenuation factor to 0.6 by following [19]. We used the same programming language, C++, to implement the approaches examined. We conducted the experiments on a Linux server using an Intel Xeon Platinum 8280 CPU with a 2.70 GHz processor and 1.5 TB memory.

5.1 Network Representation Learning Time

We evaluated the network representation learning time of each approach. Figure 2 plots the processing time to compute the representation vectors from the given graphs. This experiment set the number of dimensions to $d=128$. For DBLP, YT, LJ, and FS, we omit the results of NetMF since it failed to compute the representation vectors due to the lack of memory space.

As shown in Fig. 2, our approach offers higher efficiency than the previous approaches; it is up to 5.5, 43.7, 54.6, 77.3, 99.9, 111.5, 188.6, 601.0, and 293071.0 times faster than LouvainNE, FastRP, NetSMF, REFINE, RandNE, LightNE, truss2vec, FREDE, and NetMF, respectively. NetMF incurs a high computational cost to apply eigendecomposition to the proximity matrix since the matrix has $O(n^2)$ number of nonzero elements. In order to reduce the computation cost of NetMF, NetSMF uses the path-sampling approach. However, the path-sampling approach needs a large number of random walks to assure the approximation quality of the similarity matrix; the number of random walks is m. RandNE and REFINE incur high computation costs to obtain the orthogonal matrix used in the iterative projection procedure. LightNE also incurs high computation costs to perform orthonormalization for the basic matrix used in SVD. FastRP incurs high computation costs since it recursively performs expensive matrix computations to obtain the representation vectors. FREDE needs a high computational cost to compute personalized PageRank and SVD iteratively. The computation cost of LouvainNE is high as the Louvain method is iteratively performed to obtain the hierarchical structure. If h is the number of hops to extract neighbors of nodes, truss2vec needs $O(n (\frac{m}{n})^{h})$ time to obtain high-order structural information of nodes; it exponentially increases for the number of hops. Therefore, truss2vec requires high computation costs. On the other hand, the proposed approach factorizes the small $l \times l$ similarity matrix by performing graph clustering only once to efficiently generate the representation vectors.

5.2 Multi-label Node Classification

Table 3 Node classification performance of each approach

Full size table

This experiment performed node classification. We used the one-vs-rest logistic regression model implemented by LIBLINEAR. In the test phase, the one-vs-rest model yielded a ranking of labels rather than an exact label assignment. We took the assumption that was made in DeepWalk; the number of labels for nodes in the test data is given [22]. Table 3 shows the Micro-F1 and Macro-F1 scores where we set the training ratio to $5\%$. We set the number of dimensions to $d=128$.

Table 3 indicates that our approach yields higher Micro-F1 and Macro-F1 scores than the previous approaches. This is because, as described in Sect. 4.1, we exploit the structural similarity matrix to capture the relationships between clusters and compute the representation vectors of the clusters by factorizing their similarity matrix. NetMF applies the elementwise matrix logarithm to the proximity matrix. However, it harms the quality of representations by cutting small nonzero elements. Although NetSMF improves the efficiency of NetMF, the path-sampling approach used in NetSMF yields a sparse similarity matrix where only m node pairs can have nonzero elements in the $n \times n$ matrix. Therefore, NetSMF has difficulty effectively representing the similarities between nodes. Even though the base matrices used of FastRP, REFINE, and RandNE would be orthogonal, they do not accurately capture the structural property of nodes since the obtained representation vectors are not orthogonal after performing the iterative projection procedure. Although FREDE uses personalized PageRank, it fails to capture the structural property of nodes. Since the path-sampling approach used in LightNE yields a sparse proximity matrix where at most m node pairs can have nonzero elements, it has difficulty in effectively representing the proximities between nodes. Since LouvainNE does not exploit relationships between clusters, it separates nodes independently in the representation space according to the clusters. Due to the small-world property, most nodes can be reached from each other in a small number of hops [7]. Therefore, neighboring nodes have similar structural similarities in truss2vec. As a result, truss2vec fails to capture the relationships between nodes effectively.

5.3 Dimensionality Expansion

We expand the dimensionality of the representation vectors of the clusters if $l < d$ holds. We evaluated Micro-F1 and Macro-F1 scores to show the effectiveness of this approach by performing an ablation study for CoCit (CC) dataset where we set $d=256$. Note that the number of clusters obtained from the dataset by the IncMod method is 131; $l < d$ for these datasets. Table 4 shows the results where “W/O expansion” represents the approach that does not expand the dimensionality by projecting the representation vectors. We set the training ratio to $5\%$.

Table 4 shows that we can improve Micro-F1 and Macro-F1 scores by 2.5 and $3.0\%$, respectively. If $l <d$ holds, since the representation vector of each cluster is shorter than the representation vectors of the nodes, the comparative approach has difficulty in effectively exploiting the representation vectors. On the other hand, we compute representation matrix ${\textbf{R}}$ as ${\textbf{R}}=\textbf{U}\varvec{\Sigma }^\frac{1}{2} {\textbf{E}}$ where ${\textbf{E}}$ is the expansion matrix. Note that, let ${\textbf{U}}^\prime =\textbf{U}\varvec{\Sigma }^\frac{1}{2}$ and ${\textbf{u}}_i$ be the ith column vector of ${\textbf{U}}^\prime $, we have $({\textbf{u}}_i^\prime )^\top {\textbf{u}}_j^\prime = 0$ if $i \ne j$. Similarly, as shown in Lemma 1 each column vector in ${\textbf{R}}$ would be orthogonal to each other. This indicates that the property for columns of ${\textbf{U}}^\prime $ would be preserved in ${\textbf{R}}$ even after dimensionality expansion. Therefore, we can improve accuracy by expanding the dimensionality.

Table 4 Results of the expansion approach

Full size table

5.4 SVD Computation

If $l \ge d$, we compute the SVD of rank d on the similarity matrix between clusters. Our approach obtains the basic matrix without orthonormalization by exploiting the sparse matrix to compute SVD efficiently. In this experiment, we evaluated the processing time of SVD by setting $d=256$. Figure 3 shows the result where “SVD” represents the approach that naively computes SVD. “Randomized” is the existing efficient approach that computes the basic matrix from a randomized matrix by performing orthonormalization [11]. In this experiment, we used com-DBLP (DBLP), YouTube (YT), com-LiveJournal (LJ), and com-Friendster (FS) datasets since the number of clusters in these datasets are 1245, 3391, 3079, and 14249, respectively; $d \le l$ for these datasets. Table 5 shows the Micro-F1 and Macro-F1 scores of each approach. In this experiment, we set the training ratio to $5\%$.

Figure 3 and Table 5 show that our approach can reduce the computation time without sacrificing the accuracy; it is up to 5.4 and 830.8 times faster than the existing and the original approaches of SVD, respectively. SVD requires greater computation time than the other approaches since it takes $O(l^3)$ time to compute the representation vectors of the clusters by factorizing the $l \times l$ similarity matrix. To improve the efficiency of SVD, the randomized approach computes SVD on a smaller matrix of $l \times d$ after projecting the similarity matrix using the basic matrix of $l \times d$. However, it needs $O(l^2d + ld^2)$ time to obtain the basic matrix by performing orthonormalization. Instead, the proposed approach efficiently computes the basic matrix using the sparse matrix, as shown in Eq. (12). Specifically, our approach obtains the basic matrix at $l\log d$ time, and computes SVD on a smaller $l \times d$ matrix, ${\textbf{S}}^\prime $, as shown in Algorithm 2. Besides, the proposed approach can effectively approximate the similarity matrix as shown in Lemmas 4 and 5. Therefore, our approach can efficiently and accurately compute SVD on the similarity matrix.

Table 5 Micro-F1 and Macro-F1 of each SVD approach

Full size table

Table 6 Micro-F1 and Macro-F1 of each graph clustering approach

Full size table

5.5 Graph Clustering Approach

As described in Sect. 4.1, the proposed approach uses the IncMod method [26] to compute the representation vectors of clusters. In this experiment, we used the Louvain method [3] instead of the IncMod method as the graph clustering approach. In Fig. 4 shows the network representation learning time, and Table 6 shows Micro-F1 and Macro-F1 scores. “Louvain” is the result of the Louvain method-based approach. In this experiment, we set $d=128$ and the training ratio to $5\%$.

As shown in Fig. 4, the proposed approach is up to 19.0 times faster than the Louvain method-based approach. This is because the IncMod method is faster than the Louvain method. On the other hand, in terms of accuracy, Micro-F1 and Macro-F1 scores of the Louvain method-based approach is not so different from the proposed approach. This is because the IncMod and the Louvain methods are both the modularity-based graph clustering approaches and yield almost the same clustering results [26]. The results of this experiment indicate that we can improve efficiency by using the IncMod method without sacrificing accuracy.

6 Conclusions

This paper addressed the problem of improving the efficiency and accuracy of network representation learning. We perform graph clustering just once and factorize the similarity matrix between clusters to capture the structural property of the graph. Experiments show that our approach is more efficient than existing approaches with greater accuracy.

Data Availability

The datasets used are available from the corresponding author on reasonable request.

References

Arriaga RI, Vempala SS (2006) An algorithmic theory of learning: robust concepts and random projection. Mach Learn 63(2):161–182
Article MATH Google Scholar
Bhowmick, AK, Meneni K, Danisch M, Guillaume J, Mitra B (2020) LouvainNE: hierarchical Louvain method for high quality and scalable network embedding. In: WSDM, pp 43–51
Blondel VD, Guillaume J-L, Lambiotte R, Lefebvre E (2008) Fast unfolding of communities in large networks. J Stat Mech: Theory Exp 2008(10):P10008
Article MATH Google Scholar
Cao S, Lu W, Xu Q (2015) GraRep: Learning graph representations with global structural information. In: CIKM, pp 891–900
Chen H, Sultan SF, Tian Y, Chen M, Skiena S (2019) Fast and accurate network embeddings via very sparse random projection. In: CIKM, pp 399–408
Fahrbach M, Goranci G, Peng R, Sachdeva S, Wang C (2020) Faster graph embeddings via coarsening. In: ICML, vol. 119 of Proceedings of Machine Learning Research, PMLR, pp 2953–2963
Faloutsos M, Faloutsos P, Faloutsos C (1999) On power-law relationships of the internet topology. In: SIGCOMM, pp 251–262
Fujiwara Y, Irie G, Kuroyama S, Onizuka M (2014) Scaling manifold ranking based image retrieval. Proc VLDB Endow 8(4):341–352
Article Google Scholar
Fujiwara Y, Ida Y, Kumagai A, Nakano M, Kimura A, Ueda N (2023) Efficient Network representation learning via cluster similarity. In: Wang X, Sapino ML, Han W-S, Abbadi AE, Dobbie G, Feng Z, Shao Y, Yin H (eds) Database Systems for advanced applications. DASFAA 2023. Lecture notes in computer science, vol 13945. Springer, Cham. https://doi.org/10.1007/978-3-031-30675-4_20
Grover A, Leskovec J (2016) node2vec: Scalable reature learning for networks. In: KDD, pp 855–864
Halko N, Martinsson P, Tropp JA (2011) Finding structure with randomness: probabilistic algorithms for constructing approximate matrix decompositions. SIAM Rev 53(2):217–288
Article MathSciNet MATH Google Scholar
Ida Y, Fujiwara Y, Kashima H (2019) Fast sparse group lasso. In: NeurIPS, pp 1700–1708
Khosla M, Setty V, Anand A (2021) A comparative study for unsupervised network representation learning. IEEE Trans Knowl Data Eng 33(5):1807–1818
Google Scholar
Levy O, Goldberg Y (2014) Neural word embedding as implicit matrix factorization. In: NIPS, pp 2177–2185
Lin W, He F, Zhang F, Cheng X, Cai H (2020) Initialization for network embedding: a graph partition approach. In: WSDM, ACM, pp 367–374.
Luo Q, Yu D, Sai AMVV, Cai Z, Cheng X (2022) A survey of structural representation learning for social networks. Neurocomputing 496:56–71
Article Google Scholar
Luo Q, Yu D, Zheng Y, Sheng H, Cheng X (2022) Core-GAE: toward generation of IoT networks. IEEE Internet Things J 9(12):9241–9248
Article Google Scholar
Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013) Distributed representations of words and phrases and their compositionality. In: NIPS, pp 3111–3119
Mo Z, Xie Z, Zhang X, Luo Q, Zheng Y, Yu D (2022) Network representation learning based on social similarities. Front Environ Sci 10:974246
Article Google Scholar
Nakatsuji M, Fujiwara Y, Toda H, Sawada H, Zheng J, Hendler JA(2014) Semantic data representation for improving tensor factorization. In: AAAI, pp 2004–2012
Ou M, Cui P, Pei J, Zhang Z, Zhu W (2016) Asymmetric transitivity preserving graph embedding. In: KDD, pp 1105–1114
Perozzi B, Al-Rfou R, Skiena S (2014) DeepWalk: online learning of social representations. In: KDD, pp 701–710
Qiu J, Dhulipala L, Tang J, Peng R, Wang C (2021) LightNE: A lightweight graph processing system for network embedding. In: Li G, Li Z, Idreos S, Srivastava D editors, SIGMOD, ACM, pp 2281–2289
Qiu J, Dong Y, Ma H, Li J, Wang C, Wang K, Tang J (2019) NetSMF: large-scale network embedding as sparse matrix factorization. In: WWW, pp 1509–1520
Qiu J, Dong Y, Ma H, Li J, Wang K, Tang J (2018) Network embedding as matrix factorization: unifying DeepWalk, LINE, PTE, and node2vec. In: WSDM, pp 459–467
Shiokawa H, Fujiwara Y, Onizuka M (2013) Fast algorithm for modularity-based graph clustering. In: AAAI
Tang J, Qu M, Mei Q (2015) PTE: predictive text embedding through large-scale heterogeneous text networks. In: KDD, pp 1165–1174
Tang J, Qu M, Wang M, Zhang M, Yan J, Mei Q (2015) LINE: Large-scale information network embedding. In: WWW, pp 1067–1077
Tsitsulin A, Munkhoeva M, Mottin D, Karras P, Oseledets IV, Müller E (2021) FREDE: anytime graph embeddings. Proc VLDB Endow 14(6):1102–1110
Article Google Scholar
Vladymyrov M, Carreira-Perpiñán MÁ (2016) The variational nystrom method for large-scale spectral problems. ICML 48:211–220
Google Scholar
Zhang Z, Cui P, Li H, Wang X, Zhu W (2018) Billion-scale network embedding with iterative random projection. In: ICDM, pp 787–796
Zhu H, Koniusz P (2021) REFINE: random range finder for network embedding. In: Demartini G, Zuccon G, Culpepper JS, Huang Z, Tong H editors, CIKM, ACM, pp 3682–3686

Download references

Funding

Funding was prodived by KAKENHI (Grant Nos. 22H03596).

Author information

Authors and Affiliations

NTT Communication Science Labortories, Atsugi-shi, Kanagawa, Japan
Yasuhiro Fujiwara, Yasutoshi Ida, Atsutoshi Kumagai, Masahiro Nakano, Akisato Kimura & Naonori Ueda

Authors

Yasuhiro Fujiwara
View author publications
You can also search for this author in PubMed Google Scholar
Yasutoshi Ida
View author publications
You can also search for this author in PubMed Google Scholar
Atsutoshi Kumagai
View author publications
You can also search for this author in PubMed Google Scholar
Masahiro Nakano
View author publications
You can also search for this author in PubMed Google Scholar
Akisato Kimura
View author publications
You can also search for this author in PubMed Google Scholar
Naonori Ueda
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yasuhiro Fujiwara.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Fujiwara, Y., Ida, Y., Kumagai, A. et al. Efficient Network Representation Learning via Cluster Similarity. Data Sci. Eng. 8, 279–291 (2023). https://doi.org/10.1007/s41019-023-00222-x

Download citation

Received: 14 May 2023
Revised: 11 July 2023
Accepted: 07 August 2023
Published: 12 September 2023
Issue Date: September 2023
DOI: https://doi.org/10.1007/s41019-023-00222-x

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Efficient Network Representation Learning via Cluster Similarity

Abstract

Similar content being viewed by others

Graph convolutional networks: a comprehensive review

Multi-head multi-order graph attention networks

Predicting properties of nodes via community-aware features

1 Introduction

2 Related Work

3 Preliminaries

4 Proposed Method

4.1 Similarity Between Clusters

4.2 Dimensionality Expansion

Lemma 1

Proof

Lemma 2

Proof

Lemma 3

Proof

4.3 SVD Computation

Lemma 4

Proof

Lemma 5

Proof

Lemma 6

Proof

4.4 Representation Learning Algorithm

Theorem 1

Proof

5 Experimental Evaluation

5.1 Network Representation Learning Time

5.2 Multi-label Node Classification

5.3 Dimensionality Expansion

5.4 SVD Computation

5.5 Graph Clustering Approach

6 Conclusions

Data Availability

References

Funding

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation