1 Introduction

Recent advances in social and information science have shown that linked data pervades our society and the natural world around us. A graph is an important data structure for representing objects and their relationships. Many real-world applications can be naturally modeled as graphs [8, 12, 17, 20]. Network representation learning has become a fundamental task in recent graph analytics [16]. It converts each graph node to a fixed-length vector such that the representation vectors preserve the inherent properties and structures of the graph. Although it is difficult to directly feed nodes of a graph into well-studied methods in the vector space, it is easy to subject the representation vectors to feature-based machine learning methods such as LIBLINEAR. Therefore, network representation learning has become a fundamental task in graph analytics.

One pioneering work on network representation learning is DeepWalk [22]. It generates multiple node sequences by random walks and runs the SkipGram algorithm [18] to compute node representations. Following the introduction of DeepWalk, many network representation learning approaches have been proposed, such as LINE [28], PTE [27], and node2vec [10]. A previous study revealed that DeepWalk theoretically factorizes a matrix derived from the random walk process between nodes [14]. Therefore, DeepWalk and its variants can be viewed as implicit factorizing a closed-form matrix obtained by random walks. NetMF is a matrix factorization-based network representation learning approach based on this observation [25]. It realizes higher accuracy in performing node classification than DeepWalk and its variants. However, NetMF needs to factorize a dense \(n \times n\) proximity matrix between nodes where n is the number of nodes, and thus, it needs \(O(n^3)\) time and \(O(n^2)\) space. This is because almost all pairs of nodes are reachable within a specified length of random walks, making it prohibitively expensive to directly construct and factorize the matrix for large-scale graphs. Although several approaches have been proposed to reduce the computational cost, such as RandNE [31], FastRP [5], FREDE [29], LightNE [23], and REFINE [32], their computation time is high for large-scale graphs since they need to perform expensive matrix computations. On the other hand, several researchers proposed to exploit node connectivity relationships to perform network representation learning effectively. For example, Bhowmick et al. proposed LouvainNE, which recursively constructs the hierarchy of subgraphs using the Louvain method; the Louvain method is a graph clustering approach [3]. It represents clusters as random vectors and aggregates them to generate the representation vectors of the nodes. However, since it needs to iterate graph clustering to obtain the hierarchy structure, LouvainNE still incurs high computation costs for large-scale graphs. In addition, Mo et al. recently proposed truss2vec [19]. This approach is designed to represent the graph’s local community properties and global network topology. Specifically, it uses the truss number to capture the local community properties effectively. The truss number of an edge is obtained from the maximum k-truss where the edge is located; a k-truss is a maximal connected subgraph where each edge is in at least \(k-2\) triangles. Since the truss number is a higher-order structural information, it can more effectively capture the local community properties than simple neighborhood information such as degree. Besides, truss2vec exploits a random walk strategy to capture global network topology effectively. Unlike the sampling strategies used in previous approaches, the random walk strategy can effectively capture the similarity of nodes within the same community since it is based on the structural similarities obtained from the truss number. However, truss2vec needs a large computational time to obtain structural vectors of nodes to compute the structural similarities.

As described above, the technical challenge of the previous approaches is that they suffer from the high computational cost of performing network representation learning for large-scale graphs effectively. The contribution of this paper is to propose an efficient and accurate network representation learning approach for large-scale graphs. In this paper, we proposes Graph Clustering-based Network Representation Learning, GC-NRL. The proposed approach pays particular attention to matrix factorization and graph clustering. To achieve high efficiency and accuracy, our approach efficiently factorizes a small similarity matrix between clusters to effectively compute the embedding vectors of nodes. Specifically, we perform graph clustering for a given graph and compute a similarity matrix between clusters. We then compute the representation vectors of the clusters by factorizing their similarity matrix with sparse matrices. We determine the representation vectors of the nodes by referring to those of clusters. In terms of the matrix factorization-based approach, since our approach uses the similarity matrix between clusters instead of the proximity matrix between nodes, it can efficiently compute the representation vectors. Besides, in terms of the graph clustering-based approach, since our approach performs graph clustering only once and uses the similarities between clusters, unlike the previous approach, it can improve efficiency. In the proposed approach, we can place nodes of the same cluster closely in the representation space by referring to the representation vectors of clusters. Even if nodes are included in different clusters, our approach can closely place the nodes in the representation space if the clusters are well connected and have high similarity. Consequently, we can effectively compute the representation vectors. The main contributions are as follows:

  • Our approach effectively computes representations of nodes by factorizing a similarity matrix between clusters to capture their structural relationships.

  • The proposed approach expands and reduces the dimensionality of the representation vectors of the clusters efficiently and effectively by exploiting sparse matrices.

  • Experiments confirm that our approach is several orders of magnitude faster than the previous approaches while achieving higher node classification accuracy.

In the remainder of this paper, Sect. 2 describes related work. Section 3 gives an overview of the background, Sect. 4 introduces our approach, Sect. 5 reviews our experimental results, and Sect. 6 provides conclusions. The preliminary version of this article has been published in DASFAA 2023 [9].

2 Related Work

Recently, network representation learning has been extensively studied [13]. The success of network representation learning has driven a lot of downstream graph analytics. Inspired by the success of neural methods for learning word embedding, particularly the SkipGram model [18], many network representation learning approaches have been proposed. They typically sample node pairs close to each other by performing random walks and compute node representations using the SkipGram algorithm. DeepWalk first proposed using truncated random walks to capture the network structure [22]. Specifically, it samples pairs of nodes from the graph by performing random walks and then feeding them to SkipGram to obtain node representations. LINE takes a similar idea with an explicit objective function by setting the walk length as one or two [28]. It uses the negative sample strategy [18] to reduce the computation time. PTE is an extension of LINE for heterogeneous text networks [27]. node2vec generalizes these methods by taking potentially biased random walks for enhanced flexibility [10]. Since the random walk-based approach usually relies on stochastic gradient descent in the learning process, a large number of node pairs must be sampled until convergence to ensure the quality of the node representations; this incurs high computation costs. Levey et al. proved that approaches of the SkipGram model with negative sampling implicitly factorize a shifted pointwise mutual information matrix of word co-occurrences [14]. Based on a similar idea, Qiu et al. showed that DeepWalk, LINE, PTE, and node2vec implicitly approximate and factorize a node proximity matrix, which is usually some transformation of the i-step adjacency matrix. Following these analyses, NetMF was proposed for network representation learning [25]. This approach is a generalized matrix factorization framework that uses SVD in unifying several random walk-based approaches. It uses eigendecomposition to efficiently compute the proximity matrix. GraRep is another matrix factorization-based approach that directly applies SVD to preserve a high-order proximity matrix [4], while HOPE uses generalized SVD to preserve the asymmetric transitivity in direct graphs [21]. However, the matrix factorization-based approach also suffers from the high computation cost since it needs to perform expensive matrix factorization operations.

Several works have been proposed to speed up network representation learning. NetSMF, proposed by Qiu et al., is an efficient network representation learning approach based on NetMF [24]. This approach uses theories from spectral sparsification to sparsify the proximity matrix between nodes. To reduce computation time, it uses a path-sampling approach to efficiently obtain the proximity matrix used in NetMF. In addition, it uses the randomized approach to reduce the high computation cost of SVD. RandNE is a Gaussian random projection method for network representation learning [31]. Specifically, it maps the graph into a low-dimensional representation space using a Gaussian random projection approach while preserving the high-order proximity between nodes. It uses an iterative projection procedure to reduce the computational cost by avoiding explicit computation of the high-order proximities. FastRP is another efficient approach to network representation learning [5]. It explicitly constructs a node proximity matrix that captures the structural relationships in the graph and normalizes matrix entries based on node degree. It applies a random projection approach for the proximity matrix to efficiently obtain node representation. However, these previous network representation learning approaches still need significant computation time since they perform expensive matrix computations to obtain the representation vectors. REFINE is an iterative projection approach that adds an orthogonal constraint on the embedding vectors based on the randomized blocked QR with power iteration [32]. However, these network embedding approaches still need significant computational time since they perform expensive matrix computations to obtain the embedding vectors. FREDE uses personalized PageRank to obtain proximities between nodes, and it exploits the frequent directions sketching process to efficiently compute SVD [29]. However, it needs a high computational cost to compute personalized PageRank. LouvainNE constructs the hierarchy of subgraphs using the Louvain method recursively to coarsen a large graph into smaller clusters [2]; the Louvain method is a graph clustering approach [3]. It obtains a representation of each subgraph at different hierarchy levels and aggregates them to generate the embedding vectors. Since LouvainNE needs to iterate graph clustering to obtain the hierarchy structure, it still incurs high computational costs. To efficiently perform network representation learning, Fahrbach et al. proposed RandomConstraction [6] and Lin et al. proposed GPA [15]. However, they are preprocessing and initialization approaches, respectively, and are not used stand-alone; they need to be used with other embedding approaches such as LINE, NetMF, NetSMF, node2vec, and DeepWalk. As a result, these approaches need high computational costs.

3 Preliminaries

We introduce the background to this paper below. Table 1 lists the main symbols and their definitions. The problem of network representation learning for general graphs is formalized as follows: Given graph \(G=({\mathbb {V}}, {\mathbb {E}})\) with \({\mathbb {V}}\) being the set of n nodes and \({\mathbb {E}}\) being the set of m edges, network representation learning computes a low-dimensional representation vector \({\textbf{x}}_v\) of dimension d for each node \(v \in {\mathbb {V}}\) where \(d \ll n\) is the predefined number of dimensions. Representation vector \({\textbf{x}}_v\) is set to capture the structural property of node v.

Table 1 Definitions of main symbols

NetMF is the most popular matrix factorization-based approach [25]. It is a generalized framework that uses SVD in unifying several random walk-based approaches such as DeepWalk [22], LINE [28], and PTE [27]. NetMF computes the following \(n \times n\) high-order proximity matrix between nodes:

$$\begin{aligned} \textstyle {\textbf{M}} = \log (\max ( {\textbf{M}}^\prime ,1)), \; {\textbf{M}}^\prime = \frac{vol(G)}{bT} \left( \sum\limits _{r=1}^T ({\textbf{D}}^{-1}{\textbf{A}})^r \right) {\textbf{D}}^{-1} \end{aligned}$$
(1)

In this equation, \(\log (\cdot )\) is an elementwise logarithm, \(vol(G) = \sum\nolimits _{1 \le i,j \le n} A[i,j]\) is the volume of the graph where A[ij] is the [ij] element of the adjacency matrix \({\textbf{A}}\) corresponding to the edge weight from the jth node to the ith node, b is the number of negative samples [25], T is the window size, and \({\textbf{D}}\) is the diagonal matrix \({\textbf{D}} = \textrm{diag}({\textbf{A}}{\textbf{1}}_n)\) where \({\textbf{1}}_n\) is a vector of length n with all ones. NetMF obtains the representation vectors by using the d left singular vectors and the first d singular values after computing the SVD of matrix \({\textbf{M}}\). However, it needs \(O(n^3)\) time and \(O(n^2)\) space to compute matrix \({\textbf{M}}\) of \(n \times n\) size. Furthermore, NetMF degrades the quality of representation vectors by truncating the small nonzero elements from the proximity matrix.

4 Proposed Method

Instead of proximities between nodes, we exploit similarities between clusters to reduce the computational cost since the number of clusters is much smaller than that of nodes [26]. Since similarities between clusters represent the structural property of the graph, we can effectively generate the representation vectors by using them. Let \(c_i\) be the ith cluster, we perform graph clustering and compute representation vectors of \(c_i\) by using a similarity matrix between clusters. We compute the representation vectors of the nodes from the representation vectors of the clusters. Let \({\textbf{W}}\) be the row normalized adjacency matrix and \({\textbf{r}}_{c_i}\) be the representation vector of \(c_i\), we compute the representation vectors of node v as \({\textbf{x}}_v = \sum\nolimits _{i=1}^l {\sum\nolimits _{w \in {\mathbb {C}}_i}} W[v,w] {\textbf{r}}_{c_i}\). In this equation, l is the number of clusters, \({\mathbb {C}}_i\) is a node set included in \(c_i\), and W[vw] is the edge weight between node v and w.

4.1 Similarity Between Clusters

To compute representation vector \({\textbf{r}}_{i}\), we factorize the similarity matrix between clusters. Our approach uses the IncMod method for graph clustering since it can compute clusters efficiently [26]. Note that, since the IncMod method can handle undirected and directed graphs, our approach can handle both graphs. The IncMod method automatically sets number of clusters l based on graph structure; we cannot specify the number of clusters.

Our approach determines the similarity matrix between clusters from the difference from a random graph. Specifically, if \({\textbf{S}}\) is the \(l \times l\) similarity matrix between clusters, we define its elements as follows:

$$\begin{aligned} \textstyle S[i,j]= & {} {\sum_{v \in {\mathbb {C}}_i}} \sum _{w \in {\mathbb {C}}_j} A[v,w] \nonumber \\{} & {} - \frac{1}{vol[G]} {\sum_{v \in {\mathbb {C}}_i}} {\sum }_{w \in {\mathbb {V}}} A[v,w] {\sum_{v \in {\mathbb {V}}}} {\sum }_{w \in {\mathbb {C}}_j} A[v,w] \end{aligned}$$
(2)

Since \(\frac{1}{vol[G]} {\sum\nolimits _{v \in {\mathbb {C}}_j}} \sum\nolimits _{w \in {\mathbb {V}}} A[v,w]\) is the ratio of edge weights connected to cluster \(c_i\), if we assume that G is a random graph, the second term of the right side in Eq. (2), \(\frac{1}{vol[G]} \sum\nolimits _{v \in {\mathbb {C}}_i} \sum\nolimits _{w \in {\mathbb {V}}} A[v,w] \sum\nolimits _{v \in {\mathbb {V}}} \sum\nolimits _{w \in {\mathbb {C}}_j} A[v,w]\), corresponds to the expectation of the sum of edge weights connected to cluster \(c_i\) from \(c_j\). On the other hand, the first term, \(\sum\nolimits _{v \in {\mathbb {C}}_i} \sum\nolimits _{w \in {\mathbb {C}}_j} A[v,w]\), corresponds to the sum of edge weights actually connected to cluster \(c_i\) from \(c_j\). Therefore, S[ij] would be positive if \(c_i\) and \(c_j\) are well connected compared to a random graph; otherwise, it would be negative. As a result, \({\textbf{S}}\) effectively represents the structural relationships between the clusters. Therefore, even if nodes are included in different clusters, our approach can place the nodes closely in the representation space if their clusters have high similarity. Note that we have \(0 \le \sum\nolimits _{v \in {\mathbb {C}}_i} \sum\nolimits _{w \in {\mathbb {C}}_j} A[v,w] \le vol[G]\), \(0 \le \sum\nolimits _{v \in {\mathbb {C}}_i} \sum\nolimits _{w \in {\mathbb {V}}} A[v,w] \le vol[G]\), and \(0 \le \sum\nolimits _{v \in {\mathbb {V}}} \sum\nolimits _{w \in {\mathbb {C}}_j} A[v,w] \le vol[G]\). Therefore, from Eq. (2), we have

$$\begin{gathered} S[i,j] = {\sum_{v \in{\mathbb {C}} _{i} }} {\sum_{w \in{\mathbb {C}} _{j}} A} [v,w] - \frac{1}{{vol[G]}}\sum _{{v \in{\mathbb {C}} _{i} }} \sum _{{w \in {\mathbb{V}}}} A[v,w]\sum _{{v \in {\mathbb{V}}}} \sum _{{w \in{\mathbb {C}} _{j} }} A[v,w] \hfill \\ \quad\quad\quad \ge 0 - \frac{1}{{vol[G]}} \cdot vol[G] \cdot vol[G] = - vol[G]{\text{ }} \hfill \\ \end{gathered} $$
(3)

and

$$ \begin{aligned} \textstyle S[i,j]&\textstyle = {\sum\limits _{v \in {\mathbb {C}}_i}} {\sum\limits _{w \in {\mathbb {C}}_j}} A[v,w] \\&\quad \;\; - \frac{1}{vol[G]} \sum _{v \in {\mathbb {C}}_i} \sum _{w \in {\mathbb {V}}} A[v,w] \sum _{v \in {\mathbb {V}}} \sum _{w \in {\mathbb {C}}_j} A[v,w] \\&\textstyle \le vol[G] - \frac{1}{vol[G]} \cdot 0 \cdot 0 = vol[G] \end{aligned} $$
(4)

As a result, the range of S[ij] is given as follows:

$$\begin{aligned} \textstyle -vol[G] \le S[i,j] \le vol[G] \end{aligned}$$
(5)
Fig. 1
figure 1

Example of cluster similarity computation

We provide an example to compute elements of \({\textbf{S}}\). As shown in Fig. 1(1-1), an example graph has seven nodes of two clusters \(c_1\) and \(c_2\). Cluster \(c_1\) includes node \(v_1\), \(v_2\), and \(v_3\). Cluster \(c_2\) includes node \(v_4\), \(v_5\), \(v_6\), and \(v_7\). Note that nodes of \(c_1\) and \(c_2\) are directly connected to each other in the clusters, and these clusters have a single bidirectional edge between the clusters, as shown in Fig. 1(1-1). We assume edge weights are 1 in the graph. As a result, the adjacency matrix of the graph is given as Fig. 1(1-2). Therefore, from Fig. 1(1-2), we have

$$\begin{aligned} \textstyle vol(G) = \sum\limits _{1 \le i,j \le 7} A[i,j] = 20 \end{aligned}$$
(6)

In addition, from Fig. 1(1-2), we have

$$\begin{aligned} \textstyle \sum\limits _{v \in {\mathbb {C}}_1} \sum\limits _{w \in {\mathbb {C}}_1} A[v,w] = 6 \end{aligned}$$
(7)

and

$$\begin{aligned}{} & \textstyle \frac{1}{vol[G]} \sum\limits _{v \in {\mathbb {C}}_1} \sum\limits _{w \in {\mathbb {V}}} A[v,w] \sum\limits _{v \in {\mathbb {V}}} \sum\limits _{w \in {\mathbb {C}}_1} A[v,w] \nonumber \\{} & \quad = \frac{1}{20} \cdot 7 \cdot 7 \end{aligned}$$
(8)

As a result,

$$\begin{aligned} \textstyle S[1,1] = 6 -\frac{1}{20} \cdot 7 \cdot 7 = 3.55 \end{aligned}$$
(9)

Similarly, \(S[2,2] = 12 -\frac{1}{20} \cdot 13 \cdot 13 = 3.55\) and \(S[1,2] = S[2,1] =1 -\frac{1}{20} \cdot 7 \cdot 13 = -3.55\). Consequently, for the graph of Fig. 1(1-1), we have

$$\begin{aligned} \textstyle {\textbf{S}} = \left[ \begin{array}{cc} 3.55 &{} -3.55\\ -3.55 &{} 3.55 \end{array} \right] \end{aligned}$$
(10)

As shown in Eq. (10), although \(c_1\) and \(c_2\) are different-size clusters, they have the same self-similarity: \(S[1,1] = S[2,2]\). This indicates that the similarity given by Eq. (2) can capture the property of the clusters; \(c_1\) and \(c_2\) are the same in terms of having nodes directly connecting each other and a single bidirectional edge between the clusters.

Our approach uses SVD on \({\textbf{S}}\) to compute the representation vectors of the clusters. Specifically, we decompose \({\textbf{S}}\) as \({\textbf{S}} = {\textbf{U}} \varvec{\Sigma } {\textbf{V}}^\top \) and compute representation matrix \({\textbf{R}}\) as \({\textbf{R}}={\textbf{U}} \varvec{\Sigma }^{\frac{1}{2}}\). If \({\textbf{r}}_{i}\) is the ith row vector of \({\textbf{R}}\), we use \({\textbf{r}}_{i}\) as the representation vector of the ith cluster. However, using row vectors in \({\textbf{R}}\) has a problem in computing the representation vectors of the clusters. Since the size of \({\textbf{S}}\) is \(l \times l\), the length of \({\textbf{r}}_{i}\) is l. Therefore, if \(l < d\), the representation vectors of the clusters would be shorter than the representation vectors of the nodes with length d. As a result, it is difficult to effectively use the representation vectors of the clusters to compute the representation vectors of the nodes if \(l < d\).

4.2 Dimensionality Expansion

If \(l < d\), our approach expands the dimensionality of the representation vectors of the clusters by exploiting a sparse matrix. Let \({\textbf{E}}\) be the \(l \times d\) expansion matrix and \({\textbf{e}}_i\) be the ith column vector of \({\textbf{E}}\), we set the elements of \({\textbf{e}}_i\) as follows:

$$\begin{aligned} e_i[j] = {\left\{ \begin{array}{ll} \sqrt{\frac{l}{\log l}} &{} \text {with probability } \frac{\log l}{2l} \\ 0 &{} \text {with probability } 1 - \frac{\log l}{l} \\ -\sqrt{\frac{l}{\log l}} &{} \text {with probability } \frac{\log l}{2l} \end{array}\right. } \end{aligned}$$
(11)

To obtain \({\textbf{R}}\), we project low-dimensional matrix \({\textbf{U}}\varvec{\Sigma }^{\frac{1}{2}}\) into a high-dimensional space by using \({\textbf{E}}\). Specifically, we expand the dimensionality of the representation vectors by computing \({\textbf{R}} = {\textbf{U}}\varvec{\Sigma }^{\frac{1}{2}} {\textbf{E}}\) as shown in Algorithm 1. Let \({\textbf{r}}_i\) be the ith column vector of matrix \({\textbf{R}}\), we have the following property for \({\textbf{R}}\):

Lemma 1

Let \(E(\cdot )\) represent expectation, if \(i \ne j\) holds, we have \(E({\textbf{r}}_i^\top {\textbf{r}}_j) =0\).

Proof

Let \({\textbf{U}}^\prime = {\textbf{U}}\varvec{\Sigma }^{\frac{1}{2}}\), since \({\textbf{r}}_i = {\textbf{U}}^\prime {\textbf{e}}_i\) from Algorithm 1, we have

$$\begin{aligned} \begin{aligned} {\textbf{r}}_i^\top {\textbf{r}}_j&\textstyle = \sum\limits _{k=1}^l \bigl ( \sum\limits _{i^\prime =1}^l U^\prime [k,i^\prime ] e_i[i^\prime ] \bigl ) \bigl ( \sum\limits _{j^\prime =1}^l U^\prime [k,j^\prime ] e_j[j^\prime ] \bigl ) \\&= \sum _{k=1}^l \sum _{i^\prime =1}^l \sum _{j^\prime =1}^l U^\prime [k,i^\prime ] \\&\quad U^\prime [k,j^\prime ] e_i[i^\prime ] e_j[j^\prime ] \end{aligned} \end{aligned}$$
(12)

If \(i \ne j\) hold, from Eq. (11), we have \(e_i[i^\prime ] e_j[j^\prime ]=\frac{l}{\log l}\) with probability \(\frac{2(\log l)^2}{(2l)^2}\), we have \(e_i[i^\prime ] e_j[j^\prime ]=-\frac{l}{\log l}\) with probability \(\frac{2(\log l)^2}{(2\,l)^2}\), and we have \(e_i[i^\prime ] e_j[j^\prime ]=0\) otherwise. Therefore, we have \(E(e_i[i^\prime ] e_j[j^\prime ])=0\). As a result, if \(i \ne j\), we have

$$\begin{aligned} \textstyle E \bigg ( \sum\limits _{k=1}^l \sum\limits _{i^\prime =1}^l \sum\limits _{j^\prime =1}^l U^\prime [k, i^\prime ] U^\prime [k, j^\prime ] e_i[i^\prime ] e_j[j^\prime ] \bigg ) = 0 \end{aligned}$$
(13)

Therefore, we have \(E({\textbf{r}}_i^\top {\textbf{r}}_j) =0\) from Eq. (12). \(\square \)

figure a

The column vectors in \({\textbf{U}}\varvec{\Sigma }^{\frac{1}{2}}\) are orthogonal to each other since SVD produces orthogonal matrices. Specifically, let \({\textbf{u}}_i^\prime \) be the ith column vector of \({\textbf{U}}^\prime = {\textbf{U}}\varvec{\Sigma }^{\frac{1}{2}}\), we have \(({\textbf{u}}_i^\prime )^\top {\textbf{u}}_j^\prime =0\) such that \(i \ne j\), which is a necessary condition for preserving pairwise similarities between vectors [1]. On the other hand, Lemma 1 shows that each column of matrix \({\textbf{R}}\) would be orthogonal to each other, the same as matrix \({\textbf{U}}^\prime \). Consequently, Lemma 1 indicates that we can preserve the preferable property for the representation vectors even after dimensionality expansion. In terms of the quality of dimensionality expansion, we have the following lemma:

Lemma 2

Let \(V(\cdot )\) represent variance, the following equation holds if we have \(i \ne j\): \(V({\textbf{r}}_i^\top {\textbf{r}}_j) = \sum\nolimits _{k=1}^l \sum _{i^\prime =1}^l \sum\nolimits _{j^\prime =1}^l (U^\prime [k,i^\prime ])^2 (U^\prime [k,j^\prime ])^2\).

Proof

From Eq. (12), we have

$$\begin{aligned} \textstyle V({\textbf{r}}_i^\top {\textbf{r}}_j)= \sum\limits _{k=1}^l V \bigg ( \sum\limits _{i^\prime =1}^l \sum\limits _{j^\prime =1}^l U^\prime [k,i^\prime ] U^\prime [k,j^\prime ] e_i[i^\prime ] e_j[j^\prime ] \bigg ) \end{aligned}$$
(14)

Since \(i \ne j\) holds, we have

$$\begin{aligned} \begin{aligned}&\textstyle \bigg ( \sum\limits _{i^\prime =1}^l \sum\limits _{j^\prime =1}^l U^\prime [k,i^\prime ] U^\prime [k,j^\prime ] e_i[i^\prime ] e_j[j^\prime ] \bigg )^2 \\&\quad =\textstyle \sum\limits _{i^\prime =1}^l \sum\limits _{j^\prime =1}^l (U^\prime [k,i^\prime ])^2 (U^\prime [k,j^\prime ])^2 (e_i[i^\prime ])^2 (e_j[j^\prime ])^2 \\&\quad \quad \textstyle + 2 \sum\limits _{i^\prime =1}^l \sum\limits _{j^\prime =1}^l \sum\limits _{i^{\prime \prime }< i^\prime } \sum\limits _{j^{\prime \prime } < j^\prime } U^\prime [k,i^\prime ] U^\prime [k,j^\prime ] U^\prime [k,i^{\prime \prime }] \\&\quad U^\prime [k,j^{\prime \prime }] e_i[i^\prime ] e_j[j^\prime ] e_i[i^{\prime \prime }] e_j[j^{\prime \prime }] \end{aligned} \end{aligned}$$
(15)

From Eq. (11), \((e_i[i^\prime ])^2 = \frac{l}{\log l}\) with probability \(\frac{\log l}{l}\); \((e_i[i^\prime ])^2 = 0\), otherwise. As a result, since \(i \ne j\), we have \(E((e_i[i^\prime ])^2 (e_j[j^\prime ] )^2) =\frac{l}{\log l} \frac{\log l}{l} \cdot \frac{l}{\log l} \frac{\log l}{l}=1\). In addition, since \(i \ne j\), \(i^{\prime \prime } < i^\prime \), and \(j^{\prime \prime } < j^\prime \) hold, we have \(e_i[i^\prime ] e_j[j^\prime ] e_i[i^{\prime \prime }] e_j[j^{\prime \prime }] =\frac{l^2}{(\log l)^2}\) with probability \(\frac{8(\log l)^4}{(2\,l)^4}\), we have \(e_i[i^\prime ] e_j[j^\prime ] e_i[i^{\prime \prime }] e_j[j^{\prime \prime }] = -\frac{l^2}{(\log l)^2}\) with probability \(\frac{8(\log l)^4}{(2\,l)^4}\), and \(e_i[i^\prime ] e_j[j^\prime ] e_i[i^{\prime \prime }] e_j[j^{\prime \prime }] = 0\) holds otherwise. Therefore, \(E(e_i[i^\prime ] e_j[j^\prime ] e_i[i^{\prime \prime }] e_j[j^{\prime \prime }]) =0\) holds. As a result, from Eq. (15), we have

$$\begin{aligned}{} & {} \textstyle E \bigg ( \bigg (\sum\limits _{i^\prime =1}^l \sum\limits _{j^\prime =1}^l \! U^\prime [k,i^\prime ] U^\prime [k,j^\prime ] e_i[i^\prime ] e_j[j^\prime ] \bigg )^2 \bigg ) \nonumber \\{} &{} \quad = \! \sum _{i^\prime =1}^l \sum _{j^\prime =1}^l \! (U^\prime [k,i^\prime ] U^\prime [k,j^\prime ])^2 \end{aligned}$$
(16)

As a result, from Eqs. (13), (14), and (16), if \(i \ne j\), we have

$$\begin{aligned} \begin{aligned} V({\textbf{r}}_i^\top {\textbf{r}}_j)\,\,&\textstyle = E \bigg ( \bigg ( \sum\limits _{i^\prime =1}^l \sum\limits _{j^\prime =1}^l U^\prime [k,i^\prime ] U^\prime [k,j^\prime ] e_i[i^\prime ] e_j[j^\prime ] \bigg )^2 \bigg ) \\\quad \quad& -\textstyle \bigg (E \bigg ( \sum\limits _{k=1}^l \sum\limits _{i^\prime =1}^l \sum\limits _{j^\prime =1}^l U^\prime [k, i^\prime ] U^\prime [k, j^\prime ] e_i[i^\prime ] e_j[j^\prime ] \bigg ) \bigg )^2 \\\,&\textstyle =\sum\limits _{k = 1}^l\sum\limits _{i^\prime = 1}^l \sum\limits _{j^\prime = 1}^l (U^\prime [k, i^\prime ])^2 (U^\prime [k, j^\prime ])^2 \end{aligned} \end{aligned}$$
(17)

which completes the proof. \(\Box \)

As shown in Lemma 2, since \(V({\textbf{r}}_i^\top {\textbf{r}}_j)\) is represented as the cumulative summation of l elements, it would have a small value as number of clusters l is small. Besides, as shown in Lemma 2, \(V({\textbf{r}}_i^\top {\textbf{r}}_j)\) is independent from number of dimensions d. This indicates that we have the preferable property of the column vectors in \({\textbf{R}}\) for the representation vectors regardless of the expanded dimensionality. Algorithm 1 has the following property:

Lemma 3

Algorithm 1 takes \(O(d \log l + l^3)\) time and O(ld) space for computing representation matrix \({\textbf{R}}\).

Proof

It would take \(O(d \log l)\) time to compute \({\textbf{E}}\). It takes \(O(l^2)\) time to compute \({\textbf{S}}\). Besides, it needs \(O(l^3)\) time to compute SVD on \({\textbf{S}}\). It requires \(O(l^2 \log l)\) time to compute \({\textbf{R}} = {\textbf{U}}\varvec{\Sigma }^{\frac{1}{2}} {\textbf{E}}\) since \({\textbf{e}}_i\) has \(\log l\) nonzero elements. It needs \(O(l^2)\), \(O(d \log l)\), and O(ld) spaces to hold \({\textbf{S}}\), \({\textbf{E}}\), and \({\textbf{R}}\), respectively. Therefore, Algorithm 1 needs \(O(d \log l + l^3)\) time and O(ld) space. \(\Box \)

4.3 SVD Computation

The previous section described the approach for the case of \(l < d\). If we have \(l \ge d\), we can obtain the representation vectors of the clusters with length d by computing the SVD of rank d for \({\textbf{S}}\). However, since the computation cost of SVD is \(O(l^3)\), it is impractical to compute SVD if we have a large number of clusters. As shown in the previous network representation learning approach [24], the randomized approach can reduce the computation time of SVD [11]. This approach derives a basic matrix from a randomized matrix to compute SVD. However, this approach still incurs high computation cost to obtain the basic matrix by performing orthonormalization [30]; it needs \(O(l^2d+ld^2)\) time if it is applied to \({\textbf{S}}\). Furthermore, the memory cost of this approach is \(O(l^2)\) to hold \({\textbf{S}}\); it is quadratic to the number of clusters.

To efficiently compute SVD, we use \(l \times d\) basic matrix \({\textbf{B}}\) whose ith row vector, \({\textbf{b}}_i\), is set as follows:

$$\begin{aligned} b_i[j] = {\left\{ \begin{array}{ll} \sqrt{\frac{1}{\log d}} &{} \text {with probability } \frac{\log d}{2d} \\ 0 &{} \text {with probability } 1 - \frac{\log d}{d} \\ -\sqrt{\frac{1}{\log d}} &{} \text {with probability } \frac{\log d}{2d} \end{array}\right. } \end{aligned}$$
(18)

Unlike the previous approach, our approach does not perform orthonormalization to obtain the basic matrix. We use matrix \({\textbf{B}}\) to compute SVD. Algorithm 2 details the procedure. It uses basic matrix \({\textbf{B}}\) to project \(l \times l\) large matrix \({\textbf{S}}\) into an \(l \times d\) low-dimensional space corresponding to matrix \({\textbf{S}}^\prime \) as a form of \({\textbf{S}}^\prime =\textbf{SB}\). However, since the size of \({\textbf{S}}\) is \(l \times l\), it requires high memory cost if we directly hold \({\textbf{S}}\). To reduce the memory cost, our approach processes row vectors of \({\textbf{S}}\) one by one. Specifically, let \({\textbf{s}}_i\) be the ith row vector of \({\textbf{S}}\) and \({\textbf{s}}_i^\prime \) be the ith row vector of \({\textbf{S}}^\prime \), we compute row vectors as \({\textbf{s}}^\prime _i ={\textbf{s}}_i{\textbf{B}}\), as shown in Algorithm 2. Since it does not directly use \({\textbf{S}}\), we can reduce the memory cost in computing SVD. Let \({\textbf{V}}^{\prime \top }={\textbf{V}}^\top {\textbf{B}}^\top \), Algorithm 2 computes the representation matrix by factorizing the following \(l \times l\) matrix \(\tilde{{\textbf{S}}}\):

$$\begin{aligned} \tilde{{\textbf{S}}} = {\textbf{S}} \textbf{BB}^\top = {\textbf{S}}^\prime {\textbf{B}}^\top = {\textbf{U}} \varvec{\Sigma } {\textbf{V}}^{\prime \top } \end{aligned}$$
(19)

We have the following property for matrix \(\tilde{{\textbf{S}}}\):

figure b

Lemma 4

For matrix \(\tilde{{\textbf{S}}}\), \(E({\tilde{S}}[i,j]) = S[i,j]\) holds.

Proof

From Eq. (19), we have

$$\begin{aligned}{} & {} \textstyle {\tilde{S}}[i,j] = {\textbf{s}}_i {\textbf{B}} {\textbf{b}}_j^\top = {\textbf{s}}_i [{\textbf{b}}_1 {\textbf{b}}_j^\top \; {\textbf{b}}_2 {\textbf{b}}_j^\top \; \ldots \; {\textbf{b}}_l {\textbf{b}}_j^\top ]^\top \nonumber \\{} &\quad\quad {} \quad\,\,\ = {\sum\limits_{j^\prime =1}^l} S[i,j^\prime ] \bigl ( {\sum\limits_{k=1}^d} \! b_{j^\prime }[k] b_{j}[k] \bigl ) \end{aligned}$$
(20)

If \(j^\prime = j\), we have \(b_{j^\prime }[k] b_{j}[k] = (b_{j}[k])^2\). From Eq. (18), \((b_{j}[k])^2 = \frac{1}{\log d}\) with probability \(\frac{\log d}{d}\); \((b_{j}[k])^2 = 0\), otherwise. As a result, if \(j^\prime = j\), we have

$$\begin{aligned} \textstyle E \bigg ( \sum\limits _{k=1}^d b_{j^\prime }[k] b_{j}[k] \bigg ) = E \bigg ( \sum\limits _{k=1}^d (b_{j}[k])^2 \bigg ) = d \frac{1}{\log d} \frac{\log d}{d} = 1 \end{aligned}$$
(21)

If \(j^\prime \ne j\), from Eq. (18), we have \(b_{j^\prime }[k] b_{j}[k] = \frac{1}{\log d}\) with probability \(\frac{2(\log d)^2}{(2d)^2}\), \(b_{j^\prime }[k] b_{j}[k] = -\frac{1}{\log d}\) with probability \(\frac{2(\log d)^2}{(2d)^2}\), and \(b_{j^\prime }[k] b_{j}[k] = 0\) otherwise. As a result, we have \(E(b_{j^\prime }[k] b_{j}[k])=0\). Therefore, if \(j^\prime \ne j\), we have

$$\begin{aligned} \textstyle E \bigl ( \sum\limits _{k=1}^d b_{j^\prime }[k] b_{j}[k] \bigl ) =0 \end{aligned}$$
(22)

As a result, we have

$$\begin{aligned} \textstyle E({\tilde{S}}[i,j]) = E \bigg ( \sum\limits _{j^\prime =1}^l S[i,j^\prime ] \bigg ( \sum\limits _{k=1}^d b_{j^\prime }[k] b_{j}[k] \bigg ) \bigg ) = S[i,j] \end{aligned}$$
(23)

which completes the proof. \(\Box \)

As shown in Eq. (19), we can exactly compute matrix \(\tilde{{\textbf{S}}}\) by using Algorithm 2. Therefore, this lemma indicates that we can effectively approximate \({\textbf{S}}\) as \(\tilde{{\textbf{S}}}\). Concerning the approximation quality, we have the following property;

Lemma 5

We have \(V({\tilde{S}}[i,j]) = \bigl ( \frac{1}{\log d} - \frac{1}{d} \bigl ) (S[i,j])^2 + \frac{1}{d} \sum\limits _{j^\prime \ne j} (S[i,j^\prime ])^2\).

Proof

From Eq. (20), we have

$$\begin{aligned} V({\tilde{S}}[i,j]) \textstyle = \sum\limits _{j^\prime =1}^l (S[i,j^\prime ])^2 V \bigg ( \sum\limits _{k=1}^d b_{j^\prime }[k] b_{j}[k] \bigg ) \end{aligned}$$
(24)

If \(j^\prime = j\), we have

$$\begin{aligned}{} & {} \textstyle \bigg ( \sum\limits _{k=1}^d b_{j^\prime }[k] b_{j}[k] \bigg )^2 =\textstyle \sum\limits _{k=1}^d (b_{j}[k])^4 \nonumber \\{} & {} \quad + 2 {\sum_{k=1}^d} {\sum_{k^\prime <k}} (b_{j}[k])^2 (b_{j}[k^\prime ])^2 \end{aligned}$$
(25)

From Eq. (18), we have \((b_{j}[k])^4 = \frac{1}{(\log d)^2}\) with probability \(\frac{\log d}{d}\); \((b_{j}[k])^4 = 0\), otherwise. Therefore, we have

$$\begin{aligned} \textstyle E( \sum\limits _{k=1}^d (b_{j}[k])^4 ) = d \frac{1}{(\log d)^2} \frac{\log d}{d} = \frac{1}{\log d} \end{aligned}$$
(26)

Besides, since \(k^\prime <k\) holds, we have \((b_{j}[k])^2 (b_{j}[k^\prime ])^2 = \frac{1}{(\log d)^2}\) with probability \(\frac{(\log d)^2}{d^2}\); \((b_{j}[k])^2 (b_{j}[k^\prime ])^2 = 0\), otherwise. Therefore, \(E((b_{j}[k])^2 (b_{j}[k^\prime ])^2)=\frac{1}{(\log d)^2}\frac{(\log d)^2}{d^2} = \frac{1}{d^2}\). As a result,

$$\begin{aligned} \textstyle E(2 \sum\limits _{k=1}^d \sum\limits _{k^\prime <k} (b_{j}[k])^2 (b_{j}[k^\prime ])^2) = d(d-1)\frac{1}{d^2}=1-\frac{1}{d} \end{aligned}$$
(27)

Therefore, if \(j^\prime = j\), we have

$$\begin{aligned} \textstyle E \bigl ( \bigl ( \sum\limits _{k=1}^d b_{j^\prime }[k] b_{j}[k] \bigl )^2 \bigl ) = \frac{1}{\log d} + 1 -\frac{1}{d} \end{aligned}$$
(28)

As a result, if \(j^\prime = j\), we have the following equation from Eqs. (21) and (28):

$$\begin{aligned} \begin{aligned}&\textstyle V \bigg ( \sum\limits _{k=1}^d b_{j^\prime }[k] b_{j}[k] \bigg ) \\&\quad \textstyle = E \bigg ( \bigg ( \sum\limits _{k=1}^d b_{j^\prime }[k] b_{j}[k] \bigg )^2 \bigg ) - \bigg ( E \bigg ( \sum\limits _{k=1}^d b_{j^\prime }[k] b_{j}[k] \bigg ) \bigg )^2 = \frac{1}{\log d} - \frac{1}{d} \end{aligned} \end{aligned}$$
(29)

If \(j^\prime \ne j\), the following equation holds:

$$\begin{aligned} \begin{aligned}&\textstyle \bigg ( \sum\limits _{k=1}^d b_{j^\prime }[k] b_{j}[k] \bigg )^2 \\&\quad = \textstyle \sum\limits _{k=1}^d (b_{j^\prime }[k])^{2} (b_{j}[k])^2 + 2 \sum _{k=1}^d \sum\limits _{k^\prime < k} b_{j^\prime }[k] b_{j}[k] b_{j^\prime }[k^\prime ] b_{j}[k^\prime ] \end{aligned} \end{aligned}$$
(30)

From Eq. (18), \((b_{j^\prime }[k])^{2} (b_{j}[k])^2 = \frac{1}{(\log d)^2}\) holds with probability \(\frac{(\log d)^2}{d^2}\); \((b_{j^\prime }[k])^{2} (b_{j}[k])^2 = 0\), otherwise. Besides, if \(j^\prime \ne j\), since \(k^\prime <k\), \(b_{j^\prime }[k] b_{j}[k] b_{j^\prime }[k^\prime ] b_{j}[k^\prime ]= \frac{1}{(\log d)^2}\) holds with probability \(\frac{8(\log d)^4}{(2d)^4}\), we have \(b_{j^\prime }[k] b_{j}[k] b_{j^\prime }[k^\prime ] b_{j}[k^\prime ]= -\frac{1}{(\log d)^2}\) with probability \(\frac{8(\log d)^4}{(2d)^4}\), and we have \(b_{j^\prime }[k] b_{j}[k] b_{j^\prime }[k^\prime ] b_{j}[k^\prime ]= 0\) otherwise. As a result, we have \(E(b_{j^\prime }[k] b_{j}[k] b_{j^\prime }[k^\prime ] b_{j}[k^\prime ])=0\). Therefore, if \(j^\prime \ne j\), we have

$$\begin{aligned} \textstyle E \bigg ( \bigg ( \sum\limits _{k=1}^d b_{j^\prime }[k] b_{j}[k] \bigg )^2 \bigg ) = d \frac{1}{(\log d)^2} \frac{(\log d)^2}{d^2} = \frac{1}{d} \end{aligned}$$
(31)

As a result, from Eqs. (22) and (31), if \(j^\prime \ne j\), we have

$$\begin{aligned}{} & {} \textstyle V\bigg ( \sum\limits _{k=1}^d b_{j^\prime }[k] b_{j}[k] \bigg ) = E \bigg ( \bigg ( \sum\limits _{k=1}^d b_{j^\prime }[k] b_{j}[k] \bigg )^2 \bigg ) \nonumber \\{} & {} \quad - \bigg (E \bigg ( {\sum }_{k=1}^d b_{j^\prime }[k] b_{j}[k] \bigg ) \bigg )^2 = \frac{1}{d} \end{aligned}$$
(32)

Therefore, from Eqs. (24), (29), and (32), we have

$$\begin{aligned} \begin{aligned} V({\tilde{S}}[i,j])&\textstyle = \sum\limits _{j^\prime =1}^l (S[i,j^\prime ])^2 V \bigg ( \sum\limits _{k=1}^d b_{j^\prime }[k] b_{j}[k] \bigg ) \\&\textstyle = \bigg ( \frac{1}{\log d} - \frac{1}{d} \bigg ) (S[i,j])^2 + \frac{1}{d} \sum\limits _{j^\prime \ne j} (S[i,j^\prime ])^2 \end{aligned} \end{aligned}$$
(33)

which completes the proof.\(\Box \)

Lemma 5 indicates that \(V({\tilde{S}}[i,j])\) would be small as the dimensions of the representation d increase. Therefore, Algorithm 2 can effectively compute the representation vectors as d increases. The computational and memory costs of Algorithm 2 are as follows:

Lemma 6

Algorithm 2 needs \(O(l^2+ld^2)\) time and O(ld) space for computing representation matrix \({\textbf{R}}\).

Proof

Since \({\textbf{b}}_i\) would have \(\log d\) nonzero elements, it would take \(O(l \log d)\) time to compute \({\textbf{B}}\). It needs \(O(l^2)\) time to compute \({\textbf{S}}\). Since each column of \({\textbf{B}}\) has \(\log d\) nonzero elements, it needs \(O(l \log d)\) time to compute \({\textbf{S}}^\prime ={\textbf{S}}{\textbf{B}}\). It takes \(O(ld^2)\) time to compute SVD on \({\textbf{S}}^\prime \) and O(ld) time to compute \({\textbf{R}} = {\textbf{U}}\varvec{\Sigma }^{\frac{1}{2}}\). Besides, it needs O(l) space to hold \({\textbf{s}}_i\) and O(ld) space to hold \({\textbf{B}}\), \({\textbf{S}}^\prime \), and \({\textbf{R}}\). As a result, Algorithm 2 takes \(O(l^2+ld^2)\) time and O(ld) space. \(\Box \)

4.4 Representation Learning Algorithm

Algorithm 3 gives a full description of our algorithm. It first identifies the clusters using the IncMod method (line 1). If the number of clusters, l, is smaller than the number of dimensions, d, it computes the clusters’ representation vectors from Algorithm 1 (line 2-3). Otherwise, it computes the representation vectors using Algorithm 2 (line 4-5). It then computes the representation vectors for the nodes from the obtained representation vectors of the clusters (line 6-7). The computational and memory costs of Algorithm 3 are given as follows:

figure c

Theorem 1

Our approach takes \(O(m + d \log l + l^3)\) time and \(O(nd+m)\) space if \(l<d\) holds. Otherwise, it requires \((m + l^2 + ld^2)\) time and \(O(nd+m)\) space.

Proof

The IncMod method needs O(m) time and O(m) space [26]. If \(l<d\), as shown in Lemma 3, it takes \(O(d \log l + l^3)\) time and O(ld) space to compute the representation vectors of the clusters. Otherwise, it needs \(O(l^2+ld^2)\) time and O(ld) space, as shown in Lemma 6. It needs O(m) time to compute the representation vectors of the nodes. It needs O(nd) space to hold the representation vectors. As a result, it needs \(O(m + d \log l + l^3)\) time and \(O(nd+m)\) space if \(l<d\) holds. Otherwise, it takes \((m + l^2 + ld^2)\) time and \(O(nd+m)\) space. \(\Box \)

Table 2 Characteristics of the experimental graphs

5 Experimental Evaluation

This section compared our approach to the previous approaches: FastRP [5], REFINE [32], RandNE [31], FREDE [29], LightNE [23], LouvainNE [2], NetMF [25], NetSMF [24], and truss2vec [19]. As shown in Table 2, we used five real-world graphs: CoCit (CC), com-DBLP (DBLP), YouTube (YT), com-LiveJournal (LJ), and com-Friendster (FS). For NetMF, we set the target rank of eigendecomposition to 1, 024 as in [25]. For NetMF and NetSMF, we set negative sampling to 20, as shown in [18]. We set the window size used in NetMF, as well as NetSMF, FastRP, REFINE, RandNE, and LightNE to ten by following [25]. We set the number of nodes from which we compute personalized PageRank to 1, 000 for FREDE. For RandNE and FastRP, we set weights used in the high-order proximity matrices to one, the same as in [25]. For REFINE and LightNE, we set the number of diffusion steps to two by following the previous paper [32]. For LouvainNE, we set the damping parameter to 0.01 following [2]. For truss2vec, we set the number of hops to two, the length of random walks to 80, the number of random walks starting at each node to ten, and the attenuation factor to 0.6 by following [19]. We used the same programming language, C++, to implement the approaches examined. We conducted the experiments on a Linux server using an Intel Xeon Platinum 8280 CPU with a 2.70 GHz processor and 1.5 TB memory.

Fig. 2
figure 2

Processing time of each approach

5.1 Network Representation Learning Time

We evaluated the network representation learning time of each approach. Figure 2 plots the processing time to compute the representation vectors from the given graphs. This experiment set the number of dimensions to \(d=128\). For DBLP, YT, LJ, and FS, we omit the results of NetMF since it failed to compute the representation vectors due to the lack of memory space.

As shown in Fig. 2, our approach offers higher efficiency than the previous approaches; it is up to 5.5, 43.7, 54.6, 77.3, 99.9, 111.5, 188.6, 601.0, and 293071.0 times faster than LouvainNE, FastRP, NetSMF, REFINE, RandNE, LightNE, truss2vec, FREDE, and NetMF, respectively. NetMF incurs a high computational cost to apply eigendecomposition to the proximity matrix since the matrix has \(O(n^2)\) number of nonzero elements. In order to reduce the computation cost of NetMF, NetSMF uses the path-sampling approach. However, the path-sampling approach needs a large number of random walks to assure the approximation quality of the similarity matrix; the number of random walks is m. RandNE and REFINE incur high computation costs to obtain the orthogonal matrix used in the iterative projection procedure. LightNE also incurs high computation costs to perform orthonormalization for the basic matrix used in SVD. FastRP incurs high computation costs since it recursively performs expensive matrix computations to obtain the representation vectors. FREDE needs a high computational cost to compute personalized PageRank and SVD iteratively. The computation cost of LouvainNE is high as the Louvain method is iteratively performed to obtain the hierarchical structure. If h is the number of hops to extract neighbors of nodes, truss2vec needs \(O(n (\frac{m}{n})^{h})\) time to obtain high-order structural information of nodes; it exponentially increases for the number of hops. Therefore, truss2vec requires high computation costs. On the other hand, the proposed approach factorizes the small \(l \times l\) similarity matrix by performing graph clustering only once to efficiently generate the representation vectors.

5.2 Multi-label Node Classification

Table 3 Node classification performance of each approach

This experiment performed node classification. We used the one-vs-rest logistic regression model implemented by LIBLINEAR. In the test phase, the one-vs-rest model yielded a ranking of labels rather than an exact label assignment. We took the assumption that was made in DeepWalk; the number of labels for nodes in the test data is given [22]. Table 3 shows the Micro-F1 and Macro-F1 scores where we set the training ratio to \(5\%\). We set the number of dimensions to \(d=128\).

Table 3 indicates that our approach yields higher Micro-F1 and Macro-F1 scores than the previous approaches. This is because, as described in Sect. 4.1, we exploit the structural similarity matrix to capture the relationships between clusters and compute the representation vectors of the clusters by factorizing their similarity matrix. NetMF applies the elementwise matrix logarithm to the proximity matrix. However, it harms the quality of representations by cutting small nonzero elements. Although NetSMF improves the efficiency of NetMF, the path-sampling approach used in NetSMF yields a sparse similarity matrix where only m node pairs can have nonzero elements in the \(n \times n\) matrix. Therefore, NetSMF has difficulty effectively representing the similarities between nodes. Even though the base matrices used of FastRP, REFINE, and RandNE would be orthogonal, they do not accurately capture the structural property of nodes since the obtained representation vectors are not orthogonal after performing the iterative projection procedure. Although FREDE uses personalized PageRank, it fails to capture the structural property of nodes. Since the path-sampling approach used in LightNE yields a sparse proximity matrix where at most m node pairs can have nonzero elements, it has difficulty in effectively representing the proximities between nodes. Since LouvainNE does not exploit relationships between clusters, it separates nodes independently in the representation space according to the clusters. Due to the small-world property, most nodes can be reached from each other in a small number of hops [7]. Therefore, neighboring nodes have similar structural similarities in truss2vec. As a result, truss2vec fails to capture the relationships between nodes effectively.

5.3 Dimensionality Expansion

We expand the dimensionality of the representation vectors of the clusters if \(l < d\) holds. We evaluated Micro-F1 and Macro-F1 scores to show the effectiveness of this approach by performing an ablation study for CoCit (CC) dataset where we set \(d=256\). Note that the number of clusters obtained from the dataset by the IncMod method is 131; \(l < d\) for these datasets. Table 4 shows the results where “W/O expansion” represents the approach that does not expand the dimensionality by projecting the representation vectors. We set the training ratio to \(5\%\).

Table 4 shows that we can improve Micro-F1 and Macro-F1 scores by 2.5 and \(3.0\%\), respectively. If \(l <d\) holds, since the representation vector of each cluster is shorter than the representation vectors of the nodes, the comparative approach has difficulty in effectively exploiting the representation vectors. On the other hand, we compute representation matrix \({\textbf{R}}\) as \({\textbf{R}}=\textbf{U}\varvec{\Sigma }^\frac{1}{2} {\textbf{E}}\) where \({\textbf{E}}\) is the expansion matrix. Note that, let \({\textbf{U}}^\prime =\textbf{U}\varvec{\Sigma }^\frac{1}{2}\) and \({\textbf{u}}_i\) be the ith column vector of \({\textbf{U}}^\prime \), we have \(({\textbf{u}}_i^\prime )^\top {\textbf{u}}_j^\prime = 0\) if \(i \ne j\). Similarly, as shown in Lemma  1 each column vector in \({\textbf{R}}\) would be orthogonal to each other. This indicates that the property for columns of \({\textbf{U}}^\prime \) would be preserved in \({\textbf{R}}\) even after dimensionality expansion. Therefore, we can improve accuracy by expanding the dimensionality.

Table 4 Results of the expansion approach

5.4 SVD Computation

If \(l \ge d\), we compute the SVD of rank d on the similarity matrix between clusters. Our approach obtains the basic matrix without orthonormalization by exploiting the sparse matrix to compute SVD efficiently. In this experiment, we evaluated the processing time of SVD by setting \(d=256\). Figure 3 shows the result where “SVD” represents the approach that naively computes SVD. “Randomized” is the existing efficient approach that computes the basic matrix from a randomized matrix by performing orthonormalization [11]. In this experiment, we used com-DBLP (DBLP), YouTube (YT), com-LiveJournal (LJ), and com-Friendster (FS) datasets since the number of clusters in these datasets are 1245, 3391, 3079, and 14249, respectively; \(d \le l\) for these datasets. Table 5 shows the Micro-F1 and Macro-F1 scores of each approach. In this experiment, we set the training ratio to \(5\%\).

Figure 3 and Table 5 show that our approach can reduce the computation time without sacrificing the accuracy; it is up to 5.4 and 830.8 times faster than the existing and the original approaches of SVD, respectively. SVD requires greater computation time than the other approaches since it takes \(O(l^3)\) time to compute the representation vectors of the clusters by factorizing the \(l \times l\) similarity matrix. To improve the efficiency of SVD, the randomized approach computes SVD on a smaller matrix of \(l \times d\) after projecting the similarity matrix using the basic matrix of \(l \times d\). However, it needs \(O(l^2d + ld^2)\) time to obtain the basic matrix by performing orthonormalization. Instead, the proposed approach efficiently computes the basic matrix using the sparse matrix, as shown in Eq. (12). Specifically, our approach obtains the basic matrix at \(l\log d\) time, and computes SVD on a smaller \(l \times d\) matrix, \({\textbf{S}}^\prime \), as shown in Algorithm 2. Besides, the proposed approach can effectively approximate the similarity matrix as shown in Lemmas  4 and 5. Therefore, our approach can efficiently and accurately compute SVD on the similarity matrix.

Fig. 3
figure 3

Computation time of each SVD approach

Fig. 4
figure 4

Efficiency of each graph clustering approach

Table 5 Micro-F1 and Macro-F1 of each SVD approach
Table 6 Micro-F1 and Macro-F1 of each graph clustering approach

5.5 Graph Clustering Approach

As described in Sect. 4.1, the proposed approach uses the IncMod method [26] to compute the representation vectors of clusters. In this experiment, we used the Louvain method [3] instead of the IncMod method as the graph clustering approach. In Fig. 4 shows the network representation learning time, and Table 6 shows Micro-F1 and Macro-F1 scores. “Louvain” is the result of the Louvain method-based approach. In this experiment, we set \(d=128\) and the training ratio to \(5\%\).

As shown in Fig. 4, the proposed approach is up to 19.0 times faster than the Louvain method-based approach. This is because the IncMod method is faster than the Louvain method. On the other hand, in terms of accuracy, Micro-F1 and Macro-F1 scores of the Louvain method-based approach is not so different from the proposed approach. This is because the IncMod and the Louvain methods are both the modularity-based graph clustering approaches and yield almost the same clustering results [26]. The results of this experiment indicate that we can improve efficiency by using the IncMod method without sacrificing accuracy.

6 Conclusions

This paper addressed the problem of improving the efficiency and accuracy of network representation learning. We perform graph clustering just once and factorize the similarity matrix between clusters to capture the structural property of the graph. Experiments show that our approach is more efficient than existing approaches with greater accuracy.