Efficient Network Representation Learning via Cluster Similarity

Network representation learning is a de facto tool for graph analytics. The mainstream of the previous approaches is to factorize the proximity matrix between nodes. However, if n is the number of nodes, since the size of the proximity matrix is n×n\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$n \times n$$\end{document}, it needs O(n3)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$O(n^3)$$\end{document} time and O(n2)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$O(n^2)$$\end{document} space to perform network representation learning; they are significantly high for large-scale graphs. This paper introduces the novel idea of using similarities between clusters instead of proximities between nodes; the proposed approach computes the representations of the clusters from similarities between clusters and computes the representations of nodes by referring to them. If l is the number of clusters, since l≪n\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$l \ll n$$\end{document}, we can efficiently obtain the representations of clusters from a small l×l\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$l \times l$$\end{document} similarity matrix. Furthermore, since nodes in each cluster share similar structural properties, we can effectively compute the representation vectors of nodes. Experiments show that our approach can perform network representation learning more efficiently and effectively than existing approaches.


Introduction
Recent advances in social and information science have shown that linked data pervades our society and the natural world around us.A graph is an important data structure for representing objects and their relationships.Many real-world applications can be naturally modeled as graphs [8,12,17,20].Network representation learning has become a fundamental task in recent graph analytics [16].It converts each graph node to a fixed-length vector such that the representation vectors preserve the inherent properties and structures of the graph.Although it is difficult to directly feed nodes of a graph into well-studied methods in the vector space, it is easy to subject the representation vectors to feature-based machine learning methods such as LIBLINEAR.Therefore, network representation learning has become a fundamental task in graph analytics.
One pioneering work on network representation learning is DeepWalk [22].It generates multiple node sequences by random walks and runs the SkipGram algorithm [18] to compute node representations.Following the introduction of DeepWalk, many network representation learning approaches have been proposed, such as LINE [28], PTE [27], and node2vec [10].A previous study revealed that DeepWalk theoretically factorizes a matrix derived from the random walk process between nodes [14].Therefore, Deep-Walk and its variants can be viewed as implicit factorizing a closed-form matrix obtained by random walks.NetMF is a matrix factorization-based network representation learning approach based on this observation [25].It realizes higher accuracy in performing node classification than DeepWalk and its variants.However, NetMF needs to factorize a dense n × n proximity matrix between nodes where n is the number of nodes, and thus, it needs O(n 3 ) time and O(n 2 ) space.This is because almost all pairs of nodes are reachable within a specified length of random walks, making it prohibitively expensive to directly construct and factorize the matrix for large-scale graphs.Although several approaches have been proposed to reduce the computational cost, such as RandNE [31], FastRP [5], FREDE [29], LightNE [23], and REFINE [32], their computation time is high for large-scale graphs since they need to perform expensive matrix computations.On the other hand, several researchers proposed to exploit node connectivity relationships to perform network representation learning effectively.For example, Bhowmick et al. proposed LouvainNE, which recursively constructs the hierarchy of subgraphs using the Louvain method; the Louvain method is a graph clustering approach [3].It represents clusters as random vectors and aggregates them to generate the representation vectors of the nodes.However, since it needs to iterate graph clustering to obtain the hierarchy structure, LouvainNE still incurs high computation costs for large-scale graphs.In addition, Mo et al. recently proposed truss2vec [19].This approach is designed to represent the graph's local community properties and global network topology.Specifically, it uses the truss number to capture the local community properties effectively.The truss number of an edge is obtained from the maximum k-truss where the edge is located; a k-truss is a maximal connected subgraph where each edge is in at least k − 2 triangles.Since the truss number is a higher-order structural information, it can more effectively capture the local community properties than simple neighborhood information such as degree.Besides, truss2vec exploits a random walk strategy to capture global network topology effectively.Unlike the sampling strategies used in previous approaches, the random walk strategy can effectively capture the similarity of nodes within the same community since it is based on the structural similarities obtained from the truss number.However, truss2vec needs a large computational time to obtain structural vectors of nodes to compute the structural similarities.
As described above, the technical challenge of the previous approaches is that they suffer from the high computational cost of performing network representation learning for large-scale graphs effectively.The contribution of this paper is to propose an efficient and accurate network representation learning approach for large-scale graphs.In this paper, we proposes Graph Clustering-based Network Representation Learning, GC-NRL.The proposed approach pays particular attention to matrix factorization and graph clustering.To achieve high efficiency and accuracy, our approach efficiently factorizes a small similarity matrix between clusters to effectively compute the embedding vectors of nodes.Specifically, we perform graph clustering for a given graph and compute a similarity matrix between clusters.We then compute the representation vectors of the clusters by factorizing their similarity matrix with sparse matrices.We determine the representation vectors of the nodes by referring to those of clusters.In terms of the matrix factorization-based approach, since our approach uses the similarity matrix between clusters instead of the proximity matrix between nodes, it can efficiently compute the representation vectors.Besides, in terms of the graph clustering-based approach, since our approach performs graph clustering only once and uses the similarities between clusters, unlike the previous approach, it can improve efficiency.In the proposed approach, we can place nodes of the same cluster closely in the representation space by referring to the representation vectors of clusters.Even if nodes are included in different clusters, our approach can closely place the nodes in the representation space if the clusters are well connected and have high similarity.Consequently, we can effectively compute the representation vectors.The main contributions are as follows: • Our approach effectively computes representations of nodes by factorizing a similarity matrix between clusters to capture their structural relationships.• The proposed approach expands and reduces the dimensionality of the representation vectors of the clusters efficiently and effectively by exploiting sparse matrices.• Experiments confirm that our approach is several orders of magnitude faster than the previous approaches while achieving higher node classification accuracy.
In the remainder of this paper, Sect. 2 describes related work.Section 3 gives an overview of the background, Sect. 4 introduces our approach, Sect. 5 reviews our experimental results, and Sect.6 provides conclusions.The preliminary version of this article has been published in DASFAA 2023 [9].

Related Work
Recently, network representation learning has been extensively studied [13].The success of network representation learning has driven a lot of downstream graph analytics.
Inspired by the success of neural methods for learning word embedding, particularly the SkipGram model [18], many network representation learning approaches have been proposed.They typically sample node pairs close to each other by performing random walks and compute node representations using the SkipGram algorithm.DeepWalk first proposed using truncated random walks to capture the network structure [22].Specifically, it samples pairs of nodes from the graph by performing random walks and then feeding them to SkipGram to obtain node representations.LINE takes a similar idea with an explicit objective function by setting the walk length as one or two [28].It uses the negative sample strategy [18] to reduce the computation time.PTE is an extension of LINE for heterogeneous text networks [27].node2vec generalizes these methods by taking potentially biased random walks for enhanced flexibility [10].Since the random walk-based approach usually relies on stochastic gradient descent in the learning process, a large number of node pairs must be sampled until convergence to ensure the quality of the node representations; this incurs high computation costs.Levey et al. proved that approaches of the SkipGram model with negative sampling implicitly factorize a shifted pointwise mutual information matrix of word co-occurrences [14].Based on a similar idea, Qiu et al. showed that DeepWalk, LINE, PTE, and node2vec implicitly approximate and factorize a node proximity matrix, which is usually some transformation of the i-step adjacency matrix.Following these analyses, NetMF was proposed for network representation learning [25].This approach is a generalized matrix factorization framework that uses SVD in unifying several random walk-based approaches.It uses eigendecomposition to efficiently compute the proximity matrix.GraRep is another matrix factorization-based approach that directly applies SVD to preserve a high-order proximity matrix [4], while HOPE uses generalized SVD to preserve the asymmetric transitivity in direct graphs [21].However, the matrix factorization-based approach also suffers from the high computation cost since it needs to perform expensive matrix factorization operations.
Several works have been proposed to speed up network representation learning.NetSMF, proposed by Qiu et al., is an efficient network representation learning approach based on NetMF [24].This approach uses theories from spectral sparsification to sparsify the proximity matrix between nodes.To reduce computation time, it uses a path-sampling approach to efficiently obtain the proximity matrix used in NetMF.In addition, it uses the randomized approach to reduce the high computation cost of SVD.RandNE is a Gaussian random projection method for network representation learning [31].Specifically, it maps the graph into a low-dimensional representation space using a Gaussian random projection approach while preserving the high-order proximity between nodes.It uses an iterative projection procedure to reduce the computational cost by avoiding explicit computation of the high-order proximities.FastRP is another efficient approach to network representation learning [5].It explicitly constructs a node proximity matrix that captures the structural relationships in the graph and normalizes matrix entries based on node degree.It applies a random projection approach for the proximity matrix to efficiently obtain node representation.However, these previous network representation learning approaches still need significant computation time since they perform expensive matrix computations to obtain the representation vectors.REFINE is an iterative projection approach that adds an orthogonal constraint on the embedding vectors based on the randomized blocked QR with power iteration [32].However, these network embedding approaches still need significant computational time since they perform expensive matrix computations to obtain the embedding vectors.FREDE uses personalized PageRank to obtain proximities between nodes, and it exploits the frequent directions sketching process to efficiently compute SVD [29].However, it needs a high computational cost to compute personalized PageRank.LouvainNE constructs the hierarchy of subgraphs using the Louvain method recursively to coarsen a large graph into smaller clusters [2]; the Louvain method is a graph clustering approach [3].It obtains a representation of each subgraph at different hierarchy levels and aggregates them to generate the embedding vectors.Since LouvainNE needs to iterate graph clustering to obtain the hierarchy structure, it still incurs high computational costs.To efficiently perform network representation learning, Fahrbach et al. proposed RandomConstraction [6] and Lin et al. proposed GPA [15].However, they are preprocessing and initialization approaches, respectively, and are not used stand-alone; they need to be used with other embedding approaches such as LINE, NetMF, NetSMF, node2vec, and DeepWalk.As a result, these approaches need high computational costs.

Preliminaries
We introduce the background to this paper below.Table 1 lists the main symbols and their definitions.The problem of network representation learning for general graphs is formalized as follows: Given graph G = ( , ) with being the set of n nodes and being the set of m edges, network representation learning computes a low-dimensional representation vector x v of dimension d for each node v ∈ where d ≪ n is the predefined number of dimensions.Rep- resentation vector x v is set to capture the structural property of node v.
NetMF is the most popular matrix factorization-based approach [25].It is a generalized framework that uses SVD in unifying several random walk-based approaches such as DeepWalk [22], LINE [28], and PTE [27].NetMF computes the following n × n high-order proximity matrix between nodes: (1) In this equation, log(⋅) is an elementwise logarithm, vol(G) = ∑ 1≤i,j≤n A[i, j] is the volume of the graph where A[i, j] is the [i, j] element of the adjacency matrix A cor- responding to the edge weight from the jth node to the ith node, b is the number of negative samples [25], T is the window size, and D is the diagonal matrix D = diag(A1 n ) where 1 n is a vector of length n with all ones.NetMF obtains the representation vectors by using the d left singular vectors and the first d singular values after computing the SVD of matrix M .However, it needs O(n 3 ) time and O(n 2 ) space to compute matrix M of n × n size.Furthermore, NetMF degrades the quality of representation vectors by truncating the small nonzero elements from the proximity matrix.

Proposed Method
Instead of proximities between nodes, we exploit similarities between clusters to reduce the computational cost since the number of clusters is much smaller than that of nodes [26].Since similarities between clusters represent the structural property of the graph, we can effectively generate the representation vectors by using them.Let c i be the ith cluster, we perform graph clustering and compute representation vectors of c i by using a similarity matrix between clusters.We compute the representation vectors of the nodes from the representation vectors of the clusters.Let W be the row nor- malized adjacency matrix and r c i be the representation vec- tor of c i , we compute the representation vectors of node v as In this equation, l is the number of clusters, ℂ i is a node set included in c i , and W [v, w] is the edge weight between node v and w.

Similarity Between Clusters
To compute representation vector r i , we factorize the simi- larity matrix between clusters.Our approach uses the Inc-Mod method for graph clustering since it can compute clusters efficiently [26].Note that, since the IncMod method can handle undirected and directed graphs, our approach can handle both graphs.The IncMod method automatically sets number of clusters l based on graph structure; we cannot specify the number of clusters.
Our approach determines the similarity matrix between clusters from the difference from a random graph.Specifically, if S is the l × l similarity matrix between clusters, we define its elements as follows: is the ratio of edge weights connected to cluster c i , if we assume that G is a random graph, the second term of the right side in Eq. ( 2), , corresponds to the expectation of the sum of edge weights connected to cluster c i from c j .On the other hand, the first term, , corresponds to the sum of edge weights actually connected to cluster c i from c j .Therefore, S[i, j] would be positive if c i and c j are well connected compared to a random graph; otherwise, it would be negative.As a result, S effectively represents the structural relationships between the clusters.Therefore, even if nodes are included in different clusters, our approach can place the nodes closely in the representation space if their clusters have high similarity.Note that we have 0 ≤ ∑ . Therefore, from Eq. ( 2), we have and As a result, the range of S[i, j] is given as follows: We provide an example to compute elements of S .As shown in Fig. 1(1-1), an example graph has seven nodes of two clusters c 1 and c 2 .Cluster c 1 includes node v 1 , v 2 , and v 3 .Cluster c 2 includes node v 4 , v 5 , v 6 , and v 7 .Note that nodes of c 1 and c 2 are directly connected to each other in the clusters, and these clusters have a single bidirectional edge between the clusters, as shown in Fig. 1(1-1).We assume edge weights are 1 in the graph.As a result, the adjacency matrix of the graph is given as Fig. 1(1-2).Therefore, from Fig. 1(1-2), we have ( 2) In addition, from Fig. 1 This indicates that the similarity given by Eq. ( 2) can capture the property of the clusters; c 1 and c 2 are the same in terms of having nodes directly connecting each other and a single bidirectional edge between the clusters.
Our approach uses SVD on S to compute the represen- tation vectors of the clusters.Specifically, we decompose S as S = U V ⊤ and compute representation matrix R as R = U 1 2 .If r i is the ith row vector of R , we use r i as the representation vector of the ith cluster.However, using row vectors in R has a problem in computing the repre- sentation vectors of the clusters.Since the size of S is l × l , the length of r i is l.Therefore, if l < d , the representation vectors of the clusters would be shorter than the representation vectors of the nodes with length d.As a result, it is difficult to effectively use the representation vectors of ( 6) ∑

Dimensionality Expansion
If l < d , our approach expands the dimensionality of the representation vectors of the clusters by exploiting a sparse matrix.Let E be the l × d expansion matrix and e i be the ith column vector of E , we set the elements of e i as follows: To obtain R , we project low-dimensional matrix U 1 2 into a high-dimensional space by using E .Specifically, we expand the dimensionality of the representation vectors by computing R = U 1 2 E as shown in Algorithm 1.Let r i be the ith column vector of matrix R , we have the following property for R:  (2 l) 2 , and we have e i [i � ]e j [j � ] = 0 otherwise.Therefore, we have E(e i [i � ]e j [j � ]) = 0 .As a result, if i ≠ j , we have Therefore, we have E(r ⊤ i r j ) = 0 from Eq. (12).◻

Algorithm 1 Effective Vector Computation
Input: for j = 1 to l do The column vectors in U 1 2 are orthogonal to each other since SVD produces orthogonal matrices.Specifically, let u ′ i be the ith column vector of , which is a necessary condition for preserving pairwise similarities between vectors [1].On the other hand, Lemma 1 shows that each column of matrix R would be orthogonal to each other, the same as matrix U ′ .Consequently, Lemma 1 indicates that we can preserve the preferable property for the representation vectors even after dimensionality expansion.In terms of the quality of dimensionality expansion, we have the following lemma:

Lemma 2 Let V(⋅) represent variance, the fol- lowing equation holds if we have
Proof From Eq. ( 12), we have Since i ≠ j holds, we have From Eq. ( 11), (e i [i � ]) 2 = l log l with probability log l l ; (e i [i � ]) 2 = 0 , otherwise.As a result, since i ≠ j , we have (log l) 2 with probability 8(log l) 4   (2 l) 4 , we have 2 with probability 8(log l) 4   (2 l) 4 , and As a result, from Eq. ( 15), we have As a result, from Eqs. ( 13), (14), and ( 16), if i ≠ j , we have which completes the proof.□ As shown in Lemma 2, since V(r ⊤ i r j ) is represented as the cumulative summation of l elements, it would have a small value as number of clusters l is small.Besides, as shown in Lemma 2, V(r ⊤ i r j ) is independent from number of dimensions d.This indicates that we have the preferable property of the column vectors in R for the representation vectors regardless of the expanded dimensionality.Algorithm 1 has the following property:

SVD Computation
The previous section described the approach for the case of l < d .If we have l ≥ d , we can obtain the representation vectors of the clusters with length d by computing the SVD of rank d for S .However, since the computation cost of SVD is O(l 3 ) , it is impractical to compute SVD if we have a large number of clusters.As shown in the previous network representation learning approach [24], the randomized approach can reduce the computation time of SVD [11].This approach derives a basic matrix from a randomized matrix to compute SVD.However, this approach still incurs high computation cost to obtain the basic matrix by performing orthonormalization [30]; it needs O(l 2 d + ld 2 ) time if it is applied to S .Furthermore, the memory cost of this approach is O(l 2 ) to hold S ; it is quadratic to the number of clusters.
To efficiently compute SVD, we use l × d basic matrix B whose ith row vector, b i , is set as follows: Unlike the previous approach, our approach does not perform orthonormalization to obtain the basic matrix.We use matrix B to compute SVD.Algorithm 2 details the proce- dure.It uses basic matrix B to project l × l large matrix S into an l × d low-dimensional space corresponding to matrix S ′ as a form of S � = SB .However, since the size of S is l × l , it requires high memory cost if we directly hold S .To reduce the memory cost, our approach processes row vectors of S one by one.Specifically, let s i be the ith row vector of S and s ′ i be the ith row vector of S ′ , we compute row vectors as s � i = s i B , as shown in Algorithm 2. Since it does not directly use S , we can reduce the memory cost in computing SVD.Let V �⊤ = V B ⊤ , Algorithm 2 computes the representation matrix by factorizing the following l × l matrix S: We have the following property for matrix S: Proof From Eq. ( 19), we have 18), ) 2 = 0 , otherwise.As a result, if j � = j , we have If j ′ ≠ j , from Eq. ( 18), we have b As a result, we have which completes the proof.□ As shown in Eq. ( 19), we can exactly compute matrix S by using Algorithm 2. Therefore, this lemma indicates that we can effectively approximate S as S .Concerning the approximation quality, we have the following property; Proof From Eq. ( 20), we have If j � = j , we have ( 20) From Eq. ( 18), we have with probability As a result, Therefore, if j � = j , we have As a result, if j � = j , we have the following equation from Eqs. ( 21) and ( 28): If j ′ ≠ j , the following equation holds: From Eq. ( 18), (b (2d) 4 , and we have b As a result, from Eqs. ( 22) and ( 31), if j ′ ≠ j , we have Therefore, from Eqs. ( 24), (29), and ( 32

Representation Learning Algorithm
Algorithm 3 gives a full description of our algorithm.It first identifies the clusters using the IncMod method (line 1).If the number of clusters, l, is smaller than the number of dimensions, d, it computes the clusters' representation vectors from Algorithm 1 (line 2-3).Otherwise, it computes the representation vectors using Algorithm 2 (line 4-5).It then computes the representation vectors for the nodes from the obtained representation vectors of the clusters (line 6-7).The computational and memory costs of Algorithm 3 are given as follows: (32) compute matrix R from Algorithm 2; 6: for each v ∈ V do

Experimental Evaluation
This section compared our approach to the previous approaches: FastRP [5], REFINE [32], RandNE [31], FREDE [29], LightNE [23], LouvainNE [2], NetMF [25], NetSMF [24], and truss2vec [19].As shown in Table 2, we used five real-world graphs: CoCit (CC), com-DBLP (DBLP), YouTube (YT), com-LiveJournal (LJ), and com-Friendster (FS).For NetMF, we set the target rank of eigendecomposition to 1, 024 as in [25].For NetMF and NetSMF, we set negative sampling to 20, as shown in [18].We set the window size used in NetMF, as well as NetSMF, Fas-tRP, REFINE, RandNE, and LightNE to ten by following [25].We set the number of nodes from which we compute personalized PageRank to 1, 000 for FREDE.For RandNE and FastRP, we set weights used in the high-order proximity matrices to one, the same as in [25].For REFINE and LightNE, we set the number of diffusion steps to two by following the previous paper [32].For LouvainNE, we set the damping parameter to 0.01 following [2].For truss2vec, we set the number of hops to two, the length of random walks to 80, the number of random walks starting at each node to ten, and the attenuation factor to 0.6 by following [19].We used the same programming language, C++, to implement the approaches examined.We conducted the experiments on a Linux server using an Intel Xeon Platinum 8280 CPU with a 2.70 GHz processor and 1.5 TB memory.

Network Representation Learning Time
We evaluated the network representation learning time of each approach.Figure 2 plots the processing time to compute the representation vectors from the given graphs.This experiment set the number of dimensions to d = 128 .For DBLP, YT, LJ, and FS, we omit the results of NetMF since it failed to compute the representation vectors due to the lack of memory space.As shown in Fig. 2, our approach offers higher efficiency than the previous approaches; it is up to 5.5, 43.7, 54.6, 77.3, 99.9, 111.5, 188.6, 601.0, and 293071.0times faster than LouvainNE, FastRP, NetSMF, REFINE, RandNE, LightNE, truss2vec, FREDE, and NetMF, respectively.NetMF incurs a high computational cost to apply eigendecomposition to the proximity matrix since the matrix has O(n 2 ) number of nonzero elements.In order to reduce the computation cost of NetMF, NetSMF uses the path-sampling approach.However, the path-sampling approach needs a large number of random walks to assure the approximation quality of the similarity matrix; the number of random walks is m.RandNE and REFINE incur high computation costs to obtain the orthogonal matrix used in the iterative projection procedure.LightNE also incurs high computation costs to perform orthonormalization for the basic matrix used in SVD.FastRP incurs high computation costs since it recursively performs expensive matrix computations to obtain the representation vectors.FREDE needs a high computational cost to compute personalized PageRank and SVD iteratively.The computation cost of LouvainNE is high as the Louvain method is iteratively performed to obtain the hierarchical structure.If h is the number of hops to extract neighbors of nodes, truss2vec needs O(n( m n ) h ) time to obtain high-order structural information of nodes; it exponentially increases for the number of hops.Therefore, truss2vec requires high computation costs.On the other hand, the proposed approach factorizes the small l × l similarity matrix by per- forming graph clustering only once to efficiently generate the representation vectors.

Multi-label Node Classification
This experiment performed node classification.We used the one-vs-rest logistic regression model implemented by LIB-LINEAR.In the test phase, the one-vs-rest model yielded a ranking of labels rather than an exact label assignment.We took the assumption that was made in DeepWalk; the number of labels for nodes in the test data is given [22].Table 3 shows the Micro-F1 and Macro-F1 scores where we set the training ratio to 5% .We set the number of dimensions to d = 128.
Table 3 indicates that our approach yields higher Micro-F1 and Macro-F1 scores than the previous approaches.This is because, as described in Sect.4.1, we exploit the structural similarity matrix to capture the relationships between clusters and compute the representation vectors of the clusters by factorizing their similarity matrix.NetMF applies the elementwise matrix logarithm to the proximity matrix.However, it harms the quality of representations by cutting small nonzero elements.Although NetSMF improves the efficiency of NetMF, the path-sampling approach used in NetSMF yields a sparse similarity matrix where only m node pairs can have nonzero elements in the n × n matrix.Therefore, NetSMF has difficulty effectively representing the similarities between nodes.Even though the base matrices used of FastRP, REFINE, and RandNE would be orthogonal, they do not accurately capture the structural property of nodes since the obtained representation vectors are not orthogonal after performing the iterative projection procedure.Although FREDE uses personalized PageRank, it fails to capture the structural property of nodes.Since the path-sampling approach used in LightNE yields a sparse proximity matrix where at most m node pairs can have nonzero elements, it has difficulty in effectively representing the proximities between nodes.Since LouvainNE does not exploit relationships between clusters, it separates nodes independently in the representation space according to the clusters.Due to the small-world property, most nodes can be reached from each other in a small number of hops [7].Therefore, neighboring nodes have similar structural similarities in truss2vec.As a result, truss2vec fails to capture the relationships between nodes effectively.

Dimensionality Expansion
We expand the dimensionality of the representation vectors of the clusters if l < d holds.We evaluated Micro-F1 and Macro-F1 scores to show the effectiveness of this approach by performing an ablation study for CoCit (CC) dataset where we set d = 256 .Note that the number of clusters obtained from the dataset by the IncMod method is 131; l < d for these datasets.Table 4 shows the results where "W/O expansion" represents the approach that does not expand the dimensionality by projecting the representation vectors.We set the training ratio to 5%.
Table 4 shows that we can improve Micro-F1 and Macro-F1 scores by 2.5 and 3.0% , respectively.If l < d holds, since the representation vector of each cluster is shorter than the representation vectors of the nodes, the comparative approach has difficulty in effectively exploiting the representation vectors.On the other hand, we compute representation matrix R as R = U 1 2 E where E is the expansion matrix.Note that, let U � = U 1 2 and u i be the ith column vector of U ′ , we have (u � i ) ⊤ u � j = 0 if i ≠ j .Similarly, as shown in Lemma 1 each column vector in R would be orthogonal to each other.This indicates that the property for columns of U ′ would be preserved in R even after dimensionality expan- sion.Therefore, we can improve accuracy by expanding the dimensionality.

SVD Computation
If l ≥ d , we compute the SVD of rank d on the similarity matrix between clusters.Our approach obtains the basic matrix without orthonormalization by exploiting the sparse matrix to compute SVD efficiently.In this experiment, we evaluated the processing time of SVD by setting d = 256 .Figure 3 shows the result where "SVD" represents the approach that naively computes SVD."Randomized" is the existing efficient approach that computes the basic matrix from a randomized matrix by performing orthonormalization [11].In this experiment, we used com-DBLP (DBLP), YouTube (YT), com-LiveJournal (LJ), and com-Friendster (FS) datasets since the number of clusters in these datasets are 1245, 3391, 3079, and 14249, respectively; d ≤ l for these datasets.Table 5 shows the Micro-F1 and Macro-F1 scores of each approach.In this experiment, we set the training ratio to 5%.
Figure 3 and Table 5 show that our approach can reduce the computation time without sacrificing the accuracy; it is up to 5.4 and 830.8 times faster than the existing and the original approaches of SVD, respectively.SVD requires greater computation time than the other approaches since it takes O(l 3 ) time to compute the representation vectors of the clusters by factorizing the l × l similarity matrix.To improve the efficiency of SVD, the randomized approach computes SVD on a smaller matrix of l × d after projecting the similar- ity matrix using the basic matrix of l × d .However, it needs O(l 2 d + ld 2 ) time to obtain the basic matrix by performing orthonormalization.Instead, the proposed approach efficiently computes the basic matrix using the sparse matrix, as shown in Eq. (12).Specifically, our approach obtains the basic matrix at l log d time, and computes SVD on a smaller l × d matrix, S ′ , as shown in Algorithm 2. Besides, the proposed approach can effectively approximate the similarity matrix as shown in Lemmas 4 and 5. Therefore, our approach can efficiently and accurately compute SVD on the similarity matrix.

Graph Clustering Approach
As described in Sect.4.1, the proposed approach uses the IncMod method [26] to compute the representation vectors of clusters.In this experiment, we used the Louvain method [3] instead of the IncMod method as the graph clustering approach.In Fig. 4 shows the network representation learning time, and Table 6 shows Micro-F1 and Macro-F1 scores."Louvain" is the result of the Louvain method-based approach.In this experiment, we set d = 128 and the train- ing ratio to 5%.
As shown in Fig. 4, the proposed approach is up to 19.0 times faster than the Louvain method-based approach.This is because the IncMod method is faster than the Louvain method.On the other hand, in terms of accuracy, Micro-F1 and Macro-F1 scores of the Louvain method-based approach is not so different from the proposed approach.This is because the IncMod and the Louvain methods are both the modularity-based graph clustering approaches and yield almost the same clustering results [26].The results of this experiment indicate that we can improve efficiency by using the IncMod method without sacrificing accuracy.

Conclusions
This paper addressed the of improving the efficiency and accuracy of network representation learning.We perform graph clustering just once and factorize the similarity matrix between clusters to capture the structural property of the graph.Experiments show that our approach is more efficient than existing approaches with greater accuracy.

Data Availability
The datasets used are available from the corresponding author on reasonable request.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material.If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
10) S = 3.55 − 3.55 −3.55 3.55 the clusters to compute the representation vectors of the nodes if l < d.

Fig. 1
Fig. 1 Example of cluster similarity computation

Lemma 3 U 1 2E
Algorithm 1 takes O(d log l + l 3 ) time and O(ld) space for computing representation matrix R. Proof It would take O(d log l) time to compute E .It takes O(l 2 ) time to compute S .Besides, it needs O(l 3 ) time to compute SVD on S .It requires O(l 2 log l) time to compute R = since e i has log l nonzero elements.It needs O(l 2 ) , O(d log l) , and O(ld) spaces to hold S , E , and R , respectively.Therefore, Algorithm 1 needs O(d log l + l 3 ) time and O(ld) space.□

Lemma 6 U 1 2
), we have which completes the proof.□ Lemma 5 indicates that V( S[i, j]) would be small as the dimensions of the representation d increase.Therefore, Algorithm 2 can effectively compute the representation vectors as d increases.The computational and memory costs of Algorithm 2 are as follows: Algorithm 2 needs O(l 2 + ld 2 ) time and O(ld) space for computing representation matrix R. Proof Since b i would have log d nonzero elements, it would take O(l log d) time to compute B .It needs O(l 2 ) time to compute S .Since each column of B has log d nonzero ele- ments, it needs O(l log d) time to compute S � = SB .It takes O(ld 2 ) time to compute SVD on S ′ and O(ld) time to compute R = .Besides, it needs O(l) space to hold s i and O(ld) space to hold B , S ′ , and R .As a result, Algorithm 2 takes O(l 2 + ld 2 ) time and O(ld) space.□

7 :Theorem 1
compute xv = l i=1 w∈Ci W [v, w]rc i ; Our approach takes O(m + d log l + l 3 ) time and O(nd + m) space if l < d holds.Otherwise, it requires (m + l 2 + ld 2 ) time and O(nd + m) space.Proof The IncMod method needs O(m) time and O(m) space [26].If l < d , as shown in Lemma 3, it takes O(d log l + l 3 ) time and O(ld) space to compute the representation vectors of the clusters.Otherwise, it needs O(l 2 + ld 2 ) time and O(ld) space, as shown in Lemma 6.It needs O(m) time to compute the representation vectors of the nodes.It needs O(nd) space to hold the representation vectors.As a result, it needs O(m + d log l + l 3 ) time and O(nd + m) space if l < d holds.Otherwise, it takes (m + l 2 + ld 2 ) time and O(nd + m) space.□

Table 1
Definitions of main symbols i A n × n adjacency matrix of the graph W n × n row normalized adjacency matrix of the graph Set of nodes Set of edges

Table 2
Characteristics

Table 4
Results of the expansion approach