Dynamical SimRank search on timevarying networks
 807 Downloads
 1 Citations
Abstract
SimRank is an appealing pairwise similarity measure based on graph structure. It iteratively follows the intuition that two nodes are assessed as similar if they are pointed to by similar nodes. Many real graphs are large, and links are constantly subject to minor changes. In this article, we study the efficient dynamical computation of allpairs SimRanks on timevarying graphs. Existing methods for the dynamical SimRank computation [e.g., LTSF (Shao et al. in PVLDB 8(8):838–849, 2015) and READS (Zhang et al. in PVLDB 10(5):601–612, 2017)] mainly focus on topk search with respect to a given query. For allpairs dynamical SimRank search, Li et al.’s approach (Li et al. in EDBT, 2010) was proposed for this problem. It first factorizes the graph via a singular value decomposition (SVD) and then incrementally maintains such a factorization in response to link updates at the expense of exactness. As a result, all pairs of SimRanks are updated approximately, yielding \(O({r}^{4}n^2)\) time and \(O({r}^{2}n^2)\) memory in a graph with n nodes, where r is the target rank of the lowrank SVD. Our solution to the dynamical computation of SimRank comprises of five ingredients: (1) We first consider edge update that does not accompany new node insertions. We show that the SimRank update \({\varvec{\Delta }}{} \mathbf{S}\) in response to every link update is expressible as a rankone Sylvester matrix equation. This provides an incremental method requiring \(O(Kn^2)\) time and \(O(n^2)\) memory in the worst case to update \(n^2\) pairs of similarities for K iterations. (2) To speed up the computation further, we propose a lossless pruning strategy that captures the “affected areas” of \({\varvec{\Delta }}{} \mathbf{S}\) to eliminate unnecessary retrieval. This reduces the time of the incremental SimRank to \(O(K(m+{\textsf {AFF}}))\), where m is the number of edges in the old graph, and \({\textsf {AFF}} \ (\le n^2)\) is the size of “affected areas” in \({\varvec{\Delta }}{} \mathbf{S}\), and in practice, \({\textsf {AFF}} \ll n^2\). (3) We also consider edge updates that accompany node insertions, and categorize them into three cases, according to which end of the inserted edge is a new node. For each case, we devise an efficient incremental algorithm that can support new node insertions and accurately update the affected SimRanks. (4) We next study batch updates for dynamical SimRank computation, and design an efficient batch incremental method that handles “similar sink edges” simultaneously and eliminates redundant edge updates. (5) To achieve linear memory, we devise a memoryefficient strategy that dynamically updates all pairs of SimRanks column by column in just \(O(Kn+m)\) memory, without the need to store all \((n^2)\) pairs of old SimRank scores. Experimental studies on various datasets demonstrate that our solution substantially outperforms the existing incremental SimRank methods and is faster and more memoryefficient than its competitors on millionscale graphs.
Keywords
Similarity search SimRank computation Dynamical networks optimization1 Introduction
 Problem

(Incremental SimRank Computation)
 Given

an old digraph G, old similarities in G, link changes \(\Delta G\) ^{1} to G, and a damping factor \(C \in (0,1)\).
 Retrieve

the changes to the old similarities.
Our research for the above SimRank problem is motivated by the following real application:
Example 1
(Decentralize largescale SimRank retrieval) Consider the Web graph G in Fig. 1. There are \(n=14\) nodes (web pages) in G, and each edge is a hyperlink. To evaluate the SimRank scores of all \(({n} \times n)\) pairs of Web pages in G, existing allpairs SimRank algorithms need iteratively compute the SimRank matrix \(\mathbf {S}\) of size \(({n} \times n)\) in a centralized way (by using a single machine). In contrast, our incremental approach can significantly improve the computational efficiency of all pairs of SimRanks by retrieving \(\mathbf {S}\) in a decentralized way as follows:
We first employ a graph partitioning algorithm (e.g., METIS^{2}) that can decompose the large graph G into several small blocks such that the number of the edges with endpoints in different blocks is minimized. In this example, we partition G into 3 blocks, \(G_1 \cup G_2 \cup G_3\), along with 2 edges \(\{(f,c),(f,k)\}\) across the blocks, as depicted in the first row of Fig. 1.
It is worth mentioning that this way of retrieving \(\mathbf {S}\) is far more efficient than directly computing \(\mathbf {S}\) over G via a batch algorithm. There are two reasons:
Secondly, after graph partitioning, there are not many edges across components. Small size of \(\Delta G\) often leads to sparseness of \({\varvec{\Delta }}{} \mathbf{S}\) in general. Hence, \({\varvec{\Delta }}{} \mathbf{S}\) is stored in a sparse format. In addition, our incremental SimRank method will greatly accelerate the computation of \({\varvec{\Delta }}{} \mathbf{S}\).
Hence, along with graph partitioning, our incremental SimRank research will significantly enhance the computational efficiency of SimRank on large graphs, using a decentralized fashion. \(\square \)
Despite its usefulness, existing work on incremental SimRank computation is rather limited. To the best of our knowledge, there is a relative paucity of work [10, 13, 20, 25] on incremental SimRank problems. Shao et al. [20] proposed a novel twostage randomwalk sampling scheme, named TSF, which can support topk SimRank search over dynamic graphs. In the preprocessing stage, TSF samples \(R_g\) oneway graphs that serve as an index for querying process. At query stage, for each oneway graph, \(R_q\) new random walks of node u are sampled. However, the dynamic SimRank problems studied in [20] and this work are different: This work focuses on all \((n^2)\) pairs of SimRank retrieval, which requires \(O(K(m+{\textsf {AFF}}))\) time to compute the entire matrix \(\mathbf {S}\) in a deterministic style. In Sect. 7, we have proposed a memoryefficient version of our incremental method that updates all pairs of similarities in a columnbycolumn fashion within only \(O(Kn+m)\) memory. In comparison, Shao et al. [20] focuses on topk SimRank dynamic search w.r.t. a given query u. It incrementally retrieves only \(k \ (\le n)\) nodes with highest SimRank scores in a single column \(\mathbf {S}_{\star ,u}\), which requires \(O(K^2 R_q R_g)\) average query time^{3} to retrieve \(\mathbf {S}_{\star ,u}\) along with \(O(n \log k)\) time to return topk results from \(\mathbf {S}_{\star ,u}\). Recently, Jiang et al. [10] pointed out that the probabilistic error guarantee of Shao et al.’s method is based on the assumption that no cycle in the given graph has a length shorter than K, and they proposed READS, a new efficient indexing scheme that improves precision and indexing space for dynamic SimRank search. The querying time of READS is O(rn) to retrieve one column \(\mathbf {S}_{\star ,u}\), where r is the number of sets of random walks. Hence, TSF and READS are highly efficient for top k singlesource SimRank search. Moreover, optimization methods in this work are based on a rankone Sylvester matrix equation characterizing changes to the entire SimRank matrix \(\mathbf {S}\) for allpairs dynamical search, which is fundamentally different from [10, 20]’s methods that maintain oneway graphs (or SA forests) updating. It is important to note that, for largescale graphs, our incremental methods do not need to memorize all \((n^2)\) pairs of old SimRank scores. Instead, \(\mathbf {S}\) can be dynamically updated columnwisely in \(O(Kn+m)\) memory. For updating each column of \(\mathbf {S}\), our experiments in Sect. 8 verify that our memoryefficient incremental method is scalable on large real graphs while running 4–7x faster than the dynamical TSF [20] per edge update, due to the high cost of [20] merging oneway graph’s log buffers for TSF indexing.
Among the existing studies [10, 13, 20] on dynamical SimRank retrieval, the problem setting of Li et al.’s [13] on allpairs dynamic search is exactly the same as ours: the goal is to retrieve changes \({\varvec{\Delta }}{} \mathbf{S}\) to allpairs SimRank scores \(\mathbf {S}\), given old graph G, link changes \(\Delta G\) to G. To address this problem, the central idea of [13] is to factorize the backward transition matrix \(\mathbf {Q}\) of the original graph into \(\mathbf {U} \cdot {\varvec{\Sigma }} \cdot {\mathbf {V}}^\mathrm{T}\) via a singular value decomposition (SVD) first, and then incrementally estimate the updated matrices of \(\mathbf {U}\), \({\varvec{\Sigma }}\), \({\mathbf {V}}^\mathrm{T}\) for link changes at the expense of exactness. Consequently, updating all pairs of similarities entails \(O({r}^{4}n^2)\) time and \(O({r}^{2}n^2)\) memory yet without guaranteed accuracy, where \(r \ (\le n)\) is the target rank of the lowrank SVD approximation.^{4} This method is efficient to graphs when r is extremely small, e.g., a star graph \((r=1)\). However, in general, r is not always negligibly small.
(Please refer to “Appendix A” [32] for a discussion in detail, and “Appendix C” [32] for an example.)
1.1 Main contributions

We first focus on unit edge update that does not accompany new node insertions. By characterizing the SimRank update matrix \({\varvec{\Delta }}{} \mathbf{S}\) w.r.t. every link update as a rankone Sylvester matrix equation, we devise a fast incremental SimRank algorithm, which entails \(O(Kn^2)\) time in the worst case to update \(n^2\) pairs of similarities for K iterations (Sect. 3).

To speed up the computation further, we also propose an effective pruning strategy that captures the “affected areas” of \({\varvec{\Delta }}{} \mathbf{S}\) to discard unnecessary retrieval (e.g., the grey cells in Fig. 2), without loss of accuracy. This reduces the time of incremental SimRank to \(O(K(m+{\textsf {AFF}}))\), where \({\textsf {AFF}} \ (\le n^2)\) is the size of “affected areas” in \({\varvec{\Delta }}{} \mathbf{S}\), and in practice, \({\textsf {AFF}} \ll n^2\) (Sect. 4).

We also consider edge updates that accompany new node insertions, and distinguish them into three categories, according to which end of the inserted edge is a new node. For each case, we devise an efficient incremental SimRank algorithm that can support new nodes insertion and accurately update affected SimRank scores (Sect. 5).

We next investigate the batch updates of dynamical SimRank computation. Instead of dealing with each edge update one by one, we devise an efficient algorithm that can handle a sequence of edge insertions and deletions simultaneously, by merging “similar sink edges” and minimizing unnecessary updates (Sect. 6).

To achieve linear memory efficiency, we also express \({\varvec{\Delta }}{} \mathbf{S}\) as the sum of many rankone tensor products, and devise a memoryefficient technique that updates allpairs SimRanks in a columnbycolumn style in \(O(Kn+m)\) memory, without loss of exactness. (Sect. 7)

We conduct extensive experiments on real and synthetic datasets to demonstrate that our algorithm (a) is consistently faster than the existing incremental methods from several times to over one order of magnitude; (b) is faster than its batch counterparts especially when link updates are small; (c) for batch updates, runs faster than the repeated unit update algorithms; (d) entails linear memory and scales well on billionedge graphs for allpairs SimRank update; (e) is faster than LTSF and its memory space is less than LTSF; (f) entails more time on Cases (C0) and (C2) than Cases (C1) and (C3) for four edge types, and Case (C3) runs the fastest (Sect. 8).
This article is a substantial extension of our previous work [25]. We have made the following new updates: (1) In Sect. 5, we study three types of edge updates that accompany new node insertions. This solidly extends [25] and Li et al.’s incremental method [13] whose edge updates disallow node changes. (2) In Sect. 6, we also investigate batch updates for dynamic SimRank computation, and devise an efficient algorithm that can handle “similar sink edges” simultaneously and discard unnecessary unit updates further. (3) In Sect. 7, we propose a memoryefficient strategy that significantly reduces the memory from \(O(n^2)\) to \(O(Kn+m)\) for incrementally updating all pairs of SimRanks on millionscale graphs, without compromising running time and accuracy. (4) In Sect. 8, we conduct additional experiments on real and synthetic datasets to verify the high scalability and fast computational time of our memoryefficient methods, as compared with the LTSF method. (5) In Sect. 9, we update the related work section by incorporating stateoftheart SimRank research.
2 SimRank background
In this section, we give a broad overview of SimRank. Intuitively, the central theme behind SimRank is that “two nodes are considered as similar if their incoming neighbors are themselves similar.” Based on this idea, there have emerged two widely used SimRank models: (1) Li et al.’s model (e.g., [6, 8, 13, 18, 26, 28, 30]) and (2) Jeh and Widom’s model (e.g., [4, 9, 11, 16, 20, 29]). Throughout this article, our focus is on Li et al.’s SimRank model, also known as CoSimRank in [18], since the recent work [18] by Rothe and Schütze has showed that CoSimRank is more accurate than Jeh and Widom’s SimRank model in real applications such as bilingual lexicon extraction. (Please refer to Remark 1 for detailed explanations.)
2.1 Li et al.’s SimRank model
2.2 Jeh and Widom’s SimRank model
Remark 1
The recent work by Kusumoto et al. [11] has showed that \({{\mathbf {S}}}\) and \({{\mathbf {S}'}}\) do not produce the same results. Recently, Yu and McCann [28] have showed the subtle difference of the two SimRank models from a semantic perspective, and also justified that Li et al.’s SimRank \({{\mathbf {S}}}\) can capture more pairs of selfintersecting paths that are neglected by Jeh and Widom’s SimRank \({{\mathbf {S}'}}\). The recent work [18] by Rothe and Schütze has demonstrated further that, in real applications such as bilingual lexicon extraction, the ranking of CoSimRank \({{{\tilde{\mathbf{S}}}}}\) (i.e., the ranking of Li et al.’s SimRank \({{\mathbf {S}}}\)) is more accurate than that of Jeh and Widom’s SimRank \({{\mathbf {S}'}}\) (see [18, Table 4]).
Despite the high precision of Li et al.’s SimRank model, the existing incremental approach of Li et al. [13] for updating SimRank does not always obtain the correct solution \(\mathbf {S}\) to Eq. (1). (Please refer to “Appendix A” [32] for theoretical explanations).
Symbol and description
Symbol  Description 

n  Number of nodes in old graph G 
m  Number of edges in old graph G 
\({d_i}\)  Indegree of node i in old graph G 
d  Average indegree of graph G 
C  Damping factor (\(0<C<1\)) 
K  Iteration number 
\(\mathbf {e}_i\)  \(n \times 1\) unit vector with a 1 in the ith entry and 0s elsewhere 
\(\mathbf {Q} / \tilde{\mathbf {Q}}\)  Old/new (backward) transition matrix 
\(\mathbf {S} / \tilde{\mathbf {S}}\)  Old/new SimRank matrix 
\(\mathbf {I}_n\)  \(n \times n\) identity matrix 
\({\mathbf {X}}^\mathrm{T}\)  Transpose of matrix \(\mathbf {X}\) 
\({[\mathbf {X}]}_{i,\star }\)  ith row of matrix \(\mathbf {X}\) 
\({[\mathbf {X}]}_{\star ,j}\)  jth column of matrix \(\mathbf {X}\) 
\({[\mathbf {X}]}_{i,j}\)  (i, j)th entry of matrix \(\mathbf {X}\) 
3 Edge update without node insertions
In this section, we consider edge update that does not accompany new node insertions, i.e., the insertion of new edge (i, j) into \(G=(V,E)\) with \(i \in V\) and \(j \in V\). In this case, the new SimRank matrix \({\tilde{\mathbf{S}}}\) and the old one \(\mathbf {S}\) are of the same size. As such, it makes sense to denote the SimRank change \({\varvec{\Delta }}{} \mathbf{S}\) as \({\tilde{\mathbf{S}}} \mathbf {S}\).
Below we first introduce the big picture of our main idea and then present rigorous justifications and proofs.
3.1 The main idea
3.2 Describing \(\mathbf {u}, \mathbf {v},\mathbf {w}\) in Eqs. (4) and (6)
To obtain \( \mathbf {u}\) and \(\mathbf {v}\) in Eq. (4) at a low cost, we have the following theorem.
Theorem 1
(Please refer to “Appendix B.1” [32] for the proof of Theorem 1, and “Appendix C.2” [32] for an example.)
Theorem 1 suggests that the change \({\varvec{\Delta }}{} \mathbf{Q}\) is an \(n\times n\) rankone matrix, which can be obtain in only constant time from \(d_j\) and \({{[\mathbf {Q}]}_{j,\star }^\mathrm{T}}\). In light of this, we next describe \(\mathbf {w}\) in Eq. (6) in terms of the old \(\mathbf {Q}\) and \(\mathbf {S}\) such that Eq. (6) is a rankone Sylvester equation.
Theorem 2
(ii) Utilizing the solution \(\mathbf {M}\) to Eq. (6), the SimRank update matrix \({\varvec{\Delta }}{} \mathbf{S}\) can be represented by Eq. (5). \(\square \)
(The proof of Theorem 2 is in “Appendix B.2.” [32])
Theorem 2 provides an elegant expression of \(\mathbf {w}\) in Eq. (6). To be precise, given \(\mathbf {Q}\) and \(\mathbf {S}\) in the old graph G, and an edge (i, j) inserted to G, one can find \(\mathbf {u}\) and \(\mathbf {v}\) via Theorem 1 first, and then resort to Theorem 2 to compute \(\mathbf {w}\) from \(\mathbf {u},\mathbf {v},\mathbf {Q},\mathbf {S}\). Due to the existence of the vector \(\mathbf {w}\), it can be guaranteed that the Sylvester equation (6) is rankone. Henceforth, our aforementioned method can be employed to iteratively compute \(\mathbf {M}\) in Eq. (8), requiring no matrix–matrix multiplications.
3.3 Characterizing \({\varvec{\Delta }}{} \mathbf{S}\)
Leveraging Theorems 1 and 2, we next characterize the SimRank change \({\varvec{\Delta }}{} \mathbf{S}\).
Theorem 3
 (i)when \({{d}_{j}}=0\),$$\begin{aligned} \varvec{\gamma } = \mathbf {Q}\cdot {{[\mathbf {S}]}_{\star ,i}}+\tfrac{1}{2}{{[\mathbf {S}]}_{i,i}}\cdot {{\mathbf {e}}_{j}} \end{aligned}$$(14)
 (ii)when \({{d}_{j}}>0\),
(The proof of Theorem 3 is in “Appendix B.2.” [32])
Theorem 3 provides an efficient method to compute the incremental SimRank matrix \({\varvec{\Delta }}{} \mathbf{S}\), by utilizing the previous information of \(\mathbf {Q}\) and \(\mathbf {S}\), as opposed to [13] that requires to maintain the incremental SVD.
3.4 Deleting an edge \((i,j)_{i \in V, \ j \in V}\) from \(G=(V,E)\)
For an edge deletion, we next propose a Theorem 3like technique that can efficiently update SimRanks.
Theorem 4
3.5 IncuSR algorithm
We present our efficient incremental approach, denoted as IncuSR (in “Appendix D.1” [32]), that supports the edge insertion without accompanying new node insertions. The complexity of IncuSR is bounded by \(O(Kn^2)\) time and \(O(n^2)\) memory^{5} in the worst case for updating all \(n^2\) pairs of similarities.
(Please refer to “Appendix D.1” [32] for a detailed description of IncuSR, and “Appendix C.3” [32] for an example.)
4 Pruning unnecessary node pairs in \({\varvec{\Delta }}{} \mathbf{S}\)
After the SimRank update matrix \({\varvec{\Delta }}{} \mathbf{S}\) has been characterized as a rankone Sylvester equation, pruning techniques can further skip node pairs with unchanged SimRanks in \({\varvec{\Delta }}{} \mathbf{S}\) (called “unaffected areas”).
4.1 Affected areas in \({\varvec{\Delta }}{} \mathbf{S}\)
We next reinterpret the series \(\mathbf {M}\) in Theorem 3, aiming to identify “affected areas” in \({\varvec{\Delta }}\mathbf{S}\). Due to space limitations, we mainly focus on the edge insertion case of \(d_j>0\). Other cases have the similar results.
Such paths are the concatenation of four types of subpaths (as depicted above) associated with four matrices, respectively, \({{[{{{{\tilde{\mathbf{Q}}}}}^{k}}]}_{a,j}}, {{[\mathbf {S}]}_{i,\star }}, {{\mathbf {Q}}^\mathrm{T}},{{[{{({{{{\tilde{\mathbf{Q}}}}}^\mathrm{T}})}^{k}}]}_{\blacktriangle ,b}} \), plus the inserted edge \(j \Leftarrow i\). When such entire concatenated paths exist in the new graph, they should be accommodated for assessing the new SimRank \({[\tilde{\mathbf {S}}]}_{a,b}\) in response to the edge insertion (i, j) because our reinterpretation of SimRank indicates that SimRank counts all the symmetric inlink paths, and the entire concatenated paths can prove to be symmetric inlink paths.
Indeed, when edge (i, j) is inserted, only these three kinds of paths have extra contributions for \(\mathbf {M}\) (therefore for \({\varvec{\Delta }}{} \mathbf{S}\)). As incremental updates in SimRank merely tally these paths, node pairs without having such paths could be safely pruned. In other words, for those pruned node pairs, the three kinds of paths will have “zero contributions” to the changes in \(\mathbf {M}\) in response to edge insertion. Thus, after pruning, the remaining node pairs in G constitute the “affected areas” of \(\mathbf {M}\).
We next identify “affected areas” of \(\mathbf {M}\), by pruning redundant node pairs in G, based on the following.
Theorem 5
(Please refer to “Appendix B.5” [32] for the proof and intuition of Theorem 5, and “Appendix C.4” [32] for an example.)
Theorem 5 provides a pruning strategy to iteratively eliminate node pairs with a priori zero values in \(\mathbf {M}_k\) (thus in \({\varvec{\Delta }}{} \mathbf{S}\)). Hence, by Theorem 5, when edge (i, j) is updated, we just need to consider node pairs in \(({{{\mathcal {A}}}_{k}}\times {{{\mathcal {B}}}_{k}}) \cup ({{{\mathcal {A}}}_{0}}\times {{{\mathcal {B}}}_{0}})\) for incrementally updating \({\varvec{\Delta }}{} \mathbf{S}\).
4.2 IncSR algorithm with pruning
Based on Theorem 5, we provide a complete incremental algorithm, referred to as IncSR, by incorporating our pruning strategy into IncuSR. The total time of IncSR is \(O(K(m+{\textsf {AFF}}))\) for K iterations, where \({\textsf {AFF}}:= \text {avg}_{k \in [0,K]} ( \mathcal{A}_k \cdot \mathcal{B}_k)\) with \(\mathcal{A}_k, \mathcal{B}_k\) in Eq. 22, being the average size of “affected areas” in \(\mathbf {M}_k\) for K iterations.
(Please refer to “Appendix D.2” [32] for IncSR algorithm description and its complexity analysis.)
5 Edge update with node insertions
Remark 2
 (a)
In case (C1), new node \(j \notin V\) is indexed by \((n+1)\);
 (b)
In case (C2), new node \(i \notin V\) is indexed by \((n+1)\);
 (c)
In case (C3), new nodes \(i \notin V\) and \(j \notin V\) are indexed by \((n+1)\) and \((n+2)\), respectively.
5.1 Inserting an edge (i, j) with \(i \in V\) and \(j \notin V\)
In this case, the inserted new edge (i, j) accompanies the insertion of a new node j. Thus, the size of the new SimRank matrix \({\tilde{\mathbf{S}}}\) is different from that of the old \(\mathbf {{S}} \). As a result, we cannot simply evaluate the changes to \(\mathbf {{S}} \) by adopting \({\tilde{\mathbf{S}}} \mathbf {S}\) as we did in Sect. 3.
Theorem 6
Proof
Example 2
Consider the citation digraph G in Fig. 3. If the new edge (i, p) with new node p is inserted to G, the new \({\tilde{\mathbf{S}}}\) can be updated from the old \(\mathbf {{S}}\) as follows:
5.2 Inserting an edge (i, j) with \(i \notin V\) and \(j \in V\)
We now focus on the case (C2), the insertion of an edge (i, j) with \(i \notin V\) and \(j \in V\). Similar to the case (C1), the new edge accompanies the insertion of a new node i. Hence, \({\tilde{\mathbf{S}}} \mathbf {S}\) makes no sense.
However, in this case, the dynamic computation of SimRank is far more complicated than that of the case (C1), in that such an edge insertion not only increases the dimension of the old transition matrix \(\mathbf {{Q}}\) by one, but also changes several original elements of \(\mathbf {{Q}}\), which may recursively influence SimRank similarities. Specifically, the following theorem shows, in the case (C2), how \(\mathbf {{Q}}\) changes with the insertion of an edge \((i,j)_{i \notin V, j \in V}\).
Theorem 7
Proof
 (i)All nonzeros in \({{[\mathbf {Q}]}_{j,\star }}\) are updated from \(\tfrac{1}{d_j}\) to \(\tfrac{1}{d_j+1}\):$$\begin{aligned} {{[{\hat{\mathbf{Q}}}]}_{j,\star }} = \tfrac{{{d}_{j}}}{{{d}_{j}}+1} {{[\mathbf {Q}]}_{j,\star }} = {{[\mathbf {Q}]}_{j,\star }}  \tfrac{1}{{{d}_{j}}+1} {{[\mathbf {Q}]}_{j,\star }}. \end{aligned}$$(26)
 (ii)
Theorem 8
Proof
Theorem 8 implies that, in the case (C2), after a new edge (i, j) is inserted, the new SimRank matrix \({\tilde{\mathbf{S}}}\) takes an elegant diagonal block structure: the upperleft block of \({\tilde{\mathbf{S}}}\) is perturbed by \({\varvec{\Delta }}\tilde{\mathbf{S}}_{\mathbf{11}}\) which is the solution to the ranktwo Sylvester equation (30); the lowerright block of \({\tilde{\mathbf{S}}}\) is a constant \((1C)\). This structure of \({\tilde{\mathbf{S}}}\) suggests that the inserted edge \((i,j)_{i \notin V, j \in V}\) only has a recursive impact on the SimRanks with pairs \((x,y) \in V \times V\), but with no impacts on pairs \((x,y) \in (V \times \{i\}) \cup (\{i\} \times V)\). Thus, our incremental way of computing the new \({\tilde{\mathbf{S}}}\) will focus on the efficiency of obtaining \({\varvec{\Delta }}\tilde{\mathbf{S}}_{\mathbf{11}}\) from Eq. (30). Fortunately, we notice that \({\varvec{\Delta }}\tilde{\mathbf{S}}_{\mathbf{11}}\) satisfies the ranktwo Sylvester equation, whose algebraic structure is similar to that of \({\varvec{\Delta }}{} \mathbf{S}\) in Eqs. (5) and (6) (in Sect. 3). Hence, our previous techniques to compute \({\varvec{\Delta }}{} \mathbf{S}\) in Eqs. (5) and (6) can be analogously applied to compute \({\varvec{\Delta }}\tilde{\mathbf{S}}_{\mathbf{11}}\) in Eq. (30), thus eliminating costly matrix–matrix multiplications, as will be illustrated in Algorithm 2.
One disadvantage of Theorem 8 is that, in order to get the auxiliary vector \(\mathbf {z}\) for evaluating \({\tilde{\mathbf{S}}}\), one has to memorize the entire old matrix \(\mathbf {S}\) in Eq. (28). In fact, we can utilize the technique of rearranging the terms of the SimRank Eq. (1) to characterize \(\mathbf {Q} \mathbf {S} {{[\mathbf {Q}]}_{j,\star }^\mathrm{T}}\) in terms of only one vector \([\mathbf {S}]_{\star ,j}\) so as to avoid memorizing the entire \(\mathbf {S}\), as shown below.
Theorem 9
Proof
For edge insertion of the case (C2), Theorem 9 gives an efficient method to compute the update matrix \({\varvec{\Delta }}{{{{\tilde{\mathbf{S}}}}}_{\mathbf {11}}}\). We note that the form of \({\varvec{\Delta }}{{{{\tilde{\mathbf{S}}}}}_{\mathbf {11}}}\) in Eq. (34) is similar to that of \({\varvec{\Delta }}{{{{\tilde{\mathbf{S}}}}}}\) in Eq. (13). Thus, similar to Theorem 3, the follow method can be applied to compute \(\mathbf {M}\) so as to avoid matrix–matrix multiplications.
Example 3
Consider the citation digraph G in Fig. 4. If the new edge (p, j) with new node p is inserted to G, the new \({\tilde{\mathbf{S}}}\) can be incrementally derived from the old \(\mathbf {S}\) as follows:
5.3 Inserting an edge (i, j) with \(i \notin V\) and \(j \notin V\)
We next focus on the case (C3), the insertion of an edge (i, j) with \(i \notin V\) and \(j \notin V\). Without loss of generality, it can be tacitly assumed that nodes i and j are indexed by \(n+1\) and \(n+2\), respectively. In this case, the inserted edge (i, j) accompanies the insertion of two new nodes, which can form another independent component in the new graph.
6 Batch updates
The straightforward approach to this problem is to update each edge of \(\Delta G\) one by one, by running a unit update algorithm for \(\Delta G\) times. However, this would produce many unnecessary intermediate results and redundant updates that may cancel out each other.
Example 4
We first introduce the notion of “similar sink edges.”
Definition 1
Two distinct edges (a, c) and (b, c) are called “similar sink edges” w.r.t. node c if they have a common end node c that both a and b point to. \(\square \)

(C0) \(i_1 \in V, \ i_2 \in V, \ \ldots , i_{\delta } \in V\) and \(j \in V\);

(C1) \(i_1 \in V, \ i_2 \in V, \ \ldots , i_{\delta } \in V\) and \(j \notin V\);

(C2) \(i_1 \notin V, \ i_2 \notin V, \ \ldots , i_{\delta } \notin V\) and \(j \in V\);

(C3) \(i_1 \notin V, \ i_2 \notin V, \ \ldots , i_{\delta } \notin V\) and \(j \notin V\).
Example 5
Example 6
Recall from Example 4 that a sequence of edge updates \(\Delta G\) to the graph \(G=(V,E)\) in Fig. 5. We want to compute new SimRank scores in \(G \oplus \Delta G\).
First, we can use hashing method to obtain the net update \(\Delta G_{\mathrm{net}}\) from \(\Delta G\), as shown in Example 4.
Node  \(\textsf {sim}_\textsf {old}\)  \((f,b,)\)  \((q,i,+)\)  \((j,i,+)\)  \((p,f,+)\) 

Pairs  in G  \((k,i,+)\)  \((r,f,+)\)  
(a, b)  0.0745  0.0809  0.0809  0.0809  0.0809 
(a, i)  0  0  0  0.0340  0.0340 
(b, i)  0  0  0  0.0340  0.0340 
(f, i)  0.2464  0.2464  0.1232  0.1032  0.0516 
(f, j)  0.2064  0.2064  0.2064  0.2064  0.1032 
(g, h)  0.128  0.128  0.128  0.128  0.128 
(g, k)  0.128  0.128  0.128  0.128  0.128 
(h, k)  0.288  0.288  0.288  0.288  0.288 
(i, j)  0.3104  0.3104  0.1552  0.1552  0.1552 
(l, m)  0.16  0.16  0.16  0.16  0.16 
(l, n)  0.16  0.16  0.16  0.16  0.16 
(m, n)  0.16  0.16  0.16  0.16  0.16 
7 Memory efficiency
Lines of IncuSR (in “Appendix D.1” [32]) that require to get elements from old \(\mathbf {S}\) (highlighted in red color)
Line  Description  Required elements from old \(\mathbf {S}\) 

3  ith column of \(\mathbf {S}\)  
4  (i, i) and (j, j)th elements of \(\mathbf {S}\)  
6  (i, i)th element of \(\mathbf {S}\)  
9  jth column of \(\mathbf {S}\)  
15  All elements of old \({\mathbf {S}}\) and new \(\tilde{\mathbf {S}}\) 
Lines of IncuSR (in “Appendix D.1” [32]) that require to store \(\mathbf {M}_k\) (highlighted in red color)
Line  Description  Storage of \(\mathbf {M}_k\) 

10  All elements of \(\mathbf {M}_0\)  
14  All elements of \(\mathbf {M}_k \quad (\forall k)\)  
15  All elements of \(\mathbf {M}_K\) 
In this section, we propose a novel scalable method based on Algorithms 1–4 for dynamical SimRank search, which updates all pairs of SimRanks column by column using only \(O(Kn+m)\) memory, with no need to store all \((n^2)\) pairs of old SimRank \(\mathbf {S}\) into memory, and with no loss of accuracy.
Let us first analyze the \(O(n^2)\) memory requirement for Algorithms 1–4 in Sects. 3–5. We notice that there are two factors dominating the original \(O(n^2)\) memory: (1) the storage of the entire \(n \times n\) old SimRank matrix \(\mathbf {S}\), and (2) the computation of \(\mathbf {M}_k\) from one outer product. For example, in IncuSR (in “Appendix D.1” [32]), Lines 3, 4, 6, 9, 15 need to get elements from old \(\mathbf {S}\) (see Table 3); Lines 10, 14, 15 require to store \( n \times n\) entries of matrix \(\mathbf {M}_k\) (see Table 4). Indeed, the storage of \(\mathbf {S}\) and \(\mathbf {M}_k\) are the main obstacles to the scalability of our inmemory algorithms on large graphs, resulting in \(O(n^2)\) memory space. Apart from these lines, the memory required for the remaining steps of IncuSR is O(m), dominated by (a) the storage of sparse matrix \(\mathbf {Q}\) and (b) sparse matrix–vector products.
7.1 Avoid storing \(n \times n\) elements of old \(\mathbf {S}\)
7.2 Compute \({[\mathbf {M}_K]}_{\star , x}\) and \({[\mathbf {M}_K]}_{x, \star }\) in linear memory
The main advantage of our method is that, throughout the entire updating process, we need not store \(n \times n\) pairs of \(\mathbf {M}_k\) and \(\mathbf {S}\), and thereby, significantly reduce the memory usage from \(O(n^2)\) to \(O(Kn+m)\). In addition to the insertion case (C0), our memoryefficient methods are applicable to other insertion cases in Sect. 5.1. The complete algorithm, denoted as IncSRAllP, is described in Algorithm 5. IncSRAllP is a memoryefficient version of Algorithms 1–4. It includes a procedure PartialSim that allows us to compute two columns information of old \(\mathbf {S}\) on demand in linear memory, rather than store \(n^2\) pairs of old \(\mathbf {S}\) in memory. In response to each edge update (i, j), once the two old columns \(\mathbf {S}_{\star ,i}\) and \(\mathbf {S}_{\star ,j}\) are computed via PartialSim for updating the xth column \([{\varvec{\Delta }}{} \mathbf{S}]_{\star ,x}\), they can be memorized in only O(n) memory and reused later to compute another yth column \([{\varvec{\Delta }}{} \mathbf{S}]_{\star ,y}\) in response to the edge update (i, j).
It is worth mentioning that IncSRAllP can be also combined with our batch updating method in Sect. 6. This will speed up the dynamical update of SimRank further, with \(O(n(\max _{t=1}^{B}\delta _t) + m + Kn)\) memory. Here \(O(n\delta _t)\) memory is needed to store \(\delta _t\) columns of \(\mathbf {S}\) when \([\mathbf {S}]_{\star ,I}\) is required for processing the tth block.
8 Experimental evaluation
Description of realworld datasets
Datasets  V  E  # of pairs to be assessed  Description 

Small  
DBLP (DBLP)  13,634  93,560  185,885,956 \((={V}^2) \)  DBLP citation network 
CitH (citHepPh)  34,546  421,578  1,193,426,116 \((={V}^2) \)  High Energy Physics citation network 
Medium  
YouTu (YouTube)  178,470  953,534  1,784,700,000 \((= {10}^4 {V})\)  Social network of YouTube videos 
WebB (webBerkStan)  685,230  7,600,595  6,852,300,000 \((= {10}^4 {V})\)  Web graph of Berkeley and Stanford 
WebG (webGoogle)  916,428  5,105,039  9,164,280,000 \((= {10}^4 {V})\)  Web graph from Google 
Large  
CitP (citPatents)  3,774,768  16,518,948  3,774,768,000 \((= {10}^3 {V})\)  Citation network among US patents 
SocL (socLiveJournal)  4,847,571  68,993,773  4,847,571,000 \((= {10}^3 {V})\)  LiveJournal online social network 
UK05 (uk2005)  39,459,925  936,364,282  39,459,925,000 \((= {10}^3 {V})\)  Web graph from 2005 crawl of .uk domain 
IT04 (it2004)  41,291,594  1,150,725,436  41,291,594,000 \((= {10}^3 {V})\)  Web graph from 2004 crawl of .it domain 
8.1 Experimental settings
Datasets. We adopt both real and synthetic datasets. The real datasets include smallscale (DBLP and CitH), mediumscale (YouTu, WebB and WebG), and largescale graphs (CitP, SocL, UK05, and IT04). Table 5 summarizes the description of these datasets.
(Please refer to “Appendix E” [32] for details.)
To generate synthetic graphs and updates, we adopt GraphGen ^{6} generation engine. The graphs are controlled by (a) the number of nodes V, and (b) the number of edges E. We produce a sequence of graphs that follow the linkage generation model [7]. To control graph updates, we use two parameters simulating real evolution: (a) update type (edge/node insertion or deletion), and (b) the size of updates \(\Delta G\).
Algorithms. We implement the following algorithms: (a) IncSVD, the SVDbased linkupdate algorithm [13]; (b) IncuSR, our incremental method without pruning; (c) Batch, the batch SimRank method via finegrained memorization [24]; (d) IncSR, our incremental method with pruning power but not supporting node insertions; (e) IncSRAll, our complete enhanced version of IncSR that allows node insertions by incorporating IncuSRC1, IncuSRC2, and IncuSRC3; (f) IncbSR, our batch incremental update version of IncSR; (g) IncSRAllP, our memoryefficient version of IncSRAll that dynamically computes the SimRank matrix column by column without the need to store all pairs of old similarities; (h) LTSF, the logbased implementation of the existing competitor, TSF [20], which supports dynamic SimRank updates for topk querying.
Parameters. We set the damping factor \(C=0.6\), as used in [9]. By default, the total number of iterations is set to \(K=15\) to guarantee accuracy \({C}^{K} \le 0.0005\) [16]. On CitH and YouTu, we set \(K=10\); On large graphs (CitP, SocL, UK05, and IT04), we set \(K=5\). The target rank r for IncSVD is a speedaccuracy tradeoff, we set \(r=5\) in our time evaluation since, as shown in the experiments of [13], the highest speedup is achieved when \(r=5\). In our exactness evaluation, we shall tune this value. For LTSF algorithm, we set the number of oneway graphs \(R_g = 100\), and the number of samples at query time \(R_q=20\), as suggested in [20].
All the algorithms are implemented in Visual C++ and MATLAB. For smallscale graphs, we use a machine with an Intel Core 2.80GHz CPU and 8GB RAM. For medium and largescale graphs, we use a processor with Intel Core i76700 3.40GHz CPU and 64GB RAM.
8.2 Experimental results
8.2.1 Time efficiency of IncSR and IncuSR
We first evaluate the computational time of IncSR and IncuSR against IncSVD and Batch on real datasets.
Figure 8 shows the target rank r required for the Li et al.’s lossless SVD approach w.r.t. the edge changes \(\Delta E\) on DBLP and CitH. The yaxis is \(\frac{r}{n} \times 100\%\). On each dataset, when increasing \(\Delta E\) from 6K to 18K, we see that \(\frac{r}{n}\) is 95% on DBLP (resp. 80% on CitH), Thus, r is not negligibly smaller than n in real graphs. Due to the quartic time w.r.t. r, IncSVD may be slow in practice to get a high accuracy.
On synthetic data, we fix \(V=79{,}483\) and vary E from 485K to 560K (resp. 560K to 485K) in 15K increments (resp. decrements). The results are shown in Fig. 9. We can see that, when 6.4% edges are increased, IncSR runs 8.4x faster than IncSVD, 4.7x faster than Batch, and 2.7x faster than IncuSR. When 8.8% edges are deleted, IncSR outperforms IncSVD by 10.4x, Batch by 5.5x, and IncuSR by 2.9x. This justifies our complexity analysis of IncSR and IncuSR.
8.2.2 Effectiveness of pruning
8.2.3 Time efficiency of IncSRAll and IncbSR
We next compare the computational time of IncSRAll with IncSVD and Batch on DBLP, CitH, and YouTu. For each dataset, we increase E by \(\Delta E\) that might accompany new node insertions. Note that IncSR cannot deal with such incremental updates as \({\varvec{\Delta }}{} \mathbf{S}\) does not make any sense in such situations. To enable IncSVD to handle new node insertions, we view new inserted nodes as singleton nodes in the old graph G. Figure 12 depicts the results. We can discern that (1) on every dataset, IncSRAll runs substantially faster than IncSVD and Batch when \(\Delta E\) is small. For example, as \(\Delta E=6K\) on CitH, IncSRAll (186s) is 30.6x faster than IncSVD (5692s) and 15.1x faster than Batch (2809s). The reason is that IncSRAll can integrate the merits of IncSR with IncuSRC1, IncuSRC2, IncuSRC3 to dynamically update SimRank scores in a rankone style with no need to do costly matrix–matrix multiplications. Moreover, the complete framework of IncSRAll allows itself to support link updates that enables new node insertions. (2) When \(\Delta E\) grows larger on each dataset, the time of IncSVD increases significantly faster than IncSRAll. This larger increase is due to the SVD tensor products used by IncSVD. In contrast, IncSRAll can effectively reuse the old SimRank scores to compute changes even if such changes may accompany new node insertions.
Figure 13 compares the computational time of IncbSR with IncSRAll. From the results, we can notice that, on each graph, IncbSR is consistently faster than IncSRAll. The last column “(%)” denotes the percentage of IncbSR improvement on speedup. On each dataset, the speedup of IncbSR is more apparent when \(\Delta E\) grows larger. For example, on DBLP, the improvement of IncbSR over IncSRAll is 8.8% when \(E=75\)K, and 14.0% when \(E=83\)K. On CitH (resp. YouTu), the highest speedup of IncbSR over IncSRAll is 20.7% for \(E=419\)K (resp. 16.4% for \(E=901\)K). This is because the large size of \(\Delta E\) may increase the number of the new inserted edges with one endpoint overlapped. Hence, more edges can be handled simultaneously by IncbSR, highlighting its high efficiency over IncSRAll.
8.2.4 Total memory usage
Figure 14 evaluates the total memory usage of IncSRAll and IncbSR against IncSVD on real datasets. Note that the total memory usage includes the storage of the old SimRanks required for allpairs dynamical evaluation. For IncSRAll, we test its three versions: (a) We first switch off our methods of “pruning” and “columnwise partitioning,” denoted as “No Optimization”; (b) next turn on “pruning” only; and (c) finally turn on both. For IncSVD, we also tune the default target rank \(r=5\) larger to see how the memory space is affected by r.
8.2.5 Exactness
We next evaluate the exactness of IncSRAll, IncbSR, and IncSVD on real datasets. We leverage the NDCG metrics [13] to assess the top100 most similar pairs. We adopt the results of the batch algorithm [6] on each dataset as the \(\text {NDCG}_{100}\) baselines, due to its exactness. For IncSRAll, we evaluate its two enhanced versions: “with columnwise partitioning” and “with pruning”; for IncSVD, we tune its target rank r from 5 to 25.
Figure 15 depicts the results, showing the following. (1) On each dataset, the \(\text {NDCG}_{100}\)s of IncSRAll and IncbSR are 1, which are better than IncSVD (\(<0.62\)). This agrees with our observation that IncSVD may loss eigeninformation in real graphs. In contrast, IncSRAll and IncbSR guarantee the exactness. (2) The \(\text {NDCG}_{100}\)s for the two versions of IncSRAll are exactly the same, implying that both our pruning and columnwise partitioning methods are lossless while achieving high speedup.
8.2.6 Scalability on large graphs
To evaluate the scalability of our incremental techniques, we run IncSRAllP, a memoryefficient version of IncSR, on six real graphs (WebB, WebG, CitP, SocL, UK05, and IT04), and compare its performance with LTSF. Both IncSRAllP and LTSF can compute any single column, \(\mathbf {S}_{\star , u}\), of \(\mathbf {S}\) with no need to memorize all \(n^2\) pairs of the old \(\mathbf {S}\). To choose the query node u, we randomly pick up 10,000 queries from each mediumsized graph (WebB and WebG), and 1000 queries from each largesized graph (CitP, SocL, UK05, and IT04). To ensure every selected u can cover a board range of any possible queries, for each dataset, we first sort all nodes in V in descending order based on their importance that is measured by PageRank (PR), and then split all nodes into 10 buckets: nodes with \(\text {PR} \in [0.9, 1]\) are in the first bucket; nodes with \(\text {PR} \in [0.8, 0.9)\) the second, etc. For every medium (resp. large) sized graph, we randomly select 1000 (resp. 100) queries from each bucket, such that u contains a wide range of various types of queries. To generate dynamical updates, we follow the settings in [20], randomly choosing 1000 edges, and considering 80% of them as insertions and 20% deletions.
8.2.7 Precision
8.2.8 Memory of IncSRAllP
Figure 19 evaluates the memory usage of IncSRAllP and LTSF over six real datasets. We observe that both algorithms scale well on large graphs. On WebB, IT04, and UK05, the memory space of IncSRAllP is almost the same as LTSF; On WebG, CitP, and SocL, the memory usage of IncSRAllP is 5–8x less than LTSF. This is because, unlike LTSF that need load a oneway graph to memory, IncSRAllP only requires to prepare the vector information of \(\varvec{\xi }_k, \varvec{\eta }_k\), old \(\mathbf {S}_{\star ,i}\), and old \(\mathbf {S}_{\star ,j}\) to assess the changes to each column of \(\mathbf {S}\) in response to edge update (i, j). The memory space of these auxiliary vectors can sometimes be comparable to the size of the oneway graph and sometimes be much smaller. However, such memory space is linear to n as we do not need \(n^2\) space to store the entire old \(\mathbf {S}\). Note that the old \(\mathbf {S}_{\star ,j}\) and \(\mathbf {S}_{\star ,i}\) can be computed on demand with only linear memory by our partialpairs SimRank approach [27]. Moreover, we see that, with the growing scale of the real datasets, the memory space of IncSRAllP is increasing linearly, highlighting its scalability on large graphs.
Figure 20 depicts further the average memory usage of IncSRAllP for each case of edge insertion. We randomly pick up 1000 edges \(\{(i,j)\}\) for insertion updates on each dataset, with nodes i and j, respectively, having the probability 1 / 2 to be chosen from the old vertex set V. The average memory space of IncSRAllP for each case is reported in Fig. 20. We see that, on each dataset, the memory required for Cases (C0), (C1), and (C2) are similar, whereas the memory space of Case (C3) is much smaller than the other cases. The reason is that, for Cases (C0), (C1), and (C2), IncSRAllP needs linear memory to store some auxiliary vectors (e.g., \(\varvec{\xi }_k, \varvec{\eta }_k\), \(\mathbf {y}\), old \(\mathbf {S}_{\star ,i}\), and old \(\mathbf {S}_{\star ,j}\)) for updating SimRank scores, whereas for Case (C3), no auxiliary vectors are required for precomputation, thus saving much memory space.
9 Related work
9.1 Incremental SimRank
Li et al. [13] devised an interesting matrix representation of SimRank, and was the first to show a SVD method for incrementally updating allpairs SimRanks, which requires \(O(r^4n^2)\) time and \(O({r}^{2}n^2)\) memory. However, their incremental techniques are inherently inexact, with no guaranteed accuracy.
Recently, Shao et al. [20] provided an excellent exposition of a twostage random sampling framework, TSF, for topk SimRank dynamic search w.r.t. one query u. In the preprocessing stage, they sampled a collection of oneway graphs to index random walks in a scalable manner. In the query stage, they retrieved similar nodes by pruning unqualified nodes based on the connectivity of oneway graph. To retrieve top k nodes with highest SimRank scores in a single column \(\mathbf {S}_{\star ,u}\), [20] requires \(O(K^2 R_q R_g)\) average query time to retrieve \(\mathbf {S}_{\star ,u}\) along with \(O(n \log k)\) time to return topk results from \(\mathbf {S}_{\star ,u}\). The recent work of Jiang et al. [10] has argued that, to retrieve \(\mathbf {S}_{\star ,u}\), the querying time of [20] is \(O(K n R_q R_g)\). The n factor is due to the time to traverse the reversed oneway graph; in the worst case, all n nodes are visited. Moreover, Jiang et al. [10] observed that the probabilistic error guarantee of Shao et al.’s method is based on the assumption that no cycle in the given graph has a length shorter than K, and they proposed READS, a new efficient indexing scheme that improves precision and indexing space for dynamic SimRank search. The query time of READS is O(rn) to retrieve one column \(\mathbf {S}_{\star ,u}\), where r is the number of sets of random walks. Hence, TSF and READS are highly efficient for top k singlesource SimRank search. In comparison, our dynamical method focuses on all \((n^2)\) pairs SimRank search in \(O(K(m+{\textsf {AFF}}))\) time. Optimization methods in this work are based on a rankone Sylvester matrix equation characterizing changes to \(n^2\) pairs of SimRank scores, which is fundamentally different from [10, 20]’s methods that maintain oneway graphs (or SA forests) updating. It is important to note that, for largescale graphs, our incremental methods do not need to memorize all \((n^2)\) pairs of old SimRank scores, and can dynamically update \(\mathbf {S}\) columnwisely in only \(O(Kn+m)\) memory. For updating each column of \(\mathbf {S}\), our experiments in Sect. 8 verify that our memoryefficient incremental method is scalable on large real graphs while running 4–7 times faster than the dynamical TSF [20] per edge update, due to the high cost of [20] merging oneway graph’s log buffers for TSF indexing.
There has also been a body of work on incremental computation of other graphbased relevance measures. Banhmani et al. [1] utilized the Monte Carlo method for incrementally computing Personalized PageRank. Desikan et al. [3] proposed an excellent incremental PageRank algorithm for node updating. Their central idea revolves around the firstorder Markov chain. Sarma et al. [19] presented an excellent exposition of randomly sampling random walks of short length, and merging them together to estimate PageRank on graph streams.
9.2 Batch SimRank
In comparison to incremental algorithms, the batch SimRank computation has been wellstudied on static graphs.
For deterministic methods, Jeh and Widom [9] were the first to propose an iterative paradigm for SimRank, entailing \(O(Kd^2n^2)\) time for K iterations, where d is the average indegree. Later, Lizorkin et al. [16] utilized the partial sums memorization to speed up SimRank computation to \(O(Kdn^2)\). Yu et al. [24] have also improved SimRank computation to \(O(Kd'n^2)\) time (with \(d' \le d\)) via a finegrained memorization to share the common parts among different partial sums. Fujiwara et al. [6] exploited the matrix form of [13] to find the topk similar nodes in O(n) time w.r.t. a given query node. All these methods require \(O(n^2)\) memory to output all pairs of SimRanks. Recently, Kusumoto et al. [11] proposed a linearized method that requires only O(dn) memory and \(O(K^2 d n^2)\) time to compute all pairs of SimRanks. The recent work of [27] proposes an efficient aggregate method for computing partial pairs of SimRank scores. The main ideas of partialpairs SimRank search are also incorporated into the incremental model of our work, achieving linear memory to update \(n^2\)pairs similarities.
For parallel SimRank computing, Li et al. [15] proposed a highly parallelizable algorithm, called CloudWalker, for largescale SimRank search on Spark with ten machines. Their method consists of offline and online phases. For offline processing, an indexing vector is derived by solving a linear system in parallel. For online querying, similarities are computed instantly from the index vector. Throughout, the Monte Carlo method is used to maximally reduce time and space.
The recent work of Zhang et al. [33] conducted extensive experiments and discussed in depth many existing SimRank algorithms in a unified environment using different metrics, encompassing efficiency, effectiveness, robustness, and scalability. The empirical study for 10 algorithms from 2002 to 2015 shows that, despite many recent research efforts, the running time and precision of known algorithms have still much space for improvement. This work makes a further step toward this goal.
Fogaras and Rácz [5] proposed PSimRank in linear time to estimate a singlepair SimRank s(a, b) from the probability that two random surfers, starting from a and b, will finally meet at a node. Li et al. [14] harnessed the random walks to compute local SimRank for a single node pair. Later, Lee et al. [12] employed the Monte Carlo method to find topk SimRank node pairs. Tao et al. [22] proposed an excellent twostage way for the topk SimRankbased similarity join.
Recently, Tian and Xiao [23] proposed SLING, an efficient index structure for static SimRank computation. SLING requires \(O(n/\epsilon )\) space and \(O(m/\epsilon +n\log \tfrac{n}{\delta } /\epsilon )\) precomputation time and answers any singlepair (resp. singlesource) query in \(O(1/\epsilon )\) (resp. \(O(n/\epsilon )\)) time.
10 Conclusions
In this article, we study the problem of incrementally updating SimRank scores on timevarying graphs. Our complete scheme, IncSRAll, consists of five ingredients: (1) For edge updates that do not accompany new node insertions, we characterize the SimRank update matrix \({\varvec{\Delta }}{} \mathbf{S}\) via a rankone Sylvester equation. Based on this, a novel efficient algorithm is devised, which reduces the incremental computation of SimRank from \(O(r^4n^2)\) to \(O(Kn^2)\) for each link update. (2) To eliminate unnecessary SimRank updates further, we also devise an effective pruning strategy that can improve the incremental computation of SimRank to \(O(K(m+{\textsf {AFF}}))\), where \({\textsf {AFF}}\ (\ll n^2)\) is the size of the “affected areas” in the SimRank update matrix. (3) For edge updates that accompany new node insertions, we consider three insertion cases, according to which end of the inserted edge is a new node. For each case, we devise an efficient incremental SimRank algorithm that can support new node insertions and accurately update the affected similarities. (4) For batch updates, we also propose efficient batch incremental methods that can handle “similar sink edges” simultaneously and eliminate redundant edge updates. (5) To optimize the memory for allpairs SimRank updates, we also devise a columnwise memoryefficient technique that significantly reduces the storage from \(O(n^2)\) to \(O(Kn+m)\), without the need to memorize \(n^2\) pairs of SimRank scores. Our experimental evaluations on real and synthetic datasets demonstrate that (a) our incremental scheme is consistently 5–10 times faster than Li et al.’s SVDbased method; (b) our pruning strategy can speed up the incremental SimRank further by 3–6 times; (c) the batch update algorithm enables an extra 5–15% speedup, with just a little compromise in memory; (d) our memoryefficient incremental method is scalable on billionedge graphs; for every edge update, its computational time can be 4–7 times faster than LTSF and its memory space can be 5–8 times less than LTSF; (e) for different cases of edge updates, Cases (C0) and (C2) entail more time than Case (C1), and Case (C3) runs the fastest.
Footnotes
 1.
\(\Delta G\) consists of a sequence of edges to be inserted/deleted.
 2.
 3.
 4.
According to [13], using our notation, \(r \le \text {rank}({\varvec{\Sigma }} + \mathbf {U}^\mathrm{T} \cdot {\varvec{\Delta }}{} \mathbf{Q} \cdot \mathbf {V})\), where \({\varvec{\Delta }}{} \mathbf{Q}\) is the changes to \(\mathbf {Q}\) for link updates.
 5.
In the next sections, we shall substantially reduce its time and memory complexity further.
 6.
References
 1.Bahmani, B., Chowdhury, A., Goel, A.: Fast incremental and personalized PageRank. PVLDB 4(3), 173–184 (2010)Google Scholar
 2.Berkhin, P.: Survey: a survey on PageRank computing. Internet Math. 2, 73–120 (2005)MathSciNetCrossRefMATHGoogle Scholar
 3.Desikan, P.K., Pathak, N., Srivastava, J., Kumar, V.: Incremental PageRank computation on evolving graphs. In: WWW, pp. 1094–1095 (2005)Google Scholar
 4.Fogaras, D., Rácz, B.: Scaling linkbased similarity search. In: WWW, pp. 641–650 (2005)Google Scholar
 5.Fogaras, D., Rácz, B.: Practical algorithms and lower bounds for similarity search in massive graphs. IEEE Trans. Knowl. Data Eng. 19, 585–598 (2007)CrossRefGoogle Scholar
 6.Fujiwara, Y., Nakatsuji, M., Shiokawa, H., Onizuka, M.: Efficient search algorithm for SimRank. In: ICDE, pp. 589–600 (2013)Google Scholar
 7.Garg, S., Gupta, T., Carlsson, N., Mahanti, A.: Evolution of an online social aggregation network: an empirical study. In: Internet Measurement Conference, pp. 315–321 (2009)Google Scholar
 8.He, G., Feng, H., Li, C., Chen, H.: Parallel SimRank computation on large graphs with iterative aggregation. In: KDD, pp. 543–552 (2010)Google Scholar
 9.Jeh, G., Widom, J.: SimRank: a measure of structuralcontext similarity. In: KDD, pp. 538–543 (2002)Google Scholar
 10.Jiang, M., Fu, A.W., Wong, R.C., Wang, K.: READS: a random walk approach for efficient and accurate dynamic SimRank. PVLDB 10(9), 937–948 (2017)Google Scholar
 11.Kusumoto, M., Maehara, T., Kawarabayashi, K.: Scalable similarity search for SimRank. In: SIGMOD, pp. 325–336 (2014)Google Scholar
 12.Lee, P., Lakshmanan, L.V., Yu, J.X.: On top\(k\) structural similarity search. In: ICDE, pp. 774–785 (2012)Google Scholar
 13.Li, C., Han, J., He, G., Jin, X., Sun, Y., Yu, Y., Wu, T.: Fast computation of SimRank for static and dynamic information networks. In: EDBT, pp. 465–476 (2010)Google Scholar
 14.Li, P., Liu, H., Yu, J.X., He, J., Du, X.: Fast singlepair SimRank computation. In: SDM, pp. 571–582 (2010)Google Scholar
 15.Li, Z., Fang, Y., Liu, Q., Cheng, J., Cheng, R., Lui, J.C.S.: Walking in the cloud: parallel SimRank at scale. PVLDB 9(1), 24–35 (2015)Google Scholar
 16.Lizorkin, D., Velikhov, P., Grinev, M.N., Turdakov, D.: Accuracy estimate and optimization techniques for SimRank computation. PVLDB 1, 422–433 (2008)Google Scholar
 17.Ntoulas, A., Cho, J., Olston, C.: What’s new on the web? The evolution of the web from a search engine perspective. In: WWW, pp. 1–12 (2004)Google Scholar
 18.Rothe, S., Schütze, H.: CoSimRank: A flexible & efficient graphtheoretic similarity measure. In: ACL, pp. 1392–1402 (2014)Google Scholar
 19.Sarma, A.D., Gollapudi, S., Panigrahy, R.: Estimating PageRank on graph streams. J. ACM 58, 13 (2011)MathSciNetCrossRefMATHGoogle Scholar
 20.Shao, Y., Cui, B., Chen, L., Liu, M., Xie, X.: An efficient similarity search framework for SimRank over large dynamic graphs. PVLDB 8(8), 838–849 (2015)Google Scholar
 21.Sun, Y., Han, J., Yan, X., Yu, P.S., Wu, T.: PathSim: meta pathbased top\(k\) similarity search in heterogeneous information networks. PVLDB 4, 992–1003 (2011)Google Scholar
 22.Tao, W., Yu, M., Li, G.: Efficient top\(k\) SimRankbased similarity join. PVLDB 8(3), 317–328 (2014)Google Scholar
 23.Tian, B., Xiao, X.: SLING: a nearoptimal index structure for SimRank. In: SIGMOD, pp. 1859–1874 (2016)Google Scholar
 24.Yu, W., Lin, X., Zhang, W.: Towards efficient SimRank computation on large networks. In: ICDE, pp. 601–612 (2013)Google Scholar
 25.Yu, W., Lin, X., Zhang, W.: Fast incremental SimRank on linkevolving graphs. In: ICDE, pp. 304–315 (2014)Google Scholar
 26.Yu, W., McCann, J.A.: SigSR: SimRank search over singular graphs. In: SIGIR, pp. 859–862 (2014)Google Scholar
 27.Yu, W., McCann, J.A.: Efficient partialpairs SimRank search for large networks. PVLDB 8(5), 569–580 (2015)Google Scholar
 28.Yu, W., McCann, J.A.: High quality graphbased similarity retrieval. In: SIGIR, pp. 83–92 (2015)Google Scholar
 29.Yu, W., Lin, X., Zhang, W., McCann, J.A.: Fast allpairs SimRank assessment on large graphs and bipartite domains. IEEE Trans. Knowl. Data Eng. 27, 1810–1823 (2015)CrossRefGoogle Scholar
 30.Yu, W., McCann, J.A.: Gauging correct relative rankings for similarity search. In: CIKM, pp. 1791–1794 (2015)Google Scholar
 31.Yu, W., McCann, J.A.: Random walk with restart over dynamic graphs. In: ICDM, pp. 589–598 (2016)Google Scholar
 32.Yu, W., Lin, X., Zhang, W., McCann, J.A.: Dynamical SimRank search on timevarying networks. Technical report, arXiv:1711.00121 (2017)
 33.Zhang, Z., Shao, Y., Cui, B., Zhang, C.: An experimental evaluation of SimRankbased similarity search algorithms. PVLDB 10(5), 601–612 (2017)Google Scholar
Copyright information
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.