Skip to main content

Scheduled approximation for Personalized PageRank with Utility-based Hub Selection

Abstract

As Personalized PageRank has been widely leveraged for ranking on a graph, the efficient computation of Personalized PageRank Vector (PPV) becomes a prominent issue. In this paper, we propose FastPPV, an approximate PPV computation algorithm that is incremental and accuracy-aware. Our approach hinges on a novel paradigm of scheduled approximation: the computation is partitioned and scheduled for processing in an “organized” way, such that we can gradually improve our PPV estimation in an incremental manner and quantify the accuracy of our approximation at query time. Guided by this principle, we develop an efficient hub-based realization, where we adopt the metric of hub length to partition and schedule random walk tours so that the approximation error reduces exponentially over iterations. In addition, as tours are segmented by hubs, the shared substructures between different tours (around the same hub) can be reused to speed up query processing both within and across iterations. Given the key roles played by the hubs, we further investigate the problem of hub selection. In particular, we develop a conceptual model to select hubs based on the two desirable properties of hubs—sharing and discriminating, and present several different strategies to realize the conceptual model. Finally, we evaluate FastPPV over two real-world graphs, and show that it not only significantly outperforms two state-of-the-art baselines in both online and offline phrases, but also scales well on larger graphs. In particular, we are able to achieve near-constant time online query processing irrespective of graph size.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19

Notes

  1. http://www.dmoz.org/.

  2. http://www.informatik.uni-trier.de/~ley/db/.

  3. http://snap.stanford.edu/data/.

  4. Except the experiment on query distribution-aware hub selection, which will be discussed in Sect. 7.5.2.

References

  1. Andersen, R., Chung, F., Lang, K.: Local graph partitioning using pagerank vectors. In: FOCS, pp. 475–486 (2006)

  2. Baeza-Yates, R., Tiberi, A.: Extracting semantic relations from query logs. In: Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 76–85 (2007)

  3. Bahmani, B., Chakrabarti, K., Xin, D.: Fast personalized PageRank on MapReduce. In: SIGMOD, pp. 973–984 (2011)

  4. Bahmani, B., Chowdhury, A., Goel, A.: Fast incremental and personalized PageRank. In: VLDB, pp. 173–184 (2010)

  5. Balmin, A., Hristidis, V., Papakonstantinou, Y.: ObjectRank: authority-based keyword search in databases. VLDB 30, 564–575 (2004)

  6. Berkhin, P.: Bookmark-coloring algorithm for personalized pagerank computing. Internet Math. 3(1), 41–62 (2006)

    Article  MATH  MathSciNet  Google Scholar 

  7. Boldi, P., Vigna, S.: The webgraph framework I: compression techniques. In: WWW, pp. 595–602 (2004)

  8. Brandes, U., Gaertler, M., Wagner, D.: Experiments on graph clustering algorithms. In: In 11th European Symposium on Algorithms, pp. 568–579. Springer (2003)

  9. Brinkmeier, M., Werner, J., Recknagel, S.: Communities in graphs and hypergraphs. In: Proceedings of the Sixteenth ACM Conference on Conference on Information and Knowledge Management, CIKM ’07, pp. 869–872, New York, NY, USA. ACM (2007)

  10. Chakrabarti, S.: Dynamic personalized pagerank in entity-relation graphs. In: WWW, pp. 571–580 (2007)

  11. Chakrabarti, S., Pathak, A., Gupta, M.: Index design and query processing for graph conductance search. VLDBJ 20, 445–470 (2010)

    Article  Google Scholar 

  12. Fogaras, D., Rácz, B., Csalogány, K., Sarlós, T.: Towards scaling fully personalized pagerank: algorithms, lower bounds, and experiments. Internet Math. 2(3), 333–358 (2005)

    Article  MATH  MathSciNet  Google Scholar 

  13. Fujiwara, Y., Nakatsuji, M., Yamamuro, T., Shiokawa, H., Onizuka, M.: Efficient personalized pagerank with accuracy assurance. In: SIGKDD, pp. 15–23 (2012)

  14. Gupta, M., Pathak, A., Chakrabarti, S.: Fast algorithms for top-\(k\) personalized pagerank queries. In: WWW, pp. 1225–1226 (2008)

  15. Haveliwala, T.H.: Topic-sensitive PageRank: a context-sensitive ranking algorithm for web search. TKDE 15(4), 784–796 (2003)

  16. Jeh, G., Widom, J.: Scaling personalized web search. In: WWW, pp. 271–279 (2003)

  17. Kamvar, S., Haveliwala, T., Manning, C., Golub, G.: Exploiting the block structure of the web for computing PageRank. Technical report, Stanford University (2003)

  18. Nemhauser, G.L., Wolsey, L.A., Fisher, M.L.: An analysis of approximations for maximizing submodular set functionsi. Math. Program. 14(1), 265–294 (1978)

    Article  MATH  MathSciNet  Google Scholar 

  19. Page, L., Brin, S., Motwani, R., Winograd, T.: The PageRank citation ranking: bringing order to the web. Technical report, Stanford University (1999)

  20. Papoulis, A., Pillai, S., Unnikrishna, S.: Probability, Random Variables, and Stochastic Processes. McGraw-Hill, New York (1965)

    MATH  Google Scholar 

  21. Pathak, A., Chakrabarti, S., Gupta, M.: Index design for dynamic personalized pagerank. In: ICDE, pp. 1489–1491 (2008)

  22. Randall, K.H., Stata, R., Wickremesinghe, R.G., Wiener, J.L.: The link database: fast access to graphs of the web. In: DCC, pp. 122–131 (2002)

  23. Richardson, M., Domingos, P.: The intelligent surfer: probabilistic combination of link and content information in pagerank. In: NIPS, pp. 1441–1448 (2002)

  24. Sarkar, P., Moore, A.: Fast nearest-neighbor search in disk-resident graphs. In: SIGKDD, pp. 513–522 (2010)

  25. Silverstein, C., Marais, H., Henzinger, M., Moricz, M.: Analysis of a very large web search engine query log. SIGIR Forum 33(1), 6–12 (1999)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Fanwei Zhu.

Additional information

This material is based upon work partially supported by NSF Grant IIS 1018723, the research grant for the Human-centered Cyber-physical Systems Programme at the Advanced Digital Sciences Center of the University of Illinois at Urbana-Champaign, the Agency for Science, Technology and Research of Singapore, and Zhejiang Provincial Natural Science Foundation of China (Grant No. LQ14F020002). Any opinions, findings, and conclusions or recommendations expressed in this publication are those of the author(s) and do not necessarily reflect the views of the funding agencies.

Appendix: Proof of theorems

Appendix: Proof of theorems

Theorem 2 After iteration- \(k\), the L1 error \(\varphi ^{(k)}\) as defined in Eq. 5 satisfies the following bound:

$$\begin{aligned} \varphi ^{(k)} \le (1-\alpha )^{k+2} \end{aligned}$$

Proof

First, by Eqs. 6 and 3, we have the following:

$$\begin{aligned} \varphi ^{(k)}&= 1 - \sum _p \hat{\mathbf{r}}_{q}^{(k)}(p) \nonumber \\&=1-\sum _{i=0}^k \sum _{t \in T^i} R(t). \end{aligned}$$
(26)

Second, by Definition 1, \(\forall t \in T^k, \mathcal {L}_h(t) = k\), and \(\forall t, \mathcal {L}_h(t) < \mathcal {L}(t)\) where \(\mathcal {L}(t)\) is the natural length of \(t\) (i.e., the number of edges in \(t\)). Thus, if \(\mathcal {L}(t) \le k + 1\), then \(\mathcal {L}_h(t) \le k\), implying \(\cup _{i=0}^k T^i \supseteq \cup _{i=0}^{k+1} S^i\) where \(S^i \triangleq \{t:\mathcal {L}(t) = i\}\). Hence, the following:

$$\begin{aligned} \sum _{i=0}^k \sum _{t \in T^i} R(t) \ge \sum _{i=0}^{k+1} \sum _{t \in S^i} R(t). \end{aligned}$$
(27)

Third, we claim that

$$\begin{aligned} \sum _{t \in S^i} R(t)=(1-\alpha )^{i}\alpha , \end{aligned}$$
(28)

which can be shown by induction. The base case \(i=0\) is clearly true. In the induction step, suppose it is true for \(i=\ell \). All tours with length \(\ell +1\) must be extended from a tour of length \(\ell \) by one step. Consider a particular tour \(t'\) of length \(\ell \). The total reachability of all tours of length \(\ell +1\) that are extended from \(t'\) is \(R(t')(1-\alpha )\) based on Eq. 2. Hence, \(\sum _{t \in S^{\ell +1}} R(t)=\sum _{t' \in S^{\ell }} R(t')(1-\alpha ) = (1-\alpha )^\ell \alpha (1-\alpha ) = (1-\alpha )^{\ell +1} \alpha \), which proves the claim.

Finally, combining these results (Eqs. 26, 27 and 28), we can derive that

$$\begin{aligned} \varphi ^{(k)}&= 1-\sum _{i=0}^k \sum _{t \in T^i} R(t)\\&\le 1 - \sum _{i=0}^{k+1} \sum _{t \in S^i} R(t)\\&= 1 - \sum _{i=0}^{k+1}(1-\alpha )^{i}\alpha , \end{aligned}$$

which simplifies to \(\varphi ^{(k)} \le (1-\alpha )^{k+2}\). \(\square \)

Theorem 5 Let \(|V_{C_i}|\) be the number of nodes in community \(C_i\), then \(I(T_{C_i}) \approx |V_{C_i}|\).

Proof

We derive this computation of \(I(T_{C_i})\) step by step as follows:

$$\begin{aligned} I(T_{C_i})&=^1\sum _{t\in T_{C_i};\mathcal {L}(t)\le k_i}R(t) \nonumber \\&=^2\sum _{t\in T_{C_i}}\prod _{\mathcal {L}(t)=1}^{k_i} \frac{1}{d_i}\cdot \alpha \cdot (1-\alpha )^{\mathcal {L}(t)-1}\nonumber \\&=^3\sum _{\mathcal {L}(t)=1}^{k_i} |V_{C_i}|\cdot d_i^{\mathcal {L}(t)} \cdot \frac{1}{d_i}^{\mathcal {L}(t)} \cdot \alpha \cdot (1-\alpha )^{\mathcal {L}(t)} \nonumber \\&=^4|V_{C_i}|\cdot \alpha \cdot \sum _{\mathcal {L}(t)=1}^{k_i} (1-\alpha )^{\mathcal {L}(t)} \nonumber \\&\approx ^5|V_{C_i}| \end{aligned}$$

First, we define the importance of \(T_{C_i}\) as the overall importance of all tours with length no longer than \(k_i\) in step 1. Here, we apply a upper bound \(k_i\) on the length of tours to avoid those tours with infinite length in case \(T_{C_i}\) is cyclic; if \(T_{C_i}\) is acyclic, \(k_i\) simply equals to the length of longest tours in it. Next, in step 2, we group these tours by their length \(\mathcal {L}(t)\) so that we can calculate the importance of tours of each length (from \(1\) to \(k_i\)) according to the P-inverse distance definition. Subsequently, we approximate the number of tours at each \(\mathcal {L}(t)\) using the average out-degree \(d_i\). Specifically, for each arbitrary node \(q \in T_{C_i}\), there are \(d_i\) length-\(1\) tours starting at \(q\) in \(T_{C_i}\); for any of \(q\)’s neighbors, it has \(d_i\) out-neighbors again, constituting \(d_i^2\) length-\(2\) tours from \(q\). Generally, there are \(d_i^{\mathcal {L}(t)}\) length-\(\mathcal {L}(t)\) tours starting from an arbitrary node \(q\), and thus in \(T_{C_i}\) which contains \(|V_{C_i}|\) nodes, the total number of length-\(\mathcal {L}(t)\) tours is \(|V_{C_i}|\cdot d_i^{\mathcal {L}(t)}\). We thus reformulate the overall importance by the number and importance of each length-\(\mathcal {L}(t)\) tours in step 3. In step 4, we eliminate the same factors in the formula and have \(I(T_{C_i})=|V_i|\cdot \alpha \cdot \sum _{\mathcal {L}(t)=1}^{k_i} (1-\alpha )^{\mathcal {L}(t)}\). Since \(1-\alpha \) is smaller than \(1\), we can always find a \({ x}'\) such that for all \(\mathcal {L}(t)>{{ x}'}\), \((1-\alpha )^{{ x}'} \approx 0\). Thus, we can finally have \(I(T_{C_i}) \approx |V_{C_i}|\) in step 5.\(\square \)

Rights and permissions

Reprints and Permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhu, F., Fang, Y., Chang, K.CC. et al. Scheduled approximation for Personalized PageRank with Utility-based Hub Selection. The VLDB Journal 24, 655–679 (2015). https://doi.org/10.1007/s00778-014-0376-8

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00778-014-0376-8

Keywords

  • Personalized PageRank
  • Scheduled approximation
  • Accuracy-aware
  • Incremental enhancement
  • Hub selection