ExactSim: benchmarking single-source SimRank algorithms with high-precision ground truths

Abstract

SimRank is a popular measurement for evaluating the node-to-node similarities based on the graph topology. In recent years, single-source and top-k SimRank queries have received increasing attention due to their applications in web mining, social network analysis, and spam detection. However, a fundamental obstacle in studying SimRank has been the lack of ground truths. The only exact algorithm, Power Method, is computationally infeasible on graphs with more than \(10^6\) nodes. Consequently, no existing work has evaluated the actual accuracy of various single-source and top-k SimRank algorithms on large real-world graphs. In this paper, we present ExactSim, the first algorithm that computes the exact single-source and top-k SimRank results on large graphs. This algorithm produces ground truths with precision up to 7 decimal places with high probability. With the ground truths computed by ExactSim, we present the first experimental study of the accuracy/cost trade-offs of existing approximate SimRank algorithms on large real-world graphs and synthetic graphs. Finally, we use the ground truths to exploit various properties of SimRank distributions on large graphs.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20
Fig. 21
Fig. 22
Fig. 23
Fig. 24
Fig. 25
Fig. 26
Fig. 27
Fig. 28
Fig. 29
Fig. 30
Fig. 31

Notes

  1. 1.

    http://snap.stanford.edu/data

  2. 2.

    http://law.di.unimi.it/datasets.php

  3. 3.

    http://konect.cc/categories/Hyperlink/

References

  1. 1.

    Aldecoa, Rodrigo, Orsini, Chiara, Krioukov, Dmitri: Hyperbolic graph generator. Computer Phys. Commun. 196, 492–496 (2015)

    Article  Google Scholar 

  2. 2.

    Andersen, Reid., Chung, Fan R. K., Lang, Kevin J.: Local graph partitioning using pagerank vectors. In FOCS, pp. 475–486, (2006)

  3. 3.

    Antonellis, Ioannis, Molina, Hector Garcia, Chang, Chi Chao: Simrank++: query rewriting through link analysis of the click graph. PVLDB 1(1), 408–421 (2008)

    Google Scholar 

  4. 4.

    Bahmani, Bahman, Chowdhury, Abdur, Goel, Ashish: Fast incremental and personalized pagerank. VLDB 4(3), 173–184 (2010)

    Google Scholar 

  5. 5.

    Chung, Fan R.K., Lu, Lincoln: Concentration inequalities and martingale inequalities: a survey. Internet Math. 3(1), 79–127 (2006)

    MathSciNet  Article  Google Scholar 

  6. 6.

    Fogaras, Daniel., Racz, Balazs.: Scaling link-based similarity search. In: WWW, pp. 641–650, (2005)

  7. 7.

    Fogaras, Dániel, Rácz, Balázs, Csalogány, Károly, Sarlós, Tamás: Towards scaling fully personalized pagerank: algorithms, lower bounds, and experiments. Internet Math. 2(3), 333–358 (2005)

    MathSciNet  Article  Google Scholar 

  8. 8.

    Fujiwara, Yuichiro., Nakatsuji, Makoto., Shiokawa, Hiroaki., Onizuka, Makoto.: Efficient search algorithm for simrank. In: ICDE, pp. 589–600, (2013)

  9. 9.

    He, Guoming., Feng, Haijun., Li, Cuiping., Chen, Hong.: Parallel simrank computation on large graphs with iterative aggregation. In: KDD, pp. 543–552, (2010)

  10. 10.

    Jeh, G., Widom, J.: Simrank: a measure of structural-context similarity. In: SIGKDD, pp. 538–543, (2002)

  11. 11.

    Jiang, M., Fu, A.W.C., Wong, R.C.W.: Reads: a random walk approach for efficient and accurate dynamic simrank. PPVLDB 10(9), 937–948 (2017)

    Google Scholar 

  12. 12.

    Krioukov, Dmitri, Papadopoulos, Fragkiskos, Kitsak, Maksim, Vahdat, Amin, Boguná, Marián: Hyperbolic geometry of complex networks. Phys. Rev. E 82(3), 036106 (2010)

    MathSciNet  Article  Google Scholar 

  13. 13.

    Kusumoto, M., Maehara, T., Kawarabayashi, K-I.: Scalable similarity search for simrank. In: SIGMOD, pp. 325–336, (2014)

  14. 14.

    Lee, P., Lakshmanan, LVS., Yu, JX.: On top-k structural similarity search. In: ICDE, pp. 774–785, (2012)

  15. 15.

    Leskovec, J, Chakrabarti, D, Kleinberg, J, Faloutsos, C, Ghahramani, Z: Kronecker graphs: an approach to modeling networks. J. Mach. Learn. Res. 11(2), (2010)

  16. 16.

    Li, C., Han, J., He, G., Jin, X., Sun, Y., Yu, Y., Wu, T.: Fast computation of simrank for static and dynamic information networks. In: EDBT, pp. 465–476, (2010)

  17. 17.

    Li, L., Li, C., Chen, H., Du, X.: Mapreduce-based simrank computation and its application in social recommender system. In: 2013 IEEE International Congress on Big Data, pp. 133–140. IEEE, (2013)

  18. 18.

    Li, Zhenguo, Fang, Yixiang, Liu, Qin, Cheng, Jiefeng, Cheng, Reynold, Lui, John: Walking in the cloud: parallel simrank at scale. PVLDB 9(1), 24–35 (2015)

    Google Scholar 

  19. 19.

    Lin, Zhenjiang, Lyu, Michael R., King, Irwin: Matchsim: a novel similarity measure based on maximum neighborhood matching. KAIS 32(1), 141–166 (2012)

    Google Scholar 

  20. 20.

    Litvak, N., Scheinhardt, W.R.W., Volkovich, Y.: In-degree and pagerank: why do they follow similar power laws? Internet Math. 4(2–3), 175–198 (2007)

    MathSciNet  Article  Google Scholar 

  21. 21.

    Liu, Y., Zheng, B., He, X., Wei, Z., Xiao, X., Zheng, K., Jiaheng, L.: Probesim: scalable single-source and top-k simrank computations on dynamic graphs. PVLDB 11(1), 14–26 (2017)

    Google Scholar 

  22. 22.

    Lizorkin, D., Velikhov, P., Grinev, M., Turdakov, D.: Accuracy estimate and optimization techniques for simrank computation. VLDB J. 19(1), 45–66 (2010)

    Article  Google Scholar 

  23. 23.

    Lizorkin, D., Velikhov, P., Grinev, M.N., Turdakov, D.: Accuracy estimate and optimization techniques for simrank computation. VLDB J. 19(1), 45–66 (2010)

    Article  Google Scholar 

  24. 24.

    Lü, Linyuan, Zhou, Tao: Link prediction in complex networks: a survey. Phys. A: Stat. Mech. Appl. 390(6), 1150–1170 (2011)

    Article  Google Scholar 

  25. 25.

    Luo, X., Gao, J., Zhou, C., Yu, J. X.: Uniwalk: Unidirectional random walk based scalable simrank computation over large graph. In: 2017 IEEE 33rd International Conference on Data Engineering (ICDE), pp. 325–336, (2017)

  26. 26.

    Maehara, T., Kusumoto, M., Kawarabayashi, K.: Efficient simrank computation via linearization. CoRR, abs/1411.7228, (2014)

  27. 27.

    Maehara, T., Kusumoto, M., Kawarabayashi, K.: Scalable simrank join algorithm. In: ICDE, pp. 603–614, (2015)

  28. 28.

    Page, L., Brin, S., Motwani, R., Winograd, T.: The pagerank citation ranking: bringing order to the web. (1999)

  29. 29.

    Shao, Y., Cui, B., Chen, L., Liu, M., Xie, X.: An efficient similarity search framework for simrank over large dynamic graphs. PVLDB 8(8), 838–849 (2015)

    Google Scholar 

  30. 30.

    Tao, W., Minghe, Y., Li, G.: Efficient top-k simrank-based similarity join. PVLDB 8(3), 317–328 (2014)

    Google Scholar 

  31. 31.

    Tian, B., Xiao, X.: SLING: a near-optimal index structure for simrank. In: SIGMOD, pp. 1859–1874, (2016)

  32. 32.

    Tsitsulin, A., Mottin, D., Karras, P., Müller, E.: Verse: Versatile graph embeddings from similarity measures. In: WWW, pp. 539–548. International World Wide Web Conferences Steering Committee, (2018)

  33. 33.

    Wang, H., Wei, Z., Yuan, Y., Du, X., Wen, J.: Exact single-source simrank computation on large graphs. In: Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data, pp. 653–663, (2020)

  34. 34.

    Wang, Y, Che, Y, Lian, X, Chen, L, Luo, Q: Fast and accurate simrank computation via forward local push and its parallelization. In: IEEE Transactions on Knowledge and Data Engineering (2020)

  35. 35.

    Wang, Y., Chen, L., Che, Y., Luo, Q.: Accelerating pairwise simrank estimation over static and dynamic graphs. VLDB J. 28(1), 99–122 (2019)

    Article  Google Scholar 

  36. 36.

    Wei, Z., He, X., Xiao, X., Wang, S., Liu, Y., Du, X., Wen, J.: Prsim: sublinear time simrank computation on large power-law graphs. In: SIGMOD, pp. 1042–1059. ACM, (2019)

  37. 37.

    Xi, W., Fox, EA., Fan, W., Zhang, B., Chen, Z., Yan, J., Zhuang, D.: Simfusion: measuring similarity using unified relationship matrix. In: SIGIR, pp. 130–137. ACM, (2005)

  38. 38.

    Yu, W., Lin, X., Zhang, W.: Fast incremental simrank on link-evolving graphs. In: ICDE, pp. 304–315, (2014)

  39. 39.

    Weiren, Y., Lin, X., Zhang, W., Chang, L., Pei, J.: More is simpler: effectively and efficiently assessing node-pair similarities based on hyperlinks. PVLDB 7(1), 13–24 (2013)

    Google Scholar 

  40. 40.

    Yu, W., McCann, J.: Gauging correct relative rankings for similarity search. In: CIKM, pp. 1791–1794, (2015)

  41. 41.

    Weiren, Y., McCann, J.A.: Efficient partial-pairs simrank search for large networks. PVLDB 8(5), 569–580 (2015)

    Google Scholar 

  42. 42.

    Yu, W., McCann, J.A.: Efficient partial-pairs simrank search on large networks. Proc. VLDB Endow. 8(5), 569–580 (2015)

    Article  Google Scholar 

  43. 43.

    Yu, W., McCann, JA.: High quality graph-based similarity search. In: SIGIR, pp. 83–92, (2015)

  44. 44.

    Weiren, Y., Zhang, W., Lin, X., Zhang, Q., Le, J.: A space and time efficient algorithm for simrank computation. World Wide Web 15(3), 327–353 (2012)

    Article  Google Scholar 

  45. 45.

    Zhang, J., Tang, J., Ma, C., Tong, H., Jing, Y., Li, J.: Panther: Fast top-k similarity search on large networks. In: SIGKDD, pp. 1445–1454. ACM, (2015)

  46. 46.

    Zhao, P., Han, J., Sun, Y.: P-rank: a comprehensive structural similarity measure over information networks. In: CIKM, pp. 553–562. ACM, (2009)

  47. 47.

    Zhao, P., Han, J., Sun, Y.: P-rank: a comprehensive structural similarity measure over information networks. In: CIKM, pp. 553–562, (2009)

  48. 48.

    Zheng, W., Zou, L., Feng, Y., Chen, L., Zhao, D.: Efficient simrank-based similarity join over large graphs. PVLDB 6(7), 493–504 (2013)

    Google Scholar 

Download references

Acknowledgements

Zhewei Wei was supported by National Natural Science Foundation of China (NSFC) No. 61972401 and No. 61932001, by the Fundamental Research Funds for the Central Universities and the Research Funds of Renmin University of China under Grant 18XNLG21, and by Alibaba Group through Alibaba Innovative Research Program. The work is partially done at Beijing Key Laboratory of Big Data Management and Analysis Methods, MOE Key Lab DEKE, Renmin University of China, and Pazhou Lab, Guangzhou, 510330, China. Hanzhi Wang was supported by the Outstanding Innovative Talents Cultivation Funded Programs 2020 of Renmin University of China. Ye Yuan was supported by NSFC No. 61932004 and No. 61622202 and by FRFCU No. N181605012. Ji-Rong Wen was supported by NSFC No. 61832017 and by Beijing Outstanding Young Scientist Program NO. BJJWZYJH012019100020098. Xiaoyong Du was supported by NSFC No. U1711261.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Zhewei Wei.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Wang, H., Wei, Z., Liu, Y. et al. ExactSim: benchmarking single-source SimRank algorithms with high-precision ground truths. The VLDB Journal (2021). https://doi.org/10.1007/s00778-021-00672-7

Download citation

Keywords

  • SimRank
  • Single-source
  • Exact computation
  • Ground truths
  • Power-law
  • Benchmarks