The VLDB Journal

, Volume 19, Issue 1, pp 45–66 | Cite as

Accuracy estimate and optimization techniques for SimRank computation

  • Dmitry Lizorkin
  • Pavel Velikhov
  • Maxim Grinev
  • Denis Turdakov
Special Issue Paper

Abstract

The measure of similarity between objects is a very useful tool in many areas of computer science, including information retrieval. SimRank is a simple and intuitive measure of this kind, based on a graph-theoretic model. SimRank is typically computed iteratively, in the spirit of PageRank. However, existing work on SimRank lacks accuracy estimation of iterative computation and has discouraging time complexity. In this paper, we present a technique to estimate the accuracy of computing SimRank iteratively. This technique provides a way to find out the number of iterations required to achieve a desired accuracy when computing SimRank. We also present optimization techniques that improve the computational complexity of the iterative algorithm from O(n4) in the worst case to min(O(nl), O(n3/ log2n)), with n denoting the number of objects, and l denoting the number object-to-object relationships. We also introduce a threshold sieving heuristic and its accuracy estimation that further improves the efficiency of the method. As a practical illustration of our techniques, we computed SimRank scores on a subset of English Wikipedia corpus, consisting of the complete set of articles and category links.

Keywords

Similarity measure Graph theory SimRank Algorithm Computational complexity 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Abelson, H., Sussman, G.J.: Structure and Interpretation of Computer Programs, 2nd edn. The MIT Press (1996). http://mitpress.mit.edu/sicp/full-text/book/book.html
  2. 2.
    Andersen, R., Chung, F., Lang, K.: Local graph partitioning using PageRank vectors. In: FOCS ’06: Proceedings of the 47th Annual IEEE Symposium on Foundations of Computer Science, pp. 475–486. IEEE Computer Society, Washington, DC, USA (2006). doi:10.1109/FOCS.2006.44
  3. 3.
    Antonellis I., Molina H.G., Chang C.C.: Simrank++: query rewriting through link analysis of the click graph. Proc. VLDB Endow. 1(1), 408–421 (2008). doi:10.1145/1453856.1453903 Google Scholar
  4. 4.
    Brin, S., Page, L.: The anatomy of a large-scale hypertextual Web search engine. Comput. Networks ISDN Syst. 30(1–7), 107–117 (1998). http://www.citeseer.ist.psu.edu/brin98anatomy.html
  5. 5.
    Cohen R., Havlin S.: Scale-free networks are ultrasmall. Phys. Rev. Lett. 90(5), 058,701 (2003). doi:10.1103/PhysRevLett.90.058701 CrossRefGoogle Scholar
  6. 6.
    Flake, G.W., Lawrence, S., Giles, C.L.: Efficient identification of web communities. In: Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 150–160. ACM Press, New York (2000)Google Scholar
  7. 7.
    Fogaras, D., Rácz, B.: Scaling link-based similarity search. In: WWW ’05: Proceedings of the 14th International Conference on World Wide Web, pp. 641–650. ACM, New York, NY, USA (2005). doi:10.1145/1060745.1060839
  8. 8.
    Gabrilovich, E., Markovitch, S.: Computing semantic relatedness using Wikipedia-based explicit semantic analysis. In: Proceedings of the Twentieth International Joint Conference for Artificial Intelligence, pp. 1606–1611. Hyderabad, India (2007). http://www.cs.technion.ac.il/~shaulm/papers/pdf/Gabrilovich-Markovitch-ijcai2007.pdf
  9. 9.
    Ganesan P., Garcia-Molina H., Widom J.: Exploiting hierarchical domain structure to compute similarity. ACM Trans. Inf. Syst. 21(1), 64–93 (2003). doi:10.1145/635484.635487 CrossRefGoogle Scholar
  10. 10.
    Geerts, F., Mannila, H., Terzi, E.: Relational link-based ranking. In: VLDB’2004: Proceedings of the Thirtieth International Conference on Very Large Data Bases, pp. 552–563. VLDB Endowment (2004)Google Scholar
  11. 11.
    Gleich, D.: Fast parallel pagerank: a linear system approach. Technical report (2004)Google Scholar
  12. 12.
    Jeh, G., Widom, J.: SimRank: a measure of structural-context similarity. In: KDD ’02: Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 538–543. ACM Press, New York (2002). doi:10.1145/775047.775126
  13. 13.
    Kamvar, S., Haveliwala, T., Manning, C., Golub, G.: Exploiting the block structure of the web for computing pagerank. Technical report (2003)Google Scholar
  14. 14.
    Kohlschütter, C., Chirita, P.A., Chirita, R., Nejdl, W.: Efficient parallel computation of pagerank. In: In Proceedings of the 28th European Conference on Information Retrieval, pp. 241–252 (2006)Google Scholar
  15. 15.
    Kronrod M., Arlazarov V., Dinic E., Faradzev I.: On economic construction of the transitive closure of a direct graph. Sov. Math (Doklady) 11, 1209–1210 (1970)MATHGoogle Scholar
  16. 16.
    Li, L., Alderson, D., Tanaka, R., Doyle, J.C., Willinger, W.: Towards a theory of scale-free graphs: definition, properties, and implications (extended version). CoRR abs/cond-mat/0501169 (2005)Google Scholar
  17. 17.
    Liberty, E., Zucker, S.W.: The mailman algorithm: a note on matrix-vector multiplication. Inf. Process. Lett. 109(3), 179–182 (2009). http://www.cs.yale.edu/homes/el327/papers/mailmanAlgorithm.pdf Google Scholar
  18. 18.
    Lin, D.: An information-theoretic definition of similarity. In: Proceedings of 15th International Conference on Machine Learning, pp. 296–304. Morgan Kaufmann, San Francisco, CA (1998). citeseer.ist.psu.edu/95071.htmlGoogle Scholar
  19. 19.
    Lin, Z., King, I., Lyu, M.R.: PageSim: a novel link-based similarity measure for the world wide web. In: WI ’06: Proceedings of the 2006 IEEE/WIC/ACM International Conference on Web Intelligence, pp. 687–693. IEEE Computer Society, Washington, DC, USA (2006). doi:10.1109/WI.2006.127
  20. 20.
    Lizorkin, D., Medelyan, O., Grineva, M.: Analysis of community structure in wikipedia. In: WWW ’09: Proceedings of the 18th International Conference on World Wide Web, pp. 1221–1222. ACM, New York, NY, USA (2009). doi:10.1145/1526709.1526938
  21. 21.
    Lizorkin D., Velikhov P., Grinev M., Turdakov D.: Accuracy estimate and optimization techniques for SimRank computation. PVLDB 1(1), 422–433 (2008)Google Scholar
  22. 22.
    Lu, W., Janssen, J., Milios, E.E., Japkowicz, N.: Node similarity in networked information spaces. In: Stewart, D.A., Johnson, J.H. (eds.) CASCON, p. 11. IBM (2001). http://dblp.uni-trier.de/db/conf/cascon/cascon2001.html#LuJMJ01
  23. 23.
    Maguitman, A.G., Menczer, F., Erdinc, F., Roinestad, H., Vespignani, A.: Algorithmic computation and approximation of semantic similarity. World Wide Web 9(4), 431–456 (2006). http://portal.acm.org/citation.cfm?id=1210403.1210410
  24. 24.
    Manaskasemsak, B., Rungsawang, A.: Parallel pagerank computation on a gigabit PC cluster. In: AINA ’04: Proceedings of the 18th International Conference on Advanced Information Networking and Applications, vol. 1, pp. 273-277. IEEE Computer Society, Washington, DC, USA (2004)Google Scholar
  25. 25.
    Newman, M.E., Girvan, M.: Finding and evaluating community structure in networks. Phys. Rev. E Stat. Nonlinear Soft Matter Phys. 69(2), 1–15 (2004). http://www.ncbi.nlm.nih.gov/pubmed/14995526
  26. 26.
    Page, L., Brin, S., Motwani, R., Winograd, T.: The pagerank citation ranking: bringing order to the web. Technical Report 1999-66, Stanford InfoLab (1999). http://ilpubs.stanford.edu:8090/422/
  27. 27.
    Shi, S., Yu, J., Yang, G., Wang, D.: Distributed page ranking in structured p2p networks. In: In ICPP, pp. 179–186 (2003)Google Scholar
  28. 28.
    Small H.: Co-citation in the scientific literature: a new measure of the relationship between two documents. J. Am. Soc. Inf. Sci. 24(4), 265–269 (1973)CrossRefGoogle Scholar
  29. 29.
    Song, C., Havlin, S., Makse, H.A.: Self-similarity of complex networks (2005). http://arxiv.org/abs/cond-mat/0503078
  30. 30.
    Strube, M., Ponzetto, S.: WikiRelate! Computing semantic relatedness using Wikipedia. In: Proceedings of the 21st National Conference on Artificial Intelligence (AAAI-06), pp. 1419–1424. Boston, Mass. (2006)Google Scholar
  31. 31.
    Xi, W., Fox, E.A., Fan, W., Zhang, B., Chen, Z., Yan, J., Zhuang, D.: SimFusion: measuring similarity using unified relationship matrix. In: SIGIR ’05: Proceedings of the 28th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 130–137. ACM, New York, NY, USA (2005)Google Scholar
  32. 32.
    Zesch, T., Gurevych, I.: Analysis of the Wikipedia category graph for NLP applications. In: Proceedings of the TextGraphs-2 Workshop (NAACL-HLT 2007), pp. 1–8 (2007)Google Scholar

Copyright information

© Springer-Verlag 2009

Authors and Affiliations

  • Dmitry Lizorkin
    • 1
  • Pavel Velikhov
    • 1
  • Maxim Grinev
    • 1
  • Denis Turdakov
    • 1
  1. 1.Institute for System Programming of the Russian Academy of SciencesMoscowRussia

Personalised recommendations