Accuracy estimate and optimization techniques for SimRank computation

Lizorkin, Dmitry; Velikhov, Pavel; Grinev, Maxim; Turdakov, Denis

doi:10.1007/s00778-009-0168-8

Accuracy estimate and optimization techniques for SimRank computation

Special Issue Paper
Published: 06 October 2009

Volume 19, pages 45–66, (2010)
Cite this article

The VLDB Journal Aims and scope Submit manuscript

Dmitry Lizorkin¹,
Pavel Velikhov¹,
Maxim Grinev¹ &
…
Denis Turdakov¹

293 Accesses
74 Citations
6 Altmetric
Explore all metrics

Abstract

The measure of similarity between objects is a very useful tool in many areas of computer science, including information retrieval. SimRank is a simple and intuitive measure of this kind, based on a graph-theoretic model. SimRank is typically computed iteratively, in the spirit of PageRank. However, existing work on SimRank lacks accuracy estimation of iterative computation and has discouraging time complexity. In this paper, we present a technique to estimate the accuracy of computing SimRank iteratively. This technique provides a way to find out the number of iterations required to achieve a desired accuracy when computing SimRank. We also present optimization techniques that improve the computational complexity of the iterative algorithm from O(n ⁴) in the worst case to min(O(nl), O(n ³/ log₂ n)), with n denoting the number of objects, and l denoting the number object-to-object relationships. We also introduce a threshold sieving heuristic and its accuracy estimation that further improves the efficiency of the method. As a practical illustration of our techniques, we computed SimRank scores on a subset of English Wikipedia corpus, consisting of the complete set of articles and category links.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Abelson, H., Sussman, G.J.: Structure and Interpretation of Computer Programs, 2nd edn. The MIT Press (1996). http://mitpress.mit.edu/sicp/full-text/book/book.html
Andersen, R., Chung, F., Lang, K.: Local graph partitioning using PageRank vectors. In: FOCS ’06: Proceedings of the 47th Annual IEEE Symposium on Foundations of Computer Science, pp. 475–486. IEEE Computer Society, Washington, DC, USA (2006). doi:10.1109/FOCS.2006.44
Antonellis I., Molina H.G., Chang C.C.: Simrank++: query rewriting through link analysis of the click graph. Proc. VLDB Endow. 1(1), 408–421 (2008). doi:10.1145/1453856.1453903
Google Scholar
Brin, S., Page, L.: The anatomy of a large-scale hypertextual Web search engine. Comput. Networks ISDN Syst. 30(1–7), 107–117 (1998). http://www.citeseer.ist.psu.edu/brin98anatomy.html
Cohen R., Havlin S.: Scale-free networks are ultrasmall. Phys. Rev. Lett. 90(5), 058,701 (2003). doi:10.1103/PhysRevLett.90.058701
Article Google Scholar
Flake, G.W., Lawrence, S., Giles, C.L.: Efficient identification of web communities. In: Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 150–160. ACM Press, New York (2000)
Fogaras, D., Rácz, B.: Scaling link-based similarity search. In: WWW ’05: Proceedings of the 14th International Conference on World Wide Web, pp. 641–650. ACM, New York, NY, USA (2005). doi:10.1145/1060745.1060839
Gabrilovich, E., Markovitch, S.: Computing semantic relatedness using Wikipedia-based explicit semantic analysis. In: Proceedings of the Twentieth International Joint Conference for Artificial Intelligence, pp. 1606–1611. Hyderabad, India (2007). http://www.cs.technion.ac.il/~shaulm/papers/pdf/Gabrilovich-Markovitch-ijcai2007.pdf
Ganesan P., Garcia-Molina H., Widom J.: Exploiting hierarchical domain structure to compute similarity. ACM Trans. Inf. Syst. 21(1), 64–93 (2003). doi:10.1145/635484.635487
Article Google Scholar
Geerts, F., Mannila, H., Terzi, E.: Relational link-based ranking. In: VLDB’2004: Proceedings of the Thirtieth International Conference on Very Large Data Bases, pp. 552–563. VLDB Endowment (2004)
Gleich, D.: Fast parallel pagerank: a linear system approach. Technical report (2004)
Jeh, G., Widom, J.: SimRank: a measure of structural-context similarity. In: KDD ’02: Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 538–543. ACM Press, New York (2002). doi:10.1145/775047.775126
Kamvar, S., Haveliwala, T., Manning, C., Golub, G.: Exploiting the block structure of the web for computing pagerank. Technical report (2003)
Kohlschütter, C., Chirita, P.A., Chirita, R., Nejdl, W.: Efficient parallel computation of pagerank. In: In Proceedings of the 28th European Conference on Information Retrieval, pp. 241–252 (2006)
Kronrod M., Arlazarov V., Dinic E., Faradzev I.: On economic construction of the transitive closure of a direct graph. Sov. Math (Doklady) 11, 1209–1210 (1970)
MATH Google Scholar
Li, L., Alderson, D., Tanaka, R., Doyle, J.C., Willinger, W.: Towards a theory of scale-free graphs: definition, properties, and implications (extended version). CoRR abs/cond-mat/0501169 (2005)
Liberty, E., Zucker, S.W.: The mailman algorithm: a note on matrix-vector multiplication. Inf. Process. Lett. 109(3), 179–182 (2009). http://www.cs.yale.edu/homes/el327/papers/mailmanAlgorithm.pdf
Google Scholar
Lin, D.: An information-theoretic definition of similarity. In: Proceedings of 15th International Conference on Machine Learning, pp. 296–304. Morgan Kaufmann, San Francisco, CA (1998). citeseer.ist.psu.edu/95071.html
Lin, Z., King, I., Lyu, M.R.: PageSim: a novel link-based similarity measure for the world wide web. In: WI ’06: Proceedings of the 2006 IEEE/WIC/ACM International Conference on Web Intelligence, pp. 687–693. IEEE Computer Society, Washington, DC, USA (2006). doi:10.1109/WI.2006.127
Lizorkin, D., Medelyan, O., Grineva, M.: Analysis of community structure in wikipedia. In: WWW ’09: Proceedings of the 18th International Conference on World Wide Web, pp. 1221–1222. ACM, New York, NY, USA (2009). doi:10.1145/1526709.1526938
Lizorkin D., Velikhov P., Grinev M., Turdakov D.: Accuracy estimate and optimization techniques for SimRank computation. PVLDB 1(1), 422–433 (2008)
Google Scholar
Lu, W., Janssen, J., Milios, E.E., Japkowicz, N.: Node similarity in networked information spaces. In: Stewart, D.A., Johnson, J.H. (eds.) CASCON, p. 11. IBM (2001). http://dblp.uni-trier.de/db/conf/cascon/cascon2001.html#LuJMJ01
Maguitman, A.G., Menczer, F., Erdinc, F., Roinestad, H., Vespignani, A.: Algorithmic computation and approximation of semantic similarity. World Wide Web 9(4), 431–456 (2006). http://portal.acm.org/citation.cfm?id=1210403.1210410
Manaskasemsak, B., Rungsawang, A.: Parallel pagerank computation on a gigabit PC cluster. In: AINA ’04: Proceedings of the 18th International Conference on Advanced Information Networking and Applications, vol. 1, pp. 273-277. IEEE Computer Society, Washington, DC, USA (2004)
Newman, M.E., Girvan, M.: Finding and evaluating community structure in networks. Phys. Rev. E Stat. Nonlinear Soft Matter Phys. 69(2), 1–15 (2004). http://www.ncbi.nlm.nih.gov/pubmed/14995526
Page, L., Brin, S., Motwani, R., Winograd, T.: The pagerank citation ranking: bringing order to the web. Technical Report 1999-66, Stanford InfoLab (1999). http://ilpubs.stanford.edu:8090/422/
Shi, S., Yu, J., Yang, G., Wang, D.: Distributed page ranking in structured p2p networks. In: In ICPP, pp. 179–186 (2003)
Small H.: Co-citation in the scientific literature: a new measure of the relationship between two documents. J. Am. Soc. Inf. Sci. 24(4), 265–269 (1973)
Article Google Scholar
Song, C., Havlin, S., Makse, H.A.: Self-similarity of complex networks (2005). http://arxiv.org/abs/cond-mat/0503078
Strube, M., Ponzetto, S.: WikiRelate! Computing semantic relatedness using Wikipedia. In: Proceedings of the 21st National Conference on Artificial Intelligence (AAAI-06), pp. 1419–1424. Boston, Mass. (2006)
Xi, W., Fox, E.A., Fan, W., Zhang, B., Chen, Z., Yan, J., Zhuang, D.: SimFusion: measuring similarity using unified relationship matrix. In: SIGIR ’05: Proceedings of the 28th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 130–137. ACM, New York, NY, USA (2005)
Zesch, T., Gurevych, I.: Analysis of the Wikipedia category graph for NLP applications. In: Proceedings of the TextGraphs-2 Workshop (NAACL-HLT 2007), pp. 1–8 (2007)

Download references

Author information

Authors and Affiliations

Institute for System Programming of the Russian Academy of Sciences, B. Kommunisticheskaya avenue, 25, 109004, Moscow, Russia
Dmitry Lizorkin, Pavel Velikhov, Maxim Grinev & Denis Turdakov

Authors

Dmitry Lizorkin
View author publications
You can also search for this author in PubMed Google Scholar
Pavel Velikhov
View author publications
You can also search for this author in PubMed Google Scholar
Maxim Grinev
View author publications
You can also search for this author in PubMed Google Scholar
Denis Turdakov
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Dmitry Lizorkin.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Lizorkin, D., Velikhov, P., Grinev, M. et al. Accuracy estimate and optimization techniques for SimRank computation. The VLDB Journal 19, 45–66 (2010). https://doi.org/10.1007/s00778-009-0168-8

Download citation

Received: 12 January 2009
Revised: 16 July 2009
Accepted: 31 August 2009
Published: 06 October 2009
Issue Date: February 2010
DOI: https://doi.org/10.1007/s00778-009-0168-8

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Accuracy estimate and optimization techniques for SimRank computation

Abstract

Access this article

Similar content being viewed by others

Density-Based Clustering Based on Hierarchical Density Estimates

Partial Least Squares Methods: Partial Least Squares Correlation and Partial Least Square Regression

Check your outliers! An introduction to identifying statistical outliers in R with easystats

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Accuracy estimate and optimization techniques for SimRank computation

Abstract

Access this article

Similar content being viewed by others

Density-Based Clustering Based on Hierarchical Density Estimates

Partial Least Squares Methods: Partial Least Squares Correlation and Partial Least Square Regression

Check your outliers﻿! An introduction to identifying statistical outliers in R with easystats

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation

Check your outliers! An introduction to identifying statistical outliers in R with easystats