Skip to main content

Do Your Worst to Make the Best: Paradoxical Effects in PageRank Incremental Computations

  • Conference paper

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 3243))

Abstract

Deciding which kind of visit accumulates high-quality pages more quickly is one of the most often debated issue in the design of web crawlers. It is known that breadth-first visits work well, as they tend to discover pages with high PageRank early on in the crawl. Indeed, this visit order is much better than depth first, which is in turn even worse than a random visit; nevertheless, breadth-first can be superseded using an omniscient visit that chooses, at every step, the node of highest PageRank in the frontier.

This paper discusses a related, and previously overlooked, measure of effectivity for crawl strategies: whether the graph obtained after a partial visit is in some sense representative of the underlying web graph as far as the computation of PageRank is concerned. More precisely, we are interested in determining how rapidly the computation of PageRank over the visited subgraph yields relative ranks that agree with the ones the nodes have in the complete graph; ranks are compared using Kendall’s τ.

We describe a number of large-scale experiments that show the following paradoxical effect: visits that gather PageRank more quickly (e.g., highest-quality-first) are also those that tend to miscalculate PageRank. Finally, we perform the same kind of experimental analysis on some synthetic random graphs, generated using well-known web-graph models: the results are almost opposite to those obtained on real web graphs.

This work has been partially supported by MIUR COFIN “Linguaggi formali e automi: metodi, modelli e applicazioni” and by a “Finanziamento per grandi e mega attrezzature scientifiche” of the Università degli Studi di Milano.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Cho, J., García-Molina, H., Page, L.: Efficient crawling through URL ordering. Computer Networks and ISDN Systems 30, 161–172 (1998)

    Article  Google Scholar 

  2. Eiron, N., McCurley, K.S., Tomlin, J.A.: Ranking the web frontier. In: Proceedings of the 13th conference on World Wide Web, pp. 309–318. ACM Press, New York (2004)

    Chapter  Google Scholar 

  3. Page, L., Brin, S., Motwani, R., Winograd, T.: The pagerank citation ranking: Bringing order to the web. Technical report, Stanford Digital Library Technologies Project, Stanford University, Stanford, CA, USA (1998)

    Google Scholar 

  4. Boldi, P., Codenotti, B., Santini, M., Vigna, S.: Ubicrawler: A scalable fully distributed web crawler. Software: Practice & Experience 34, 711–726 (2004)

    Google Scholar 

  5. Najork, M., Wiener, J.L.: Breadth-first search crawling yields high-quality pages. In: Proc. of Tenth International World Wide Web Conference, Hong Kong, China (2001)

    Google Scholar 

  6. Kumar, R., Raghavan, P., Rajagopalan, S., Sivakumar, D., Tomkins, A., Upfal, E.: Stochastic models for the web graph. In: Proceedings of the 41st Annual Symposium on Foundations of Computer Science, vol. 57. IEEE Computer Society, Los Alamitos (2000)

    Google Scholar 

  7. Kendall, M., Gibbons, J.D.: Rank Correlation Methods. Edward Arnold (1990)

    Google Scholar 

  8. Fagin, R., Kumar, R., Sivakumar, D.: Comparing top k lists. In: Proceedings of the fourteenth annual ACM-SIAM symposium on Discrete algorithms, Society for Industrial and Applied Mathematics, pp. 28–36 (2003)

    Google Scholar 

  9. Fagin, R., Kumar, R., McCurley, K.S., Novak, J., Sivakumar, D., Tomlin, J.A., Williamson, D.P.: Searching the workplace web. In: Proceedings of the twelfth international conference on World Wide Web, pp. 366–375. ACM Press, New York (2003)

    Chapter  Google Scholar 

  10. Dwork, C., Kumar, R., Naor, M., Sivakumar, D.: Rank aggregation methods for the web. In: Proceedings of the tenth international conference on World Wide Web, pp. 613–622. ACM Press, New York (2001)

    Chapter  Google Scholar 

  11. Kamvar, S.D., Haveliwala, T.H., Manning, C.D., Golub, G.H.: Extrapolation methods for accelerating pagerank computations. In: Proceedings of the twelfth international conference on World Wide Web, pp. 261–270. ACM Press, New York (2003)

    Chapter  Google Scholar 

  12. Kendall, M.G.: Rank Correlation Methods. Hafner Publishing Co., New York (1955)

    MATH  Google Scholar 

  13. Knight, W.R.: A computer method for calculating kendall’s tau with ungrouped data. Journal of the American Statistical Association 61, 436–439 (1966)

    Article  MATH  Google Scholar 

  14. Kendall, M.G.: A new measure of rank correlation. Biometrika 30, 81–93 (1938)

    Article  MATH  Google Scholar 

  15. Brin, S., Page, L.: The anatomy of a large-scale hypertextual web search engine. Computer Networks 30, 107–117 (1998)

    Google Scholar 

  16. Haveliwala, T.: Efficient computation of pagerank. Technical report, Stanford University (1999)

    Google Scholar 

  17. Lee, H.C., Borodin, A.: Perturbation of the hyper-linked environment. In: Warnow, T.J., Zhu, B. (eds.) COCOON 2003. LNCS, vol. 2697, pp. 272–283. Springer, Heidelberg (2003)

    Chapter  Google Scholar 

  18. Ng, A.Y., Zheng, A.X., Jordan, M.I.: Stable algorithms for link analysis. In: Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 258–266. ACM Press, New York (2001)

    Google Scholar 

  19. Abiteboul, S., Preda, M., Cobena, G.: Adaptive on-line page importance computation. In: Proceedings of the twelfth international conference on World Wide Web, pp. 280–290. ACM Press, New York (2003)

    Chapter  Google Scholar 

  20. Bianchini, M., Gori, M., Scarselli, F.: Inside pageRank. In: ACM Transactions on Internet Technologies (to appear, 2004)

    Google Scholar 

  21. Langville, A.N., Meyer, C.D.: Deeper inside pageRank. Internet Mathematics (to appear, 2004)

    Google Scholar 

  22. Lempel, R., Moran, S.: Rank stability and rank similarity of link-based web ranking algorithms in authority connected graphs. Information Retrieval (in print, 2004); special issue on Advances in Mathematics and Formal Methods in Information Retrieval

    Google Scholar 

  23. Hirai, J., Raghavan, S., Garcia-Molina, H., Paepcke, A.: Webbase: A repository of web pages. In: Proc. of WWW9, Amsterdam, The Netherlands (2000)

    Google Scholar 

  24. Albert, R., Barábasi, A.L., Jeong, H.: Diameter of the World Wide Web. Nature 401 (1999)

    Google Scholar 

  25. Boldi, P., Vigna, S.: The WebGraph framework I: Compression techniques. In: Proc. of the Thirteenth International World Wide Web Conference, Manhattan, USA, pp. 595–601 (2004)

    Google Scholar 

  26. Boldi, P., Vigna, S.: The WebGraph framework II: Codes for the World–Wide Web. Technical Report 294-03, Università di Milano, Dipartimento di Scienze dell’Informazione (2003); To appear as a poster in Proc. of DCC 2004, IEEE Press

    Google Scholar 

  27. Donato, D., Laura, L., Leonardi, S., Milozzi, S.: A library of software tools for performing measures on large networks (2004), http://www.dis.uniroma1.it/~cosin/html_pages/COSIN-Tools.htm

  28. Kamvar, S.D., Haveliwala, T.H., Manning, C.D., Golub, G.H.: Exploiting the block structure of the web for computing pagerank. Technical report, Stanford University (2003)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2004 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Boldi, P., Santini, M., Vigna, S. (2004). Do Your Worst to Make the Best: Paradoxical Effects in PageRank Incremental Computations. In: Leonardi, S. (eds) Algorithms and Models for the Web-Graph. WAW 2004. Lecture Notes in Computer Science, vol 3243. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30216-2_14

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-30216-2_14

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-23427-2

  • Online ISBN: 978-3-540-30216-2

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics