Do Your Worst to Make the Best: Paradoxical Effects in PageRank Incremental Computations

  • Paolo Boldi
  • Massimo Santini
  • Sebastiano Vigna
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3243)

Abstract

Deciding which kind of visit accumulates high-quality pages more quickly is one of the most often debated issue in the design of web crawlers. It is known that breadth-first visits work well, as they tend to discover pages with high PageRank early on in the crawl. Indeed, this visit order is much better than depth first, which is in turn even worse than a random visit; nevertheless, breadth-first can be superseded using an omniscient visit that chooses, at every step, the node of highest PageRank in the frontier.

This paper discusses a related, and previously overlooked, measure of effectivity for crawl strategies: whether the graph obtained after a partial visit is in some sense representative of the underlying web graph as far as the computation of PageRank is concerned. More precisely, we are interested in determining how rapidly the computation of PageRank over the visited subgraph yields relative ranks that agree with the ones the nodes have in the complete graph; ranks are compared using Kendall’s τ.

We describe a number of large-scale experiments that show the following paradoxical effect: visits that gather PageRank more quickly (e.g., highest-quality-first) are also those that tend to miscalculate PageRank. Finally, we perform the same kind of experimental analysis on some synthetic random graphs, generated using well-known web-graph models: the results are almost opposite to those obtained on real web graphs.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Cho, J., García-Molina, H., Page, L.: Efficient crawling through URL ordering. Computer Networks and ISDN Systems 30, 161–172 (1998)CrossRefGoogle Scholar
  2. 2.
    Eiron, N., McCurley, K.S., Tomlin, J.A.: Ranking the web frontier. In: Proceedings of the 13th conference on World Wide Web, pp. 309–318. ACM Press, New York (2004)CrossRefGoogle Scholar
  3. 3.
    Page, L., Brin, S., Motwani, R., Winograd, T.: The pagerank citation ranking: Bringing order to the web. Technical report, Stanford Digital Library Technologies Project, Stanford University, Stanford, CA, USA (1998)Google Scholar
  4. 4.
    Boldi, P., Codenotti, B., Santini, M., Vigna, S.: Ubicrawler: A scalable fully distributed web crawler. Software: Practice & Experience 34, 711–726 (2004)Google Scholar
  5. 5.
    Najork, M., Wiener, J.L.: Breadth-first search crawling yields high-quality pages. In: Proc. of Tenth International World Wide Web Conference, Hong Kong, China (2001)Google Scholar
  6. 6.
    Kumar, R., Raghavan, P., Rajagopalan, S., Sivakumar, D., Tomkins, A., Upfal, E.: Stochastic models for the web graph. In: Proceedings of the 41st Annual Symposium on Foundations of Computer Science, vol. 57. IEEE Computer Society, Los Alamitos (2000)Google Scholar
  7. 7.
    Kendall, M., Gibbons, J.D.: Rank Correlation Methods. Edward Arnold (1990)Google Scholar
  8. 8.
    Fagin, R., Kumar, R., Sivakumar, D.: Comparing top k lists. In: Proceedings of the fourteenth annual ACM-SIAM symposium on Discrete algorithms, Society for Industrial and Applied Mathematics, pp. 28–36 (2003)Google Scholar
  9. 9.
    Fagin, R., Kumar, R., McCurley, K.S., Novak, J., Sivakumar, D., Tomlin, J.A., Williamson, D.P.: Searching the workplace web. In: Proceedings of the twelfth international conference on World Wide Web, pp. 366–375. ACM Press, New York (2003)CrossRefGoogle Scholar
  10. 10.
    Dwork, C., Kumar, R., Naor, M., Sivakumar, D.: Rank aggregation methods for the web. In: Proceedings of the tenth international conference on World Wide Web, pp. 613–622. ACM Press, New York (2001)CrossRefGoogle Scholar
  11. 11.
    Kamvar, S.D., Haveliwala, T.H., Manning, C.D., Golub, G.H.: Extrapolation methods for accelerating pagerank computations. In: Proceedings of the twelfth international conference on World Wide Web, pp. 261–270. ACM Press, New York (2003)CrossRefGoogle Scholar
  12. 12.
    Kendall, M.G.: Rank Correlation Methods. Hafner Publishing Co., New York (1955)MATHGoogle Scholar
  13. 13.
    Knight, W.R.: A computer method for calculating kendall’s tau with ungrouped data. Journal of the American Statistical Association 61, 436–439 (1966)CrossRefMATHGoogle Scholar
  14. 14.
    Kendall, M.G.: A new measure of rank correlation. Biometrika 30, 81–93 (1938)CrossRefMATHGoogle Scholar
  15. 15.
    Brin, S., Page, L.: The anatomy of a large-scale hypertextual web search engine. Computer Networks 30, 107–117 (1998)Google Scholar
  16. 16.
    Haveliwala, T.: Efficient computation of pagerank. Technical report, Stanford University (1999)Google Scholar
  17. 17.
    Lee, H.C., Borodin, A.: Perturbation of the hyper-linked environment. In: Warnow, T.J., Zhu, B. (eds.) COCOON 2003. LNCS, vol. 2697, pp. 272–283. Springer, Heidelberg (2003)CrossRefGoogle Scholar
  18. 18.
    Ng, A.Y., Zheng, A.X., Jordan, M.I.: Stable algorithms for link analysis. In: Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 258–266. ACM Press, New York (2001)Google Scholar
  19. 19.
    Abiteboul, S., Preda, M., Cobena, G.: Adaptive on-line page importance computation. In: Proceedings of the twelfth international conference on World Wide Web, pp. 280–290. ACM Press, New York (2003)CrossRefGoogle Scholar
  20. 20.
    Bianchini, M., Gori, M., Scarselli, F.: Inside pageRank. In: ACM Transactions on Internet Technologies (to appear, 2004)Google Scholar
  21. 21.
    Langville, A.N., Meyer, C.D.: Deeper inside pageRank. Internet Mathematics (to appear, 2004)Google Scholar
  22. 22.
    Lempel, R., Moran, S.: Rank stability and rank similarity of link-based web ranking algorithms in authority connected graphs. Information Retrieval (in print, 2004); special issue on Advances in Mathematics and Formal Methods in Information RetrievalGoogle Scholar
  23. 23.
    Hirai, J., Raghavan, S., Garcia-Molina, H., Paepcke, A.: Webbase: A repository of web pages. In: Proc. of WWW9, Amsterdam, The Netherlands (2000)Google Scholar
  24. 24.
    Albert, R., Barábasi, A.L., Jeong, H.: Diameter of the World Wide Web. Nature 401 (1999)Google Scholar
  25. 25.
    Boldi, P., Vigna, S.: The WebGraph framework I: Compression techniques. In: Proc. of the Thirteenth International World Wide Web Conference, Manhattan, USA, pp. 595–601 (2004)Google Scholar
  26. 26.
    Boldi, P., Vigna, S.: The WebGraph framework II: Codes for the World–Wide Web. Technical Report 294-03, Università di Milano, Dipartimento di Scienze dell’Informazione (2003); To appear as a poster in Proc. of DCC 2004, IEEE PressGoogle Scholar
  27. 27.
    Donato, D., Laura, L., Leonardi, S., Milozzi, S.: A library of software tools for performing measures on large networks (2004), http://www.dis.uniroma1.it/~cosin/html_pages/COSIN-Tools.htm
  28. 28.
    Kamvar, S.D., Haveliwala, T.H., Manning, C.D., Golub, G.H.: Exploiting the block structure of the web for computing pagerank. Technical report, Stanford University (2003)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2004

Authors and Affiliations

  • Paolo Boldi
    • 1
  • Massimo Santini
    • 2
  • Sebastiano Vigna
    • 1
  1. 1.Dipartimento di Scienze dell’InformazioneUniversità degli Studi di MilanoMilanoItaly
  2. 2.Università di Modena e Reggio EmiliaReggio EmiliaItaly

Personalised recommendations