Abstract
Deciding which kind of visit accumulates high-quality pages more quickly is one of the most often debated issue in the design of web crawlers. It is known that breadth-first visits work well, as they tend to discover pages with high PageRank early on in the crawl. Indeed, this visit order is much better than depth first, which is in turn even worse than a random visit; nevertheless, breadth-first can be superseded using an omniscient visit that chooses, at every step, the node of highest PageRank in the frontier.
This paper discusses a related, and previously overlooked, measure of effectivity for crawl strategies: whether the graph obtained after a partial visit is in some sense representative of the underlying web graph as far as the computation of PageRank is concerned. More precisely, we are interested in determining how rapidly the computation of PageRank over the visited subgraph yields relative ranks that agree with the ones the nodes have in the complete graph; ranks are compared using Kendall’s τ.
We describe a number of large-scale experiments that show the following paradoxical effect: visits that gather PageRank more quickly (e.g., highest-quality-first) are also those that tend to miscalculate PageRank. Finally, we perform the same kind of experimental analysis on some synthetic random graphs, generated using well-known web-graph models: the results are almost opposite to those obtained on real web graphs.
This work has been partially supported by MIUR COFIN “Linguaggi formali e automi: metodi, modelli e applicazioni” and by a “Finanziamento per grandi e mega attrezzature scientifiche” of the Università degli Studi di Milano.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Cho, J., García-Molina, H., Page, L.: Efficient crawling through URL ordering. Computer Networks and ISDN Systems 30, 161–172 (1998)
Eiron, N., McCurley, K.S., Tomlin, J.A.: Ranking the web frontier. In: Proceedings of the 13th conference on World Wide Web, pp. 309–318. ACM Press, New York (2004)
Page, L., Brin, S., Motwani, R., Winograd, T.: The pagerank citation ranking: Bringing order to the web. Technical report, Stanford Digital Library Technologies Project, Stanford University, Stanford, CA, USA (1998)
Boldi, P., Codenotti, B., Santini, M., Vigna, S.: Ubicrawler: A scalable fully distributed web crawler. Software: Practice & Experience 34, 711–726 (2004)
Najork, M., Wiener, J.L.: Breadth-first search crawling yields high-quality pages. In: Proc. of Tenth International World Wide Web Conference, Hong Kong, China (2001)
Kumar, R., Raghavan, P., Rajagopalan, S., Sivakumar, D., Tomkins, A., Upfal, E.: Stochastic models for the web graph. In: Proceedings of the 41st Annual Symposium on Foundations of Computer Science, vol. 57. IEEE Computer Society, Los Alamitos (2000)
Kendall, M., Gibbons, J.D.: Rank Correlation Methods. Edward Arnold (1990)
Fagin, R., Kumar, R., Sivakumar, D.: Comparing top k lists. In: Proceedings of the fourteenth annual ACM-SIAM symposium on Discrete algorithms, Society for Industrial and Applied Mathematics, pp. 28–36 (2003)
Fagin, R., Kumar, R., McCurley, K.S., Novak, J., Sivakumar, D., Tomlin, J.A., Williamson, D.P.: Searching the workplace web. In: Proceedings of the twelfth international conference on World Wide Web, pp. 366–375. ACM Press, New York (2003)
Dwork, C., Kumar, R., Naor, M., Sivakumar, D.: Rank aggregation methods for the web. In: Proceedings of the tenth international conference on World Wide Web, pp. 613–622. ACM Press, New York (2001)
Kamvar, S.D., Haveliwala, T.H., Manning, C.D., Golub, G.H.: Extrapolation methods for accelerating pagerank computations. In: Proceedings of the twelfth international conference on World Wide Web, pp. 261–270. ACM Press, New York (2003)
Kendall, M.G.: Rank Correlation Methods. Hafner Publishing Co., New York (1955)
Knight, W.R.: A computer method for calculating kendall’s tau with ungrouped data. Journal of the American Statistical Association 61, 436–439 (1966)
Kendall, M.G.: A new measure of rank correlation. Biometrika 30, 81–93 (1938)
Brin, S., Page, L.: The anatomy of a large-scale hypertextual web search engine. Computer Networks 30, 107–117 (1998)
Haveliwala, T.: Efficient computation of pagerank. Technical report, Stanford University (1999)
Lee, H.C., Borodin, A.: Perturbation of the hyper-linked environment. In: Warnow, T.J., Zhu, B. (eds.) COCOON 2003. LNCS, vol. 2697, pp. 272–283. Springer, Heidelberg (2003)
Ng, A.Y., Zheng, A.X., Jordan, M.I.: Stable algorithms for link analysis. In: Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 258–266. ACM Press, New York (2001)
Abiteboul, S., Preda, M., Cobena, G.: Adaptive on-line page importance computation. In: Proceedings of the twelfth international conference on World Wide Web, pp. 280–290. ACM Press, New York (2003)
Bianchini, M., Gori, M., Scarselli, F.: Inside pageRank. In: ACM Transactions on Internet Technologies (to appear, 2004)
Langville, A.N., Meyer, C.D.: Deeper inside pageRank. Internet Mathematics (to appear, 2004)
Lempel, R., Moran, S.: Rank stability and rank similarity of link-based web ranking algorithms in authority connected graphs. Information Retrieval (in print, 2004); special issue on Advances in Mathematics and Formal Methods in Information Retrieval
Hirai, J., Raghavan, S., Garcia-Molina, H., Paepcke, A.: Webbase: A repository of web pages. In: Proc. of WWW9, Amsterdam, The Netherlands (2000)
Albert, R., Barábasi, A.L., Jeong, H.: Diameter of the World Wide Web. Nature 401 (1999)
Boldi, P., Vigna, S.: The WebGraph framework I: Compression techniques. In: Proc. of the Thirteenth International World Wide Web Conference, Manhattan, USA, pp. 595–601 (2004)
Boldi, P., Vigna, S.: The WebGraph framework II: Codes for the World–Wide Web. Technical Report 294-03, Università di Milano, Dipartimento di Scienze dell’Informazione (2003); To appear as a poster in Proc. of DCC 2004, IEEE Press
Donato, D., Laura, L., Leonardi, S., Milozzi, S.: A library of software tools for performing measures on large networks (2004), http://www.dis.uniroma1.it/~cosin/html_pages/COSIN-Tools.htm
Kamvar, S.D., Haveliwala, T.H., Manning, C.D., Golub, G.H.: Exploiting the block structure of the web for computing pagerank. Technical report, Stanford University (2003)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2004 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Boldi, P., Santini, M., Vigna, S. (2004). Do Your Worst to Make the Best: Paradoxical Effects in PageRank Incremental Computations. In: Leonardi, S. (eds) Algorithms and Models for the Web-Graph. WAW 2004. Lecture Notes in Computer Science, vol 3243. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30216-2_14
Download citation
DOI: https://doi.org/10.1007/978-3-540-30216-2_14
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-23427-2
Online ISBN: 978-3-540-30216-2
eBook Packages: Springer Book Archive