Abstract
Previous scalability experiments found that early precision improves as collection size increases. However, that was under the assumption that a collection’s documents are all sampled with uniform probability from the same population. We contrast this to a large breadth-first web crawl, an important scenario in real-world Web search, where the early documents have quite different characteristics from the later documents. Having observed that NDCG@100 (measured over a set of reference queries) begins to plateau in the initial stages of the crawl, we investigate a number of possible reasons for this behaviour. These include the web-pages themselves, the metric used to measure retrieval effectiveness as well as the set of relevance judgements used.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Baeza-Yates, R., Castillo, C.: Crawling the infinite web. Journal of Web Engineering 6(1), 49–72 (2007)
Bompada, T., Chang, C.-C., Chen, J., Kumar, R., Shenoy, R.: On the robustness of relevance measures with incomplete judgments. In: Proceedings of SIGIR 2007, pp. 359–366 (2007)
Buckley, C., Voorhees, E.M.: Retrieval evaluation with incomplete information. In: Proceedings of SIGIR 2004, pp. 25–32 (2004)
Chakrabarti, S., van den Berg, M., Dom, B.: Focused crawling: a new approach to topic-specific web resource discovery. Computer Networks 31(11-16), 1623–1640 (1999)
Cho, J., Garcia-Molina, H.: The evolution of the web and implications for an incremental crawler. In: VLDB 2000: Proceedings of the 26th International Conference on Very Large Data Bases, pp. 200–209 (2000)
Cho, J., Garcia-Molina, H., Page, L.: Efficient crawling through URL ordering. Computer Networks and ISDN Systems 30(1-7), 161–172 (1998)
Cho, J., Schonfeld, U.: Rankmass crawler: a crawler with high personalized PageRank coverage guarantee. In: VLDB 2007: Proceedings of the 33rd international conference on Very large data bases, pp. 375–386 (2007)
Craswell, N., Robertson, S., Zaragoza, H., Taylor, M.: Relevance weighting for query independent evidence. In: Proceedings of SIGIR 2005, pp. 416–423 (2005)
Dasgupta, A., Ghosh, A., Kumar, R., Olston, C., Pandey, S., Tomkins, A.: The discoverability of the web. In: WWW 2007: Proceedings of the 16th international conference on World Wide Web, pp. 421–430. ACM, New York (2007)
Fetterly, D., Craswell, N., Vinay, V.: Search effectiveness with a breadth-first crawl. In: SIGIR 2008: Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval, pp. 755–756. ACM, New York (2008)
Gyongyi, Z., Garcia-Molina, H., Pedersen, J.: Combating web spam with trustrank. In: VLDB 2004: Proceedings of the 30h International Conference on Very Large Data Bases, pp. 271–279 (2004)
Hawking, D., Robertson, S.: On collection size and retrieval effectiveness. Information Retrieval 6(1), 99–105 (2003)
Henzinger, M., Heydon, A., Mitzenmacher, M., Najork, M.: Measuring index quality using random walks on the Web. Comput. Networks 31(11), 1291–1303 (1999)
Järvelin, K., Kekäläinen, J.: Cumulated gain-based evaluation of ir techniques. ACM Trans. Inf. Syst. 20(4), 422–446 (2002)
Lee, H.-T., Leonard, D., Wang, X., Loguinov, D.: IRLbot: scaling to 6 billion pages and beyond. In: Proceedings of WWW 2008, pp. 427–436 (2008)
Najork, M., Wiener, J.L.: Breadth-first crawling yields high-quality pages. In: WWW 2001: Proceedings of the 10th international conference on World Wide Web, pp. 114–118 (2001)
Page, L., Brin, S., Motwani, R., Winograd, T.: The pagerank citation ranking: Bringing order to the web. Technical report, Stanford Digital Library Technologies Project (1998)
Pandey, S., Olston, C.: User-centric web crawling. In: WWW 2005: Proceedings of the 14th international conference on World Wide Web, pp. 401–411 (2005)
Yilmaz, E., Aslam, J.A.: Estimating average precision with incomplete and imperfect judgments. In: Proceedings of CIKM 2006, pp. 102–111 (2006)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2009 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Fetterly, D., Craswell, N., Vinay, V. (2009). Measuring the Search Effectiveness of a Breadth-First Crawl. In: Boughanem, M., Berrut, C., Mothe, J., Soule-Dupuy, C. (eds) Advances in Information Retrieval. ECIR 2009. Lecture Notes in Computer Science, vol 5478. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-00958-7_35
Download citation
DOI: https://doi.org/10.1007/978-3-642-00958-7_35
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-00957-0
Online ISBN: 978-3-642-00958-7
eBook Packages: Computer ScienceComputer Science (R0)