Abstract
To provide fast, scalable search facilities, web search engines store collections locally. The collections are gathered by crawling the Web. A problem with crawling is determining when to revisit resources because they have changed: stale documents contribute towards poor search results, while unnecessary refreshing is expensive. However, some changes — such as in images, advertisements, and headers — are unlikely to affect query results. In this paper, we investigate measures for determining whether documents have changed and should be recrawled. We show that content-based measures are more effective than the traditional approach of using HTTP headers. Refreshing based on HTTP headers typically recrawls 16% of the collection each day, but users do not retrieve the majority of refreshed documents. In contrast, refreshing documents when more than twenty words change recrawls 22% of the collection but updates documents more effectively. We conclude that our simple measures are an effective component of a web crawling strategy.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Cho, J., Garcia-Molina, H.: The Evolution of the Web and Implications for an Incremental Crawler. In: Abbadi, A.E., Brodie, M.L., Chakravarthy, S., Dayal, U., Kamel, N., Schlageter, G., Whang, K.-Y. (eds.) Proceedings of 26th International Conference on Very Large Data Bases, pp. 200–209. Morgan Kaufmann, Cairo Egypt (2000)
Cho, J., Garcia-Molina, H.: Estimating Frequency of Change. Stanford University, Computer Science Department (November 2000)
Edwards, J., McCurley, K.S., Tomlin, J.A.: An adaptive model for optimizing performance of an incremental web crawler. In: Proceedings of the Tenth International World Wide Web Conference, pp. 106–113. ACM Press, Hong Kong (2001)
Brewington, B.E., Cybenko, G.: How dynamic is the Web? In: Proceedings of the 9th international World Wide Web Conference on Computer Networks, Amsterdam, Netherlands, vol. 33(1–6), pp. 257–276 (2000)
Liu, L., Pu, C., Tang, W.: WebCQ: Detecting and delivering information changes on the Web. In: Proceedings of the International Conference on Information and Knowledge Management (CIKM), pp. 512–519. ACM Press, McLean (2000)
Broder, A., Glassman, S., Manasse, M., Zweig, G.: Syntactic clustering of the Web. In: Proceedings of the 6th International World Wide Web Conference, Santa Clara, California, pp. 391–404 (1997)
Wills, C.E., Mikhailov, M.: Towards a better understanding of Web resources and server responses for improved caching. Computer Networks 31(11-16), 1286–1389 (1999)
Williams, H.E., Zobel, J.: Searchable Words on the Web. International Journal of Digital Libraries (to appear)
Spink, A., Wolfram, D., Jansen, B.J., Saracevic, T.: Searching the web: The public and their queries. Journal of the American Society for Information Science 52(3), 226–234 (2001)
Harman, D.: Overview of the second text retrieval conference (TREC-2). Information Processing & Management 31(3), 271–289 (1995)
Witten, I.H., Moffat, A., Bell, T.C.: Managing Gigabytes: Compressing and Indexing Documents and Images, 2nd edn. Morgan Kaufmann Publishers, Los Altos (1999)
Robertson, S.E., Walker, S.: Okapi/Keenbow at TREC-8. In: Voorhees, E.M., Harman, D. (eds.) Proceedings Text Retrieval Conference (TREC), National Institute of Standards and Technology, Washington, pp. 151–162 (1999)
Hawking, D., Craswell, N., Thistlewaite, P.: Overview of TREC-7 Very Large Collection Track. In: The Eighth Text Retrieval Conference (TREC 8), National Institute of Standards and Technology Special Publication 500-246, Washington DC, pp. 91–104 (1999)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2003 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Ali, H., Williams, H.E. (2003). What’s Changed? Measuring Document Change in Web Crawling for Search Engines. In: Nascimento, M.A., de Moura, E.S., Oliveira, A.L. (eds) String Processing and Information Retrieval. SPIRE 2003. Lecture Notes in Computer Science, vol 2857. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-39984-1_3
Download citation
DOI: https://doi.org/10.1007/978-3-540-39984-1_3
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-20177-9
Online ISBN: 978-3-540-39984-1
eBook Packages: Springer Book Archive