What’s Changed? Measuring Document Change in Web Crawling for Search Engines

Ali, Halil; Williams, Hugh E.

doi:10.1007/978-3-540-39984-1_3

What’s Changed? Measuring Document Change in Web Crawling for Search Engines

Halil Ali⁷ &
Hugh E. Williams⁷

Conference paper

531 Accesses
1 Citations

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2857))

Abstract

To provide fast, scalable search facilities, web search engines store collections locally. The collections are gathered by crawling the Web. A problem with crawling is determining when to revisit resources because they have changed: stale documents contribute towards poor search results, while unnecessary refreshing is expensive. However, some changes — such as in images, advertisements, and headers — are unlikely to affect query results. In this paper, we investigate measures for determining whether documents have changed and should be recrawled. We show that content-based measures are more effective than the traditional approach of using HTTP headers. Refreshing based on HTTP headers typically recrawls 16% of the collection each day, but users do not retrieve the majority of refreshed documents. In contrast, refreshing documents when more than twenty words change recrawls 22% of the collection but updates documents more effectively. We conclude that our simple measures are an effective component of a web crawling strategy.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Cho, J., Garcia-Molina, H.: The Evolution of the Web and Implications for an Incremental Crawler. In: Abbadi, A.E., Brodie, M.L., Chakravarthy, S., Dayal, U., Kamel, N., Schlageter, G., Whang, K.-Y. (eds.) Proceedings of 26th International Conference on Very Large Data Bases, pp. 200–209. Morgan Kaufmann, Cairo Egypt (2000)
Google Scholar
Cho, J., Garcia-Molina, H.: Estimating Frequency of Change. Stanford University, Computer Science Department (November 2000)
Google Scholar
Edwards, J., McCurley, K.S., Tomlin, J.A.: An adaptive model for optimizing performance of an incremental web crawler. In: Proceedings of the Tenth International World Wide Web Conference, pp. 106–113. ACM Press, Hong Kong (2001)
Chapter Google Scholar
Brewington, B.E., Cybenko, G.: How dynamic is the Web? In: Proceedings of the 9th international World Wide Web Conference on Computer Networks, Amsterdam, Netherlands, vol. 33(1–6), pp. 257–276 (2000)
Google Scholar
Liu, L., Pu, C., Tang, W.: WebCQ: Detecting and delivering information changes on the Web. In: Proceedings of the International Conference on Information and Knowledge Management (CIKM), pp. 512–519. ACM Press, McLean (2000)
Google Scholar
Broder, A., Glassman, S., Manasse, M., Zweig, G.: Syntactic clustering of the Web. In: Proceedings of the 6th International World Wide Web Conference, Santa Clara, California, pp. 391–404 (1997)
Google Scholar
Wills, C.E., Mikhailov, M.: Towards a better understanding of Web resources and server responses for improved caching. Computer Networks 31(11-16), 1286–1389 (1999)
Article Google Scholar
Williams, H.E., Zobel, J.: Searchable Words on the Web. International Journal of Digital Libraries (to appear)
Google Scholar
Spink, A., Wolfram, D., Jansen, B.J., Saracevic, T.: Searching the web: The public and their queries. Journal of the American Society for Information Science 52(3), 226–234 (2001)
Article Google Scholar
Harman, D.: Overview of the second text retrieval conference (TREC-2). Information Processing & Management 31(3), 271–289 (1995)
Article Google Scholar
Witten, I.H., Moffat, A., Bell, T.C.: Managing Gigabytes: Compressing and Indexing Documents and Images, 2nd edn. Morgan Kaufmann Publishers, Los Altos (1999)
Google Scholar
Robertson, S.E., Walker, S.: Okapi/Keenbow at TREC-8. In: Voorhees, E.M., Harman, D. (eds.) Proceedings Text Retrieval Conference (TREC), National Institute of Standards and Technology, Washington, pp. 151–162 (1999)
Google Scholar
Hawking, D., Craswell, N., Thistlewaite, P.: Overview of TREC-7 Very Large Collection Track. In: The Eighth Text Retrieval Conference (TREC 8), National Institute of Standards and Technology Special Publication 500-246, Washington DC, pp. 91–104 (1999)
Google Scholar

Download references

Author information

Authors and Affiliations

School of Computer Science and Information Technology, RMIT University, GPO Box 2476V, Melbourne, 3001, Australia
Halil Ali & Hugh E. Williams

Authors

Halil Ali
View author publications
You can also search for this author in PubMed Google Scholar
Hugh E. Williams
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computing Science, University of Alberta, Canada
Mario A. Nascimento
Universidade Federal do Amazonas, Manaus, AM, Brasil
Edleno S. de Moura
INESC-ID/IST, R. Alves Redol 9, 1000, Lisboa, Portugal
Arlindo L. Oliveira

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ali, H., Williams, H.E. (2003). What’s Changed? Measuring Document Change in Web Crawling for Search Engines. In: Nascimento, M.A., de Moura, E.S., Oliveira, A.L. (eds) String Processing and Information Retrieval. SPIRE 2003. Lecture Notes in Computer Science, vol 2857. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-39984-1_3

Download citation

DOI: https://doi.org/10.1007/978-3-540-39984-1_3
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-20177-9
Online ISBN: 978-3-540-39984-1
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics