Skip to main content

What’s Changed? Measuring Document Change in Web Crawling for Search Engines

  • Conference paper

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2857))

Abstract

To provide fast, scalable search facilities, web search engines store collections locally. The collections are gathered by crawling the Web. A problem with crawling is determining when to revisit resources because they have changed: stale documents contribute towards poor search results, while unnecessary refreshing is expensive. However, some changes — such as in images, advertisements, and headers — are unlikely to affect query results. In this paper, we investigate measures for determining whether documents have changed and should be recrawled. We show that content-based measures are more effective than the traditional approach of using HTTP headers. Refreshing based on HTTP headers typically recrawls 16% of the collection each day, but users do not retrieve the majority of refreshed documents. In contrast, refreshing documents when more than twenty words change recrawls 22% of the collection but updates documents more effectively. We conclude that our simple measures are an effective component of a web crawling strategy.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Cho, J., Garcia-Molina, H.: The Evolution of the Web and Implications for an Incremental Crawler. In: Abbadi, A.E., Brodie, M.L., Chakravarthy, S., Dayal, U., Kamel, N., Schlageter, G., Whang, K.-Y. (eds.) Proceedings of 26th International Conference on Very Large Data Bases, pp. 200–209. Morgan Kaufmann, Cairo Egypt (2000)

    Google Scholar 

  2. Cho, J., Garcia-Molina, H.: Estimating Frequency of Change. Stanford University, Computer Science Department (November 2000)

    Google Scholar 

  3. Edwards, J., McCurley, K.S., Tomlin, J.A.: An adaptive model for optimizing performance of an incremental web crawler. In: Proceedings of the Tenth International World Wide Web Conference, pp. 106–113. ACM Press, Hong Kong (2001)

    Chapter  Google Scholar 

  4. Brewington, B.E., Cybenko, G.: How dynamic is the Web? In: Proceedings of the 9th international World Wide Web Conference on Computer Networks, Amsterdam, Netherlands, vol. 33(1–6), pp. 257–276 (2000)

    Google Scholar 

  5. Liu, L., Pu, C., Tang, W.: WebCQ: Detecting and delivering information changes on the Web. In: Proceedings of the International Conference on Information and Knowledge Management (CIKM), pp. 512–519. ACM Press, McLean (2000)

    Google Scholar 

  6. Broder, A., Glassman, S., Manasse, M., Zweig, G.: Syntactic clustering of the Web. In: Proceedings of the 6th International World Wide Web Conference, Santa Clara, California, pp. 391–404 (1997)

    Google Scholar 

  7. Wills, C.E., Mikhailov, M.: Towards a better understanding of Web resources and server responses for improved caching. Computer Networks 31(11-16), 1286–1389 (1999)

    Article  Google Scholar 

  8. Williams, H.E., Zobel, J.: Searchable Words on the Web. International Journal of Digital Libraries (to appear)

    Google Scholar 

  9. Spink, A., Wolfram, D., Jansen, B.J., Saracevic, T.: Searching the web: The public and their queries. Journal of the American Society for Information Science 52(3), 226–234 (2001)

    Article  Google Scholar 

  10. Harman, D.: Overview of the second text retrieval conference (TREC-2). Information Processing & Management 31(3), 271–289 (1995)

    Article  Google Scholar 

  11. Witten, I.H., Moffat, A., Bell, T.C.: Managing Gigabytes: Compressing and Indexing Documents and Images, 2nd edn. Morgan Kaufmann Publishers, Los Altos (1999)

    Google Scholar 

  12. Robertson, S.E., Walker, S.: Okapi/Keenbow at TREC-8. In: Voorhees, E.M., Harman, D. (eds.) Proceedings Text Retrieval Conference (TREC), National Institute of Standards and Technology, Washington, pp. 151–162 (1999)

    Google Scholar 

  13. Hawking, D., Craswell, N., Thistlewaite, P.: Overview of TREC-7 Very Large Collection Track. In: The Eighth Text Retrieval Conference (TREC 8), National Institute of Standards and Technology Special Publication 500-246, Washington DC, pp. 91–104 (1999)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2003 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Ali, H., Williams, H.E. (2003). What’s Changed? Measuring Document Change in Web Crawling for Search Engines. In: Nascimento, M.A., de Moura, E.S., Oliveira, A.L. (eds) String Processing and Information Retrieval. SPIRE 2003. Lecture Notes in Computer Science, vol 2857. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-39984-1_3

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-39984-1_3

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-20177-9

  • Online ISBN: 978-3-540-39984-1

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics