Adaptive Clustering-Based Change Prediction for Refreshing Web Repository

  • Bundit ManaskasemsakEmail author
  • Petchpoom Pumjang
  • Arnon Rungsawang
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9155)


Resource constraints, such as time and network bandwidth, hinder modern search engine providers to keep local database completely synchronize with the Web. In this paper, we propose an adaptive clustering based change prediction approach to refresh the local web repository. Especially, we first group the existing web pages in the current repository into web clusters based on their similar change characteristics. We then sample and examine some pages in each cluster to estimate their change patterns. Selected cluster of web pages with higher change probability will be later downloaded to update the current repository. Finally, the effectiveness of the current download cycle will be examined; either auxiliary (non-downloaded), reward (correct change prediction), or penalty (wrong change prediction) score will be assigned to a web page. This score will later be used to reinforce the consecutive web clustering as well as the change prediction processes. To evaluate the performance of the proposed approach, we run extensive experiments on snapshots of real Web dataset of about 282,000 distinct URLs which are belonging to more than 12,500 websites. The results clearly show that the proposed approach outperforms the existing state-of-the-art on clustering-based web crawling policy in that it can provide fresher local web repository with limited resource.


Web change prediction Refresh policy Web crawler Search engine Sampling Clustering Adaptive learning 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Ali, H., Williams, H.E.: What’s changed? measuring document change in web crawling for search engines. In: Nascimento, M.A., de Moura, E.S., Oliveira, A.L. (eds.) SPIRE 2003. LNCS, vol. 2857, pp. 28–42. Springer, Heidelberg (2003) CrossRefGoogle Scholar
  2. 2.
    Baeza-Yates, R.A., Ribeiro-Neto, B.A.: Modern Information Retrieval. Addison Wesley, England (1999) Google Scholar
  3. 3.
    Brewington, B.E., Cybenko, G.: Keeping up with the changing web. Computer 33(5), 52–58 (2000)CrossRefGoogle Scholar
  4. 4.
    Burner, M.: Crawling towards eternity: Building an archive of the world wide web. Web Techniques Magazine 2(5), 37–40 (1997)Google Scholar
  5. 5.
    Cho, J., Garcia-Molina, H.: Synchronizing a database to improve freshness. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 117–128 (2000)Google Scholar
  6. 6.
    Cho, J., Garcia-Molina, H.: Effective page refresh policies for web crawlers. ACM Transactions on Database Systems 28(4), 390–426 (2003)CrossRefGoogle Scholar
  7. 7.
    Cho, J., Ntoulas, A.: Effective change detection using sampling. In: Proceedings of the 28th International Conference on Very Large Data Bases, pp. 514–525 (2002)Google Scholar
  8. 8.
    Dice, L.R.: Measures of the amount of ecologic association between species. Ecology 26(3), 297–320 (1945)CrossRefGoogle Scholar
  9. 9.
    Douglis, F., Feldmann, A., Krishnamurthy, B., Mogul, J.: Rate of change and other metrics: a live study of the world wide web. In: Proceedings of the USENIX Symposium on Internet Technologies and Systems (1997)Google Scholar
  10. 10.
    Fetterly, D., Manasse, M., Najork, M., Wiener, J.: A large-scale study of the evolution of web pages. In: Proceedings of the 12th International Conference on World Wide Web, pp. 669–678 (2004)Google Scholar
  11. 11.
    Grimmett, G.R., Stirzaker, D.R.: Probability and Random Processes, 3rd edn. Oxford University Press, England (2001) Google Scholar
  12. 12.
    Ntoulas, A., Cho, J., Olston, C.: What’s new on the web?: the evolution of the web from a search engine perspective. In: Proceedings of the 13th International Conference on World Wide Web, pp. 1–12 (2004)Google Scholar
  13. 13.
    Pelleg, D., Moore, A.W.: X-means: Extending k-means with efficient estimation of the number of clusters. In: Proceedings of the 17th International Conference on Machine Learning, pp. 727–734 (2000)Google Scholar
  14. 14.
    Sørensen, T.: A method of establishing groups of equal amplitude in plant sociology based on similarity of species and its application to analyses of the vegetation on danish commons. Biologiske Skrifter 5(4), 1–34 (1948)Google Scholar
  15. 15.
    Tan, Q., Mitra, P.: Clustering-based incremental web crawling. ACM Transactions on Information Systems 28(4), 17:1–17:27 (2010)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2015

Authors and Affiliations

  • Bundit Manaskasemsak
    • 1
    Email author
  • Petchpoom Pumjang
    • 1
  • Arnon Rungsawang
    • 1
  1. 1.Massive Information and Knowledge Engineering Laboratory, Department of Computer Engineering, Faculty of EngineeringKasetsart UniversityBangkokThailand

Personalised recommendations