Adaptive Clustering-Based Change Prediction for Refreshing Web Repository
Resource constraints, such as time and network bandwidth, hinder modern search engine providers to keep local database completely synchronize with the Web. In this paper, we propose an adaptive clustering based change prediction approach to refresh the local web repository. Especially, we first group the existing web pages in the current repository into web clusters based on their similar change characteristics. We then sample and examine some pages in each cluster to estimate their change patterns. Selected cluster of web pages with higher change probability will be later downloaded to update the current repository. Finally, the effectiveness of the current download cycle will be examined; either auxiliary (non-downloaded), reward (correct change prediction), or penalty (wrong change prediction) score will be assigned to a web page. This score will later be used to reinforce the consecutive web clustering as well as the change prediction processes. To evaluate the performance of the proposed approach, we run extensive experiments on snapshots of real Web dataset of about 282,000 distinct URLs which are belonging to more than 12,500 websites. The results clearly show that the proposed approach outperforms the existing state-of-the-art on clustering-based web crawling policy in that it can provide fresher local web repository with limited resource.
KeywordsWeb change prediction Refresh policy Web crawler Search engine Sampling Clustering Adaptive learning
Unable to display preview. Download preview PDF.
- 2.Baeza-Yates, R.A., Ribeiro-Neto, B.A.: Modern Information Retrieval. Addison Wesley, England (1999) Google Scholar
- 4.Burner, M.: Crawling towards eternity: Building an archive of the world wide web. Web Techniques Magazine 2(5), 37–40 (1997)Google Scholar
- 5.Cho, J., Garcia-Molina, H.: Synchronizing a database to improve freshness. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 117–128 (2000)Google Scholar
- 7.Cho, J., Ntoulas, A.: Effective change detection using sampling. In: Proceedings of the 28th International Conference on Very Large Data Bases, pp. 514–525 (2002)Google Scholar
- 9.Douglis, F., Feldmann, A., Krishnamurthy, B., Mogul, J.: Rate of change and other metrics: a live study of the world wide web. In: Proceedings of the USENIX Symposium on Internet Technologies and Systems (1997)Google Scholar
- 10.Fetterly, D., Manasse, M., Najork, M., Wiener, J.: A large-scale study of the evolution of web pages. In: Proceedings of the 12th International Conference on World Wide Web, pp. 669–678 (2004)Google Scholar
- 11.Grimmett, G.R., Stirzaker, D.R.: Probability and Random Processes, 3rd edn. Oxford University Press, England (2001) Google Scholar
- 12.Ntoulas, A., Cho, J., Olston, C.: What’s new on the web?: the evolution of the web from a search engine perspective. In: Proceedings of the 13th International Conference on World Wide Web, pp. 1–12 (2004)Google Scholar
- 13.Pelleg, D., Moore, A.W.: X-means: Extending k-means with efficient estimation of the number of clusters. In: Proceedings of the 17th International Conference on Machine Learning, pp. 727–734 (2000)Google Scholar
- 14.Sørensen, T.: A method of establishing groups of equal amplitude in plant sociology based on similarity of species and its application to analyses of the vegetation on danish commons. Biologiske Skrifter 5(4), 1–34 (1948)Google Scholar
- 15.Tan, Q., Mitra, P.: Clustering-based incremental web crawling. ACM Transactions on Information Systems 28(4), 17:1–17:27 (2010)Google Scholar