Algorithm of Webpage Update Detection Based on Body Text
In the process of Internet information recycles, especially in the application of resource download, we need to judge whether a webpage is updated or not. So we can decide the resource that whether it needs to be downloaded or not. In this paper we put forward an algorithm about the webpage update detection which is based on the webpage’s body text. This algorithm is based on extracting Chinese text feature and judges whether a webpage need to be updated or not by analyzing the feature. The result shows that this method has high detection rate and quick progressing speed.
KeywordsBody text Update detection Update degree Detection rate Resource download
Thanks for sponsors of, 2009BAH40B04, CNGI-09-03-15 and NCET-09-0708.
- 1.Elhadi M, Al-Tobi A (2009) Webpage duplicate detection using combined POS and sequence alignment algorithm. In: 2009 WRI world congress on computer science and information engineering, vol 76, pp 630–634Google Scholar
- 2.Liu KY, Zheng JH (2002) Research of automatic Chinese word segmentation. Proc Int Conf Mach Learn Cybern 55(2):805–809Google Scholar
- 3.Abudoulikemu Y (2010) The research and application of the Chinese machinery word segmentation algorithm based on improved patricia tree dictionary. In: 2nd international conference on signal processing systems (ICSPS), 2010, vol 54, pp 341–345Google Scholar
- 5.Ma WY (2007) Effective analysis of Chinese word-segmentation accuracy. Mod Electron Technol 4(243):108–111Google Scholar