Algorithm of Webpage Update Detection Based on Body Text

Conference paper
Part of the Lecture Notes in Electrical Engineering book series (LNEE, volume 206)


In the process of Internet information recycles, especially in the application of resource download, we need to judge whether a webpage is updated or not. So we can decide the resource that whether it needs to be downloaded or not. In this paper we put forward an algorithm about the webpage update detection which is based on the webpage’s body text. This algorithm is based on extracting Chinese text feature and judges whether a webpage need to be updated or not by analyzing the feature. The result shows that this method has high detection rate and quick progressing speed.


Body text Update detection Update degree Detection rate Resource download 



Thanks for sponsors of, 2009BAH40B04, CNGI-09-03-15 and NCET-09-0708.


  1. 1.
    Elhadi M, Al-Tobi A (2009) Webpage duplicate detection using combined POS and sequence alignment algorithm. In: 2009 WRI world congress on computer science and information engineering, vol 76, pp 630–634Google Scholar
  2. 2.
    Liu KY, Zheng JH (2002) Research of automatic Chinese word segmentation. Proc Int Conf Mach Learn Cybern 55(2):805–809Google Scholar
  3. 3.
    Abudoulikemu Y (2010) The research and application of the Chinese machinery word segmentation algorithm based on improved patricia tree dictionary. In: 2nd international conference on signal processing systems (ICSPS), 2010, vol 54, pp 341–345Google Scholar
  4. 4.
    Wang FL, Yang CC (2007) Mining web data for Chinese segmentation. J Am Soc Inform Sci Technol 58(12):1820–1837CrossRefGoogle Scholar
  5. 5.
    Ma WY (2007) Effective analysis of Chinese word-segmentation accuracy. Mod Electron Technol 4(243):108–111Google Scholar

Copyright information

© Springer-Verlag London 2013

Authors and Affiliations

  1. 1.Communication University of ChinaBeijingChina

Personalised recommendations