Web News Pages Extraction Method Based on DOM and Decision Tree

  • Zhizhao Chen
  • Jian Cheng Lv
Part of the Lecture Notes in Electrical Engineering book series (LNEE, volume 124)


Content extraction for Web news pages is a basic work to many web applications and has to be solved well. This paper presents a new method to extract the contents of Web news pages. This method firstly parses the HTML code by a simple and convenient way that does not rely on a third-party toolkit, turningthe HTML structure into a more easily-operated DOM (Document Object Model) tree. And on this basis,select the sub-treecandidates which perhaps contain the main content of the page. Being the Element nodes of the DOM tree, four specific attributes of them we define in this paper are obtained. Anda decision tree can be trained according to these attributes.Because learning and predicting need a well-trained decision tree, identifying the news body sub tree among a number of sub trees in a page can be regarded as a classification problem.


Decision Tree Spatial Feature News Article Element Node Content Feature 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Wang, J., Chen, C., Wang, C., Pei, J.: Can We Learn a Template-Independent Wrapper for News Article Extraction from a Single Training Site. In: KDD 2009, pp. 1345–1353 (2009)Google Scholar
  2. 2.
    Zheng, S., Song, R., Wen, J.: Template-independent news extraction based on visual consistency. In: AAAI 2007, vol. 22, pp. 1507–1513 (2007)Google Scholar
  3. 3.
    Mitchell, T.M.: Machine Learning. Decision Trees, ch. 3Google Scholar
  4. 4.
    Reis, D.C., Golgher, P.B., Silva, A.S., Laender, A.F.: Automatic web news extraction using tree edit distance. In: WWW 2004, pp. 502–511 (2004)Google Scholar
  5. 5.
    Mitchell, T.: Machine Learning.Decision Tree Learning, ch. 3. McGraw Hill (1997)Google Scholar
  6. 6.
    Wang, L., Liu, Z.-T., Wang, Y.-H., Liao, T.: Web Page Main Text Extraction Based on Content Similarity. Computer Engineering 36(6), 102–104 (2010)Google Scholar
  7. 7.
    Shi, L., Tang, Y., Zhangxin, X.: Research on Decision Tree Technology in Data Mining. Computer and Modernization  (10) (2009)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  • Zhizhao Chen
    • 1
  • Jian Cheng Lv
    • 1
  1. 1.School of Computer ScienceSichuan UniversityChengduChina

Personalised recommendations