Web News Pages Extraction Method Based on DOM and Decision Tree
Content extraction for Web news pages is a basic work to many web applications and has to be solved well. This paper presents a new method to extract the contents of Web news pages. This method firstly parses the HTML code by a simple and convenient way that does not rely on a third-party toolkit, turningthe HTML structure into a more easily-operated DOM (Document Object Model) tree. And on this basis,select the sub-treecandidates which perhaps contain the main content of the page. Being the Element nodes of the DOM tree, four specific attributes of them we define in this paper are obtained. Anda decision tree can be trained according to these attributes.Because learning and predicting need a well-trained decision tree, identifying the news body sub tree among a number of sub trees in a page can be regarded as a classification problem.
KeywordsDecision Tree Spatial Feature News Article Element Node Content Feature
Unable to display preview. Download preview PDF.
- 1.Wang, J., Chen, C., Wang, C., Pei, J.: Can We Learn a Template-Independent Wrapper for News Article Extraction from a Single Training Site. In: KDD 2009, pp. 1345–1353 (2009)Google Scholar
- 2.Zheng, S., Song, R., Wen, J.: Template-independent news extraction based on visual consistency. In: AAAI 2007, vol. 22, pp. 1507–1513 (2007)Google Scholar
- 3.Mitchell, T.M.: Machine Learning. Decision Trees, ch. 3Google Scholar
- 4.Reis, D.C., Golgher, P.B., Silva, A.S., Laender, A.F.: Automatic web news extraction using tree edit distance. In: WWW 2004, pp. 502–511 (2004)Google Scholar
- 5.Mitchell, T.: Machine Learning.Decision Tree Learning, ch. 3. McGraw Hill (1997)Google Scholar
- 6.Wang, L., Liu, Z.-T., Wang, Y.-H., Liao, T.: Web Page Main Text Extraction Based on Content Similarity. Computer Engineering 36(6), 102–104 (2010)Google Scholar
- 7.Shi, L., Tang, Y., Zhangxin, X.: Research on Decision Tree Technology in Data Mining. Computer and Modernization (10) (2009)Google Scholar