Web News Pages Extraction Method Based on DOM and Decision Tree

Chen, Zhizhao; Lv, Jian Cheng

doi:10.1007/978-3-642-25781-0_23

Zhizhao Chen⁶ &
Jian Cheng Lv⁶

Part of the book series: Lecture Notes in Electrical Engineering ((LNEE,volume 124))

178 Accesses
1 Citations

Abstract

Content extraction for Web news pages is a basic work to many web applications and has to be solved well. This paper presents a new method to extract the contents of Web news pages. This method firstly parses the HTML code by a simple and convenient way that does not rely on a third-party toolkit, turningthe HTML structure into a more easily-operated DOM (Document Object Model) tree. And on this basis,select the sub-treecandidates which perhaps contain the main content of the page. Being the Element nodes of the DOM tree, four specific attributes of them we define in this paper are obtained. Anda decision tree can be trained according to these attributes.Because learning and predicting need a well-trained decision tree, identifying the news body sub tree among a number of sub trees in a page can be regarded as a classification problem.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Softcover Book: USD 219.99; Price excludes VAT (USA)

Hardcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Wang, J., Chen, C., Wang, C., Pei, J.: Can We Learn a Template-Independent Wrapper for News Article Extraction from a Single Training Site. In: KDD 2009, pp. 1345–1353 (2009)
Google Scholar
Zheng, S., Song, R., Wen, J.: Template-independent news extraction based on visual consistency. In: AAAI 2007, vol. 22, pp. 1507–1513 (2007)
Google Scholar
Mitchell, T.M.: Machine Learning. Decision Trees, ch. 3
Google Scholar
Reis, D.C., Golgher, P.B., Silva, A.S., Laender, A.F.: Automatic web news extraction using tree edit distance. In: WWW 2004, pp. 502–511 (2004)
Google Scholar
Mitchell, T.: Machine Learning.Decision Tree Learning, ch. 3. McGraw Hill (1997)
Google Scholar
Wang, L., Liu, Z.-T., Wang, Y.-H., Liao, T.: Web Page Main Text Extraction Based on Content Similarity. Computer Engineering 36(6), 102–104 (2010)
Google Scholar
Shi, L., Tang, Y., Zhangxin, X.: Research on Decision Tree Technology in Data Mining. Computer and Modernization (10) (2009)
Google Scholar

Download references

Author information

Authors and Affiliations

School of Computer Science, Sichuan University, Chengdu, China
Zhizhao Chen & Jian Cheng Lv

Authors

Zhizhao Chen
View author publications
You can also search for this author in PubMed Google Scholar
Jian Cheng Lv
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

College of Communication Engineering, Jilin University, Room 313, Building No.1, Changchun, Nanhu Avenue 5372, 130012, Jilin, China, People’s Republic
Zhihong Qian
Department of Electrical Engineering, The University of Mississippi, Anderson Hall 314, 38677, Mississippi, Mississippi, USA
Lei Cao
Department of Electrical and Computer Eng., Naval Postgraduate School, Rm. 452 Spanagel Bldg. 232, Dyer Road 833, 93943-5121, Monterey, California, USA
Weilian Su
Faculty of Computing, London Metropolitan University, Holloway Road 166-220, N7 8DB, London, United Kingdom
Tingkai Wang
College of Software, Changchun University of Science and Tech., Changchun, 130022, Jilin, China, People’s Republic
Huamin Yang

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Chen, Z., Lv, J.C. (2012). Web News Pages Extraction Method Based on DOM and Decision Tree. In: Qian, Z., Cao, L., Su, W., Wang, T., Yang, H. (eds) Recent Advances in Computer Science and Information Engineering. Lecture Notes in Electrical Engineering, vol 124. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-25781-0_23

Download citation

DOI: https://doi.org/10.1007/978-3-642-25781-0_23
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-25780-3
Online ISBN: 978-3-642-25781-0
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics