A hybrid approach for content extraction with text density and visual importance of DOM nodes

Song, Dandan; Sun, Fei; Liao, Lejian

doi:10.1007/s10115-013-0687-x

A hybrid approach for content extraction with text density and visual importance of DOM nodes

Regular Paper
Published: 26 September 2013

Volume 42, pages 75–96, (2015)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

Dandan Song¹,
Fei Sun¹^nAff2 &
Lejian Liao¹

866 Accesses
24 Citations
Explore all metrics

Abstract

Additional contents in web pages, such as navigation panels, advertisements, copyrights and disclaimer notices, are typically not related to the main subject and may hamper the performance of Web data mining. They are traditionally taken as noises and need to be removed properly. To achieve this, two intuitive and crucial kinds of information—the textual information and the visual information of web pages—is considered in this paper. Accordingly, Text Density and Visual Importance are defined for the Document Object Model (DOM) nodes of a web page. Furthermore, a content extraction method with these measured values is proposed. It is a fast, accurate and general method for extracting content from diverse web pages. And with the employment of DOM nodes, the original structure of the web page can be preserved. Evaluated with the CleanEval benchmark and with randomly selected pages from well-known Web sites, where various web domains and styles are tested, the effect of the method is demonstrated. The average F1-scores with our method were 8.7 % higher than the best scores among several alternative methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Main Content Extraction from Heterogeneous Webpages

Information Extraction from Web Sources Based on Multi-aspect Content Analysis

Extracting Web Content by Exploiting Multi-Category Characteristics

Notes

References

Adelberg B (1998) NoDoSE—a tool for semi-automatically extracting semi-structured data from text documents. In: Proceedings of SIGMOD ’98. ACM, New York, NY, USA, pp 283–294
Baluja S (2006) Browsing on small screens: recasting web-page segmentation into an efficient machine learning framework. In: Proceedings of WWW ’06, pp 33–42
Bar-Yossef Z, Rajagopalan S (2002) Template detection via data mining and its applications. In: Proceedings of WWW ’02. NY, USA, New York, pp 580–591
Bu Z, Zhang C, Xia Z, Wang J (2013) An FAR-SW based approach for webpage information extraction. Inf Syst Front 1–15. doi:10.1007/s10796-013-9412-2
Cai D, Yu S, Wen J, Ma W (2003) Extracting content structure for web pages based on visual representation. In: Proceedings of APWeb’03, pp 406–417
Chen L, Ye S, Li X (2006) Template detection for large scale search engines. In: Proceedings of SAC ’06. NY, USA, New York, pp 1094–1098
Chen Y, Fankhauser P, Zhang H-J (2003) Detecting web page structure for adaptive viewing on small form factor devices. In: Proceedings of WWW ’03, pp 225–233
Davison BD (2000) Recognizing nepotistic links on the web. In: AAAI-2000 workshop on artificial intelligence for web search. Austin, TX, pp 23–28
Debnath S, Mitra P, Giles CL (2005) Automatic extraction of informative blocks from webpages. In: Proceedings of SAC ’05, pp 1722–1726
Debnath S, Mitra P, Giles CL (2005) Identifying content blocks from web documents. ISMIS 3488(5):285–293
Google Scholar
Fan J, Luo P, Lim SH, Liu S, Parag J, Liu J (2011) Article clipper: a system for web article extraction. In: Proceedings of KDD ’11, pp 743–746
Fernandes D, de Moura ES, Ribeiro-Neto B, da Silva AS, Gonçalves MA (2007) Computing block importance for searching on web sites. In: Proceedings of CIKM ’07, pp 165–174
Finn A, Kushmerick N, Smyth B (2001) Fact or fiction: content classification for digital libraries. In: Joint DELOS-NSF workshop: personalization and recommender systems in digital libraries
Fumarola F, Weninger T, Barber R, Malerba D, Han J (2011) Extracting general lists from web documents: a hybrid approach. In: Proceedings of IEA/AIE ’11. Heidelberg, Berlin, pp 285–294
Gibson D, Punera K, Tomkins A (2005) The volume and evolution of web page templates. In: Proceedings of WWW ’05. ACM, New York, NY, USA, pp 830–839
Gottron T (2008) Combining content extraction heuristics: the CombinE system. In: Proceedings of iiWAS ’08, pp 591–595
Gottron T (2008) Content code blurring: a new approach to content extraction. In: Proceedings of DEXA ’08, pp 29–33
Gupta S, Kaiser G, Stolfo S (2005) Extracting context to improve accuracy for HTML content extraction. In: Proceedings of WWW ’05, pp 1114–1115
Kao H, Lin S, Ho J, Chen M (2004) Mining web informative structures and contents based on entropy analysis. IEEE Trans Knowl Data Eng 16:41–55
Article Google Scholar
Kohlschütter C, Fankhauser P, Nejdl W (2010) Boilerplate detection using shallow text features. In: Proceedings of WSDM ’10, pp 441–450
Kushmerick N (1999) Learning to remove internet advertisements. In: Proceedings of AGENTS ’99. NY, USA, New York, pp 175–181
Li Y, Dong S-b, Zheng X, Ma B-H (2012) Improving navigation page detection by using DOM-based block text identification. In: Proceedings of 10th international conference on ICT and knowledge engineering, Bangkok, pp 129–134
Lin S, Ho J (2002) Discovering informative content blocks from web documents. In: Proceedings of SIGKDD ’02. NY, USA, New York, pp 588–593
Liu L, Pu C, Han W (2000) XWRAP: an XML-enabled wrapper construction system for web information sources. In: Proceedings of ICDE ’00, pp 611–621
Mantratzis C, Orgun M, Cassidy S (2005) Separating XHTML content from navigation clutter using DOM-structure block analysis. In: Proceedings of HYPERTEXT ’05, pp 145–147
Marek M, Pecina P, Spousta M (2007) Web page cleaning with conditional random fields. In: Proceedings of WAC3 ’07, Cleaneval session
Peters ME, Lecocq D (2013) Content extraction using diverse feature sets. In: Proceedings of WWW ’13. Republic and Canton of Geneva, Switzerland, pp 89–90
Pinto D, Branstein M, Coleman R, Croft WB, King M, Li W, Wei X (2002) QuASM: a system for question answering using semi-structured data. In: Proceedings of JCDL ’02, pp 46–55
Rahman AFR, Alam H, Hartono R (2001) Content extraction from HTML documents. In: Proceedings of WDA ’01, pp 7–10
Shen D, Wang H, Jiang Z, Cao J (2013) A high efficient incremental microblog crawler: design and implementation. J Inf Comput Sci 10(6):1731–1747
Article Google Scholar
Song R, Liu H, Wen J, Ma W (2004) Learning block importance models for web pages. In: Proceedings of WWW ’04. NY, USA, New York, pp 203–211
W3C Document Object Model (2009) Website. http://www.w3.org/DOM
Weninger T, Hsu WH, Han J (2010) CETR—content extraction via tag ratios. In: Proceedings of WWW ’10. NY, USA, New York, pp 971–980
Yi L, Liu B, Li X (2003) Eliminating noisy information in web pages for data mining. In: Proceedings of SIGKDD ’03. NY, USA, New York, pp 296–305

Download references

Acknowledgments

This work is funded by the National Program on Key Basic Research Project (973 Program, Grant No. 2013CB329605), Natural Science Foundation of China (NSFC, Grant Nos. 60873237 and 61003168), Natural Science Foundation of Beijing (Grant No. 4092037), Outstanding Young Teacher Foundation and Basic Research Foundation of Beijing Institute of Technology, and partially supported by Beijing Key Discipline Program.

Author information

Fei Sun
Present address: Institute of Computing Technology, Chinese Academy of Sciences, 100190 , Beijing, China

Authors and Affiliations

Beijing Engineering Research Center of High Volume Language Information Processing and Cloud Computing Application, Beijing Lab of Intelligent Information Technology, School of Computer Science and Technology, Beijing Institute of Technology, 100081 , Beijing, China
Dandan Song, Fei Sun & Lejian Liao

Authors

Dandan Song
View author publications
You can also search for this author in PubMed Google Scholar
Fei Sun
View author publications
You can also search for this author in PubMed Google Scholar
Lejian Liao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Lejian Liao.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Song, D., Sun, F. & Liao, L. A hybrid approach for content extraction with text density and visual importance of DOM nodes. Knowl Inf Syst 42, 75–96 (2015). https://doi.org/10.1007/s10115-013-0687-x

Download citation

Received: 22 December 2012
Revised: 27 August 2013
Accepted: 14 September 2013
Published: 26 September 2013
Issue Date: January 2015
DOI: https://doi.org/10.1007/s10115-013-0687-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A hybrid approach for content extraction with text density and visual importance of DOM nodes

Abstract

Access this article

Similar content being viewed by others

Main Content Extraction from Heterogeneous Webpages

Information Extraction from Web Sources Based on Multi-aspect Content Analysis

Extracting Web Content by Exploiting Multi-Category Characteristics

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A hybrid approach for content extraction with text density and visual importance of DOM nodes

Abstract

Access this article

Similar content being viewed by others

Main Content Extraction from Heterogeneous Webpages

Information Extraction from Web Sources Based on Multi-aspect Content Analysis

Extracting Web Content by Exploiting Multi-Category Characteristics

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation