Abstract
Additional contents in web pages, such as navigation panels, advertisements, copyrights and disclaimer notices, are typically not related to the main subject and may hamper the performance of Web data mining. They are traditionally taken as noises and need to be removed properly. To achieve this, two intuitive and crucial kinds of information—the textual information and the visual information of web pages—is considered in this paper. Accordingly, Text Density and Visual Importance are defined for the Document Object Model (DOM) nodes of a web page. Furthermore, a content extraction method with these measured values is proposed. It is a fast, accurate and general method for extracting content from diverse web pages. And with the employment of DOM nodes, the original structure of the web page can be preserved. Evaluated with the CleanEval benchmark and with randomly selected pages from well-known Web sites, where various web domains and styles are tested, the effect of the method is demonstrated. The average F1-scores with our method were 8.7 % higher than the best scores among several alternative methods.
Similar content being viewed by others
References
Adelberg B (1998) NoDoSE—a tool for semi-automatically extracting semi-structured data from text documents. In: Proceedings of SIGMOD ’98. ACM, New York, NY, USA, pp 283–294
Baluja S (2006) Browsing on small screens: recasting web-page segmentation into an efficient machine learning framework. In: Proceedings of WWW ’06, pp 33–42
Bar-Yossef Z, Rajagopalan S (2002) Template detection via data mining and its applications. In: Proceedings of WWW ’02. NY, USA, New York, pp 580–591
Bu Z, Zhang C, Xia Z, Wang J (2013) An FAR-SW based approach for webpage information extraction. Inf Syst Front 1–15. doi:10.1007/s10796-013-9412-2
Cai D, Yu S, Wen J, Ma W (2003) Extracting content structure for web pages based on visual representation. In: Proceedings of APWeb’03, pp 406–417
Chen L, Ye S, Li X (2006) Template detection for large scale search engines. In: Proceedings of SAC ’06. NY, USA, New York, pp 1094–1098
Chen Y, Fankhauser P, Zhang H-J (2003) Detecting web page structure for adaptive viewing on small form factor devices. In: Proceedings of WWW ’03, pp 225–233
Davison BD (2000) Recognizing nepotistic links on the web. In: AAAI-2000 workshop on artificial intelligence for web search. Austin, TX, pp 23–28
Debnath S, Mitra P, Giles CL (2005) Automatic extraction of informative blocks from webpages. In: Proceedings of SAC ’05, pp 1722–1726
Debnath S, Mitra P, Giles CL (2005) Identifying content blocks from web documents. ISMIS 3488(5):285–293
Fan J, Luo P, Lim SH, Liu S, Parag J, Liu J (2011) Article clipper: a system for web article extraction. In: Proceedings of KDD ’11, pp 743–746
Fernandes D, de Moura ES, Ribeiro-Neto B, da Silva AS, Gonçalves MA (2007) Computing block importance for searching on web sites. In: Proceedings of CIKM ’07, pp 165–174
Finn A, Kushmerick N, Smyth B (2001) Fact or fiction: content classification for digital libraries. In: Joint DELOS-NSF workshop: personalization and recommender systems in digital libraries
Fumarola F, Weninger T, Barber R, Malerba D, Han J (2011) Extracting general lists from web documents: a hybrid approach. In: Proceedings of IEA/AIE ’11. Heidelberg, Berlin, pp 285–294
Gibson D, Punera K, Tomkins A (2005) The volume and evolution of web page templates. In: Proceedings of WWW ’05. ACM, New York, NY, USA, pp 830–839
Gottron T (2008) Combining content extraction heuristics: the CombinE system. In: Proceedings of iiWAS ’08, pp 591–595
Gottron T (2008) Content code blurring: a new approach to content extraction. In: Proceedings of DEXA ’08, pp 29–33
Gupta S, Kaiser G, Stolfo S (2005) Extracting context to improve accuracy for HTML content extraction. In: Proceedings of WWW ’05, pp 1114–1115
Kao H, Lin S, Ho J, Chen M (2004) Mining web informative structures and contents based on entropy analysis. IEEE Trans Knowl Data Eng 16:41–55
Kohlschütter C, Fankhauser P, Nejdl W (2010) Boilerplate detection using shallow text features. In: Proceedings of WSDM ’10, pp 441–450
Kushmerick N (1999) Learning to remove internet advertisements. In: Proceedings of AGENTS ’99. NY, USA, New York, pp 175–181
Li Y, Dong S-b, Zheng X, Ma B-H (2012) Improving navigation page detection by using DOM-based block text identification. In: Proceedings of 10th international conference on ICT and knowledge engineering, Bangkok, pp 129–134
Lin S, Ho J (2002) Discovering informative content blocks from web documents. In: Proceedings of SIGKDD ’02. NY, USA, New York, pp 588–593
Liu L, Pu C, Han W (2000) XWRAP: an XML-enabled wrapper construction system for web information sources. In: Proceedings of ICDE ’00, pp 611–621
Mantratzis C, Orgun M, Cassidy S (2005) Separating XHTML content from navigation clutter using DOM-structure block analysis. In: Proceedings of HYPERTEXT ’05, pp 145–147
Marek M, Pecina P, Spousta M (2007) Web page cleaning with conditional random fields. In: Proceedings of WAC3 ’07, Cleaneval session
Peters ME, Lecocq D (2013) Content extraction using diverse feature sets. In: Proceedings of WWW ’13. Republic and Canton of Geneva, Switzerland, pp 89–90
Pinto D, Branstein M, Coleman R, Croft WB, King M, Li W, Wei X (2002) QuASM: a system for question answering using semi-structured data. In: Proceedings of JCDL ’02, pp 46–55
Rahman AFR, Alam H, Hartono R (2001) Content extraction from HTML documents. In: Proceedings of WDA ’01, pp 7–10
Shen D, Wang H, Jiang Z, Cao J (2013) A high efficient incremental microblog crawler: design and implementation. J Inf Comput Sci 10(6):1731–1747
Song R, Liu H, Wen J, Ma W (2004) Learning block importance models for web pages. In: Proceedings of WWW ’04. NY, USA, New York, pp 203–211
W3C Document Object Model (2009) Website. http://www.w3.org/DOM
Weninger T, Hsu WH, Han J (2010) CETR—content extraction via tag ratios. In: Proceedings of WWW ’10. NY, USA, New York, pp 971–980
Yi L, Liu B, Li X (2003) Eliminating noisy information in web pages for data mining. In: Proceedings of SIGKDD ’03. NY, USA, New York, pp 296–305
Acknowledgments
This work is funded by the National Program on Key Basic Research Project (973 Program, Grant No. 2013CB329605), Natural Science Foundation of China (NSFC, Grant Nos. 60873237 and 61003168), Natural Science Foundation of Beijing (Grant No. 4092037), Outstanding Young Teacher Foundation and Basic Research Foundation of Beijing Institute of Technology, and partially supported by Beijing Key Discipline Program.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Song, D., Sun, F. & Liao, L. A hybrid approach for content extraction with text density and visual importance of DOM nodes. Knowl Inf Syst 42, 75–96 (2015). https://doi.org/10.1007/s10115-013-0687-x
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-013-0687-x