Skip to main content
Log in

A hybrid approach for content extraction with text density and visual importance of DOM nodes

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

Additional contents in web pages, such as navigation panels, advertisements, copyrights and disclaimer notices, are typically not related to the main subject and may hamper the performance of Web data mining. They are traditionally taken as noises and need to be removed properly. To achieve this, two intuitive and crucial kinds of information—the textual information and the visual information of web pages—is considered in this paper. Accordingly, Text Density and Visual Importance are defined for the Document Object Model (DOM) nodes of a web page. Furthermore, a content extraction method with these measured values is proposed. It is a fast, accurate and general method for extracting content from diverse web pages. And with the employment of DOM nodes, the original structure of the web page can be preserved. Evaluated with the CleanEval benchmark and with randomly selected pages from well-known Web sites, where various web domains and styles are tested, the effect of the method is demonstrated. The average F1-scores with our method were 8.7 % higher than the best scores among several alternative methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Notes

  1. http://www.ft.com.

  2. http://disnet.cs.bit.edu.cn/.

  3. http://webkit.org/.

References

  1. Adelberg B (1998) NoDoSE—a tool for semi-automatically extracting semi-structured data from text documents. In: Proceedings of SIGMOD ’98. ACM, New York, NY, USA, pp 283–294

  2. Baluja S (2006) Browsing on small screens: recasting web-page segmentation into an efficient machine learning framework. In: Proceedings of WWW ’06, pp 33–42

  3. Bar-Yossef Z, Rajagopalan S (2002) Template detection via data mining and its applications. In: Proceedings of WWW ’02. NY, USA, New York, pp 580–591

  4. Bu Z, Zhang C, Xia Z, Wang J (2013) An FAR-SW based approach for webpage information extraction. Inf Syst Front 1–15. doi:10.1007/s10796-013-9412-2

  5. Cai D, Yu S, Wen J, Ma W (2003) Extracting content structure for web pages based on visual representation. In: Proceedings of APWeb’03, pp 406–417

  6. Chen L, Ye S, Li X (2006) Template detection for large scale search engines. In: Proceedings of SAC ’06. NY, USA, New York, pp 1094–1098

  7. Chen Y, Fankhauser P, Zhang H-J (2003) Detecting web page structure for adaptive viewing on small form factor devices. In: Proceedings of WWW ’03, pp 225–233

  8. Davison BD (2000) Recognizing nepotistic links on the web. In: AAAI-2000 workshop on artificial intelligence for web search. Austin, TX, pp 23–28

  9. Debnath S, Mitra P, Giles CL (2005) Automatic extraction of informative blocks from webpages. In: Proceedings of SAC ’05, pp 1722–1726

  10. Debnath S, Mitra P, Giles CL (2005) Identifying content blocks from web documents. ISMIS 3488(5):285–293

    Google Scholar 

  11. Fan J, Luo P, Lim SH, Liu S, Parag J, Liu J (2011) Article clipper: a system for web article extraction. In: Proceedings of KDD ’11, pp 743–746

  12. Fernandes D, de Moura ES, Ribeiro-Neto B, da Silva AS, Gonçalves MA (2007) Computing block importance for searching on web sites. In: Proceedings of CIKM ’07, pp 165–174

  13. Finn A, Kushmerick N, Smyth B (2001) Fact or fiction: content classification for digital libraries. In: Joint DELOS-NSF workshop: personalization and recommender systems in digital libraries

  14. Fumarola F, Weninger T, Barber R, Malerba D, Han J (2011) Extracting general lists from web documents: a hybrid approach. In: Proceedings of IEA/AIE ’11. Heidelberg, Berlin, pp 285–294

  15. Gibson D, Punera K, Tomkins A (2005) The volume and evolution of web page templates. In: Proceedings of WWW ’05. ACM, New York, NY, USA, pp 830–839

  16. Gottron T (2008) Combining content extraction heuristics: the CombinE system. In: Proceedings of iiWAS ’08, pp 591–595

  17. Gottron T (2008) Content code blurring: a new approach to content extraction. In: Proceedings of DEXA ’08, pp 29–33

  18. Gupta S, Kaiser G, Stolfo S (2005) Extracting context to improve accuracy for HTML content extraction. In: Proceedings of WWW ’05, pp 1114–1115

  19. Kao H, Lin S, Ho J, Chen M (2004) Mining web informative structures and contents based on entropy analysis. IEEE Trans Knowl Data Eng 16:41–55

    Article  Google Scholar 

  20. Kohlschütter C, Fankhauser P, Nejdl W (2010) Boilerplate detection using shallow text features. In: Proceedings of WSDM ’10, pp 441–450

  21. Kushmerick N (1999) Learning to remove internet advertisements. In: Proceedings of AGENTS ’99. NY, USA, New York, pp 175–181

  22. Li Y, Dong S-b, Zheng X, Ma B-H (2012) Improving navigation page detection by using DOM-based block text identification. In: Proceedings of 10th international conference on ICT and knowledge engineering, Bangkok, pp 129–134

  23. Lin S, Ho J (2002) Discovering informative content blocks from web documents. In: Proceedings of SIGKDD ’02. NY, USA, New York, pp 588–593

  24. Liu L, Pu C, Han W (2000) XWRAP: an XML-enabled wrapper construction system for web information sources. In: Proceedings of ICDE ’00, pp 611–621

  25. Mantratzis C, Orgun M, Cassidy S (2005) Separating XHTML content from navigation clutter using DOM-structure block analysis. In: Proceedings of HYPERTEXT ’05, pp 145–147

  26. Marek M, Pecina P, Spousta M (2007) Web page cleaning with conditional random fields. In: Proceedings of WAC3 ’07, Cleaneval session

  27. Peters ME, Lecocq D (2013) Content extraction using diverse feature sets. In: Proceedings of WWW ’13. Republic and Canton of Geneva, Switzerland, pp 89–90

  28. Pinto D, Branstein M, Coleman R, Croft WB, King M, Li W, Wei X (2002) QuASM: a system for question answering using semi-structured data. In: Proceedings of JCDL ’02, pp 46–55

  29. Rahman AFR, Alam H, Hartono R (2001) Content extraction from HTML documents. In: Proceedings of WDA ’01, pp 7–10

  30. Shen D, Wang H, Jiang Z, Cao J (2013) A high efficient incremental microblog crawler: design and implementation. J Inf Comput Sci 10(6):1731–1747

    Article  Google Scholar 

  31. Song R, Liu H, Wen J, Ma W (2004) Learning block importance models for web pages. In: Proceedings of WWW ’04. NY, USA, New York, pp 203–211

  32. W3C Document Object Model (2009) Website. http://www.w3.org/DOM

  33. Weninger T, Hsu WH, Han J (2010) CETR—content extraction via tag ratios. In: Proceedings of WWW ’10. NY, USA, New York, pp 971–980

  34. Yi L, Liu B, Li X (2003) Eliminating noisy information in web pages for data mining. In: Proceedings of SIGKDD ’03. NY, USA, New York, pp 296–305

Download references

Acknowledgments

This work is funded by the National Program on Key Basic Research Project (973 Program, Grant No. 2013CB329605), Natural Science Foundation of China (NSFC, Grant Nos. 60873237 and 61003168), Natural Science Foundation of Beijing (Grant No. 4092037), Outstanding Young Teacher Foundation and Basic Research Foundation of Beijing Institute of Technology, and partially supported by Beijing Key Discipline Program.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Lejian Liao.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Song, D., Sun, F. & Liao, L. A hybrid approach for content extraction with text density and visual importance of DOM nodes. Knowl Inf Syst 42, 75–96 (2015). https://doi.org/10.1007/s10115-013-0687-x

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-013-0687-x

Keywords

Navigation