Relevance-based content extraction of HTML documents

Wu, Qi; Chen, Xing-shu; Zhu, Kai; Wang, Chun-hui

doi:10.1007/s11771-012-1226-8

Relevance-based content extraction of HTML documents

Published: 01 July 2012

Volume 19, pages 1921–1926, (2012)
Cite this article

Journal of Central South University Aims and scope Submit manuscript

Qi Wu (吴麒)¹,
Xing-shu Chen (陈兴蜀)¹,
Kai Zhu (朱锴)¹ &
…
Chun-hui Wang (王春晖)¹

105 Accesses
1 Citation
Explore all metrics

Abstract

Content extraction of HTML pages is the basis of the web page clustering and information retrieval, so it is necessary to eliminate cluttered information and very important to extract content of pages accurately. A novel and accurate solution for extracting content of HTML pages was proposed. First of all, the HTML page is parsed into DOM object and the IDs of all leaf nodes are generated. Secondly, the score of each leaf node is calculated and the score is adjusted according to the relationship with neighbors. Finally, the information blocks are found according to the definition, and a universal classification algorithm is used to identify the content blocks. The experimental results show that the algorithm can extract content effectively and accurately, and the recall rate and precision are 96.5% and 93.8%, respectively.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Main Content Extraction from Web Documents Using Text Block Context

Web Content Extraction Using Clustering with Web Structure

Exploiting Multi-Category Characteristics and Unified Framework to Extract Web Content

Article Open access 07 June 2018

References

OU J W, DONG X B, CAI B. Topic information extraction from template web pages [J]. Journal of Tsinghua University: Science and Technology, 2005, 45(S1): 1743–1747.
Google Scholar
SANDIP D, PRASENJIT M, C LEE G. Identifying content blocks from web documents [C]// 2005 International Symposium on Methodologies for Intelligent Systems (ISMIS 2005). New York: LNAI, 2005: 285–293.
Google Scholar
MOHSEN A, MIR M P, AMIR M R. Main content extraction from detailed web pages [J]. International Journal of Computer Applications, 2010, 4(11): 18–21.
Article Google Scholar
YI L, LIU B, LI X L. Eliminating noisy information in web pages for data mining [C]// The Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Washington: ACM Press, 2003: 296–305.
Chapter Google Scholar
SUHIT G, HILA B, GAIL K, SALVATORE S. Verifying genre-based clustering approach to content extraction [C]// The 15th International World Wide Web Conference. Budapest: ACM Press, 2006: 875–876.
Google Scholar
DEBNATH S, Automatic identification of informative sections of web pages [J]. IEEE Transactions on Knowledge and Data Engineering, 2005, 17(9): 1233–1246.
Article Google Scholar
GOTTRON T. Combining content extraction heuristics: the combined system [C]// The 10th International Conference on Information Integration and Web-based Application & Services. New York: ACM Press, 2008: 591–594.
Google Scholar
GOTTRON T. An evolutionary approach to automatically optimize web content extraction [C]// The Joint Venture of the 17th International Conference Intelligent Information System (IIS) and the 24th International Conference on Artificial Intelligence (AI). Krakow: The IEEE Computational Intelligence Society, 2009: 331–341.
Google Scholar
JAVIER A M, KOEN D, MARIE F M. Language independent content extraction from web pages [C]// The 9th Dutch-Belgian Information Retrieval Workshop. Netherland: University of Twente, 2009: 50–55.
Google Scholar
TIM W, WILLIAM H H. Web content extraction through histogram clustering [C]// The 18th International Conference on Artificial Neural Networks in Engineering (ANNIE 2008). St. Louis: Lecture Notes in Computer Science, 2008: 124–132.
Google Scholar
THOMAS G. Content code blurring: A new approach to content extraction [C]// The 2008 19th International Conference on Database and Expert Systems Application. Washington: IEEE Computer Society, 2008: 29–33.
Google Scholar
BING L D, WANG Y X, ZHANG Y. Primary content extraction with mountain model [C]// The 8th IEEE International Conference on Computer and Information Technology. Sydney: IEEE Press, 2008: 479–484.
Chapter Google Scholar
W3C. Document object model [EB/OL]. [2011-3-5]. http://www.w3.org/DOM/.
MACQUEEN J. Some methods for classification and analysis of multivariate observations [C]// The 5th Berkeley Symposium on Mathematical Statistics and Probability. Berkeley: Berkeley Press, 1967: 281–297.
Google Scholar
Computer Networks and Distributed System Laboratory, Peking University. CWIRF [EB/OL]. [2011-3-8]. http://www.cwirf.org/.

Download references

Author information

Authors and Affiliations

Network and Trusted Computing Institute, College of Computer Science, Sichuan University, Chengdu, 610065, China
Qi Wu (吴麒), Xing-shu Chen (陈兴蜀), Kai Zhu (朱锴) & Chun-hui Wang (王春晖)

Authors

Qi Wu (吴麒)
View author publications
You can also search for this author in PubMed Google Scholar
Xing-shu Chen (陈兴蜀)
View author publications
You can also search for this author in PubMed Google Scholar
Kai Zhu (朱锴)
View author publications
You can also search for this author in PubMed Google Scholar
Chun-hui Wang (王春晖)
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xing-shu Chen (陈兴蜀).

Additional information

Foundation item: Project(2012BAH18B05) supported by the Supporting Program of Ministry of Science and Technology of China

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wu, Q., Chen, Xs., Zhu, K. et al. Relevance-based content extraction of HTML documents. J. Cent. South Univ. 19, 1921–1926 (2012). https://doi.org/10.1007/s11771-012-1226-8

Download citation

Received: 13 May 2011
Accepted: 13 July 2011
Published: 01 July 2012
Issue Date: July 2012
DOI: https://doi.org/10.1007/s11771-012-1226-8

Key words

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Relevance-based content extraction of HTML documents

Abstract

Access this article

Similar content being viewed by others

Main Content Extraction from Web Documents Using Text Block Context

Web Content Extraction Using Clustering with Web Structure

Exploiting Multi-Category Characteristics and Unified Framework to Extract Web Content

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Key words

Navigation

Relevance-based content extraction of HTML documents

Abstract

Access this article

Similar content being viewed by others

Main Content Extraction from Web Documents Using Text Block Context

Web Content Extraction Using Clustering with Web Structure

Exploiting Multi-Category Characteristics and Unified Framework to Extract Web Content

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Key words

Search

Navigation