Identifying Content Blocks from Web Documents

Debnath, Sandip; Mitra, Prasenjit; Giles, C. Lee

doi:10.1007/11425274_30

Sandip Debnath²²,
Prasenjit Mitra^22,23 &
C. Lee Giles^22,23

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 3488))

Included in the following conference series:

International Symposium on Methodologies for Intelligent Systems

1110 Accesses
19 Citations

Abstract

Intelligent information processing systems, such as digital libraries or search engines index web-pages according to their informative content. However, web-pages contain several non-informative contents, e.g., navigation sidebars, advertisements, copyright notices, etc. It is very important to separate the informative “primary content blocks” from these non-informative blocks. In this paper, two algorithms, FeatureExtractor and K-FeatureExtractor are proposed to identify the “primary content blocks” based on their features. None of these algorithms require any supervised learning, but still can identify the “primary content blocks” with high precision and recall. While operating on several thousand web-pages obtained from 15 different websites, our algorithms significantly outperform the Entropy-based algorithm proposed by Lin and Ho [14] in both precision and run-time.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Ambite, J.L., Ashish, N., Barish, G., Knoblock, C.A., Minton, S., Modi, P.J., Muslea, I., Philpot, A., Tejada, S.: Ariadne: a system for constructing mediators for Internet sources. In: SIGMOD, pp. 561–563 (1998)
Google Scholar
Atzeni, P., Mecca, G., Merialdo, P.: Semistructured and structured data in the web: Going back and forth. In: Workshop on Management of Semistructured Data (1997)
Google Scholar
Bar-Yossef, Z., Rajagopalan, S.: Template detection via data mining and its applications. In: Proceedings of WWW 2002, pp. 580–591 (2002)
Google Scholar
Chawathe, S., Garcia-Molina, H., Hammer, J., Ireland, K., Papakonstantinon, Y., Ullman, J., Widom, J.: The tsimmis project: integration of heterogeneous information sources. In: Proceedings of the 10th meeting og Information Processing Society of, Japan, pp. 7–18 (1994)
Google Scholar
Chidlovskii, B., Ragetli, J., de Rijke, M.: Wrapper generation via grammar induction. In: Lopez de Mantaras, R., Plaza, E. (eds.) ECML 2000. LNCS (LNAI), vol. 1810, pp. 96–108. Springer, Heidelberg (2000)
Chapter Google Scholar
Cohen, W.W.: A web-based information system that reasons with structured collections of text. In: Sycara, K.P., Wooldridge, M. (eds.) Proceedings of the 2nd International Conference on Autonomous Agents (Agents 1998), pp. 9–13, 400–407. ACM Press, New York (1998)
Chapter Google Scholar
Crescenzi, V., Mecca, G., Merialdo, P.: Roadrunner: Towards automatic data extraction from large web sites. In: Proceedings of the 27th International Conference on Very Large Data Bases, pp. 109–118 (2001)
Google Scholar
Debnath, S., Mitra, P., Giles, C.L.: Automatic extraction of informative blocks from webpages. In: The upcoming proceedings of the Special Track on Web Technologies and Applications in the ACM Symposium of Applied Computing (2005)
Google Scholar
Hsu, C.: Initial results on wrapping semistructured web pages with finite-state transducers and contextual rules. In: AAAI 1998 Workshop on AI and Information Integration, pp. 66–73. AAAI Press, Menlo Park (1998)
Google Scholar
Kirk, T., Levy, A.Y., Sagiv, Y., Srivastava, D.: The information manifold. In: Proceedings of the AAAI Spring Symposium: Information Gathering from Heterogeneous Distributed Environments, pp. 85–91 (1995)
Google Scholar
Kushmerick, N.: Wrapper induction: Efficiency and expressiveness. Artificial Intelligence 118(1-2), 15–68 (2000)
Article MATH MathSciNet Google Scholar
Kushmerick, N., Weld, D.S., Doorenbos, R.B.: Wrapper induction for information extraction. In: International Joint Conference on Artificial Intelligence (IJCAI), pp. 729–737 (1997)
Google Scholar
Levy, A.Y., Srivastava, D., Kirk, T.: Data model and query evaluation in global information systems. Journal of Intelligent Information Systems - Special Issue on Networked Information Discovery and Retrieval 5(2), 121–143 (1995)
Google Scholar
Lin, S.-H., Ho, J.-M.: Discovering informative content blocks from web documents. In: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 588–593 (2002)
Google Scholar
Liu, B., Zhao, K., Yi, L.: Eliminating noisy information in web pages for data mining. In: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 296–305 (2003)
Google Scholar
Muslea, I., Minton, S., Knoblock, C.A.: Hierarchical wrapper induction for semistructured information sources. Autonomous Agents and Multi-Agent Systems 4(1/2), 93–114 (2001)
Article Google Scholar
Yi, L., Liu, B., Li, X.: Visualizing web site comparisons. In: Proceedings of the eleventh international conference on World Wide Web, pp. 693–703 (2002)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science and Engineering,
Sandip Debnath, Prasenjit Mitra & C. Lee Giles
School of Information Sciences and Technology, Penn State University, University Park, PA, 16802, USA
Prasenjit Mitra & C. Lee Giles

Authors

Sandip Debnath
View author publications
You can also search for this author in PubMed Google Scholar
Prasenjit Mitra
View author publications
You can also search for this author in PubMed Google Scholar
C. Lee Giles
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

LIRIS - UFR d’Informatique, Université Claude Bernard Lyon 1, 43, boulevard du 11 novembre 1918, 69622, Villeurbanne, France
Mohand-Said Hacid
Department of Computer Science, State University of New York, 12222, Albany, NY, USA
Neil V. Murray
Department of Computer Science, University of North Carolina, 28223, Charlotte, NC, USA
Zbigniew W. Raś
Shimane University, 89-1 Enya-cho Izumo, 6938501, Shimane, Japan
Shusaku Tsumoto

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Debnath, S., Mitra, P., Giles, C.L. (2005). Identifying Content Blocks from Web Documents. In: Hacid, MS., Murray, N.V., Raś, Z.W., Tsumoto, S. (eds) Foundations of Intelligent Systems. ISMIS 2005. Lecture Notes in Computer Science(), vol 3488. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11425274_30

Download citation

DOI: https://doi.org/10.1007/11425274_30
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-25878-0
Online ISBN: 978-3-540-31949-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics