Skip to main content

Identifying Content Blocks from Web Documents

  • Conference paper
Foundations of Intelligent Systems (ISMIS 2005)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 3488))

Included in the following conference series:

Abstract

Intelligent information processing systems, such as digital libraries or search engines index web-pages according to their informative content. However, web-pages contain several non-informative contents, e.g., navigation sidebars, advertisements, copyright notices, etc. It is very important to separate the informative “primary content blocks” from these non-informative blocks. In this paper, two algorithms, FeatureExtractor and K-FeatureExtractor are proposed to identify the “primary content blocks” based on their features. None of these algorithms require any supervised learning, but still can identify the “primary content blocks” with high precision and recall. While operating on several thousand web-pages obtained from 15 different websites, our algorithms significantly outperform the Entropy-based algorithm proposed by Lin and Ho [14] in both precision and run-time.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Ambite, J.L., Ashish, N., Barish, G., Knoblock, C.A., Minton, S., Modi, P.J., Muslea, I., Philpot, A., Tejada, S.: Ariadne: a system for constructing mediators for Internet sources. In: SIGMOD, pp. 561–563 (1998)

    Google Scholar 

  2. Atzeni, P., Mecca, G., Merialdo, P.: Semistructured and structured data in the web: Going back and forth. In: Workshop on Management of Semistructured Data (1997)

    Google Scholar 

  3. Bar-Yossef, Z., Rajagopalan, S.: Template detection via data mining and its applications. In: Proceedings of WWW 2002, pp. 580–591 (2002)

    Google Scholar 

  4. Chawathe, S., Garcia-Molina, H., Hammer, J., Ireland, K., Papakonstantinon, Y., Ullman, J., Widom, J.: The tsimmis project: integration of heterogeneous information sources. In: Proceedings of the 10th meeting og Information Processing Society of, Japan, pp. 7–18 (1994)

    Google Scholar 

  5. Chidlovskii, B., Ragetli, J., de Rijke, M.: Wrapper generation via grammar induction. In: Lopez de Mantaras, R., Plaza, E. (eds.) ECML 2000. LNCS (LNAI), vol. 1810, pp. 96–108. Springer, Heidelberg (2000)

    Chapter  Google Scholar 

  6. Cohen, W.W.: A web-based information system that reasons with structured collections of text. In: Sycara, K.P., Wooldridge, M. (eds.) Proceedings of the 2nd International Conference on Autonomous Agents (Agents 1998), pp. 9–13, 400–407. ACM Press, New York (1998)

    Chapter  Google Scholar 

  7. Crescenzi, V., Mecca, G., Merialdo, P.: Roadrunner: Towards automatic data extraction from large web sites. In: Proceedings of the 27th International Conference on Very Large Data Bases, pp. 109–118 (2001)

    Google Scholar 

  8. Debnath, S., Mitra, P., Giles, C.L.: Automatic extraction of informative blocks from webpages. In: The upcoming proceedings of the Special Track on Web Technologies and Applications in the ACM Symposium of Applied Computing (2005)

    Google Scholar 

  9. Hsu, C.: Initial results on wrapping semistructured web pages with finite-state transducers and contextual rules. In: AAAI 1998 Workshop on AI and Information Integration, pp. 66–73. AAAI Press, Menlo Park (1998)

    Google Scholar 

  10. Kirk, T., Levy, A.Y., Sagiv, Y., Srivastava, D.: The information manifold. In: Proceedings of the AAAI Spring Symposium: Information Gathering from Heterogeneous Distributed Environments, pp. 85–91 (1995)

    Google Scholar 

  11. Kushmerick, N.: Wrapper induction: Efficiency and expressiveness. Artificial Intelligence 118(1-2), 15–68 (2000)

    Article  MATH  MathSciNet  Google Scholar 

  12. Kushmerick, N., Weld, D.S., Doorenbos, R.B.: Wrapper induction for information extraction. In: International Joint Conference on Artificial Intelligence (IJCAI), pp. 729–737 (1997)

    Google Scholar 

  13. Levy, A.Y., Srivastava, D., Kirk, T.: Data model and query evaluation in global information systems. Journal of Intelligent Information Systems - Special Issue on Networked Information Discovery and Retrieval 5(2), 121–143 (1995)

    Google Scholar 

  14. Lin, S.-H., Ho, J.-M.: Discovering informative content blocks from web documents. In: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 588–593 (2002)

    Google Scholar 

  15. Liu, B., Zhao, K., Yi, L.: Eliminating noisy information in web pages for data mining. In: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 296–305 (2003)

    Google Scholar 

  16. Muslea, I., Minton, S., Knoblock, C.A.: Hierarchical wrapper induction for semistructured information sources. Autonomous Agents and Multi-Agent Systems 4(1/2), 93–114 (2001)

    Article  Google Scholar 

  17. Yi, L., Liu, B., Li, X.: Visualizing web site comparisons. In: Proceedings of the eleventh international conference on World Wide Web, pp. 693–703 (2002)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2005 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Debnath, S., Mitra, P., Giles, C.L. (2005). Identifying Content Blocks from Web Documents. In: Hacid, MS., Murray, N.V., Raś, Z.W., Tsumoto, S. (eds) Foundations of Intelligent Systems. ISMIS 2005. Lecture Notes in Computer Science(), vol 3488. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11425274_30

Download citation

  • DOI: https://doi.org/10.1007/11425274_30

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-25878-0

  • Online ISBN: 978-3-540-31949-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics