Abstract
Intelligent information processing systems, such as digital libraries or search engines index web-pages according to their informative content. However, web-pages contain several non-informative contents, e.g., navigation sidebars, advertisements, copyright notices, etc. It is very important to separate the informative “primary content blocks” from these non-informative blocks. In this paper, two algorithms, FeatureExtractor and K-FeatureExtractor are proposed to identify the “primary content blocks” based on their features. None of these algorithms require any supervised learning, but still can identify the “primary content blocks” with high precision and recall. While operating on several thousand web-pages obtained from 15 different websites, our algorithms significantly outperform the Entropy-based algorithm proposed by Lin and Ho [14] in both precision and run-time.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Ambite, J.L., Ashish, N., Barish, G., Knoblock, C.A., Minton, S., Modi, P.J., Muslea, I., Philpot, A., Tejada, S.: Ariadne: a system for constructing mediators for Internet sources. In: SIGMOD, pp. 561–563 (1998)
Atzeni, P., Mecca, G., Merialdo, P.: Semistructured and structured data in the web: Going back and forth. In: Workshop on Management of Semistructured Data (1997)
Bar-Yossef, Z., Rajagopalan, S.: Template detection via data mining and its applications. In: Proceedings of WWW 2002, pp. 580–591 (2002)
Chawathe, S., Garcia-Molina, H., Hammer, J., Ireland, K., Papakonstantinon, Y., Ullman, J., Widom, J.: The tsimmis project: integration of heterogeneous information sources. In: Proceedings of the 10th meeting og Information Processing Society of, Japan, pp. 7–18 (1994)
Chidlovskii, B., Ragetli, J., de Rijke, M.: Wrapper generation via grammar induction. In: Lopez de Mantaras, R., Plaza, E. (eds.) ECML 2000. LNCS (LNAI), vol. 1810, pp. 96–108. Springer, Heidelberg (2000)
Cohen, W.W.: A web-based information system that reasons with structured collections of text. In: Sycara, K.P., Wooldridge, M. (eds.) Proceedings of the 2nd International Conference on Autonomous Agents (Agents 1998), pp. 9–13, 400–407. ACM Press, New York (1998)
Crescenzi, V., Mecca, G., Merialdo, P.: Roadrunner: Towards automatic data extraction from large web sites. In: Proceedings of the 27th International Conference on Very Large Data Bases, pp. 109–118 (2001)
Debnath, S., Mitra, P., Giles, C.L.: Automatic extraction of informative blocks from webpages. In: The upcoming proceedings of the Special Track on Web Technologies and Applications in the ACM Symposium of Applied Computing (2005)
Hsu, C.: Initial results on wrapping semistructured web pages with finite-state transducers and contextual rules. In: AAAI 1998 Workshop on AI and Information Integration, pp. 66–73. AAAI Press, Menlo Park (1998)
Kirk, T., Levy, A.Y., Sagiv, Y., Srivastava, D.: The information manifold. In: Proceedings of the AAAI Spring Symposium: Information Gathering from Heterogeneous Distributed Environments, pp. 85–91 (1995)
Kushmerick, N.: Wrapper induction: Efficiency and expressiveness. Artificial Intelligence 118(1-2), 15–68 (2000)
Kushmerick, N., Weld, D.S., Doorenbos, R.B.: Wrapper induction for information extraction. In: International Joint Conference on Artificial Intelligence (IJCAI), pp. 729–737 (1997)
Levy, A.Y., Srivastava, D., Kirk, T.: Data model and query evaluation in global information systems. Journal of Intelligent Information Systems - Special Issue on Networked Information Discovery and Retrieval 5(2), 121–143 (1995)
Lin, S.-H., Ho, J.-M.: Discovering informative content blocks from web documents. In: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 588–593 (2002)
Liu, B., Zhao, K., Yi, L.: Eliminating noisy information in web pages for data mining. In: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 296–305 (2003)
Muslea, I., Minton, S., Knoblock, C.A.: Hierarchical wrapper induction for semistructured information sources. Autonomous Agents and Multi-Agent Systems 4(1/2), 93–114 (2001)
Yi, L., Liu, B., Li, X.: Visualizing web site comparisons. In: Proceedings of the eleventh international conference on World Wide Web, pp. 693–703 (2002)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Debnath, S., Mitra, P., Giles, C.L. (2005). Identifying Content Blocks from Web Documents. In: Hacid, MS., Murray, N.V., Raś, Z.W., Tsumoto, S. (eds) Foundations of Intelligent Systems. ISMIS 2005. Lecture Notes in Computer Science(), vol 3488. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11425274_30
Download citation
DOI: https://doi.org/10.1007/11425274_30
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-25878-0
Online ISBN: 978-3-540-31949-8
eBook Packages: Computer ScienceComputer Science (R0)