Applying Pattern Mining to Web Information Extraction

Chang, Chia-Hui; Lui, Shao-Chen; Wu, Yen-Chin

doi:10.1007/3-540-45357-1_4

Chia-Hui Chang⁴,
Shao-Chen Lui⁴ &
Yen-Chin Wu⁴

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 2035))

Included in the following conference series:

Pacific-Asia Conference on Knowledge Discovery and Data Mining

1334 Accesses
9 Citations

Abstract

Information extraction (IE) from semi-structured Web documents is a critical issue for information integration systems on the Internet. Previous work in wrapper induction aim to solve this problem by applying machine learning to automatically generate extractors. For example, WIEN, Stalker, Softmealy, etc. However, this approach still requires human intervention to provide training examples. In this paper, we propose a novel idea to IE, by repeated pattern mining and multiple pattern alignment. The discovery of repeated patterns are realized through a data structure call PAT tree. In addition, incomplete patterns are further revised by pattern alignment to comprehend all pattern instances. This new track to IE involves no human effort and content-dependent heuristics. Experimental results show that the constructed extraction rules can achieves 97 percent extraction over fourteen popular search engines.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Chien, L.F. 1997. PAT-tree-based keyword extraction for Chinese information retrieval. In Proceedings of the 20th annual international ACM SIGIR conference on Research and development in information retrieval. pp.50–58. 1997.
Google Scholar
Doorenbos, R.B., Etzioni, O. and Weld, D.S. A scalable comparison-shopping agent for the World-Wide Web. In Proceedings of the first international conference on Autonomous Agents. pp. 39–48, NewYork, NY, 1997, ACM Press.
Google Scholar
Embley, D.; Jiang, Y.; and Ng. Y.-K. 1999. Record-boundary discovery in Web documents. In Proceedings of the 1999 ACM SIGMOD International Conference on Management of Data (SIGMOD’99). pp. 467–478, Philadelphia, Pennsylvania.
Google Scholar
Gonnet, G.H.; Baeza-yates, R.A.; and Snider, T. 1992. New Indices for Text: Pat Trees and Pat Arrays. Information Retrieval: Data Structures and Algorithms, Prentice Hall.
Google Scholar
Gusfield, D. 1997. Algorithms on strings, trees, and sequences, Cambridge. 1997.
Google Scholar
Hsu, C.-N. and Dung, M.-T. 1998. Generating_nite-state transducers for semi-structured data extraction from the Web. Information Systems. 23(8):521–538.
Article Google Scholar
Knoblock, A. et al., ed., 1998. Proc. 1998 Workshop on AI and Information Integration, Menlo Park, California.: AAAI Press.
Google Scholar
Kurtz, S. and Schleiermacher, C. 1999. REPuter: fast computation of maximal repeats in complete genomes. Bioinformatics 15(5):426–427.
Article Google Scholar
Kushmerick, N.; Weld, D.; and Doorenbos, R. 1997 Wrapper induction for information extraction. In Proceedings of the 15th International Joint Conference on Artificial Intelligence (IJCAI).
Google Scholar
Muslea, I.; Minton, S.; and Knoblock, C. 1999. A hierarchical approach to wrapper induction. In Proceedings of the 3rd International Conference on Autonomous Agents (Agents’99), Seattle, WA.
Google Scholar
Muslea, I. 1999. Extraction patterns for information extraction tasks: a survey. In Proceedings of AAAI’99: Workshop on Machine Learning for Information Extraction
Google Scholar

Download references

Author information

Authors and Affiliations

Dept. of Computer Science and Information Engineering, National Central University, Chung-Li, 320, Taiwan
Chia-Hui Chang, Shao-Chen Lui & Yen-Chin Wu

Authors

Chia-Hui Chang
View author publications
You can also search for this author in PubMed Google Scholar
Shao-Chen Lui
View author publications
You can also search for this author in PubMed Google Scholar
Yen-Chin Wu
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Dept. of Computer Science and Information Systems, The University of Hong Kong, Pokfulam, Hong Kong China
David Cheung
CSIRO Mathematical and Information Sciences, GPO Box 664, Canberra, ACT 2601, Australia
Graham J. Williams
Department of Computer Science, City University of Hong Kong, 83 Tat Chee Ave., Kowloon, Hong Kong China
Qing Li

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Chang, CH., Lui, SC., Wu, YC. (2001). Applying Pattern Mining to Web Information Extraction. In: Cheung, D., Williams, G.J., Li, Q. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2001. Lecture Notes in Computer Science(), vol 2035. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45357-1_4

Download citation

DOI: https://doi.org/10.1007/3-540-45357-1_4
Published: 11 April 2001
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-41910-5
Online ISBN: 978-3-540-45357-4
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics