Knowledge Discovery from Semistructured Texts

Sakamoto, Hiroshi; Arimura, Hiroki; Arikawa, Setsuo

doi:10.1007/3-540-45884-0_45

Knowledge Discovery from Semistructured Texts

Hiroshi Sakamoto²,
Hiroki Arimura² &
Setsuo Arikawa²

Chapter
First Online: 01 January 2002

503 Accesses
6 Citations

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 2281))

Abstract

This paper surveys our recent results on the knowledge discovery from semistructured texts, which contain heterogeneous structures represented by labeled trees. The aim of our study is to extract useful information from documents on the Web. First, we present the theoretical results on learning rewriting rules between labeled trees. Second, we apply our method to the learning HTML trees in the framework of the wrapper induction. We also examine our algorithms for real world HTML documents and present the results.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

S. Abiteboul, P. Buneman, D. Suciu, Data on theWeb: From relations to semistructured data and XML, Morgan Kaufmann, San Francisco, CA, 2000.
Google Scholar
D. Angluin, Queries and concept learning, Machine Learning vol.2, pp.319–342, 1988.
Google Scholar
H. Arimura, Learning Acyclic First-order Horn Sentences From Entailment, Proc. 7th Int. Workshop on Algorithmic Learning Theory, LNAI 1316, pp.432–445, 1997.
Google Scholar
H. Arimura, H. Ishizaka, T. Shinohara, Learning unions of tree patterns using queries, Theoretical Computer Science vol.185, pp.47–62, 1997.
Article MATH MathSciNet Google Scholar
W. W. Cohen, W. Fan, Learning Page-Independent Heuristics for Extracting Data from Web Pages, Proc. WWW-99, 1999.
Google Scholar
M. Craven, D. DiPasquo, D. Freitag, A. McCallum, T. Mitchell, K. Nigam, S. Slattery, Learning to construct knowledge bases from the World Wide Web, Artificial Intelligence vol. 118 pp. 69–113, 2000.
Article MATH Google Scholar
N. Dershowitz, J.-P. Jouannaud, Rewrite Systems, Chapter 6, Formal Models and Semantics, Handbook of Theoretical Computer Science Vol. B, Elseveir, 1990.
Google Scholar
F. Drewes, Computation by Tree Transductions, Ph D. Thesis, University of Bremen, Department of Mathematics and Informatics, February 1996.
Google Scholar
M. Frazier, L. Pitt, Learning from entailment: an application to propositional Horn sentences, Proc. 10th Int. Conf. Machine Learning, pp.120–127, 1993.
Google Scholar
D. Freitag, Information extraction from HTML: Application of a general machine learning approach. Proc. the Fifteenth National Conference on Artificial Intelligence, pp. 517–523, 1998.
Google Scholar
K. Hirata, K. Yamada, H. Harao, Tractable and intractable second-order matching problems. Proc. 5th Annual International Computing and Combinatorics Conference, 1627, pp. 432–441, 1999.
Google Scholar
J. Hammer, H. Garcia-Molina, J. Cho, A. Crespo, Extracting semistructured information from the Web. Proc. the Workshop on Management of Semistructured Data, pp. 18–25, 1997.
Google Scholar
C.-H. Hsu, Initial results on wrapping semistructured web pages with finite-state transducers and contextual rules. In papers from the 1998 Workshop on AI and Information Integration, pp. 66–73, 1998.
Google Scholar
R. Khardon, Learning function-free Horn expressions, Proc. COLT’98, pp. 154–165, 1998.
Google Scholar
P. Kilpelainen, H. Mannila, Ordered and unordered tree inclusion, SIAM J. Comput., vol. 24, pp.340–356, 1995.
Article MathSciNet Google Scholar
N. Kushmerick, Wrapper induction: efficiency and expressiveness. Artificial Intelligence vol. 118, pp. 15–68, 2000.
Article MATH MathSciNet Google Scholar
I. Muslea, S. Minton, C. A. Knoblock, Wrapper induction for semistructured, web-based information sources. Proc. the Conference on Automated Learning and Discovery, 1998.
Google Scholar
H. Sakamoto, H. Arimura, S. Arikawa, Identification of tree translation rules from examples. Proc. 5th International Colloquium on Grammatical Inference. LNAI 1891, pp. 241–255, 2000.
Google Scholar
H. Sakamoto, Y. Murakami, H. Arimura, S. Arikawa, Extracting Partial Structures from HTML Documents, Proc. the 14the International FLAIRS Conference, pp.264–268, 2001, AAAI Press.
Google Scholar
K. Taniguchi, H. Sakamoto, H. Arimura, S. Shimozono, S. Arikawa, Mining Semi-Structured Data by Path Expressions, Proc. the 4th International Conference on Discovery Science, (to appear).
Google Scholar
L. G. Valiant, A theory of learnable, Commun. ACM vol.27, pp. 1134–1142, 1984.
Article MATH Google Scholar

Download references

Author information

Authors and Affiliations

Department of Informatics, Kyushu University, Hakozaki 6-10-1, Higashi-ku, 812-8581, Fukuoka-shi, Japan
Hiroshi Sakamoto, Hiroki Arimura & Setsuo Arikawa

Authors

Hiroshi Sakamoto
View author publications
You can also search for this author in PubMed Google Scholar
Hiroki Arimura
View author publications
You can also search for this author in PubMed Google Scholar
Setsuo Arikawa
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Informatics, Kyushu University, 6-10-1 Hakozaki, Higashi-ku, 812-8581, Fukuoka, Japan
Setsuo Arikawa & Ayumi Shinohara &

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Sakamoto, H., Arimura, H., Arikawa, S. (2002). Knowledge Discovery from Semistructured Texts. In: Arikawa, S., Shinohara, A. (eds) Progress in Discovery Science. Lecture Notes in Computer Science(), vol 2281. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45884-0_45

Download citation

DOI: https://doi.org/10.1007/3-540-45884-0_45
Published: 14 March 2002
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-43338-5
Online ISBN: 978-3-540-45884-5
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics