Skip to main content

Knowledge Discovery from Semistructured Texts

  • Chapter
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 2281))

Abstract

This paper surveys our recent results on the knowledge discovery from semistructured texts, which contain heterogeneous structures represented by labeled trees. The aim of our study is to extract useful information from documents on the Web. First, we present the theoretical results on learning rewriting rules between labeled trees. Second, we apply our method to the learning HTML trees in the framework of the wrapper induction. We also examine our algorithms for real world HTML documents and present the results.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. S. Abiteboul, P. Buneman, D. Suciu, Data on theWeb: From relations to semistructured data and XML, Morgan Kaufmann, San Francisco, CA, 2000.

    Google Scholar 

  2. D. Angluin, Queries and concept learning, Machine Learning vol.2, pp.319–342, 1988.

    Google Scholar 

  3. H. Arimura, Learning Acyclic First-order Horn Sentences From Entailment, Proc. 7th Int. Workshop on Algorithmic Learning Theory, LNAI 1316, pp.432–445, 1997.

    Google Scholar 

  4. H. Arimura, H. Ishizaka, T. Shinohara, Learning unions of tree patterns using queries, Theoretical Computer Science vol.185, pp.47–62, 1997.

    Article  MATH  MathSciNet  Google Scholar 

  5. W. W. Cohen, W. Fan, Learning Page-Independent Heuristics for Extracting Data from Web Pages, Proc. WWW-99, 1999.

    Google Scholar 

  6. M. Craven, D. DiPasquo, D. Freitag, A. McCallum, T. Mitchell, K. Nigam, S. Slattery, Learning to construct knowledge bases from the World Wide Web, Artificial Intelligence vol. 118 pp. 69–113, 2000.

    Article  MATH  Google Scholar 

  7. N. Dershowitz, J.-P. Jouannaud, Rewrite Systems, Chapter 6, Formal Models and Semantics, Handbook of Theoretical Computer Science Vol. B, Elseveir, 1990.

    Google Scholar 

  8. F. Drewes, Computation by Tree Transductions, Ph D. Thesis, University of Bremen, Department of Mathematics and Informatics, February 1996.

    Google Scholar 

  9. M. Frazier, L. Pitt, Learning from entailment: an application to propositional Horn sentences, Proc. 10th Int. Conf. Machine Learning, pp.120–127, 1993.

    Google Scholar 

  10. D. Freitag, Information extraction from HTML: Application of a general machine learning approach. Proc. the Fifteenth National Conference on Artificial Intelligence, pp. 517–523, 1998.

    Google Scholar 

  11. K. Hirata, K. Yamada, H. Harao, Tractable and intractable second-order matching problems. Proc. 5th Annual International Computing and Combinatorics Conference, 1627, pp. 432–441, 1999.

    Google Scholar 

  12. J. Hammer, H. Garcia-Molina, J. Cho, A. Crespo, Extracting semistructured information from the Web. Proc. the Workshop on Management of Semistructured Data, pp. 18–25, 1997.

    Google Scholar 

  13. C.-H. Hsu, Initial results on wrapping semistructured web pages with finite-state transducers and contextual rules. In papers from the 1998 Workshop on AI and Information Integration, pp. 66–73, 1998.

    Google Scholar 

  14. R. Khardon, Learning function-free Horn expressions, Proc. COLT’98, pp. 154–165, 1998.

    Google Scholar 

  15. P. Kilpelainen, H. Mannila, Ordered and unordered tree inclusion, SIAM J. Comput., vol. 24, pp.340–356, 1995.

    Article  MathSciNet  Google Scholar 

  16. N. Kushmerick, Wrapper induction: efficiency and expressiveness. Artificial Intelligence vol. 118, pp. 15–68, 2000.

    Article  MATH  MathSciNet  Google Scholar 

  17. I. Muslea, S. Minton, C. A. Knoblock, Wrapper induction for semistructured, web-based information sources. Proc. the Conference on Automated Learning and Discovery, 1998.

    Google Scholar 

  18. H. Sakamoto, H. Arimura, S. Arikawa, Identification of tree translation rules from examples. Proc. 5th International Colloquium on Grammatical Inference. LNAI 1891, pp. 241–255, 2000.

    Google Scholar 

  19. H. Sakamoto, Y. Murakami, H. Arimura, S. Arikawa, Extracting Partial Structures from HTML Documents, Proc. the 14the International FLAIRS Conference, pp.264–268, 2001, AAAI Press.

    Google Scholar 

  20. K. Taniguchi, H. Sakamoto, H. Arimura, S. Shimozono, S. Arikawa, Mining Semi-Structured Data by Path Expressions, Proc. the 4th International Conference on Discovery Science, (to appear).

    Google Scholar 

  21. L. G. Valiant, A theory of learnable, Commun. ACM vol.27, pp. 1134–1142, 1984.

    Article  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2002 Springer-Verlag Berlin Heidelberg

About this chapter

Cite this chapter

Sakamoto, H., Arimura, H., Arikawa, S. (2002). Knowledge Discovery from Semistructured Texts. In: Arikawa, S., Shinohara, A. (eds) Progress in Discovery Science. Lecture Notes in Computer Science(), vol 2281. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45884-0_45

Download citation

  • DOI: https://doi.org/10.1007/3-540-45884-0_45

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-43338-5

  • Online ISBN: 978-3-540-45884-5

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics