Accurately and Reliably Extracting Data from the Web: A Machine Learning Approach

Knoblock, Craig A.; Lerman, Kristina; Minton, Steven; Muslea, Ion

doi:10.1007/978-3-7908-1772-0_17

Craig A. Knoblock^6,7,
Kristina Lerman⁶,
Steven Minton⁷ &
…
Ion Muslea⁶

Part of the book series: Studies in Fuzziness and Soft Computing ((STUDFUZZ,volume 111))

300 Accesses
24 Citations

Abstract

A critical problem in developing information agents for the Web is accessing data that is formatted for human use. We have developed a set of tools for extracting data from web sites and transforming it into a structured data format, such as XML. The resulting data can then be used to build new applications without having to deal with unstructured data. The advantages of our wrapping technology over previous work are the the ability to learn highly accurate extraction rules, to verify the wrapper to ensure that the correct data continues to be extracted, and to automatically adapt to changes in the sites from which the data is being extracted.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Hardcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Blum, A. and Mitchell, T., (1998). Combining labeled and unlabeled data with co-training. Proc. of the 1998 Conference on Computational Learning Theory, 92–100.
Google Scholar
Carrasco, R. and Oncina, J., (1994). Learning stochastic regular grammars by means of a state merging method. In Lecture Notes In Computer Science, p. 862.
Google Scholar
Cohen, W., (199). Recognizing structure in web pages using similarity queries. Proc. of the 16th National Conference on Artificial Intelligence AAAI-1999,59–66.
Google Scholar
Freitag, D. and Kushmerick, N., (2000). Boosted wrapper induction. Proc. of the 17th National Conference on Artificial Intelligence AAAI-2000, 577–583.
Google Scholar
Goan, T., Benson, N. and Etzioni, O. (1996). A grammar inference algorithm for the world wide web. Proc. of the AAAI Spring Symposium on Machine Learning in Information Access.
Google Scholar
Hsu, C. and Dung, M., (1998). Generating finite-state transducers for semi-structured data extraction from the web. Journal of Information Systems, 23 (8): 521–538.
Article Google Scholar
Kushmerick, N., (1999). Regression testing for wrapper maintenance. In Proc. of the 16th National Conference on Artificial Intelligence AAAI-1999, 74–79.
Google Scholar
Kushmerick, N., (2000). Wrapper induction: efficiency and expressiveness. Artificial Intelligence Journal, 118 (1–2): 15–68.
Article MathSciNet MATH Google Scholar
Lerman, K. and Minton, S., (2000). Learning the common structure of data. Proc. of the 17th National Conference on Artificial Intelligence AAAI-2000, 609–614.
Google Scholar
Muslea, I., Minton, S. and Knoblock, C., (2000). Co-testing: Selective sampling with redundant views. Proc. of the 17th National Conference on Artificial Intelligence AAAI-2000, 621–626.
Google Scholar
Muslea, I., Minton, S. and Knoblock, C., (2001). Hierarchical wrapper induction for semistructured information sources. Journal of Autonomous Agents and Multi Agent Systems, 4: 93–114.
Article Google Scholar
Soderland, S., (1999). Learning extraction rules for semi-structured and free text. Machine Learning, 34: 233–272.
Article MATH Google Scholar
Thompson, C., Califf, M. and Mooney, R., (1999). Active learning for natural language parsing and information extraction. Proc. of the 16th International Conference on Machine Learning ICML-99, 406–414.
Google Scholar

Download references

Author information

Authors and Affiliations

University of Southern California, 4676 Admiralty Way, Marina del Rey, CA, 90292, USA
Craig A. Knoblock, Kristina Lerman & Ion Muslea
Fetch Technologies, 4676 Admiralty Way, Marina del Rey, CA, 90292, USA
Craig A. Knoblock & Steven Minton

Authors

Craig A. Knoblock
View author publications
You can also search for this author in PubMed Google Scholar
Kristina Lerman
View author publications
You can also search for this author in PubMed Google Scholar
Steven Minton
View author publications
You can also search for this author in PubMed Google Scholar
Ion Muslea
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Institute of Computer Science, Technical University of Lodz, ul. Sterlinga 16/18, 90-217, Lodz, Poland
Piotr S. Szczepaniak
Systems Research Institute, Polish Academy of Sciences, ul. Newelska 6, 01-447, Warsaw, Poland
Piotr S. Szczepaniak & Janusz Kacprzyk &
Facultad de Informática, Universidad Politécnica de Madrid, Campus de Montegancedo, 28660, Madrid, Spain
Javier Segovia
Computer Science Division, Department of Electrical Engineering and Computer Sciences, University of California, 94720-1776, Berkeley, CA, USA
Lotfi A. Zadeh

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Knoblock, C.A., Lerman, K., Minton, S., Muslea, I. (2003). Accurately and Reliably Extracting Data from the Web: A Machine Learning Approach. In: Szczepaniak, P.S., Segovia, J., Kacprzyk, J., Zadeh, L.A. (eds) Intelligent Exploration of the Web. Studies in Fuzziness and Soft Computing, vol 111. Physica, Heidelberg. https://doi.org/10.1007/978-3-7908-1772-0_17

Download citation

DOI: https://doi.org/10.1007/978-3-7908-1772-0_17
Publisher Name: Physica, Heidelberg
Print ISBN: 978-3-7908-2519-0
Online ISBN: 978-3-7908-1772-0
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics