Abstract
A critical problem in developing information agents for the Web is accessing data that is formatted for human use. We have developed a set of tools for extracting data from web sites and transforming it into a structured data format, such as XML. The resulting data can then be used to build new applications without having to deal with unstructured data. The advantages of our wrapping technology over previous work are the the ability to learn highly accurate extraction rules, to verify the wrapper to ensure that the correct data continues to be extracted, and to automatically adapt to changes in the sites from which the data is being extracted.
© 2000 IEEE. Reprinted, with permission, from IEEE Data Engineering Bulletin, 23(4), December, 2000.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Blum, A. and Mitchell, T., (1998). Combining labeled and unlabeled data with co-training. Proc. of the 1998 Conference on Computational Learning Theory, 92–100.
Carrasco, R. and Oncina, J., (1994). Learning stochastic regular grammars by means of a state merging method. In Lecture Notes In Computer Science, p. 862.
Cohen, W., (199). Recognizing structure in web pages using similarity queries. Proc. of the 16th National Conference on Artificial Intelligence AAAI-1999,59–66.
Freitag, D. and Kushmerick, N., (2000). Boosted wrapper induction. Proc. of the 17th National Conference on Artificial Intelligence AAAI-2000, 577–583.
Goan, T., Benson, N. and Etzioni, O. (1996). A grammar inference algorithm for the world wide web. Proc. of the AAAI Spring Symposium on Machine Learning in Information Access.
Hsu, C. and Dung, M., (1998). Generating finite-state transducers for semi-structured data extraction from the web. Journal of Information Systems, 23 (8): 521–538.
Kushmerick, N., (1999). Regression testing for wrapper maintenance. In Proc. of the 16th National Conference on Artificial Intelligence AAAI-1999, 74–79.
Kushmerick, N., (2000). Wrapper induction: efficiency and expressiveness. Artificial Intelligence Journal, 118 (1–2): 15–68.
Lerman, K. and Minton, S., (2000). Learning the common structure of data. Proc. of the 17th National Conference on Artificial Intelligence AAAI-2000, 609–614.
Muslea, I., Minton, S. and Knoblock, C., (2000). Co-testing: Selective sampling with redundant views. Proc. of the 17th National Conference on Artificial Intelligence AAAI-2000, 621–626.
Muslea, I., Minton, S. and Knoblock, C., (2001). Hierarchical wrapper induction for semistructured information sources. Journal of Autonomous Agents and Multi Agent Systems, 4: 93–114.
Soderland, S., (1999). Learning extraction rules for semi-structured and free text. Machine Learning, 34: 233–272.
Thompson, C., Califf, M. and Mooney, R., (1999). Active learning for natural language parsing and information extraction. Proc. of the 16th International Conference on Machine Learning ICML-99, 406–414.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2003 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Knoblock, C.A., Lerman, K., Minton, S., Muslea, I. (2003). Accurately and Reliably Extracting Data from the Web: A Machine Learning Approach. In: Szczepaniak, P.S., Segovia, J., Kacprzyk, J., Zadeh, L.A. (eds) Intelligent Exploration of the Web. Studies in Fuzziness and Soft Computing, vol 111. Physica, Heidelberg. https://doi.org/10.1007/978-3-7908-1772-0_17
Download citation
DOI: https://doi.org/10.1007/978-3-7908-1772-0_17
Publisher Name: Physica, Heidelberg
Print ISBN: 978-3-7908-2519-0
Online ISBN: 978-3-7908-1772-0
eBook Packages: Springer Book Archive