Extracting Web Data Using Instance-Based Learning

Zhai, Yanhong; Liu, Bing

doi:10.1007/s11280-007-0022-0

Extracting Web Data Using Instance-Based Learning

Published: 02 March 2007

Volume 10, pages 113–132, (2007)
Cite this article

World Wide Web Aims and scope Submit manuscript

Yanhong Zhai¹ &
Bing Liu¹

280 Accesses
28 Citations
Explore all metrics

Abstract

This paper studies structured data extraction from Web pages. Existing approaches to data extraction include wrapper induction and automated methods. In this paper, we propose an instance-based learning method, which performs extraction by comparing each new instance to be extracted with labeled instances. The key advantage of our method is that it does not require an initial set of labeled pages to learn extraction rules as in wrapper induction. Instead, the algorithm is able to start extraction from a single labeled instance. Only when a new instance cannot be extracted does it need labeling. This avoids unnecessary page labeling, which solves a major problem with inductive learning (or wrapper induction), i.e., the set of labeled instances may not be representative of all other instances. The instance-based approach is very natural because structured data on the Web usually follow some fixed templates. Pages of the same template usually can be extracted based on a single page instance of the template. A novel technique is proposed to match a new instance with a manually labeled instance and in the process to extract the required data items from the new instance. The technique is also very efficient. Experimental results based on 1,200 pages from 24 diverse Web sites demonstrate the effectiveness of the method. It also outperforms the state-of-the-art existing systems significantly.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Arasu, A., Garcia-Molina, H.: Extracting structured data from web pages. In: SIGMOD ’03: Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data, pp. 337–348. ACM, New York (2003)
Chang, C.H., Kuo, S.C.: Olera: Semi-supervised web-data extraction with visual support. In: IEEE Intelligent systems, vol. November/December (2004)
Chang, C.H., Lui, S.C.: Iepad: information extraction based on pattern discovery. In: WWW ’01: Proceedings of the 10th International Conference on World Wide Web, pp. 681–688. ACM, New York (2001)
Cohen, W., Hurst, M., Jensen, L.: A flexible learning system for wrapping tables and lists in html documents. In: The Eleventh International World Wide Web Conference WWW-2002 (2002)
Crescenzi, V., Mecca, G., Merialdo, P.: Roadrunner: towards automatic data extraction from large web sites. In: VLDB ’01: Proceedings of the 27th International Conference on Very Large Data Bases, pp. 109–118. Morgan Kaufmann, San Mateo, CA (2001)
Embley, D.W., Jiang, Y., Ng, Y.K.: Record-boundary discovery in web documents. In: SIGMOD (1999)
Feldman, R., Aumann, Y., Finkelstein-Landau, M., Hurvitz, E., Regev, Y., Yaroshevich, A.: A comparative study of information extraction strategies. In: CICLing’02: Proceedings of the Third International Conference on Computational Linguistics and Intelligent Text Processing, pp. 349–359. Springer, Berlin Heidelberg New York (2002)
Freitag, D., Kushmerick, N.: Boosted wrapper induction. In: Proceedings of the Seventeenth National Conference on Artificial Intelligence and Twelfth Conference on Innovative Applications of Artificial Intelligence, pp. 577–583. MIT Press, Cambridge, MA (2000)
Freitag, D., McCallum, A.K.: Information extraction with hmms and shrinkage. In: Proceedings of the AAAI-99 Workshop on Machine Learning for Information Extraction (1999)
Hammer, J., Garcia-Molina, H., Cho, J., Crespo, A., Aranha, R.: Extracting semistructured information from the web. In: Proceedings of the Workshop on Management of Semistructured Data (1997)
Hsu, C.N., Dung, M.T.: Generating finite-state transducers for semi-structured data extraction from the web. Inf. Syst. 23, 521–538 (1998)
Article Google Scholar
Knoblock, C.A., Lerman, K., Minton, S., Muslea, I.: Accurately and reliably extracting data from the web: a machine learning approach, pp. 275–287 (2003)
Kushmerick, N.: Wrapper induction for information extraction. PhD thesis Chairperson-Daniel S. Weld (1997)
Kushmerick, N.: Wrapper induction: efficiency and expressiveness. Artif. Intell., 15–68 (2000)
Lerman, K., Getoor, L., Minton, S., Knoblock, C.: Using the structure of web sites for automatic segmentation of tables. In: SIGMOD ’04: Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data, pp. 119–130. ACM, New York (2004)
Lerman, K., Minton, S.: Learning the common structure of data. In: Proceedings of the Seventeenth National Conference on Artificial Intelligence and Twelfth Conference on Innovative Applications of Artificial Intelligence, pp. 609–614. MIT Press, Cambridge, MA (2000)
Liu, B., Grossman, R., Zhai, Y.: Mining data records in web pages. In: KDD ’03: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 601–606. ACM, New York (2003)
Liu, L., Pu, C., Han, W.: Xwrap: An xml-enabled wrapper construction system for web information sources. In: ICDE ’00: Proceedings of the 16th International Conference on Data Engineering, Washington, DC, USA. IEEE Computer Society Press, Los Alamitos, CA (2000)
Mitchell, T.: Machine Learning. McGraw-Hill, New York (1997)
MATH Google Scholar
Muslea, I., Minton, S., Knoblock, C.: A hierarchical approach to wrapper induction. In: AGENTS ’99: Proceedings of the Third Annual Conference on Autonomous Agents, pp. 190–197. ACM, New York (1999)
Muslea, I., Minton, S., Knoblock, C.: Adaptive view validation: a first step towards automatic view detection. In: Proceedings of ICML2002, pp. 443–450 (2002)
Muslea, I., Minton, S., Knoblock, C.: Active learning with strong and weak views: a case study on wrapper induction. In: Proceedings of the 18th International Joint Conference on Artificial Intelligence (IJCAI-2003) (2003)
Pinto, D., McCallum, A., Wei, X., Croft, W.B.: Table extraction using conditional random fields. In: SIGIR ’03: Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 235–242. ACM, New York (2003)
Sahuguet, A., Azavant, F.: Wysiwyg web wrapper factory (w4f). In: WWW8. (1999)
Wuu, Y.: Identifying syntactic differences between two programs. Softw. Pract. Exp. 21, 739–755 (1997)
Google Scholar
Zhai, Y., Liu, B.: Web data extraction based on partial tree alignment. In: WWW’05: Proceedings of the 14th International Conference on World Wide Web, pp. 76–85. ACM, New York (2005)

Download references

Author information

Authors and Affiliations

Department of Computer Science, University of Illinois at Chicago, 851 S. Morgan Street, Chicago, IL, 60607, USA
Yanhong Zhai & Bing Liu

Authors

Yanhong Zhai
View author publications
You can also search for this author in PubMed Google Scholar
Bing Liu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Bing Liu.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhai, Y., Liu, B. Extracting Web Data Using Instance-Based Learning. World Wide Web 10, 113–132 (2007). https://doi.org/10.1007/s11280-007-0022-0

Download citation

Received: 01 May 2006
Revised: 14 December 2006
Accepted: 11 January 2007
Published: 02 March 2007
Issue Date: June 2007
DOI: https://doi.org/10.1007/s11280-007-0022-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Extracting Web Data Using Instance-Based Learning

Abstract

Access this article

Similar content being viewed by others

Mining Product Features from the Web: A Self-supervised Approach

Efficient Page-Level Data Extraction via Schema Induction and Verification

Automated Text Data Extraction Based on Unsupervised Small Sample Learning

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Extracting Web Data Using Instance-Based Learning

Abstract

Access this article

Similar content being viewed by others

Mining Product Features from the Web: A Self-supervised Approach

Efficient Page-Level Data Extraction via Schema Induction and Verification

Automated Text Data Extraction Based on Unsupervised Small Sample Learning

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation