Skip to main content
Log in

Extracting Web Data Using Instance-Based Learning

  • Published:
World Wide Web Aims and scope Submit manuscript

Abstract

This paper studies structured data extraction from Web pages. Existing approaches to data extraction include wrapper induction and automated methods. In this paper, we propose an instance-based learning method, which performs extraction by comparing each new instance to be extracted with labeled instances. The key advantage of our method is that it does not require an initial set of labeled pages to learn extraction rules as in wrapper induction. Instead, the algorithm is able to start extraction from a single labeled instance. Only when a new instance cannot be extracted does it need labeling. This avoids unnecessary page labeling, which solves a major problem with inductive learning (or wrapper induction), i.e., the set of labeled instances may not be representative of all other instances. The instance-based approach is very natural because structured data on the Web usually follow some fixed templates. Pages of the same template usually can be extracted based on a single page instance of the template. A novel technique is proposed to match a new instance with a manually labeled instance and in the process to extract the required data items from the new instance. The technique is also very efficient. Experimental results based on 1,200 pages from 24 diverse Web sites demonstrate the effectiveness of the method. It also outperforms the state-of-the-art existing systems significantly.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Arasu, A., Garcia-Molina, H.: Extracting structured data from web pages. In: SIGMOD ’03: Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data, pp. 337–348. ACM, New York (2003)

  2. Chang, C.H., Kuo, S.C.: Olera: Semi-supervised web-data extraction with visual support. In: IEEE Intelligent systems, vol. November/December (2004)

  3. Chang, C.H., Lui, S.C.: Iepad: information extraction based on pattern discovery. In: WWW ’01: Proceedings of the 10th International Conference on World Wide Web, pp. 681–688. ACM, New York (2001)

  4. Cohen, W., Hurst, M., Jensen, L.: A flexible learning system for wrapping tables and lists in html documents. In: The Eleventh International World Wide Web Conference WWW-2002 (2002)

  5. Crescenzi, V., Mecca, G., Merialdo, P.: Roadrunner: towards automatic data extraction from large web sites. In: VLDB ’01: Proceedings of the 27th International Conference on Very Large Data Bases, pp. 109–118. Morgan Kaufmann, San Mateo, CA (2001)

  6. Embley, D.W., Jiang, Y., Ng, Y.K.: Record-boundary discovery in web documents. In: SIGMOD (1999)

  7. Feldman, R., Aumann, Y., Finkelstein-Landau, M., Hurvitz, E., Regev, Y., Yaroshevich, A.: A comparative study of information extraction strategies. In: CICLing’02: Proceedings of the Third International Conference on Computational Linguistics and Intelligent Text Processing, pp. 349–359. Springer, Berlin Heidelberg New York (2002)

  8. Freitag, D., Kushmerick, N.: Boosted wrapper induction. In: Proceedings of the Seventeenth National Conference on Artificial Intelligence and Twelfth Conference on Innovative Applications of Artificial Intelligence, pp. 577–583. MIT Press, Cambridge, MA (2000)

  9. Freitag, D., McCallum, A.K.: Information extraction with hmms and shrinkage. In: Proceedings of the AAAI-99 Workshop on Machine Learning for Information Extraction (1999)

  10. Hammer, J., Garcia-Molina, H., Cho, J., Crespo, A., Aranha, R.: Extracting semistructured information from the web. In: Proceedings of the Workshop on Management of Semistructured Data (1997)

  11. Hsu, C.N., Dung, M.T.: Generating finite-state transducers for semi-structured data extraction from the web. Inf. Syst. 23, 521–538 (1998)

    Article  Google Scholar 

  12. Knoblock, C.A., Lerman, K., Minton, S., Muslea, I.: Accurately and reliably extracting data from the web: a machine learning approach, pp. 275–287 (2003)

  13. Kushmerick, N.: Wrapper induction for information extraction. PhD thesis Chairperson-Daniel S. Weld (1997)

  14. Kushmerick, N.: Wrapper induction: efficiency and expressiveness. Artif. Intell., 15–68 (2000)

  15. Lerman, K., Getoor, L., Minton, S., Knoblock, C.: Using the structure of web sites for automatic segmentation of tables. In: SIGMOD ’04: Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data, pp. 119–130. ACM, New York (2004)

  16. Lerman, K., Minton, S.: Learning the common structure of data. In: Proceedings of the Seventeenth National Conference on Artificial Intelligence and Twelfth Conference on Innovative Applications of Artificial Intelligence, pp. 609–614. MIT Press, Cambridge, MA (2000)

  17. Liu, B., Grossman, R., Zhai, Y.: Mining data records in web pages. In: KDD ’03: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 601–606. ACM, New York (2003)

  18. Liu, L., Pu, C., Han, W.: Xwrap: An xml-enabled wrapper construction system for web information sources. In: ICDE ’00: Proceedings of the 16th International Conference on Data Engineering, Washington, DC, USA. IEEE Computer Society Press, Los Alamitos, CA (2000)

  19. Mitchell, T.: Machine Learning. McGraw-Hill, New York (1997)

    MATH  Google Scholar 

  20. Muslea, I., Minton, S., Knoblock, C.: A hierarchical approach to wrapper induction. In: AGENTS ’99: Proceedings of the Third Annual Conference on Autonomous Agents, pp. 190–197. ACM, New York (1999)

  21. Muslea, I., Minton, S., Knoblock, C.: Adaptive view validation: a first step towards automatic view detection. In: Proceedings of ICML2002, pp. 443–450 (2002)

  22. Muslea, I., Minton, S., Knoblock, C.: Active learning with strong and weak views: a case study on wrapper induction. In: Proceedings of the 18th International Joint Conference on Artificial Intelligence (IJCAI-2003) (2003)

  23. Pinto, D., McCallum, A., Wei, X., Croft, W.B.: Table extraction using conditional random fields. In: SIGIR ’03: Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 235–242. ACM, New York (2003)

  24. Sahuguet, A., Azavant, F.: Wysiwyg web wrapper factory (w4f). In: WWW8. (1999)

  25. Wuu, Y.: Identifying syntactic differences between two programs. Softw. Pract. Exp. 21, 739–755 (1997)

    Google Scholar 

  26. Zhai, Y., Liu, B.: Web data extraction based on partial tree alignment. In: WWW’05: Proceedings of the 14th International Conference on World Wide Web, pp. 76–85. ACM, New York (2005)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Bing Liu.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhai, Y., Liu, B. Extracting Web Data Using Instance-Based Learning. World Wide Web 10, 113–132 (2007). https://doi.org/10.1007/s11280-007-0022-0

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11280-007-0022-0

Keywords

Navigation