Advertisement

Wrapper Maintenance for Web-Data Extraction Based on Pages Features

  • Shunxian Zhou
  • Yaping Lin
  • Jingpu Wang
  • Xiaolin Yang
Part of the Advances in Soft Computing book series (AINSC, volume 35)

Abstract

Extracting data from Web pages using wrappers is a fundamental problem arising in a large variety of applications of vast practical interest. There are two main issues relevant to Web-data extraction, namely wrapper generation and wrapper maintenance. In this paper, we propose a novel approach to automatic wrapper maintenance. It is based on the observation that despite various page changes, many important features of the pages are preserved, such as text pattern features, annotations, and hyperlinks. Our approach uses these preserved features to identify the locations of the desired values in the changed pages, and repairs wrappers correspondingly. Experiments over several real-world Web sites show that the proposed automatic approach can effectively maintain wrappers to extract desired data with high accuracy.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    1. Baumgartner R, Flesca S, Gottlob G. Visual Web Information Extraction with Lixto. In Proceedings of the Very Large Data Bases; 2001, 119–128.Google Scholar
  2. 2.
    2. Chidlovskii B. Automatic repairing of Web Wrappers. In 3rd International Workshop on Web Information and Data Management, 2001, 24–30.Google Scholar
  3. 3.
    3. Hammer J, Brenning M, Garcia-Molina H, Nestorov S, VassalosV, Yemeni R,. Template-based wrappers in the TSIMMIS system. In Proceedings of ACM SIGMOD Conference, 1997, 532–535.Google Scholar
  4. 4.
    4. Knoblock C A, Lerman K, Minton S, Muslea I. Accurately and Reliably Extracting Data from the Web: A Machine Learning Approach. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering, 2000, 23(4): 33–41.Google Scholar
  5. 5.
    5. Kristina Lerman, Steven Minton, Craig A. Knoblock: Wrapper Maintenance: A Machine Learning Approach. J. Artif. Intell. Res. (JAIR.) 18: 149–181 (2003)Google Scholar
  6. 6.
    6. Kushmerick N. Regression testing for wrapper maintenance. In Proceedings of AAAI, 1999, 74–79Google Scholar
  7. 7.
    7. Kushmerick N. Wrapper verification. World Wide Web Journal, 2000, 3(2): 79–94.zbMATHCrossRefGoogle Scholar
  8. 8.
    8. Lerman K. and Minton S. Learning the common structure of data. In AAAI2000.Google Scholar
  9. 9.
    9. Muslea, I., Minton, S. and Knoblock, C., (2001). Hierarchical wrapper induction for semistructured information sources. Journal of Autonomous Agents and Multi-Agent Systems, 4:93–114.CrossRefGoogle Scholar

Copyright information

© Springer 2006

Authors and Affiliations

  • Shunxian Zhou
    • 1
  • Yaping Lin
    • 1
  • Jingpu Wang
    • 2
  • Xiaolin Yang
    • 2
  1. 1.College of softwareUniversity of HunanChangshaChina
  2. 2.College of Computer and CommunicationUniversity of HunanChangshaChina

Personalised recommendations