Wrapper Maintenance for Web-Data Extraction Based on Pages Features
Extracting data from Web pages using wrappers is a fundamental problem arising in a large variety of applications of vast practical interest. There are two main issues relevant to Web-data extraction, namely wrapper generation and wrapper maintenance. In this paper, we propose a novel approach to automatic wrapper maintenance. It is based on the observation that despite various page changes, many important features of the pages are preserved, such as text pattern features, annotations, and hyperlinks. Our approach uses these preserved features to identify the locations of the desired values in the changed pages, and repairs wrappers correspondingly. Experiments over several real-world Web sites show that the proposed automatic approach can effectively maintain wrappers to extract desired data with high accuracy.
Unable to display preview. Download preview PDF.
- 1.1. Baumgartner R, Flesca S, Gottlob G. Visual Web Information Extraction with Lixto. In Proceedings of the Very Large Data Bases; 2001, 119–128.Google Scholar
- 2.2. Chidlovskii B. Automatic repairing of Web Wrappers. In 3rd International Workshop on Web Information and Data Management, 2001, 24–30.Google Scholar
- 3.3. Hammer J, Brenning M, Garcia-Molina H, Nestorov S, VassalosV, Yemeni R,. Template-based wrappers in the TSIMMIS system. In Proceedings of ACM SIGMOD Conference, 1997, 532–535.Google Scholar
- 4.4. Knoblock C A, Lerman K, Minton S, Muslea I. Accurately and Reliably Extracting Data from the Web: A Machine Learning Approach. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering, 2000, 23(4): 33–41.Google Scholar
- 5.5. Kristina Lerman, Steven Minton, Craig A. Knoblock: Wrapper Maintenance: A Machine Learning Approach. J. Artif. Intell. Res. (JAIR.) 18: 149–181 (2003)Google Scholar
- 6.6. Kushmerick N. Regression testing for wrapper maintenance. In Proceedings of AAAI, 1999, 74–79Google Scholar
- 8.8. Lerman K. and Minton S. Learning the common structure of data. In AAAI2000.Google Scholar