Abstract
A substantial subset of the web data follows some kind of underlying structure. In order to let software programs gain full benefit from these “semi-structured” web sources, wrapper programs are built to provide a “machine-readable” view over them. A significant problem with wrappers is that, since web sources are autonomous, they may experience changes that invalidate the current wrapper, so automatic maintenance is an important research issue. Web wrappers must perform two kinds of tasks: automatically navigating through websites and automatically extracting structured data from HTML pages. While several previous works have addressed the automatic maintenance of the components performing the data extraction task, the problem of automatically maintaining the required web navigation sequences remains unaddressed to the best of our knowledge. In this paper we propose and expirementally validate a set of novel heuristics and algorithms to fill this gap.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Anupan, V., Freire, J., Kumar, B., Lieuwen, D.: Automating Web Navigation with WebVCR. In: Proceedings of the 9th International World Wide Web Conference (2000)
Arasu, A., Garcia-Molina, H.: Extracting Structured Data from Web Pages. In: Proceedings of the ACM SIGMOD International Conference on Management of data (2003)
Cohen, W., Ravikumar, P., Fienberg, S.: A Comparison of String Distance Metrics for Name-Matching Tasks. In: Proceedings of IJCAI 2003 Workshop (IIWeb 2003) (2003)
Knoblock, C.A., Lerman, K., Minton, S., Muslea, I.: Accurately and Reliably Extracting Data from the Web: A Machine Learning Approach. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering (1999)
Kushmerick, N.: Regression testing for wrapper maintenance. In: Proceedings of the 16th Ntl. Conf. on Artificial Intelligence and Innovative Applications of Artificial Intelligence (1999)
Kushmerick, N.: Wrapper induction: Efficiency and expressiveness. Artificial Intelligence 118, 15–68 (2000)
Kushmerick, N.: Learning to invoke web forms. In: Proc. Int. Conf. Ontologies, Databases and Applications of Semantics (2003)
Laender, A.H.F., Ribeiro-Neto, B.A., Soares da Silva, A., Teixeira, J.S.: A Brief Survey of Web Data Extraction Tools. ACM SIGMOD Record 31(2), 84–93 (2002)
Lerman, K., Minton, S., Knoblock, C.: A Machine Learning Approach. Journal of Artificial Intelligence Research 18, 149–181 (2003)
Liddle, S., Embley, D., Scott, D., Yau Ho, S.: Extracting Data Behind Web Forms. In: Proceedings of the 28th Intl. Conference on Very Large Databases (VLDB 2002) (2002)
Meng, X., Hu, D., Li, C.: Schema-Guided Wrapper Maintenance for Web-Data Extraction. In: Proceedings of the ACM 5th Intl. Workshop on Web Information and Data Management (WIDM) (2003)
Mohapatra, R., Rajaraman, K., Sam Yuan, S.: Efficient Wrapper Reinduction from Dynamic Web Sources. In: Proceedings of the IEEE/WIC/ACM Intl. Conf. on Web Intelligence (2004)
Raghavan, S., Garcia-Molina, H.: Crawling the hidden web. In: Proceedings of the 27th Conference on Very Large DataBases (VLDB 2001). ACM Press, New York (2001)
Pan, A., Raposo, J., Álvarez, M., Hidalgo, J., Viña, A.: Semi-Automatic Wrapper Generation for Commercial Web Sources. In: Proceedings of IFIP WG8.1 Working Conference on Engineering Information Systems in the Internet Context (EISIC) (2002)
Pan, A., Raposo, J., et al.: ITPilot: A Toolkit for Industrial-strength Web Data Extraction. In: Proceedings of the 2005 IEEE/WIC/ACM Intl. Conf. on Web Intelligence (WI 2005) (2005)
Raposo, J., Pan, A., Alvarez, M., Hidalgo, J.: Automatically Maintaining Wrappers for Web Sources. In: Proceedings of the 9th Intl. Database Engineering and Applications Symp. (IDEAS) (2005)
Zhai, Y., Liu, B.: Web Data Extraction Based on Partial Tree Alignment. In: Proceedings of the 2005 World Wide Web Conference (WWW 2005). ACM Press, New York (2005)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2006 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Raposo, J., Álvarez, M., Losada, J., Pan, A. (2006). Maintaining Web Navigation Flows for Wrappers. In: Lee, J., Shim, J., Lee, Sg., Bussler, C., Shim, S. (eds) Data Engineering Issues in E-Commerce and Services. DEECS 2006. Lecture Notes in Computer Science, vol 4055. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11780397_9
Download citation
DOI: https://doi.org/10.1007/11780397_9
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-35440-6
Online ISBN: 978-3-540-35441-3
eBook Packages: Computer ScienceComputer Science (R0)