Advertisement

Maintaining Web Navigation Flows for Wrappers

  • Juan Raposo
  • Manuel Álvarez
  • José Losada
  • Alberto Pan
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4055)

Abstract

A substantial subset of the web data follows some kind of underlying structure. In order to let software programs gain full benefit from these “semi-structured” web sources, wrapper programs are built to provide a “machine-readable” view over them. A significant problem with wrappers is that, since web sources are autonomous, they may experience changes that invalidate the current wrapper, so automatic maintenance is an important research issue. Web wrappers must perform two kinds of tasks: automatically navigating through websites and automatically extracting structured data from HTML pages. While several previous works have addressed the automatic maintenance of the components performing the data extraction task, the problem of automatically maintaining the required web navigation sequences remains unaddressed to the best of our knowledge. In this paper we propose and expirementally validate a set of novel heuristics and algorithms to fill this gap.

Keywords

Form Field Visual Distance Query Form Searchable Attribute Detail Page 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Anupan, V., Freire, J., Kumar, B., Lieuwen, D.: Automating Web Navigation with WebVCR. In: Proceedings of the 9th International World Wide Web Conference (2000)Google Scholar
  2. 2.
    Arasu, A., Garcia-Molina, H.: Extracting Structured Data from Web Pages. In: Proceedings of the ACM SIGMOD International Conference on Management of data (2003)Google Scholar
  3. 3.
    Cohen, W., Ravikumar, P., Fienberg, S.: A Comparison of String Distance Metrics for Name-Matching Tasks. In: Proceedings of IJCAI 2003 Workshop (IIWeb 2003) (2003)Google Scholar
  4. 4.
    Knoblock, C.A., Lerman, K., Minton, S., Muslea, I.: Accurately and Reliably Extracting Data from the Web: A Machine Learning Approach. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering (1999)Google Scholar
  5. 5.
    Kushmerick, N.: Regression testing for wrapper maintenance. In: Proceedings of the 16th Ntl. Conf. on Artificial Intelligence and Innovative Applications of Artificial Intelligence (1999)Google Scholar
  6. 6.
    Kushmerick, N.: Wrapper induction: Efficiency and expressiveness. Artificial Intelligence 118, 15–68 (2000)MATHCrossRefMathSciNetGoogle Scholar
  7. 7.
    Kushmerick, N.: Learning to invoke web forms. In: Proc. Int. Conf. Ontologies, Databases and Applications of Semantics (2003)Google Scholar
  8. 8.
    Laender, A.H.F., Ribeiro-Neto, B.A., Soares da Silva, A., Teixeira, J.S.: A Brief Survey of Web Data Extraction Tools. ACM SIGMOD Record 31(2), 84–93 (2002)CrossRefGoogle Scholar
  9. 9.
    Lerman, K., Minton, S., Knoblock, C.: A Machine Learning Approach. Journal of Artificial Intelligence Research 18, 149–181 (2003)MATHGoogle Scholar
  10. 10.
    Liddle, S., Embley, D., Scott, D., Yau Ho, S.: Extracting Data Behind Web Forms. In: Proceedings of the 28th Intl. Conference on Very Large Databases (VLDB 2002) (2002)Google Scholar
  11. 11.
    Meng, X., Hu, D., Li, C.: Schema-Guided Wrapper Maintenance for Web-Data Extraction. In: Proceedings of the ACM 5th Intl. Workshop on Web Information and Data Management (WIDM) (2003)Google Scholar
  12. 12.
    Mohapatra, R., Rajaraman, K., Sam Yuan, S.: Efficient Wrapper Reinduction from Dynamic Web Sources. In: Proceedings of the IEEE/WIC/ACM Intl. Conf. on Web Intelligence (2004)Google Scholar
  13. 13.
    Raghavan, S., Garcia-Molina, H.: Crawling the hidden web. In: Proceedings of the 27th Conference on Very Large DataBases (VLDB 2001). ACM Press, New York (2001)Google Scholar
  14. 14.
    Pan, A., Raposo, J., Álvarez, M., Hidalgo, J., Viña, A.: Semi-Automatic Wrapper Generation for Commercial Web Sources. In: Proceedings of IFIP WG8.1 Working Conference on Engineering Information Systems in the Internet Context (EISIC) (2002)Google Scholar
  15. 15.
    Pan, A., Raposo, J., et al.: ITPilot: A Toolkit for Industrial-strength Web Data Extraction. In: Proceedings of the 2005 IEEE/WIC/ACM Intl. Conf. on Web Intelligence (WI 2005) (2005)Google Scholar
  16. 16.
    Raposo, J., Pan, A., Alvarez, M., Hidalgo, J.: Automatically Maintaining Wrappers for Web Sources. In: Proceedings of the 9th Intl. Database Engineering and Applications Symp. (IDEAS) (2005)Google Scholar
  17. 17.
    Zhai, Y., Liu, B.: Web Data Extraction Based on Partial Tree Alignment. In: Proceedings of the 2005 World Wide Web Conference (WWW 2005). ACM Press, New York (2005)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Juan Raposo
    • 1
  • Manuel Álvarez
    • 1
  • José Losada
    • 2
  • Alberto Pan
    • 1
  1. 1.Department of Information and Communications TechnologiesUniversity of A CoruñaCoruñaSpain
  2. 2.Denodo Technologies Inc. Real 22, 3º.A CoruñaSpain

Personalised recommendations