Skip to main content

Maintaining Web Navigation Flows for Wrappers

  • Conference paper

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 4055))

Abstract

A substantial subset of the web data follows some kind of underlying structure. In order to let software programs gain full benefit from these “semi-structured” web sources, wrapper programs are built to provide a “machine-readable” view over them. A significant problem with wrappers is that, since web sources are autonomous, they may experience changes that invalidate the current wrapper, so automatic maintenance is an important research issue. Web wrappers must perform two kinds of tasks: automatically navigating through websites and automatically extracting structured data from HTML pages. While several previous works have addressed the automatic maintenance of the components performing the data extraction task, the problem of automatically maintaining the required web navigation sequences remains unaddressed to the best of our knowledge. In this paper we propose and expirementally validate a set of novel heuristics and algorithms to fill this gap.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Anupan, V., Freire, J., Kumar, B., Lieuwen, D.: Automating Web Navigation with WebVCR. In: Proceedings of the 9th International World Wide Web Conference (2000)

    Google Scholar 

  2. Arasu, A., Garcia-Molina, H.: Extracting Structured Data from Web Pages. In: Proceedings of the ACM SIGMOD International Conference on Management of data (2003)

    Google Scholar 

  3. Cohen, W., Ravikumar, P., Fienberg, S.: A Comparison of String Distance Metrics for Name-Matching Tasks. In: Proceedings of IJCAI 2003 Workshop (IIWeb 2003) (2003)

    Google Scholar 

  4. Knoblock, C.A., Lerman, K., Minton, S., Muslea, I.: Accurately and Reliably Extracting Data from the Web: A Machine Learning Approach. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering (1999)

    Google Scholar 

  5. Kushmerick, N.: Regression testing for wrapper maintenance. In: Proceedings of the 16th Ntl. Conf. on Artificial Intelligence and Innovative Applications of Artificial Intelligence (1999)

    Google Scholar 

  6. Kushmerick, N.: Wrapper induction: Efficiency and expressiveness. Artificial Intelligence 118, 15–68 (2000)

    Article  MATH  MathSciNet  Google Scholar 

  7. Kushmerick, N.: Learning to invoke web forms. In: Proc. Int. Conf. Ontologies, Databases and Applications of Semantics (2003)

    Google Scholar 

  8. Laender, A.H.F., Ribeiro-Neto, B.A., Soares da Silva, A., Teixeira, J.S.: A Brief Survey of Web Data Extraction Tools. ACM SIGMOD Record 31(2), 84–93 (2002)

    Article  Google Scholar 

  9. Lerman, K., Minton, S., Knoblock, C.: A Machine Learning Approach. Journal of Artificial Intelligence Research 18, 149–181 (2003)

    MATH  Google Scholar 

  10. Liddle, S., Embley, D., Scott, D., Yau Ho, S.: Extracting Data Behind Web Forms. In: Proceedings of the 28th Intl. Conference on Very Large Databases (VLDB 2002) (2002)

    Google Scholar 

  11. Meng, X., Hu, D., Li, C.: Schema-Guided Wrapper Maintenance for Web-Data Extraction. In: Proceedings of the ACM 5th Intl. Workshop on Web Information and Data Management (WIDM) (2003)

    Google Scholar 

  12. Mohapatra, R., Rajaraman, K., Sam Yuan, S.: Efficient Wrapper Reinduction from Dynamic Web Sources. In: Proceedings of the IEEE/WIC/ACM Intl. Conf. on Web Intelligence (2004)

    Google Scholar 

  13. Raghavan, S., Garcia-Molina, H.: Crawling the hidden web. In: Proceedings of the 27th Conference on Very Large DataBases (VLDB 2001). ACM Press, New York (2001)

    Google Scholar 

  14. Pan, A., Raposo, J., Álvarez, M., Hidalgo, J., Viña, A.: Semi-Automatic Wrapper Generation for Commercial Web Sources. In: Proceedings of IFIP WG8.1 Working Conference on Engineering Information Systems in the Internet Context (EISIC) (2002)

    Google Scholar 

  15. Pan, A., Raposo, J., et al.: ITPilot: A Toolkit for Industrial-strength Web Data Extraction. In: Proceedings of the 2005 IEEE/WIC/ACM Intl. Conf. on Web Intelligence (WI 2005) (2005)

    Google Scholar 

  16. Raposo, J., Pan, A., Alvarez, M., Hidalgo, J.: Automatically Maintaining Wrappers for Web Sources. In: Proceedings of the 9th Intl. Database Engineering and Applications Symp. (IDEAS) (2005)

    Google Scholar 

  17. Zhai, Y., Liu, B.: Web Data Extraction Based on Partial Tree Alignment. In: Proceedings of the 2005 World Wide Web Conference (WWW 2005). ACM Press, New York (2005)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2006 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Raposo, J., Álvarez, M., Losada, J., Pan, A. (2006). Maintaining Web Navigation Flows for Wrappers. In: Lee, J., Shim, J., Lee, Sg., Bussler, C., Shim, S. (eds) Data Engineering Issues in E-Commerce and Services. DEECS 2006. Lecture Notes in Computer Science, vol 4055. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11780397_9

Download citation

  • DOI: https://doi.org/10.1007/11780397_9

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-35440-6

  • Online ISBN: 978-3-540-35441-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics