Skip to main content

Automatic Wrapper Adaptation by Tree Edit Distance Matching

  • Conference paper
Combinations of Intelligent Methods and Applications

Part of the book series: Smart Innovation, Systems and Technologies ((SIST,volume 8))

Abstract

Information distributed through the Web keeps growing faster day by day, and for this reason, several techniques for extracting Web data have been suggested during last years. Often, extraction tasks are performed through so called wrappers, procedures extracting information from Web pages, e.g. implementing logic-based techniques. Many fields of application today require a strong degree of robustness of wrappers, in order not to compromise assets of information or reliability of data extracted.

Unfortunately, wrappers may fail in the task of extracting data from a Web page, if its structure changes, sometimes even slightly, thus requiring the exploiting of new techniques to be automatically held so as to adapt the wrapper to the new structure of the page, in case of failure. In this work we present a novel approach of automatic wrapper adaptation based on the measurement of similarity of trees through improved tree edit distance matching techniques.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 169.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Bille, P.: A survey on tree edit distance and related problems. Theoretical Computer Science 337(1-3), 217–239 (2005), doi:10.1016/j.tcs.2004.12.030

    Article  MATH  MathSciNet  Google Scholar 

  2. Chidlovskii, B.: Automatic repairing of web wrappers. In: Proceedings of the 3rd international workshop on Web information and data management, p. 30. ACM Press, New York (2001)

    Google Scholar 

  3. Ferrara, E., Fiumara, G., Baumgartner, R.: Web Data Extraction, Applications and Techniques: A Survey. Technical Report (2010)

    Google Scholar 

  4. Hirschberg, D.S.: A linear space algorithm for computing maximal common subsequences. Communications of the ACM 18(6), 343 (1975)

    Article  MathSciNet  Google Scholar 

  5. Kim, Y., Park, J., Kim, T., Choi, J.: Web Information Extraction by HTML Tree Edit Distance Matching. In: Proceedings of the 2007 International Conference on Convergence Information Technology, vol. 1, pp. 2455–2460. IEEE, Los Alamitos (2007), doi:10.1109/ICCIT.2007.19

    Chapter  Google Scholar 

  6. Klein, P.: Computing the edit-distance between unrooted ordered trees. In: Algorithms –ESA. LNCS, vol. 1461, pp. 1–1. Springer, Heidelberg (1998)

    Google Scholar 

  7. Kowalkiewicz, M., Kaczmarek, T., Abramowicz, W.: MyPortal: robust extraction and aggregation of web content. In: Proceedings of the 32nd International Conference on Very Large Data Bases, pp. 1219–1222 (2006)

    Google Scholar 

  8. Laender, A.H.F., Ribeiro-Neto, B.A., Da, A.S., Silva, J.S.: A brief survey of web data extraction tools. ACM Sigmod 31(2), 84–93 (2002), doi:10.1145/565117.565137

    Article  Google Scholar 

  9. Lerman, K., Minton, S., Knoblock, C.: Wrapper maintenance: A machine learning approach. Journal of Artificial Intelligence Research 18, 149–181 (2003)

    MATH  Google Scholar 

  10. Meng, X., Hu, D., Li, C.: Schema-guided wrapper maintenance for web-data extraction. In: Proceedings of the 5th ACM international workshop on Web information and data management, pp. 1–8. ACM Press, New York (2003), doi:10.1145/956699.956701

    Chapter  Google Scholar 

  11. Raposo, J., Pan, A., Álvarez, M., Viña, A.: Automatic wrapper maintenance for semi-structured web sources using results from previous queries. In: Proceedings of the 2005 ACM symposium on Applied computing - SAC 2005 , pp. 654–659. ACM Press, New York (2005), doi:10.1145/1066677.1066826

    Chapter  Google Scholar 

  12. Selkow, S.: The tree-to-tree editing problem. Information Processing Letters 6(6), 184–186 (1977), doi:10.1016/0020-0190(77)90064-3

    Article  MATH  MathSciNet  Google Scholar 

  13. Tai, K.: The tree-to-tree correction problem. Journal of the ACM (JACM) 26(3), 433 (1979)

    Article  MathSciNet  Google Scholar 

  14. Tekli, J., Chbeir, R., Yetongnon, K.: An overview on XML similarity: Background, current trends and future directions. Computer Science Review 3(3), 151–173 (2009), doi:10.1016/j.cosrev.2009.03.001

    Article  Google Scholar 

  15. Wong, T.: A Probabilistic Approach for Adapting Information Extraction Wrappers and Discovering New Attributes. In: Proceedings of the Fourth IEEE International Conference on Data Mining, pp. 257–264. IEEE, Los Alamitos (2004), doi:10.1109/ICDM.2004.10111

    Chapter  Google Scholar 

  16. Yang, W.: Identifying syntactic differences between two programs. Software - Practice and Experience 21(7), 739–755 (1991)

    Article  Google Scholar 

  17. Zhai, Y., Liu, B.: Web data extraction based on partial tree alignment. In: Proceedings of the 14th International Conference on World Wide Web, pp. 76–85. ACM Press, New York (2005), doi:10.1145/1060745.1060761

    Chapter  Google Scholar 

  18. Zhang, K., Shasha, D.: Simple fast algorithms for the editing distance between trees and related problems. SIAM J. Comput. 18(6), 1245–1262 (1989)

    Article  MATH  MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2011 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Ferrara, E., Baumgartner, R. (2011). Automatic Wrapper Adaptation by Tree Edit Distance Matching. In: Hatzilygeroudis, I., Prentzas, J. (eds) Combinations of Intelligent Methods and Applications. Smart Innovation, Systems and Technologies, vol 8. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-19618-8_3

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-19618-8_3

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-19617-1

  • Online ISBN: 978-3-642-19618-8

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics