Abstract
Information distributed through the Web keeps growing faster day by day, and for this reason, several techniques for extracting Web data have been suggested during last years. Often, extraction tasks are performed through so called wrappers, procedures extracting information from Web pages, e.g. implementing logic-based techniques. Many fields of application today require a strong degree of robustness of wrappers, in order not to compromise assets of information or reliability of data extracted.
Unfortunately, wrappers may fail in the task of extracting data from a Web page, if its structure changes, sometimes even slightly, thus requiring the exploiting of new techniques to be automatically held so as to adapt the wrapper to the new structure of the page, in case of failure. In this work we present a novel approach of automatic wrapper adaptation based on the measurement of similarity of trees through improved tree edit distance matching techniques.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Bille, P.: A survey on tree edit distance and related problems. Theoretical Computer Science 337(1-3), 217–239 (2005), doi:10.1016/j.tcs.2004.12.030
Chidlovskii, B.: Automatic repairing of web wrappers. In: Proceedings of the 3rd international workshop on Web information and data management, p. 30. ACM Press, New York (2001)
Ferrara, E., Fiumara, G., Baumgartner, R.: Web Data Extraction, Applications and Techniques: A Survey. Technical Report (2010)
Hirschberg, D.S.: A linear space algorithm for computing maximal common subsequences. Communications of the ACM 18(6), 343 (1975)
Kim, Y., Park, J., Kim, T., Choi, J.: Web Information Extraction by HTML Tree Edit Distance Matching. In: Proceedings of the 2007 International Conference on Convergence Information Technology, vol. 1, pp. 2455–2460. IEEE, Los Alamitos (2007), doi:10.1109/ICCIT.2007.19
Klein, P.: Computing the edit-distance between unrooted ordered trees. In: Algorithms –ESA. LNCS, vol. 1461, pp. 1–1. Springer, Heidelberg (1998)
Kowalkiewicz, M., Kaczmarek, T., Abramowicz, W.: MyPortal: robust extraction and aggregation of web content. In: Proceedings of the 32nd International Conference on Very Large Data Bases, pp. 1219–1222 (2006)
Laender, A.H.F., Ribeiro-Neto, B.A., Da, A.S., Silva, J.S.: A brief survey of web data extraction tools. ACM Sigmod 31(2), 84–93 (2002), doi:10.1145/565117.565137
Lerman, K., Minton, S., Knoblock, C.: Wrapper maintenance: A machine learning approach. Journal of Artificial Intelligence Research 18, 149–181 (2003)
Meng, X., Hu, D., Li, C.: Schema-guided wrapper maintenance for web-data extraction. In: Proceedings of the 5th ACM international workshop on Web information and data management, pp. 1–8. ACM Press, New York (2003), doi:10.1145/956699.956701
Raposo, J., Pan, A., Álvarez, M., Viña, A.: Automatic wrapper maintenance for semi-structured web sources using results from previous queries. In: Proceedings of the 2005 ACM symposium on Applied computing - SAC 2005 , pp. 654–659. ACM Press, New York (2005), doi:10.1145/1066677.1066826
Selkow, S.: The tree-to-tree editing problem. Information Processing Letters 6(6), 184–186 (1977), doi:10.1016/0020-0190(77)90064-3
Tai, K.: The tree-to-tree correction problem. Journal of the ACM (JACM) 26(3), 433 (1979)
Tekli, J., Chbeir, R., Yetongnon, K.: An overview on XML similarity: Background, current trends and future directions. Computer Science Review 3(3), 151–173 (2009), doi:10.1016/j.cosrev.2009.03.001
Wong, T.: A Probabilistic Approach for Adapting Information Extraction Wrappers and Discovering New Attributes. In: Proceedings of the Fourth IEEE International Conference on Data Mining, pp. 257–264. IEEE, Los Alamitos (2004), doi:10.1109/ICDM.2004.10111
Yang, W.: Identifying syntactic differences between two programs. Software - Practice and Experience 21(7), 739–755 (1991)
Zhai, Y., Liu, B.: Web data extraction based on partial tree alignment. In: Proceedings of the 14th International Conference on World Wide Web, pp. 76–85. ACM Press, New York (2005), doi:10.1145/1060745.1060761
Zhang, K., Shasha, D.: Simple fast algorithms for the editing distance between trees and related problems. SIAM J. Comput. 18(6), 1245–1262 (1989)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2011 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Ferrara, E., Baumgartner, R. (2011). Automatic Wrapper Adaptation by Tree Edit Distance Matching. In: Hatzilygeroudis, I., Prentzas, J. (eds) Combinations of Intelligent Methods and Applications. Smart Innovation, Systems and Technologies, vol 8. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-19618-8_3
Download citation
DOI: https://doi.org/10.1007/978-3-642-19618-8_3
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-19617-1
Online ISBN: 978-3-642-19618-8
eBook Packages: EngineeringEngineering (R0)