Abstract
This paper studies automatic extraction of structured data from Web pages. Each of such pages may contain several groups of structured data records. Existing automatic methods still have several limitations. In this paper, we propose a more effective method for the task. Given a page, our method first builds a tag tree based on visual information. It then performs a post-order traversal of the tree and matches subtrees in the process using a tree edit distance method and visual cues. After the process ends, data records are found and data items in them are aligned and extracted. The method can extract data from both flat and nested data records. Experimental evaluation shows that the method performs the extraction task accurately.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Arasu, A., Garcia-Molina, H.: Extracting Structured Data from Web Pages. In: SIGMOD 2003 (2003)
Buttler, D., Liu, L., Pu, C.: A fully automated extraction system for the World Wide Web. In: IEEE ICDCS-21 (2001)
Chang, C., Lui, S.-L.: IEPAD: Information extraction based on pattern discovery. In: WWW10 (2001)
Crescenzi, V., Mecca, G., Merialdo, P.: Roadrunner: towards automatic data extraction from large web sites. In: VLDB 2001 (2001)
Kushmerick, N.: Wrapper induction: efficiency and expressiveness. Artificial Intelligence 118, 15–68 (2000)
Lerman, K., Getoor, L., Minton, S., Knoblock, C.: Using the Structure of Web Sites for Automatic Segmentation of Tables. In: SIGMOD 2004 (2004)
Liu, B., Grossman, R., Zhai, Y.: Mining data records from Web pages. In: KDD 2003 (2003)
Muslea, I., Minton, S., Knoblock, C.: A hierarchical approach to wrapper induction. In: Agents 1999 (1999)
Pinto, D., McCallum, A., Wei, X., Bruce, W.: Table extraction using conditional random fields. In: SIGIR 2003 (2003)
Reis, D., Golgher, P., Silva, A., Laender, A.: Automatic Web news extraction using tree edit distance. In: WWW 2004 (2004)
Wang, J.-Y., Lochovsky, F.: Data extraction and label assignment for Web databases. In: WWW 2003 (2003)
Yang, W.: Identifying syntactic differences between two programs. Softw. Pract. Exper. 21(7), 739–755 (1991)
Zhai, Y., Liu, B.: Web data extraction based on partial tree alignment. In: WWW 2005 (2005)
Zhai, Y., Liu, B.: Extracting Web data using instance-based learning. In: Ngu, A.H.H., Kitsuregawa, M., Neuhold, E.J., Chung, J.-Y., Sheng, Q.Z. (eds.) WISE 2005. LNCS, vol. 3806, pp. 318–331. Springer, Heidelberg (2005)
Zhao, H., Meng, W., Wu, Z., Raghavan, V., Yu, C.: Fully automatic wrapper generation for search engines. In: WWW 2005 (2005)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Liu, B., Zhai, Y. (2005). NET – A System for Extracting Web Data from Flat and Nested Data Records. In: Ngu, A.H.H., Kitsuregawa, M., Neuhold, E.J., Chung, JY., Sheng, Q.Z. (eds) Web Information Systems Engineering – WISE 2005. WISE 2005. Lecture Notes in Computer Science, vol 3806. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11581062_39
Download citation
DOI: https://doi.org/10.1007/11581062_39
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-30017-5
Online ISBN: 978-3-540-32286-3
eBook Packages: Computer ScienceComputer Science (R0)