NET – A System for Extracting Web Data from Flat and Nested Data Records

Liu, Bing; Zhai, Yanhong

doi:10.1007/11581062_39

Bing Liu²¹ &
Yanhong Zhai²¹

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 3806))

Included in the following conference series:

International Conference on Web Information Systems Engineering

1238 Accesses
40 Citations
3 Altmetric

Abstract

This paper studies automatic extraction of structured data from Web pages. Each of such pages may contain several groups of structured data records. Existing automatic methods still have several limitations. In this paper, we propose a more effective method for the task. Given a page, our method first builds a tag tree based on visual information. It then performs a post-order traversal of the tree and matches subtrees in the process using a tree edit distance method and visual cues. After the process ends, data records are found and data items in them are aligned and extracted. The method can extract data from both flat and nested data records. Experimental evaluation shows that the method performs the extraction task accurately.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Arasu, A., Garcia-Molina, H.: Extracting Structured Data from Web Pages. In: SIGMOD 2003 (2003)
Google Scholar
Buttler, D., Liu, L., Pu, C.: A fully automated extraction system for the World Wide Web. In: IEEE ICDCS-21 (2001)
Google Scholar
Chang, C., Lui, S.-L.: IEPAD: Information extraction based on pattern discovery. In: WWW10 (2001)
Google Scholar
Crescenzi, V., Mecca, G., Merialdo, P.: Roadrunner: towards automatic data extraction from large web sites. In: VLDB 2001 (2001)
Google Scholar
Kushmerick, N.: Wrapper induction: efficiency and expressiveness. Artificial Intelligence 118, 15–68 (2000)
Article MATH MathSciNet Google Scholar
Lerman, K., Getoor, L., Minton, S., Knoblock, C.: Using the Structure of Web Sites for Automatic Segmentation of Tables. In: SIGMOD 2004 (2004)
Google Scholar
Liu, B., Grossman, R., Zhai, Y.: Mining data records from Web pages. In: KDD 2003 (2003)
Google Scholar
Muslea, I., Minton, S., Knoblock, C.: A hierarchical approach to wrapper induction. In: Agents 1999 (1999)
Google Scholar
Pinto, D., McCallum, A., Wei, X., Bruce, W.: Table extraction using conditional random fields. In: SIGIR 2003 (2003)
Google Scholar
Reis, D., Golgher, P., Silva, A., Laender, A.: Automatic Web news extraction using tree edit distance. In: WWW 2004 (2004)
Google Scholar
Wang, J.-Y., Lochovsky, F.: Data extraction and label assignment for Web databases. In: WWW 2003 (2003)
Google Scholar
Yang, W.: Identifying syntactic differences between two programs. Softw. Pract. Exper. 21(7), 739–755 (1991)
Article Google Scholar
Zhai, Y., Liu, B.: Web data extraction based on partial tree alignment. In: WWW 2005 (2005)
Google Scholar
Zhai, Y., Liu, B.: Extracting Web data using instance-based learning. In: Ngu, A.H.H., Kitsuregawa, M., Neuhold, E.J., Chung, J.-Y., Sheng, Q.Z. (eds.) WISE 2005. LNCS, vol. 3806, pp. 318–331. Springer, Heidelberg (2005)
Chapter Google Scholar
Zhao, H., Meng, W., Wu, Z., Raghavan, V., Yu, C.: Fully automatic wrapper generation for search engines. In: WWW 2005 (2005)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, University of Illinois at Chicago, 851 S. Morgan Street, Chicago, IL, 60607, USA
Bing Liu & Yanhong Zhai

Authors

Bing Liu
View author publications
You can also search for this author in PubMed Google Scholar
Yanhong Zhai
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Texas State University, San Marcos, TX,
Anne H. H. Ngu
Institute of Industrial Science, The University of Tokyo, 4-6-1 Komaba, Meguro-ku, 153-8505, Tokyo, Japan
Masaru Kitsuregawa
University of Vienna, Vienna, Austria
Erich J. Neuhold
IBM Research Division, Thomas J. Watson Research Center, P.O. Box 218, 10598, New York, Yorktown Heights, USA
Jen-Yao Chung
School of Computer Science and Engineering, University of New South Wales, NSW 2052, Sydney, Australia
Quan Z. Sheng

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Liu, B., Zhai, Y. (2005). NET – A System for Extracting Web Data from Flat and Nested Data Records. In: Ngu, A.H.H., Kitsuregawa, M., Neuhold, E.J., Chung, JY., Sheng, Q.Z. (eds) Web Information Systems Engineering – WISE 2005. WISE 2005. Lecture Notes in Computer Science, vol 3806. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11581062_39

Download citation

DOI: https://doi.org/10.1007/11581062_39
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-30017-5
Online ISBN: 978-3-540-32286-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics