Skip to main content

NET – A System for Extracting Web Data from Flat and Nested Data Records

  • Conference paper
Web Information Systems Engineering – WISE 2005 (WISE 2005)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 3806))

Included in the following conference series:

Abstract

This paper studies automatic extraction of structured data from Web pages. Each of such pages may contain several groups of structured data records. Existing automatic methods still have several limitations. In this paper, we propose a more effective method for the task. Given a page, our method first builds a tag tree based on visual information. It then performs a post-order traversal of the tree and matches subtrees in the process using a tree edit distance method and visual cues. After the process ends, data records are found and data items in them are aligned and extracted. The method can extract data from both flat and nested data records. Experimental evaluation shows that the method performs the extraction task accurately.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Arasu, A., Garcia-Molina, H.: Extracting Structured Data from Web Pages. In: SIGMOD 2003 (2003)

    Google Scholar 

  2. Buttler, D., Liu, L., Pu, C.: A fully automated extraction system for the World Wide Web. In: IEEE ICDCS-21 (2001)

    Google Scholar 

  3. Chang, C., Lui, S.-L.: IEPAD: Information extraction based on pattern discovery. In: WWW10 (2001)

    Google Scholar 

  4. Crescenzi, V., Mecca, G., Merialdo, P.: Roadrunner: towards automatic data extraction from large web sites. In: VLDB 2001 (2001)

    Google Scholar 

  5. Kushmerick, N.: Wrapper induction: efficiency and expressiveness. Artificial Intelligence 118, 15–68 (2000)

    Article  MATH  MathSciNet  Google Scholar 

  6. Lerman, K., Getoor, L., Minton, S., Knoblock, C.: Using the Structure of Web Sites for Automatic Segmentation of Tables. In: SIGMOD 2004 (2004)

    Google Scholar 

  7. Liu, B., Grossman, R., Zhai, Y.: Mining data records from Web pages. In: KDD 2003 (2003)

    Google Scholar 

  8. Muslea, I., Minton, S., Knoblock, C.: A hierarchical approach to wrapper induction. In: Agents 1999 (1999)

    Google Scholar 

  9. Pinto, D., McCallum, A., Wei, X., Bruce, W.: Table extraction using conditional random fields. In: SIGIR 2003 (2003)

    Google Scholar 

  10. Reis, D., Golgher, P., Silva, A., Laender, A.: Automatic Web news extraction using tree edit distance. In: WWW 2004 (2004)

    Google Scholar 

  11. Wang, J.-Y., Lochovsky, F.: Data extraction and label assignment for Web databases. In: WWW 2003 (2003)

    Google Scholar 

  12. Yang, W.: Identifying syntactic differences between two programs. Softw. Pract. Exper. 21(7), 739–755 (1991)

    Article  Google Scholar 

  13. Zhai, Y., Liu, B.: Web data extraction based on partial tree alignment. In: WWW 2005 (2005)

    Google Scholar 

  14. Zhai, Y., Liu, B.: Extracting Web data using instance-based learning. In: Ngu, A.H.H., Kitsuregawa, M., Neuhold, E.J., Chung, J.-Y., Sheng, Q.Z. (eds.) WISE 2005. LNCS, vol. 3806, pp. 318–331. Springer, Heidelberg (2005)

    Chapter  Google Scholar 

  15. Zhao, H., Meng, W., Wu, Z., Raghavan, V., Yu, C.: Fully automatic wrapper generation for search engines. In: WWW 2005 (2005)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2005 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Liu, B., Zhai, Y. (2005). NET – A System for Extracting Web Data from Flat and Nested Data Records. In: Ngu, A.H.H., Kitsuregawa, M., Neuhold, E.J., Chung, JY., Sheng, Q.Z. (eds) Web Information Systems Engineering – WISE 2005. WISE 2005. Lecture Notes in Computer Science, vol 3806. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11581062_39

Download citation

  • DOI: https://doi.org/10.1007/11581062_39

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-30017-5

  • Online ISBN: 978-3-540-32286-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics