Skip to main content

Automatic Data Extraction from Data-Rich Web Pages

  • Conference paper

Part of the Lecture Notes in Computer Science book series (LNISA,volume 3453)

Abstract

Extracting data from web pages using wrappers is a fundamental problem arising in a large variety of applications of vast practical interests. In this paper, we propose a novel technique to the problem of differentiating roles of data items from Web pages, which is one of the key problems in our automatic extraction approach. The problem is resolved at various levels: semantic blocks, sections and data items, and several approaches are proposed to effectively identify the mapping between data items having the same role. Intensive experiments on real web sites show that the proposed technique can effectively help extracting desired data with high accuracies in most of the cases.

Keywords

  • Data Item
  • Schema Element
  • Extraction Rule
  • Syntactic Feature
  • Sample Page

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

This is a preview of subscription content, access via your institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (Canada)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   84.99
Price excludes VAT (Canada)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   109.99
Price excludes VAT (Canada)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Baumgartner, R., Flesca, S., Gottlob, G.: Visual web information extraction with lixto. In: Proceedings of VLDB, pp. 119–128 (2001)

    Google Scholar 

  2. Liu, L., Pu, C., Han, W.: Xwrap: An XML-enabled wrapper construction system for web information sources. In: Proceedings of ICDE, pp. 611–621 (2000)

    Google Scholar 

  3. Meng, X., Wang, H., Hu, D., Li, C.: A supervised visual wrapper generator for web-data extraction. In: Proceedings of COMPSAC, pp. 657–662 (2003)

    Google Scholar 

  4. Sahuguet, A., Azavant, F.: Building intelligent web applications using lightweight wrappers. Data Knowl. Eng. 36, 283–316 (2001)

    CrossRef  MATH  Google Scholar 

  5. Muslea, I., Minton, S., Knoblock, C.A.: Hierarchical wrapper induction for semistructured information sources. Autonomous Agents and Multi-Agent Systems 4(1/2), 93–114 (2001)

    CrossRef  Google Scholar 

  6. Arasu, A., Garcia-Molina, H.: Extracting structure data from web pages. In: Proceedings of SIGMOD, pp. 337–348 (2003)

    Google Scholar 

  7. Crescenzi, V., Mecca, G., Merialdo, P.: Roadrunner: Towards automatic data extraction from large web sites. In: Proceedings of VLDB, pp. 109–118 (2001)

    Google Scholar 

  8. Wang, J., Lochovsky, F.H.: Data extraction and label assignment for web databases. In: Proceedings of WWW, pp. 187–196 (2003)

    Google Scholar 

  9. Grumbach, S., Mecca, G.: In search of the lost schema. In: Beeri, C., Bruneman, P. (eds.) ICDT 1999. LNCS, vol. 1540, pp. 314–331. Springer, Heidelberg (1998)

    CrossRef  Google Scholar 

  10. XML query language (xquery), http://www.w3.org/TR/xquery/

  11. XML path language (xpath) 2.0, http://www.w3.org/TR/xpath20/

  12. Document object model (dom) level 2 core specification, http://www.w3.org/TR/DOM-Level-2-Core

  13. Arlotta, L., Crescenzi, V., Mecca, G., Merialdo, P.: Automatic annotation of data extracted from large web sites. In: Proceedings of WebDB, pp. 7–12 (2003)

    Google Scholar 

  14. Meng, X., Hu, D., Li, C.: Schema-guided wrapper maintenance for web-data extraction. In: Proceedings of ACM WIDM, pp. 1–8 (2003)

    Google Scholar 

  15. Meng, X., Wang, H., Hu, D., Gu, M.: Sg-wram: Schema guided wrapper maintenance. In: Proceedings of ICDE, pp. 750–752 (2003)

    Google Scholar 

  16. Lerman, K., Minton, S.: Learning the common structure of data. In: Proceedings of AAAI/IAAI, pp. 609–614 (2000)

    Google Scholar 

  17. Laender, A.H.F., Ribeiro-Neto, B.A., da Silva, A.S., Teixeira, J.S.: A brief survey of web data extraction tools. SIGMOD Record 31, 84–93 (2002)

    CrossRef  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and Permissions

Copyright information

© 2005 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Hu, D., Meng, X. (2005). Automatic Data Extraction from Data-Rich Web Pages. In: Zhou, L., Ooi, B.C., Meng, X. (eds) Database Systems for Advanced Applications. DASFAA 2005. Lecture Notes in Computer Science, vol 3453. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11408079_75

Download citation

  • DOI: https://doi.org/10.1007/11408079_75

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-25334-1

  • Online ISBN: 978-3-540-32005-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics