Skip to main content

WDEE: Web Data Extraction by Example

  • Conference paper
Database Systems for Advanced Applications (DASFAA 2005)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 3453))

Included in the following conference series:

Abstract

Web data extraction systems in use today transform semi-structured Web documents and deliver structured documents to end users. Some systems provide a visual interface to users to generate the extraction rules. However, to end users, the visual effect of Web documents is lost during the transformation process. In this paper, we propose an approach that allows a user to query extracted documents without knowledge of formal query language. We bridge the gap between visual effect of Web documents and structured documents extracted by providing a QBE-like (Query by Example) interface named Wdee. The principle component of our method is the notion of a document schema. Document schemata are patterns of structures embedded in documents. Wdee generates tree skeletons based on schema information and a user may execute queries by input condition in the skeltons. By maintaining the mapping relation among schemata of Web documents and extracted documents, a visual example may be presented to end users. With the example, Wdee allows a user to construct tree skeletons in a manner that resembles the browsing of Web pages.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Gupta, A., Harinarayan, V., Rajaraman, A.: Virtual database technology. SIGMOD Record 26, 57–61 (1997)

    Article  Google Scholar 

  2. Gottlob, G., Koch, C.: Monadic datalog and the expressive power of languages for web information extraction. In: Proc. of the 21th PODS, pp. 17–28 (2002)

    Google Scholar 

  3. Gottlob, G., Koch, C.: Monadic queries over tree-structured data. In: Proceedings of the 17th IEEE Symposium on Logic in Computer Science, pp. 189–202 (2002)

    Google Scholar 

  4. Li, Z., Ng, W.K.: Wiccap: From semi-structured data to structured data. In: Proc. of 11th IEEE International Conference on the ECBS (2004)

    Google Scholar 

  5. Baumgartner, R., Flesca, S., Gottlob, G.: Visual web information extraction with lixto. In: Proc. of 27th International Conference on Very Large Data Bases, Roma, Italy, pp. 119–128. Morgan Kaufmann, San Francisco (2001)

    Google Scholar 

  6. Liu, Z., Li, F., Ng, W.K.: WICCAP data model: Mapping physical websites to logical views. In: Proc. of the 21st International Conference on Conceptual Modelling (2002)

    Google Scholar 

  7. Arasu, A., Garcia-Molina, H.: Extracting structured data from web pages. In: Proc. of the 2003 ACM SIGMOD, pp. 337–348 (2003)

    Google Scholar 

  8. Chang, C.H., Lui, S.C.: IEPAD: information extraction based on pattern discovery. In: Proc. of the 10th WWW Conference, pp. 681–688 (2001)

    Google Scholar 

  9. Crescenzi, V., Mecca, G., Merialdo, P.: RoadRunner: Towards automatic data extraction from large web sites. In: Proc. of 27th International Conference on Very Large Data Bases, pp. 109–118 (2001)

    Google Scholar 

  10. Rajaraman, A., Ullman, J.D.: Querying websites using compact skeletons. In: Proc. of the 12th PODS, pp. 16–27 (2001)

    Google Scholar 

  11. Zloof, M.M.: Query-by-example: A data base language. IBM Systems Journal 16, 324–343 (1977)

    Article  Google Scholar 

  12. da Silva, A.S., Filha, I.M.E., Laender, A.H.F., Embley, D.W.: Representing and querying semistructured web data using nested tables with structural variants. In: Proc. of the 21st International Conference on Conceptual Modelling (2002)

    Google Scholar 

  13. Neven, F.: Automata, logic, and XML. In: Bradfield, J.C. (ed.) CSL 2002 and EACSL 2002. LNCS, vol. 2471, pp. 2–26. Springer, Heidelberg (2002)

    Chapter  Google Scholar 

  14. Bergholz, A.: Querying Semistructured Data Based On Schema Matching. PhD thesis, Humboldt-University Berlin (2000)

    Google Scholar 

  15. Gottlob, G., Koch, C., Pichler, R.: Xpath processing in a nutshell. SIGMOD Record 32, 12–19 (2003)

    Article  Google Scholar 

  16. May, W., Himmeröder, R., Lausen, G., Ludäscher, B.: A unified framework for wrapping, mediating and restructuring information from the web. In: Proceedings of the Workshops on Evolution and Change in Data Management, Reverse Engineering in Information Systems, and the World Wide Web and Conceptual Modeling, pp. 307–320 (1999)

    Google Scholar 

  17. Kushmerick, N.: Wrapper induction: Efficiency and expressiveness. Artificial Intelligence 118, 15–68 (2000)

    Article  MATH  MathSciNet  Google Scholar 

  18. Wang, K., Liu, H.: Discovering structural association of semistructured data. IEEE Transactions on Knowledge and Data Engineering 12, 353–371 (2000)

    Article  Google Scholar 

  19. Flesca, S., Manco, G., Masciari, E., Pontieri, L., Pugliese, A.: Detecting structural similarities between xml documents. In: Proceedings of 5th International Workshop on the Web and Databases, Madison, Wisconsin, USA (2002)

    Google Scholar 

  20. Nierman, A., Jagadish, H.V.: Evaluating structural similarity in xml documents. In: Proceedings of 5th International Workshop on the Web and Databases (2002)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2005 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Li, Z., Ng, W.K. (2005). WDEE: Web Data Extraction by Example. In: Zhou, L., Ooi, B.C., Meng, X. (eds) Database Systems for Advanced Applications. DASFAA 2005. Lecture Notes in Computer Science, vol 3453. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11408079_32

Download citation

  • DOI: https://doi.org/10.1007/11408079_32

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-25334-1

  • Online ISBN: 978-3-540-32005-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics