Skip to main content

SEDE: A Schema Explorer and Data Extractor for HTML Web Pages

  • Conference paper
  • 1899 Accesses

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 236))

Abstract

We present an approach for automatically exploring relation schema and extracting data from HTML pages. By abstracting a DOM-tree constructed from a HTML page into a set of generalized lists, this approach automatically generates a relation schema for storing data extracted from the page. Based on this approach, we have developed a software system named as SEDE (Schema Explorer and Data Extractor for HTML pages), which can reduces the workload of extracting and storing data objects within HTML pages. This paper will mainly introduce SEDE.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Padmadas, V., Gadge, J.: Web Data Extracion Using Visual Features. In: Proc of Int’l Conf. and Workshop on Emerging Trends in Technology (ICWET 2010), pp. 218–221 (2010)

    Google Scholar 

  2. Liu, W., Meng, X., Meng, W.: ViDE: A Vision-Based Approach for Deep Web Data Extraction. IEEE Transactions on Knowledge and Data Engineering 22(3), 447–460 (2010)

    Article  Google Scholar 

  3. Cai, D., Yu, S.P., Wen, J.R., Ma, W.Y.: VIPS: A Vision-based Page Segmentation Algorithm. Microsoft Technical Report, MSR-TR-2003-79

    Google Scholar 

  4. Hiremat, P.S., Benchalli, S.S., Algur, S.P., Udapud, R.V.: Mining Data Regions from Web Pages. In: Proc of Int’l Conf. on Management of Data (COMAD 2005) (2005b)

    Google Scholar 

  5. Burget, R.: Layout Based Information Extraction from HTML Documents. In: Proc of the 9th International Conference on Document Analysis and Recognition (ICDAR 2007), p. 5 (2007)

    Google Scholar 

  6. Zaki, M.J.: Efficiently Mining Frequent Trees in a Forest. In: Proc. of 8th ACM SIGKDD Int’l Conf. on Knowledge Discovery and Data Mining (SIGKDD 2002), pp. 71–80 (2002)

    Google Scholar 

  7. Xiao, Y., et al.: Efficient Data Mining for Maximal Frequent Subtrees. In: Proc. of the 3rd IEEE Int. Conf. on Data Mining (ICDM 2003), pp. 379–386 (2003)

    Google Scholar 

  8. Zhai, Y.H., Liu, B.: Web Data Extraction Based on Partial Tree Alignment. In: Proc. of 14th Int’l. Conf. on World Wide Web (WWW 2005), pp. 76–85 (2005)

    Google Scholar 

  9. Deng, X.B.: Automatic Transformation of HTML Pages into Relational Database. Journal of Information and Computational Science 7(2), 349–355 (2010)

    Google Scholar 

  10. Bille, P.: A Survey on Tree Edit Distance and Related Problems. Theoretical Computer Science 337(1-3), 217–239 (2005)

    Article  MathSciNet  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2011 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Deng, X. (2011). SEDE: A Schema Explorer and Data Extractor for HTML Web Pages. In: Zhu, M. (eds) Information and Management Engineering. ICCIC 2011. Communications in Computer and Information Science, vol 236. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-24097-3_5

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-24097-3_5

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-24096-6

  • Online ISBN: 978-3-642-24097-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics