SEDE: A Schema Explorer and Data Extractor for HTML Web Pages

Deng, Xubin

doi:10.1007/978-3-642-24097-3_5

SEDE: A Schema Explorer and Data Extractor for HTML Web Pages

Xubin Deng²

Conference paper

1899 Accesses

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 236))

Abstract

We present an approach for automatically exploring relation schema and extracting data from HTML pages. By abstracting a DOM-tree constructed from a HTML page into a set of generalized lists, this approach automatically generates a relation schema for storing data extracted from the page. Based on this approach, we have developed a software system named as SEDE (Schema Explorer and Data Extractor for HTML pages), which can reduces the workload of extracting and storing data objects within HTML pages. This paper will mainly introduce SEDE.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Padmadas, V., Gadge, J.: Web Data Extracion Using Visual Features. In: Proc of Int’l Conf. and Workshop on Emerging Trends in Technology (ICWET 2010), pp. 218–221 (2010)
Google Scholar
Liu, W., Meng, X., Meng, W.: ViDE: A Vision-Based Approach for Deep Web Data Extraction. IEEE Transactions on Knowledge and Data Engineering 22(3), 447–460 (2010)
Article Google Scholar
Cai, D., Yu, S.P., Wen, J.R., Ma, W.Y.: VIPS: A Vision-based Page Segmentation Algorithm. Microsoft Technical Report, MSR-TR-2003-79
Google Scholar
Hiremat, P.S., Benchalli, S.S., Algur, S.P., Udapud, R.V.: Mining Data Regions from Web Pages. In: Proc of Int’l Conf. on Management of Data (COMAD 2005) (2005b)
Google Scholar
Burget, R.: Layout Based Information Extraction from HTML Documents. In: Proc of the 9th International Conference on Document Analysis and Recognition (ICDAR 2007), p. 5 (2007)
Google Scholar
Zaki, M.J.: Efficiently Mining Frequent Trees in a Forest. In: Proc. of 8th ACM SIGKDD Int’l Conf. on Knowledge Discovery and Data Mining (SIGKDD 2002), pp. 71–80 (2002)
Google Scholar
Xiao, Y., et al.: Efficient Data Mining for Maximal Frequent Subtrees. In: Proc. of the 3rd IEEE Int. Conf. on Data Mining (ICDM 2003), pp. 379–386 (2003)
Google Scholar
Zhai, Y.H., Liu, B.: Web Data Extraction Based on Partial Tree Alignment. In: Proc. of 14th Int’l. Conf. on World Wide Web (WWW 2005), pp. 76–85 (2005)
Google Scholar
Deng, X.B.: Automatic Transformation of HTML Pages into Relational Database. Journal of Information and Computational Science 7(2), 349–355 (2010)
Google Scholar
Bille, P.: A Survey on Tree Edit Distance and Related Problems. Theoretical Computer Science 337(1-3), 217–239 (2005)
Article MathSciNet MATH Google Scholar

Download references

Author information

Authors and Affiliations

School of Information, Zhejiang University of Finance & Economics, Hangzhou, 310018, China
Xubin Deng

Authors

Xubin Deng
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Nanchang University, 235 Nanjing Donglu, 330047, Nanchang, China
Min Zhu

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Deng, X. (2011). SEDE: A Schema Explorer and Data Extractor for HTML Web Pages. In: Zhu, M. (eds) Information and Management Engineering. ICCIC 2011. Communications in Computer and Information Science, vol 236. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-24097-3_5

Download citation

DOI: https://doi.org/10.1007/978-3-642-24097-3_5
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-24096-6
Online ISBN: 978-3-642-24097-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics