Abstract
We present an approach for automatically exploring relation schema and extracting data from HTML pages. By abstracting a DOM-tree constructed from a HTML page into a set of generalized lists, this approach automatically generates a relation schema for storing data extracted from the page. Based on this approach, we have developed a software system named as SEDE (Schema Explorer and Data Extractor for HTML pages), which can reduces the workload of extracting and storing data objects within HTML pages. This paper will mainly introduce SEDE.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Padmadas, V., Gadge, J.: Web Data Extracion Using Visual Features. In: Proc of Int’l Conf. and Workshop on Emerging Trends in Technology (ICWET 2010), pp. 218–221 (2010)
Liu, W., Meng, X., Meng, W.: ViDE: A Vision-Based Approach for Deep Web Data Extraction. IEEE Transactions on Knowledge and Data Engineering 22(3), 447–460 (2010)
Cai, D., Yu, S.P., Wen, J.R., Ma, W.Y.: VIPS: A Vision-based Page Segmentation Algorithm. Microsoft Technical Report, MSR-TR-2003-79
Hiremat, P.S., Benchalli, S.S., Algur, S.P., Udapud, R.V.: Mining Data Regions from Web Pages. In: Proc of Int’l Conf. on Management of Data (COMAD 2005) (2005b)
Burget, R.: Layout Based Information Extraction from HTML Documents. In: Proc of the 9th International Conference on Document Analysis and Recognition (ICDAR 2007), p. 5 (2007)
Zaki, M.J.: Efficiently Mining Frequent Trees in a Forest. In: Proc. of 8th ACM SIGKDD Int’l Conf. on Knowledge Discovery and Data Mining (SIGKDD 2002), pp. 71–80 (2002)
Xiao, Y., et al.: Efficient Data Mining for Maximal Frequent Subtrees. In: Proc. of the 3rd IEEE Int. Conf. on Data Mining (ICDM 2003), pp. 379–386 (2003)
Zhai, Y.H., Liu, B.: Web Data Extraction Based on Partial Tree Alignment. In: Proc. of 14th Int’l. Conf. on World Wide Web (WWW 2005), pp. 76–85 (2005)
Deng, X.B.: Automatic Transformation of HTML Pages into Relational Database. Journal of Information and Computational Science 7(2), 349–355 (2010)
Bille, P.: A Survey on Tree Edit Distance and Related Problems. Theoretical Computer Science 337(1-3), 217–239 (2005)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2011 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Deng, X. (2011). SEDE: A Schema Explorer and Data Extractor for HTML Web Pages. In: Zhu, M. (eds) Information and Management Engineering. ICCIC 2011. Communications in Computer and Information Science, vol 236. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-24097-3_5
Download citation
DOI: https://doi.org/10.1007/978-3-642-24097-3_5
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-24096-6
Online ISBN: 978-3-642-24097-3
eBook Packages: Computer ScienceComputer Science (R0)