Abstract
We present a new approach to automatically convert HTML documents into XML documents. It first captures the inter-blocks nested structure, then the intra-blocks nested structure, which consists of blocks including headings, lists, paragraphs and tables in HTML documents, by exploiting both formatting information and structural information implied by HTML tags.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Kushmerick, N., Weld, D., Doorenbos, R.: Wrapper Induction for Information Extraction. In: IJCAI, pp. 729–737 (1997)
Ashish, N., Knoblock, C.A.: Semi-Automatic Wrapper Generation for Internet Information Sources. In: CoopIS, pp. 160–169 (1997)
Hammer, J., Garcia-Molina, H., Nestorov, S., Yerneni, R., Breunig, M.M., Vassalos, V.: Template-Based Wrappers in the TSIMMIS System. In: SIGMOD Conference, pp. 532–535 (1997)
Lim, S.J., Ng, Y.-K.: A Heuristic Approach for Coverting HTML Documents to XML Documents. In: Loyd, J., et al. (eds.) CL, pp. 1182–1196 (2000)
Potok, T.E., Elmore, M.T., Reed, J.W., Samatova, N.F.: An Ontology- Based HTML to XML Conversion Using Intelligent Agents. In: Loyd, J., et al. (eds.) HICSS, pp. 120–129 (2002)
Sahuguet, A., Azavant, F.: Building IntelligentWeb Applications Using Lightweight Wrappers. Data and Knowledge Engineering 36(3), 283–316 (2001)
Umehara, M., Iwanuma, K., Nabeshima, H.: A Case-Based Recognition of Semantic Structures in HTML Documents. In: IDEAL, pp. 141–147 (2002)
Yang, Y., Luk, W.-S.: A Framework for Web Table Mining. In: WIDM 2002, pp. 36–42 (2002)
Li, S., Liu, M., Wang, G., Peng, Z.: Capturing Semantic hierarchies to Perform Meaningful Integration in HTML Tables. In: APWEB, pp. 899–902 (2004)
Baumgartner, R., Flesca, S., Gottlob, G.: Visual Web Information Extraction with Lixto. In: VLDB, pp. 119–128 (2001)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2004 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Li, S., Liu, M., Ling, T.W., Peng, Z. (2004). Automatic HTML to XML Conversion. In: Li, Q., Wang, G., Feng, L. (eds) Advances in Web-Age Information Management. WAIM 2004. Lecture Notes in Computer Science, vol 3129. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-27772-9_78
Download citation
DOI: https://doi.org/10.1007/978-3-540-27772-9_78
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-22418-1
Online ISBN: 978-3-540-27772-9
eBook Packages: Springer Book Archive