Automatic HTML to XML Conversion

Li, Shijun; Liu, Mengchi; Ling, Tok Wang; Peng, Zhiyong

doi:10.1007/978-3-540-27772-9_78

Shijun Li¹⁸,
Mengchi Liu¹⁹,
Tok Wang Ling²⁰ &
…
Zhiyong Peng²¹

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 3129))

Included in the following conference series:

International Conference on Web-Age Information Management

902 Accesses
6 Citations

Abstract

We present a new approach to automatically convert HTML documents into XML documents. It first captures the inter-blocks nested structure, then the intra-blocks nested structure, which consists of blocks including headings, lists, paragraphs and tables in HTML documents, by exploiting both formatting information and structural information implied by HTML tags.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Kushmerick, N., Weld, D., Doorenbos, R.: Wrapper Induction for Information Extraction. In: IJCAI, pp. 729–737 (1997)
Google Scholar
Ashish, N., Knoblock, C.A.: Semi-Automatic Wrapper Generation for Internet Information Sources. In: CoopIS, pp. 160–169 (1997)
Google Scholar
Hammer, J., Garcia-Molina, H., Nestorov, S., Yerneni, R., Breunig, M.M., Vassalos, V.: Template-Based Wrappers in the TSIMMIS System. In: SIGMOD Conference, pp. 532–535 (1997)
Google Scholar
Lim, S.J., Ng, Y.-K.: A Heuristic Approach for Coverting HTML Documents to XML Documents. In: Loyd, J., et al. (eds.) CL, pp. 1182–1196 (2000)
Google Scholar
Potok, T.E., Elmore, M.T., Reed, J.W., Samatova, N.F.: An Ontology- Based HTML to XML Conversion Using Intelligent Agents. In: Loyd, J., et al. (eds.) HICSS, pp. 120–129 (2002)
Google Scholar
Sahuguet, A., Azavant, F.: Building IntelligentWeb Applications Using Lightweight Wrappers. Data and Knowledge Engineering 36(3), 283–316 (2001)
Article MATH Google Scholar
Umehara, M., Iwanuma, K., Nabeshima, H.: A Case-Based Recognition of Semantic Structures in HTML Documents. In: IDEAL, pp. 141–147 (2002)
Google Scholar
Yang, Y., Luk, W.-S.: A Framework for Web Table Mining. In: WIDM 2002, pp. 36–42 (2002)
Google Scholar
Li, S., Liu, M., Wang, G., Peng, Z.: Capturing Semantic hierarchies to Perform Meaningful Integration in HTML Tables. In: APWEB, pp. 899–902 (2004)
Google Scholar
Baumgartner, R., Flesca, S., Gottlob, G.: Visual Web Information Extraction with Lixto. In: VLDB, pp. 119–128 (2001)
Google Scholar

Download references

Author information

Authors and Affiliations

School of Computer, Wuhan University, Wuhan, China, 430072
Shijun Li
School of Computer Science, Carleton University, 1125 Colonel By Drive, Ottawa, ON, Canada, K1S 5B6
Mengchi Liu
School of Computer, National University of Singapore, Lower Kent Ridge Road, Singapore, 119260
Tok Wang Ling
State Key Lab of Software Engineering, Wuhan University, Wuhan, China, 430072
Zhiyong Peng

Authors

Shijun Li
View author publications
You can also search for this author in PubMed Google Scholar
Mengchi Liu
View author publications
You can also search for this author in PubMed Google Scholar
Tok Wang Ling
View author publications
You can also search for this author in PubMed Google Scholar
Zhiyong Peng
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Science, City University of Hong Kong, 83 Tat Chee Avenue, Kowloon, Hong Kong, China
Qing Li
Shenyang Liaoning, Northeastern University, 110004, China
Guoren Wang
Dept. of Computer Science & Technology, Tsinghua University, Beijing, China
Ling Feng

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Li, S., Liu, M., Ling, T.W., Peng, Z. (2004). Automatic HTML to XML Conversion. In: Li, Q., Wang, G., Feng, L. (eds) Advances in Web-Age Information Management. WAIM 2004. Lecture Notes in Computer Science, vol 3129. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-27772-9_78

Download citation

DOI: https://doi.org/10.1007/978-3-540-27772-9_78
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-22418-1
Online ISBN: 978-3-540-27772-9
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics