Advertisement

An XML Approach to Semantically Extract Data from HTML Tables

  • Jixue Liu
  • Zhuoyun Ao
  • Ho-Hyun Park
  • Yongfeng Chen
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3588)

Abstract

Data intensive information is often published on the internet in the format of HTML tables. Extracting some of the information that is of users’ interest from the internet, especially when large number of web pages need to be accessed, is time consuming. To automate the processes of information extraction, this paper proposes an XML way of semantically analyzing HTML tables for the data od interest. It firstly introduces a mini language in XML syntax for specifying ontologies that represent the data of interest. Then it defines algorithms that parse HTML tables to a specially defined type of XML trees. The XML trees are then compared with the ontologies to semantically analyze and locate the part of table or nested tables that have the interesting data. Finally, interesting data, once identified, is output as XML documents.

Keywords

Cell Node Mapping Tree Content Tree Interesting Data Position Number 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Brasethvik, T., Gulla, J.A.: Natural language analysis for semantic document modeling. DKE 38(1), 45–62 (2001)zbMATHCrossRefGoogle Scholar
  2. 2.
    Bray, T., Paoli, J., Sperberg-McQueen, C.M.: Extensible markup language (xml) 1.0 (1998), http://www.w3.org/TR/1998/REC-xml-19980210
  3. 3.
    Crescenzi, V., Mecca, G., Merialdo, P.: Roadrunner: Towards automatic data extraction from large web sites. In: VLDB, pp. 109–118 (2001)Google Scholar
  4. 4.
    Crescenzi, V., Mecca, G., Merialdo, P.: Roadrunner: automatic data extraction from data-intensive web sites. In: SIGMOD Conference, p. 624 (2002)Google Scholar
  5. 5.
    Crescenzi, V., Mecca, G., Merialdo, P., Missier, P.: An automatic data grabber for large web sites. In: VLDB, pp. 1321–1324 (2004)Google Scholar
  6. 6.
    Embley, D.W., Tao, C., Liddle, S.W.: Automatically extracting ontologically specified data from html tables of unknown structure. In: Spaccapietra, S., March, S.T., Kambayashi, Y. (eds.) ER 2002. LNCS, vol. 2503, pp. 322–337. Springer, Heidelberg (2002)CrossRefGoogle Scholar
  7. 7.
    Filha, I.M.R.E., da Silva, A.S., Laender, A.H.F., Embley, D.W.: Using nested tables for representing and querying semistructured web data. In: Pidduck, A.B., Mylopoulos, J., Woo, C.C., Ozsu, M.T. (eds.) CAiSE 2002. LNCS, vol. 2348, pp. 719–723. Springer, Heidelberg (2002)CrossRefGoogle Scholar
  8. 8.
    Hammer, J., Garcia-Molina, H., Cho, J., Aranha, R., Crespo, A.: Extracting semistructured information from the web. In: Proceedings of the Workshop on Management of Semistructured Data (1997)Google Scholar
  9. 9.
    HTML-Working-Group. Hypertext markup language (html), W3C (2004), http://www.w3.org/MarkUp/
  10. 10.
    Lam, W., Lin, W.-Y.: Learning to extract hierarchical information from semi-structured documents. In: CIKM, pp. 250–257 (2000)Google Scholar
  11. 11.
    Lerman, K., Getoor, L., Minton, S., Knoblock, C.A.: Using the structure of web sites for automatic segmentation of tables. In: SIGMOD Conference, pp. 119–130 (2004)Google Scholar
  12. 12.
    Lerman, K., Knoblock, C.A., Minton, S.: Automatic data extraction from lists and tables in web sources. In: Automatic Text Extraction and Mining workshop (ATEM 2001), IJCAI 2001, Seattle, WA (2001), http://www.isi.edu/~lerman/papers/lerman-atem2001.pdf
  13. 13.
    Lim, S.-J., Nag, Y.-K.: An automated approach for retrieving hierarchical data from html tables. In: CIKM, pp. 466–474 (1999)Google Scholar
  14. 14.
    Soderland, S.: Learning to extract text-based information from the world wide web. In: KDD, pp. 251–254 (1997)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2005

Authors and Affiliations

  • Jixue Liu
    • 1
  • Zhuoyun Ao
    • 1
  • Ho-Hyun Park
    • 2
  • Yongfeng Chen
    • 3
  1. 1.School of Computer and Information ScienceUniversity of SouthAustralia
  2. 2.School of Electrical and Electronics EngineeringChung-Ang University 
  3. 3.Faculty of ManagementXian University of Architecture and Technology 

Personalised recommendations