Combining Multiple Sources of Evidence in Web Information Extraction

  • Martin Labský
  • Vojtěch Svátek
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4994)


Extraction of meaningful content from collections of web pages with unknown structure is a challenging task, which can only be successfully accomplished by exploiting multiple heterogeneous resources. In the Ex information extraction tool, so-called extraction ontologies are used by human designers to specify the domain semantics, to manually provide extraction evidence, as well as to define extraction subtasks to be carried out via trainable classifiers. Elements of an extraction ontology can be endowed with probability estimates, which are used for selection and ranking of attribute and instance candidates to be extracted. At the same time, HTML formatting regularities are locally exploited.


Domain Ontology Context Pattern Instance Candidate Extraction Ontology Extract Product Feature 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Duda, R.O., Gasching, J., Hart, P.E.: Model design in the Prospector consultant system for mineral exploration. Readings in Artificial Intelligence, 334–348 (1981)Google Scholar
  2. 2.
    Embley, D.W., Tao, C., Liddle, D.W.: Automatically extracting ontologically specified data from HTML tables of unknown structure. In: Spaccapietra, S., March, S.T., Kambayashi, Y. (eds.) ER 2002. LNCS, vol. 2503, pp. 322–337. Springer, Heidelberg (2002)CrossRefGoogle Scholar
  3. 3.
    Kiryakov, A., Popov, B., Terziev, I., Manov, D., Ognyanoff, D.: Semantic annotation, indexing, and retrieval. J. Web Sem. 2, 49–79 (2004)Google Scholar
  4. 4.
    Labský, M., Nekvasil, M., Svátek, V., Rak, D.: The Ex Project: Web Information Extraction using Extraction Ontologies. In: Proc. PriCKL workshop, ECML/PKDD (2007)Google Scholar
  5. 5.
    Labský, M., Svátek, V: Information extraction with presentation ontologies. Technical report, KEG UEP,
  6. 6.
    Popescu, A., Etzioni, O.: Extracting Product Features and Opinions from Reviews. In: Proc. EMNLP (2005)Google Scholar
  7. 7.
    Wei, X., Croft, B., McCallum, A.: Table Extraction for Answer Retrieval. Information Retrieval Journal 9(5), 589–611 (2006)CrossRefGoogle Scholar
  8. 8.
    Wick, M., Culotta, A., McCallum, A.: Learning Field Compatibilities to Extract Database Records from Unstructured Text. In: Proc. EMNLP (2006)Google Scholar
  9. 9.
    Yates, A., Etzioni, O.: Unsupervised Resolution of Objects and Relations on the Web. In: Proc. HLT (2007)Google Scholar
  10. 10.
    Dietterich, T.G.: Machine Learning for Sequential Data: A Review. In: Caelli, T.M., Amin, A., Duin, R.P.W., Kamel, M.S., de Ridder, D. (eds.) SPR 2002 and SSPR 2002. LNCS, vol. 2396, Springer, Heidelberg (2002)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2008

Authors and Affiliations

  • Martin Labský
    • 1
  • Vojtěch Svátek
    • 1
  1. 1.Department of Information and Knowledge EngineeringUniversity of EconomicsPraha 3Czech Republic

Personalised recommendations