iDocument: Using Ontologies for Extracting and Annotating Information from Unstructured Text

  • Benjamin Adrian
  • Jörn Hees
  • Ludger van Elst
  • Andreas Dengel
Part of the Lecture Notes in Computer Science book series (LNCS, volume 5803)


Due to the huge amount of text data in the WWW, annotating unstructured text with semantic markup is a crucial topic in Semantic Web research. This work formally analyzes the incorporation of domain ontologies into information extraction tasks in iDocument. Ontology-based information extraction exploits domain ontologies with formalized and structured domain knowledge for extracting domain-relevant information from un-annotated and unstructured text. iDocument provides a pipeline architecture, an extraction template interface and the ability of exchanging domain ontologies for performing information extraction tasks. This work outlines iDocument’s ontology-based architecture, the use of SPARQL queries as extraction templates and an evaluation of iDocument in an automatic document annotation scenario.


Information Extraction Domain Ontology SPARQL Query Unstructured Text Symbol Recognition 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Atzmüller, M., Klügl, P., Puppe, F.: Rule-Based Information Extraction for Structured Data Acquisition using TextMarker. In: Proc. LWA 2008 (Special Track on Knowledge Discovery and Machine Learning) (2008)Google Scholar
  2. 2.
    Ireson, N., Ciravegna, F., Califf, M.E., Freitag, D., Kushmerick, N., Lavelli, A.: Evaluating Machine Learning for Information Extraction. In: Raedt, L.D., Wrobel, S. (eds.) ICML. ACM Int. Conf. Proc. Series, vol. 119, pp. 345–352. ACM, New York (2005)Google Scholar
  3. 3.
    Brickley, D., Guha, R.V.: RDF Vocabulary Description Language 1.0: RDF Schema. W3C recommendation, World Wide Web Consortium (2004)Google Scholar
  4. 4.
    Bontcheva, K., Tablan, V., Maynard, D., Cunningham, H.: Evolving GATE to meet new challenges in language engineering. JNLE 10(3-4), 349–373 (2004)Google Scholar
  5. 5.
    Buitelaar, P., Cimiano, P., Frank, A., Hartung, M., Racioppa, S.: Ontology-based Information Extraction and Integration from Heterogeneous Data Sources. Int. Journal of Human-Computer Studies (11), 759–788 (2008)Google Scholar
  6. 6.
    Endres-Niggemeyer, B., Jauris-Heipke, S., Pinsky, M., Ulbricht, U.: Wissen gewinnen durch Wissen: Ontologiebasierte Informationsextraktion. Information - Wissenschaft & Praxis 57(1), 301–308 (2006)Google Scholar
  7. 7.
    Embley, D.W., Campbell, D.M., Smith, R.D., Liddle, S.W.: Ontology-based Extraction and Structuring of Information from Data-Rich Unstructured Documents. In: CIKM 1998: Proc. of the 7th Int. Conf. on Information and Knowledge Management, pp. 52–59. ACM, New York (1998)Google Scholar
  8. 8.
    Sintek, M., Junker, M., van Elst, L., Abecker, A.: Using Information Extraction Rules for Extending Domain Ontologies. In: Workshop on Ontology Learning. (2001)Google Scholar
  9. 9.
    Maedche, A., Neumann, G., Staab, S.: Bootstrapping an Ontology-based Information Extraction System. In: Szczepaniak, P., Segovia, J., Kacprzyk, J., Zadeh, L.A. (eds.) Intelligent Exploration of the Web. Springer, Berlin (2002)Google Scholar
  10. 10.
    Grishman, R., Sundheim, B.: Design of the MUC-6 evaluation. In: Proc. of a workshop held at Vienna, pp. 413–422. Association for Computational Linguistics, Virginia (1996)Google Scholar
  11. 11.
    Hobbs, J., Israel, D.: Principles of Template Design. In: HLT 1994: Proc. of the workshop on HLT, pp. 177–181. ACL, Morristown (1994)Google Scholar
  12. 12.
    Labský, M., Svátek, V., Nekvasil, M., Rak, D.: The Ex Project: Web Information Extraction using Extraction Ontologies. In: Proc. Workshop on Prior Conceptual Knowledge in Machine Learning and Knowledge Discovery, PriCKL 2007 (2007)Google Scholar
  13. 13.
    Sauermann, L., van Elst, L., Dengel, A.: PIMO - a Framework for Representing Personal Information Models. In: Proc. of I-Semantics 2007, JUCS, pp. 270–277 (2007)Google Scholar
  14. 14.
    Adrian, B., Dengel, A.: Believing Finite-State cascades in Knowledge-based Information Extraction. In: Dengel, A.R., Berns, K., Breuel, T.M., Bomarius, F., Roth-Berghofer, T.R. (eds.) KI 2008. LNCS (LNAI), vol. 5243, pp. 152–159. Springer, Heidelberg (2008)CrossRefGoogle Scholar
  15. 15.
    Grothkast, A., Adrian, B., Schumacher, K., Dengel, A.: OCAS: Ontology-Based Corpus and Annotation Scheme. In: Proc. of the HLIE Workshop 2008, ECML PKDD, pp. 25–35 (2008)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2009

Authors and Affiliations

  • Benjamin Adrian
    • 1
  • Jörn Hees
    • 2
  • Ludger van Elst
    • 1
  • Andreas Dengel
    • 1
    • 2
  1. 1.Knowledge Management DepartmentDFKIKaiserslauternGermany
  2. 2.CS DepartmentUniversity of KaiserslauternKaiserslauternGermany

Personalised recommendations