Information Extraction

  • Claire NédellecEmail author
  • Adeline Nazarenko
  • Robert Bossy
Part of the International Handbooks on Information Systems book series (INFOSYS)


Information Extraction (IE) addresses the intelligent access to document contents by automatically extracting information relevant to a given task. This chapter focuses on how ontologies can be exploited to interpret the textual document content for IE purposes. It makes a state of the art of IE systems from the point of view of IE as a knowledge-based NLP process. It reviews the different steps of NLP necessary for IE tasks: named entity recognition, term analysis, semantic typing and identification specific relations. It stresses on the importance of ontological knowledge for performing each step and presents corpus-based methods for the acquisition of the required knowlege.

This chapter shows that IE is an ontology-based activity and argues that future effort in IE should focus on formalizing and reinforcing the relation between the text extraction and the ontology model. The discussion gives authors’ insights on the integration of ontological knowledge in IE systems from a formal and pragmatic point of view.

Examples in this chapter are taken from IE tasks for biology since this domain attracts a large community of IE specialists and provides a large number of ontological resources.


Information Extraction Training Corpus Extraction Rule Semantic Unit Lexical Knowledge 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    E. Agichtein and L. Gravano. Snowball: Extracting relations from large plain-text collections. In Proceedings of the Fifth ACM International Conference on Digital Libraries, 2000.Google Scholar
  2. 2.
    E. Alphonse and C. Rouveirol. Lazy propositionalization for relational learning. In W. Horn, editor, Proc. of the 14th European Conference on Artificial Intelligence (ECAI’2000), pages 256–260. IOS Press, 2000.Google Scholar
  3. 3.
    S. Ananiadou and J. McNaught. Text Mining for Biology and Biomedicine. Artech House Books, 2006.Google Scholar
  4. 4.
    Rie Kubota Ando. Biocreative ii gene mention tagging system at ibm watson. In L. Hirschmann, M. Krallinger, and A. Valencia, editors, Proceedings of the Second BioCreative Challenge Evaluation Workshop. CNIO, 2007.Google Scholar
  5. 5.
    A. R. Aronson, O. Bodenreider, Chang H. F., S. M. Humphrey, Mork J. G., S. J. Nelson, T. J. Rindflesch, and W. J. Wilbur. The nlm indexing initiative. In Proceedings of the AMIA Symp., pages 17–2, 2000.Google Scholar
  6. 6.
    S. Aubin and T. Hamon. Improving term extraction with terminological resources. In T. Salakoski, F. Ginter, S. Pyysalo, and T. Pahikkala, editors, Advances in Natural Language Processing (Proceedings of the 5th International Conference on NLP (FinTAL’06, LNAI 4139, pages 380–387. Springer, 2006.Google Scholar
  7. 7.
    K. Bontcheva, V. Tablan, D. Maynard, and H. Cunningham. Evolving GATE to meet new challenges. Natural Language Engineering, 2004.Google Scholar
  8. 8.
    P. Buitelaar, M. Sintek, and M. Kiesel. A lexicon model for multilingual/multimedia ontologies. In The Semantic Web: Research and Applications; Proceedings of the 3rd European Semantic Web Conference (ESWC06), Lecture Notes in Computer Science, Vol. 4011. Springer, 2006.Google Scholar
  9. 9.
    M. T. Cabré, R. Estopà, and J. Vivaldi. Automatic term detection: a review of current systems. In Didier Bourgault, Christian Jacquemin, and Marie-Claude L’Homme, editors, Recent Advances in Computational Terminology, volume 2 of Natural Langage Processing, pages 53–87. John Benjamins, Amsterdam, 2001.CrossRefGoogle Scholar
  10. 10.
    N. Chinchor and P. Robinson. Muc-7 named entity task definition (version 3.5). In Message Understanding Conference Proceedings, MUC-7. NIST, 1998.Google Scholar
  11. 11.
    P. Cimiano, P. Haase, M. Herold, M. Mantel, and P. Buitelaar. Lexonto: A model for ontology lexicons for ontology-based nlp. In Paul Buitelaar, Key-Sun Choi, Aldo Gangemi, and Chu-Ren Huang, editors, Proceedings of the OntoLex07 Workshop held in conjunction with the 6th International Semantic Web Conference (ISWC07) “From Text to Knowledge: The Lexicon/Ontology Interface”, Busan (South Korea), November 2007.Google Scholar
  12. 12.
    B. Daille. Variations and application-oriented terminology engineering. Terminology, 11(1):181–197, 2005.CrossRefGoogle Scholar
  13. 13.
    Riloff E. Automatically constructing a dictionary for information extraction tasks. In Proceedings of AAAI93, pages 811–816, 1993.Google Scholar
  14. 14.
    Ciravegna F. Learning to tag for information extraction from text. In Proceedings of the ECAI-2000 Workshop on Machine Learning for Information Extraction, 2000.Google Scholar
  15. 15.
    David Ferrucci and Adam Lally. Uima: an architectural approach to unstructured information processing in the corporate research environment. Nat. Lang. Eng., 10(3-4):327–348, 2004.CrossRefGoogle Scholar
  16. 16.
    R. Grishman and B. Sundheim. Message understanding conference-6: A brief history. In 16th International Conference on Computational Linguistics, Copenhagen, Denmark, 1996.Google Scholar
  17. 17.
    Zhou GuoDong and Su Jian. Exploring deep knowledge resources in biomedical name recognition. In Nigel Collier, Patrick Ruch, and Adeline Nazarenko, editors, COLING 2004 International Joint workshop on Natural Language Processing in Biomedicine and its Applications (NLPBA/BioNLP) 2004, pages 99–102, Geneva, Switzerland, August 28th and 29th 2004. COLING.Google Scholar
  18. 18.
    B. Habert, E. Naulleau, and A. Nazarenko. Symbolic word clustering for medium-size corpora. In Proceedings of the 16th International Conference on Computational Linguistics, volume 1, pages 490–495, Copenhagen, Denmark, 1996.Google Scholar
  19. 19.
    Thierry Hamon, Adeline Nazarenko, Thierry Poibeau, Sophie Aubin, and Julien Derivière. A Robust Linguistic Platform for Efficient and Domain specific Web Content Analysis. In Proceedings of the 8th Conference RIAO’07 (Large-Scale Semantic Access to Content), Pittsburgh, USA, May 2007.Google Scholar
  20. 20.
    M. A. Hearst. Automatic acquisition of hyponyms from large text corpora. In Proceedings of the 15th International conference on Computational Linguistics, volume 2, pages 539–545, Nantes, 1992.Google Scholar
  21. 21.
    W. Hersh, A. Cohen, L. Ruslen, and P. Roberts. Trec 2007 genomics track overview. In TREC 2007 Proceedings, 2007.Google Scholar
  22. 22.
    J. R. Hobbs, D. Appelt, J. Bear, D. Israel, M. Kameyama, M. Stickel, and M. Tyson. Fastus: A cascaded finite-state transducer for extraction information from natural language text. In E Roche and Y Schabes, editors, Finite-State Language Processing, chapter 13, pages 383–406. MIT Press, 1997.Google Scholar
  23. 23.
    Christian Jacquemin. A symbolic and surgical acquisition of terms through variation. In S. Wermter, E. Riloff, and G. Scheler, editors, Connectionist, Statistical and Symbolic Approaches to Learning for Natural Language Processing, pages 425–438. Springer-Verlag, 1996.Google Scholar
  24. 24.
    M. Krallinger. The interaction-pair and interaction method sub-task evaluation. In proceedings of the BioCreAtIvE II Workshop, at CNIO, Madrid, Spain, 2007.Google Scholar
  25. 25.
    S. Kripke. Naming and necessity. In G. Harman D. Davidson, editor, Semantics of Natural Language. Reidel, Dordrecht, 1972.Google Scholar
  26. 26.
    B. Lauser and M. Sini. From agrovoc to the agricultural ontology service/ concept server: an owl model for creating ontologies in the agricultural domain. In Proceedings of the 2006 international conference on Dublin Core and Metadata Applications (DCMI’06): “Metadata for knowledge and learning”, pages 76–88. Dublin Core Metadata Initiative, 2006.Google Scholar
  27. 27.
    K. Lerman, A. Plangprasopchok, and C. Wong. Personalizing image search results on flickr. Technical report, arXiv, 2007.Google Scholar
  28. 28.
    A.-P. Manine and C. Nédellec. Alvis deliverable d6.4.b: Acquisition of relation extraction rules by machine learning. Technical report, Institut National de la Recherche Agronomique,, march 2007.
  29. 29.
    A.-P. Manine, E. Alphonse and P. Bessières. Information extraction as an ontology population task and its application to genic interactions. ICTAI ’08: Proceedingsof the 2008 20th IEEE International Conference on Tools with Artificial Intelligence,, pages 74–81. IEEEComputer Society, Washington, DC, USA, 2008.
  30. 30.
    S. Miller, M. Crystal, H. Fox, L. Ramshaw, R. Schwartz, R. Stone, R. Weischedel, and the Annotation Group. Algorithms that learn to extract information–BBN: Description of the SIFT system as used for MUC. In Proceedings of the Seventh Message Understanding Conference (MUC-7), 1998.Google Scholar
  31. 31.
    A. Nazarenko, E. Alphonse, J. Derivière, T. Hamon, G. Vauvert, and D. Weissenbacher. The ALVIS format for linguistically annotated documents. In Proceedings of the 5th international conference on Language Resources and Evaluation, LREC 2006, pages 1782–1786. ELDA, 2006.Google Scholar
  32. 32.
    A. Nazarenko, C. Nédellec, E. Alphonse, S. Aubin, T. Hamon, and A.-P. Manine. Semantic annotation in the Alvis project. In W. Buntine and H. Tirri, editors, Proceedings of the International Workshop on Intelligent Information Access, pages 40–54, Helsinki, Finlande, 2006.Google Scholar
  33. 33.
    C. Nédellec, P. Bessières, R. Bossy, A. Kotoujansky, and A.-P. Manine. Annotation guidelines for machine learning-based named entity recognition in microbiology. In M. Hilario and C. Nedellec, editors, Proceedings of the Data and text mining in integrative biology workshop, associé ECML/PKDD, pages 40–54, Berlin, Allemagne, 2006.Google Scholar
  34. 34.
    Claire Nedellec. Learning language in logic - genic interaction extraction challenge. In Cussens J. and Nedellec C., editors, Proceedings of the Learning Language in Logic (LLL05) workshop joint to ICML’05, pages 40–54, 2005.Google Scholar
  35. 35.
    S. Nirenburg and V. Raskin. Ontological semantics. MIT Press, 2004.Google Scholar
  36. 36.
    B. Popov, A. Kiryakov, D. Ognyanoff, D. Manov, and A. Kirilov. Kim - a semantic platform for information extraction and retrieval. Nat. Lang. Eng., 10 (3-4):375–392, 2004.CrossRefGoogle Scholar
  37. 37.
    J. Pustejovsky, J. Castano, B. Cochran, M. Kotecki, M. Morrell, and A. Rumshisky. Linguistic knowledge extraction from medline: Automatic construction of an acronym database. In Proceedings of the 10th World Congress on Health and Medical Informatics (Medinfo 2001), 2001.Google Scholar
  38. 38.
    S. Pyysalo, T. Salakoski, S. Aubin, and A. Nazarenko. Lexical adaptation of link grammar to the biomedical sublanguage: a comparative evaluation of three approaches. BMC Bioinformatics, 7(Suppl 3), 2006.Google Scholar
  39. 39.
    Saetre R., Yoshida K., Yakushiji A., Miyao Y., Matsubayashi Y., and Ohta T. Akane system: Protein-protein interaction pairs in biocreative2 challenge, ppi-ips subtask. In L. Hirschmann, M. Krallinger, and A. Valencia, editors, Proceedings of the Second BioCreative Challenge Evaluation Workshop. CNIO, 2007.Google Scholar
  40. 40.
    A. Reymonet, J. Thomas, and N. Aussenac-Gilles. Modelling ontological and terminological resources in owl dl. In Paul Buitelaar, Key-Sun Choi, Aldo Gangemi, and Chu-Ren Huang, editors, Proceedings of the OntoLex07 Workshop held in conjunction with the 6th International Semantic Web Conference (ISWC07) “From Text to Knowledge: The Lexicon/Ontology Interface”, Busan (South Korea), November 2007.Google Scholar
  41. 41.
    F. Rinaldi, G. Schneider, K. Kaljurand, M. Hess, and M. Romacker. An environment for relation mining over richly annotated corpora: the case of genia. BMC Bioinformatics, 7(Suppl 3), 2006.Google Scholar
  42. 42.
    Juan C. Sager. A Practical Course in Terminology Processing. John Benjamins Publishing Company, 1990.Google Scholar
  43. 43.
    A. S. Schwartz and M. A. Hearst. Introduction to the bio-entity recognition task at jnlpba. In Proceedings of the Pacific Symposium on Biocomputing (PSB 2003). International Conference on Computational Linguistics (COLING’04), 2003.Google Scholar
  44. 44.
    Ohta T., Tateisi Y., Mima H., and Tsujii J. Genia corpus: an annotated research abstract corpus in molecular biology domain. In Proceedings of the Human Language Technology Conference, 2002.Google Scholar
  45. 45.
    J. Wilbur, L. Smith, and L. Tanabe. Biocreative ii: Gene mention task. In proceedings of the BioCreAtIvE II Workshop, at CNIO, Madrid, Spain, 2007.Google Scholar
  46. 46.
    Y. Wilks. Information extraction as a core language technology. In M. T. Pazienza, editor, Information Extraction. Springer, Berlin, 1997.Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2009

Authors and Affiliations

  • Claire Nédellec
    • 1
    Email author
  • Adeline Nazarenko
    • 2
  • Robert Bossy
    • 1
  1. 1.INRAParisFrance
  2. 2.Université Paris-NordParisFrance

Personalised recommendations