Concepticons vs. lexicons: An architecture for multilingual information extraction

  • Robert Gaizauskas
  • Kevin Humphreys
  • Saliha Azzam
  • Yorick Wilks
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 1299)


Given an information extraction (IE) system that performs an extraction task against texts in one language, it is natural to consider how to modify the system to perform the same task against texts in a different language. More generally, there may be a requirement to do the extraction task against texts in an arbitrary number of different languages and to present results to a user who has no knowledge of the source language from which the information has been extracted. To minimise the language-specific alterations that need to be made in extending the system to a new language, it is important to separate the task-specific conceptual knowledge the system uses, which may be assumed to be language independent, from the language-dependent lexical knowledge the system requires, which unavoidably must be extended for each new language. In this paper we describe how the architecture of the LaSIE system, an IE system designed to do monolingual extraction from English texts, has been modified to support a clean separation between conceptual and lexical information. This separation allows hard-to-acquire, domain-specific conceptual knowledge to be represented only once, and hence to be reused in extracting information from texts in multiple languages, while standard lexical resources can be used to extend language coverage. Preliminary experiments with extending the system to French are described.


Machine Translation Information Extraction Target Language Word Sense English Text 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Advanced Research Projects Agency. Proceedings of the Fifth Message Understanding Conference (MUC-5). Morgan Kaufmann, 1993.Google Scholar
  2. 2.
    H. Alshawi, editor. The Core Language Engine. MIT Press, Cambridge MA, 1992.Google Scholar
  3. 3.
    AVENTINUS: Advanced information system for multinational drug enforcement. Site visited 29/05/97.Google Scholar
  4. 4.
    J. Cowie and W. Lehnert. Information extraction. Communications of the ACM, 39(1):80–91, 1996.CrossRefGoogle Scholar
  5. 5.
    H. Cunningham, S. Azzam, and Y. Wilks. Domain Modelling for AVENTINUS (WP 4.2). LE project LEl-2238 AVENTINUS internal technical report, University of Sheffield, UK, 1996.Google Scholar
  6. 6.
    Defense Advanced Research Projects Agency. Proceedings of the Sixth Message Understanding Conference (MUC-6). Morgan Kaufmann, 1995.Google Scholar
  7. 7.
    ECRAN: Extraction of Content: Research at Near-Market. Site visited 29/05/97.Google Scholar
  8. 8.
    FACILE: Fast and Accurate Categorisation of Information by Language Engineering. Site visited 29/05/97.Google Scholar
  9. 9.
    R. Gaizauskas. XI: A Knowledge Representation Language Based on Cross-Classification and Inheritance. Technical Report CS-95-24, Department of Computer Science, University of Sheffield, 1995.Google Scholar
  10. 10.
    R. Gaizauskas and K. Humphreys. Using a semantic network for information extraction. Journal of Natural Language Engineering, 1997. In press.Google Scholar
  11. 11.
    R. Gaizauskas, T. Wakao, K Humphreys, H. Cunningham, and Y. Wilks. Description of the LaSIE system as used for MUC-6. In Proceedings of the Sixth Message Understanding Conference (MUC-6). Morgan Kaufmann, 1995.Google Scholar
  12. 12.
    R. Gaizauskas and Y. Wilks. Information Extraction: Beyond Document Retrieval. Submitted to Journal of Documentation, 1997.Google Scholar
  13. 13.
    R. Grishman and B. Sundheim. Message understanding conference — 6: A brief history. In Proceedings of the 16th International Conference on Computational Linguistics, Copenhagen, June 1996.Google Scholar
  14. 14.
    H. Horacek and M. Zock, editors. New Concepts in Natural Language Generation: Planning, Realization and Systems. Pinter Publishers, London, 1993.Google Scholar
  15. 15.
    W.J. Hutchins. Machine Translation: past, present, future. Chichester: Ellis Horwood, 1986.Google Scholar
  16. 16.
    M. Kameyama. Information Extraction across Linguistic Boundaries. In AAAI Spring Symposium on Cross-Language Text and Speech Processing, 1997.Google Scholar
  17. 17.
    R. Merchant, M.E. Okurowski, and N. Chinchor. The Multi-Lingual Entity Tast (MET) Overview. In Advances in Text Processing — TIPSTER Programme Phase II, pages 445–447. DARPA, Morgan Kaufman, 1996.Google Scholar
  18. 18.
    G. A. Miller (Ed.). WordNet: An on-line lexical database. International Journal of Lexicography, 3(4):235–312, 1990.CrossRefGoogle Scholar
  19. 19.
    SPARKLE: Shallow parsing and knowledge extraction for language engineering. Site visited 10/06/97.Google Scholar
  20. 20.
    TREE: Trans European Employment. Site visited 29/05/97.Google Scholar
  21. 21.
    Y. Wilks and M. Stevenson. Sense tagging: Semantic tagging with a lexicon. In Proceedings of the ANLP97 Workshop on Tagging Text with Lexical Semantics, 1997.Google Scholar
  22. 22.
    D. Yarowsky. Word-sense disambiguation using statistical models of Roget's cate-gories trained on large corpora. In COLING-92, 1992.Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 1997

Authors and Affiliations

  • Robert Gaizauskas
    • 1
  • Kevin Humphreys
    • 1
  • Saliha Azzam
    • 1
  • Yorick Wilks
    • 1
  1. 1.Department of Computer ScienceUniversity of SheffieldUSA

Personalised recommendations