Exploiting a Web-Based Encyclopedia as a Knowledge Base for the Extraction of Multilingual Terminology

  • Fatiha Sadat
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7614)


Multilingual linguistic resources are usually constructed from parallel corpora, but since these corpora are available only for selected text domains and language pairs, the potential of other resources is being explored as well. This article seeks to explore and to exploit the idea of using multilingual web-based encyclopaedias such as Wikipedia as comparable corpora for bilingual terminology extraction. We propose an approach to extract terms and their translations from different types of Wikipedia link information and data. The next step will be using linguistic-based information to re-rank and filter the extracted term candidates in the target language. Preliminary evaluations using the combined statistics-based and linguistic-based approaches were applied on different pairs of languages including Japanese, French and English. These evaluations showed a real open improvement and a good quality of the extracted term candidates for building or enriching multilingual anthologies, dictionaries or feeding a cross-language information retrieval system with the related expansion terms of the source query.


terminology comparable corpora translation Cross-Language Information Retrieval linguistics-based information 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Adafre, S.F., De Rijke, M.: Finding similar sentences across multiple languages in Wikipedia. In: Proceedings of the EACL Workshop on NEW TEXT Wikis and Blogs and Other Dynamic Text Sources (2006)Google Scholar
  2. 2.
    Adar, E., Skinner, M., Weld, D.S.: Information arbitrage across multi-lingual Wikipedia. In: Proceedings of the Second ACM International Conference on Web Search and Data Mining, Barcelona, Spain, February 09-12 (2009)Google Scholar
  3. 3.
    Dejean, H., Gaussier, E., Sadat, F.: An Approach based on Multilingual Thesauri and Model Combination for Bilingual Lexicon Extraction. In: Proceedings of COLING 2002, Taiwan (2002)Google Scholar
  4. 4.
    Dunning, T.: Accurate Methods for the Statistics of Surprise and Coincidence. Computational Linguistics 19(1), 61–74 (1993)Google Scholar
  5. 5.
    Erdmann, M., Nakayama, K., Hara, T., Nishio, S.: An Approach for Extracting Bilingual Terminology from Wikipedia. In: Haritsa, J.R., Kotagiri, R., Pudi, V. (eds.) DASFAA 2008. LNCS, vol. 4947, pp. 380–392. Springer, Heidelberg (2008a)CrossRefGoogle Scholar
  6. 6.
    Erdmann, M., Nakayama, K., Hara, T., Nishio, S.: Extraction of bilingual terminology from a multilingual Web-based encyclopedia. J. Inform. Process. (2008b)Google Scholar
  7. 7.
    Erdmann, M., Nakayama, K., Hara, T., Nishio, S.: Improving the extraction of bilingual terminology from Wikipedia. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMCCAP) 5(4) (October 2009)Google Scholar
  8. 8.
    Fung, P.: A Statistical View of Bilingual Lexicon Extraction: From Parallel Corpora to Non-Parallel Corpora. In: Véronis, J. (ed.) Parallel Text Processing (2000)Google Scholar
  9. 9.
    Gœuriot, L., Daille, B., Morin, E.: Compilation of specialized comparable corpus in French and Japanese. Proceedings. In: ACL-IJCNLP Workshop “Building and Using Comparable Corpora” (BUCC 2009), Singapore (August 2009)Google Scholar
  10. 10.
    Gœuriot, L., Morin, E., Daille, B.: Reconnaissance de critères de comparabilité dans un corpus multilingue spécialisé. Actes. In: Sixième édition de la Conférence en Recherche d’Information et Applications, CORIA 2009 (2009)Google Scholar
  11. 11.
    Kun, Y., Tsujii, J.: Bilingual Dictionary Extraction from Wikipedia (2009a). In: Proceedings of MT Summit XII Proceedings 2009 (2009)Google Scholar
  12. 12.
    Kun, Y., Junichi, T.: Extracting Bilingual Dictionary from Comparable Corpora with Dependency Heterogeneity. In: Proceedings of NAACL HLT 2009: Short Papers, Boulder, Colorado, pp. 121–124 (June 2009b)Google Scholar
  13. 13.
    Mohammadi, M., QasemAgharee, N.: In: Proceedings of NIPS Workshop, Grammar Induction, Representation of Language and Language Learning, Whistler, Canada (December 2009)Google Scholar
  14. 14.
    Morin, E., Daille, B.: Extraction de terminologies bilingues à partir de corpus comparables d’un domaine spécialisé. Traitement Automatique des Langues (TAL), Lavoisier 45(3), 103–122 (2004)Google Scholar
  15. 15.
    Morin, E., Daille, B.: Comparabilité de corpus et fouille terminologique multilingue. Traitement Automatique des Langues (TAL) 47(1), 113–136 (2006)Google Scholar
  16. 16.
    Nakagawa, H.: Disambiguation of Lexical Translations Based on Bilingual Comparable Corpora. In: Proceedings of LREC 2000, Workshop of Terminology Resources and Computation, WTRC 2000, pp. 33–38 (2000)Google Scholar
  17. 17.
    Peters, C., Picchi, E.: Capturing the Comparable: A System for Querying Comparable Text Corpora. In: Proceedings of the Third International Conference on Statistical Analysis of Textual Data, pp. 255–262 (1995)Google Scholar
  18. 18.
    Rapp, R.: Automatic Identification of Word Translations from Unrelated English and German Corpora. In: Proceedings of European Chapter of the Association for Computational Linguistics, EACL (1999)Google Scholar
  19. 19.
    Sadat, F., Yoshikawa, M., Uemura, S.: Learning bilingual translations from comparable corpora to cross-language information retrieval: hybrid statistics-based and linguistics-based approach. In: Proceedings of EACL 2003, Workshop on Information Retrieval with Asian Languages, Sapporo, Japan, vol. 11, pp. 57–64 (2003)Google Scholar
  20. 20.
    Sadat, F.: Knowledge Acquisition from Collections of News Articles to Cross-language Information Retrieval. In: Proceedings of RIAO 2004 Conference, Avignon, France, pp. 504–513 (2004)Google Scholar
  21. 21.
    Véronis, J.: Parallel Text Processing: Alignment and Use of Translation Corpora. Kluwer Academic Publishers Ed., Dordrecht (2000)zbMATHGoogle Scholar
  22. 22.
    Voss, J.: Measuring Wikipedia. In: Proceedings of 10th International Conference of the International Society for Scientometrics and Informetrics (2005)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  • Fatiha Sadat
    • 1
  1. 1.Université du Quebec à MontréalMontréalCanada

Personalised recommendations