Skip to main content

Exploiting a Web-Based Encyclopedia as a Knowledge Base for the Extraction of Multilingual Terminology

  • Conference paper
Advances in Natural Language Processing (JapTAL 2012)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 7614))

Included in the following conference series:

  • 1562 Accesses

Abstract

Multilingual linguistic resources are usually constructed from parallel corpora, but since these corpora are available only for selected text domains and language pairs, the potential of other resources is being explored as well. This article seeks to explore and to exploit the idea of using multilingual web-based encyclopaedias such as Wikipedia as comparable corpora for bilingual terminology extraction. We propose an approach to extract terms and their translations from different types of Wikipedia link information and data. The next step will be using linguistic-based information to re-rank and filter the extracted term candidates in the target language. Preliminary evaluations using the combined statistics-based and linguistic-based approaches were applied on different pairs of languages including Japanese, French and English. These evaluations showed a real open improvement and a good quality of the extracted term candidates for building or enriching multilingual anthologies, dictionaries or feeding a cross-language information retrieval system with the related expansion terms of the source query.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Adafre, S.F., De Rijke, M.: Finding similar sentences across multiple languages in Wikipedia. In: Proceedings of the EACL Workshop on NEW TEXT Wikis and Blogs and Other Dynamic Text Sources (2006)

    Google Scholar 

  2. Adar, E., Skinner, M., Weld, D.S.: Information arbitrage across multi-lingual Wikipedia. In: Proceedings of the Second ACM International Conference on Web Search and Data Mining, Barcelona, Spain, February 09-12 (2009)

    Google Scholar 

  3. Dejean, H., Gaussier, E., Sadat, F.: An Approach based on Multilingual Thesauri and Model Combination for Bilingual Lexicon Extraction. In: Proceedings of COLING 2002, Taiwan (2002)

    Google Scholar 

  4. Dunning, T.: Accurate Methods for the Statistics of Surprise and Coincidence. Computational Linguistics 19(1), 61–74 (1993)

    Google Scholar 

  5. Erdmann, M., Nakayama, K., Hara, T., Nishio, S.: An Approach for Extracting Bilingual Terminology from Wikipedia. In: Haritsa, J.R., Kotagiri, R., Pudi, V. (eds.) DASFAA 2008. LNCS, vol. 4947, pp. 380–392. Springer, Heidelberg (2008a)

    Chapter  Google Scholar 

  6. Erdmann, M., Nakayama, K., Hara, T., Nishio, S.: Extraction of bilingual terminology from a multilingual Web-based encyclopedia. J. Inform. Process. (2008b)

    Google Scholar 

  7. Erdmann, M., Nakayama, K., Hara, T., Nishio, S.: Improving the extraction of bilingual terminology from Wikipedia. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMCCAP) 5(4) (October 2009)

    Google Scholar 

  8. Fung, P.: A Statistical View of Bilingual Lexicon Extraction: From Parallel Corpora to Non-Parallel Corpora. In: Véronis, J. (ed.) Parallel Text Processing (2000)

    Google Scholar 

  9. Gœuriot, L., Daille, B., Morin, E.: Compilation of specialized comparable corpus in French and Japanese. Proceedings. In: ACL-IJCNLP Workshop “Building and Using Comparable Corpora” (BUCC 2009), Singapore (August 2009)

    Google Scholar 

  10. Gœuriot, L., Morin, E., Daille, B.: Reconnaissance de critères de comparabilité dans un corpus multilingue spécialisé. Actes. In: Sixième édition de la Conférence en Recherche d’Information et Applications, CORIA 2009 (2009)

    Google Scholar 

  11. Kun, Y., Tsujii, J.: Bilingual Dictionary Extraction from Wikipedia (2009a). In: Proceedings of MT Summit XII Proceedings 2009 (2009)

    Google Scholar 

  12. Kun, Y., Junichi, T.: Extracting Bilingual Dictionary from Comparable Corpora with Dependency Heterogeneity. In: Proceedings of NAACL HLT 2009: Short Papers, Boulder, Colorado, pp. 121–124 (June 2009b)

    Google Scholar 

  13. Mohammadi, M., QasemAgharee, N.: In: Proceedings of NIPS Workshop, Grammar Induction, Representation of Language and Language Learning, Whistler, Canada (December 2009)

    Google Scholar 

  14. Morin, E., Daille, B.: Extraction de terminologies bilingues à partir de corpus comparables d’un domaine spécialisé. Traitement Automatique des Langues (TAL), Lavoisier 45(3), 103–122 (2004)

    Google Scholar 

  15. Morin, E., Daille, B.: Comparabilité de corpus et fouille terminologique multilingue. Traitement Automatique des Langues (TAL) 47(1), 113–136 (2006)

    Google Scholar 

  16. Nakagawa, H.: Disambiguation of Lexical Translations Based on Bilingual Comparable Corpora. In: Proceedings of LREC 2000, Workshop of Terminology Resources and Computation, WTRC 2000, pp. 33–38 (2000)

    Google Scholar 

  17. Peters, C., Picchi, E.: Capturing the Comparable: A System for Querying Comparable Text Corpora. In: Proceedings of the Third International Conference on Statistical Analysis of Textual Data, pp. 255–262 (1995)

    Google Scholar 

  18. Rapp, R.: Automatic Identification of Word Translations from Unrelated English and German Corpora. In: Proceedings of European Chapter of the Association for Computational Linguistics, EACL (1999)

    Google Scholar 

  19. Sadat, F., Yoshikawa, M., Uemura, S.: Learning bilingual translations from comparable corpora to cross-language information retrieval: hybrid statistics-based and linguistics-based approach. In: Proceedings of EACL 2003, Workshop on Information Retrieval with Asian Languages, Sapporo, Japan, vol. 11, pp. 57–64 (2003)

    Google Scholar 

  20. Sadat, F.: Knowledge Acquisition from Collections of News Articles to Cross-language Information Retrieval. In: Proceedings of RIAO 2004 Conference, Avignon, France, pp. 504–513 (2004)

    Google Scholar 

  21. Véronis, J.: Parallel Text Processing: Alignment and Use of Translation Corpora. Kluwer Academic Publishers Ed., Dordrecht (2000)

    MATH  Google Scholar 

  22. Voss, J.: Measuring Wikipedia. In: Proceedings of 10th International Conference of the International Society for Scientometrics and Informetrics (2005)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2012 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Sadat, F. (2012). Exploiting a Web-Based Encyclopedia as a Knowledge Base for the Extraction of Multilingual Terminology. In: Isahara, H., Kanzaki, K. (eds) Advances in Natural Language Processing. JapTAL 2012. Lecture Notes in Computer Science(), vol 7614. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-33983-7_9

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-33983-7_9

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-33982-0

  • Online ISBN: 978-3-642-33983-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics