Cross-Language Retrieval with Wikipedia

  • Péter Schönhofen
  • András Benczúr
  • István Bíró
  • Károly Csalogány
Part of the Lecture Notes in Computer Science book series (LNCS, volume 5152)


We demonstrate a twofold use of Wikipedia for cross-lingual information retrieval. As our main contribution, we exploit Wikipedia hyperlinkage for query term disambiguation. We also use bilingual Wikipedia articles for dictionary extension. Our method is based on translation disambiguation; we combine the Wikipedia based technique with a method based on bigram statistics of pairs formed by translations of different source language terms.


Machine Translation Query Term Source Language Parallel Corpus Term Pair 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Di Nunzio, G.M., Ferro, N., Mandl, T., Peters, C.: CLEF 2007 Ad Hoc track overview. In: Peters, C., et al. (eds.) CLEF 2007. LNCS, vol. 5152, pp. 13–32. Springer, Heidelberg (2008)Google Scholar
  2. 2.
    Benczúr, A.A., Csalogány, K., Fogaras, D., Friedman, E., Sarlás, T., Uher, M., Windhager, E.: Searching a small national domain – A preliminary report. In: Proceedings of the 12th International World Wide Web Conference (WWW) (2003)Google Scholar
  3. 3.
    Di Nunzio, G., Ferro, N., Mandl, T., Peters, C.: CLEF 2006: Ad Hoc Track Overview. In: Peters, C., Clough, P., Gey, F.C., Karlgren, J., Magnini, B., Oard, D.W., de Rijke, M., Stempfhuber, M. (eds.) CLEF 2006. LNCS, vol. 4730. Springer, Heidelberg (2007)CrossRefGoogle Scholar
  4. 4.
    Halácsy, P., Trón, V.: Benefits of deep NLP-based lemmatization for information retrieval. In: Peters, C., Clough, P., Gey, F.C., Karlgren, J., Magnini, B., Oard, D.W., de Rijke, M., Stempfhuber, M. (eds.) CLEF 2006. LNCS, vol. 4730. Springer, Heidelberg (2007)CrossRefGoogle Scholar
  5. 5.
    Savoy, J., Abdou, S.: UniNE at CLEF 2006: Experiments with Monolingual, Bilingual, Domain-Specific and Robust Retrieval. In: Peters, C., Clough, P., Gey, F.C., Karlgren, J., Magnini, B., Oard, D.W., de Rijke, M., Stempfhuber, M. (eds.) CLEF 2006. LNCS, vol. 4730. Springer, Heidelberg (2007)CrossRefGoogle Scholar
  6. 6.
    Hungarian Grammar: From Wikipedia, the free encyclopedia,
  7. 7.
    Hiemstra, D., de Jong, F.: Disambiguation strategies for cross-language information retrieval. In: Proceedings of the Third European Conference on Research and Advanced Technology for Digital Libraries, London, UK, pp. 274–293 (1999)Google Scholar
  8. 8.
    Dorr, B.J.: The use of lexical semantics in interlingual machine translation. Machine Translation 7(3), 135–193 (1992)CrossRefGoogle Scholar
  9. 9.
    Knight, K., Luk, S.K.: Building a large-scale knowledge base for machine translation. In: Proceedings of the twelfth National Conference on Artificial Intelligence, pp. 773–778 (1994)Google Scholar
  10. 10.
    Navigli, R., Velardi, P., Gangemi, A.: Ontology learning and its application to automated terminology translation. IEEE Intelligent Systems 18(1), 22–31 (2003)CrossRefGoogle Scholar
  11. 11.
    Mahesh, K.: Ontology development for machine translation: Ideology and methodology. Technical Report MCCS 96-292, Computing Research Laboratory, New Mexico State University (1996)Google Scholar
  12. 12.
    Denoyer, L., Gallinari, P.: The Wikipedia XML corpus. SIGIR Forum 40(1), 64–69 (2006)CrossRefGoogle Scholar
  13. 13.
    Adafre, S.F., de Rijke, M.: Finding similar sentences across multiple languages in Wikipedia. In: Proceedings of the New Text Workshop, 11th Conference of the European Chapter of the Association for Computational Linguistics (2006)Google Scholar
  14. 14.
    Schmid, H.: Probabilistic part-of-speech tagging using decision trees. In: Proceedings of the International Conference on New Methods in Language Processing (1994)Google Scholar
  15. 15.
    Project jDictionary: SMART English-German plugin version 1.4,
  16. 16.
    Völkel, M., Krötzsch, M., Vrandecic, D., Haller, H., Studer, R.: Semantic Wikipedia. In: Proceedings of the 15th international conference on World Wide Web, pp. 585–594 (2006)Google Scholar
  17. 17.
    Schönhofen, P.: Identifying document topics using the Wikipedia category network. In: Web Intelligence, pp. 456–462 (2006)Google Scholar
  18. 18.
    Rasolofo, Y., Savoy, J.: Term proximity scoring for keyword-based retrieval systems. In: Sebastiani, F. (ed.) ECIR 2003. LNCS, vol. 2633, pp. 207–218. Springer, Heidelberg (2003)CrossRefGoogle Scholar
  19. 19.
    Büttcher, S., Clarke, C.L.A., Lushman, B.: Term proximity scoring for Ad-Hoc retrieval on very large text collections. In: SIGIR 2006, pp. 621–622. ACM Press, New York (2006)CrossRefGoogle Scholar
  20. 20.
    Singhal, A., Buckley, C., Mitra, M., Salton, G.: Pivoted document length normalization. Technical Report TR95-1560, Cornell University, Ithaca, NY (1995)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2008

Authors and Affiliations

  • Péter Schönhofen
    • 1
  • András Benczúr
    • 1
  • István Bíró
    • 1
  • Károly Csalogány
    • 1
  1. 1.Data Mining and Web search Research Group, Informatics Laboratory Computer and Automation Research InstituteHungarian Academy of Sciences 

Personalised recommendations