Advertisement

Lessons Learnt from Experiments on the Ad Hoc Multilingual Test Collections at CLEF

  • Jacques SavoyEmail author
  • Martin Braschler
Chapter
Part of the The Information Retrieval Series book series (INRE, volume 41)

Abstract

This chapter describes the lessons learnt from the ad hoc track at CLEF in the years 2000 to 2009. This contribution focuses on Information Retrieval (IR) for languages other than English (monolingual IR), as well as bilingual IR (also termed “cross-lingual”; the request is written in one language and the searched collection in another), and multilingual IR (the information items are written in many different languages). During these years the ad hoc track has used mainly newspaper test collections, covering more than 15 languages. The authors themselves have designed, implemented and evaluated IR tools for all these languages during those CLEF campaigns. Based on our own experience and the lessons reported by other participants in these years, we are able to describe the most important challenges when designing a IR system for a new language. When dealing with bilingual IR, our experiments indicate that the critical point is the translation process. However, currently online translating systems tend to offer rather effective translation from one language to another, especially when one of these languages is English. In order to solve the multilingual IR question, different IR architectures are possible. For the simplest approach based on query translation of individual language pairs, the crucial component is the merging of the intermediate bilingual results. When considering both document and query translation, the complexity of the whole system represents clearly a main issue.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Notes

Acknowledgement

The authors would like to thank the CLEF organizers for their efforts in developing the CLEF test collections.

References

  1. Amati G, van Rijsbergen CJ (2002) Probabilistic models of information retrieval based on measuring the divergence from randomness. ACM Trans Inf Syst 20:357–389CrossRefGoogle Scholar
  2. Ballesteros L, Croft BW (1997) Phrasal translation and query expansion techniques for cross-language information retrieval. In: Proceedings ACM SIGIR. ACM Press, New York, pp 84–91Google Scholar
  3. Braschler M (2004) Combination approaches for multilingual text retrieval. Inform Retrieval J 7:183–204CrossRefGoogle Scholar
  4. Braschler M, Ripplinger B (2004) How effective is stemming and decompounding for German text retrieval? Inform Retrieval J 7:291–316CrossRefGoogle Scholar
  5. Braschler M, Schäuble P (2001) Experiments with the eurospider retrieval system for CLEF 2000. In: Peters C (ed) Cross-language information retrieval and evaluation. LNCS, vol 2069, Springer, Berlin pp 140–148CrossRefGoogle Scholar
  6. Braschler M, Göhring A, Schäuble P (2003) Europsider at CLEF 2002. In: Peters P, Braschler M, Gonzalo J, Kluck M (eds) Advances in cross-language information retrieval: third workshop of the cross–language evaluation forum (CLEF 2002) revised papers. LNCS, vol 2785. Springer, Berlin, pp 164–174CrossRefGoogle Scholar
  7. Buckley C, Singhal A, Mitra M, Salton G (1995) New retrieval approaches using SMART. In: Proceedings TREC-4, NIST, Gaithersburg, pp 25–48Google Scholar
  8. Buckley C, Singhal A, Mitra M, Salton G (1997) Using clustering and superconcepts within SMART: TREC-6. In: Proceedings TREC-6, NIST, Gaithersburg, pp 107–124Google Scholar
  9. Chen A (2004) Report on CLEF-2003 monolingual tracks: fusion of probabilistic models for effective monolingual retrieval. In: Peters C, Gonzalo J, Braschler M, Kluck M (eds) Comparative evaluation of multilingual information access systems, LNCS, vol 3237. Springer, Berlin, pp 322–336CrossRefGoogle Scholar
  10. Crocker C (2006) Løst in Tränšlatioπ. Misadventures in English abroad. Michael 0’Mara Books, LondonGoogle Scholar
  11. Dolamic L, Savoy J (2009a) Indexing and searching strategies for the Russian language. J Am Soc Inf Sci Technol 60:2540–2547CrossRefGoogle Scholar
  12. Dolamic L, Savoy J (2009b) Indexing and stemming approaches for the Czech language. Inf Process Manag 45:714–720CrossRefGoogle Scholar
  13. Dolamic L, Savoy J (2010a) Comparative study of indexing and search strategies for the Hindi, Marathi and Bengali languages. ACM Trans Asian Lang Inf Process 9(3):11CrossRefGoogle Scholar
  14. Dolamic L, Savoy J (2010b) Retrieval effectiveness of machine translated queries. J Am Soc Inf Sci Technol 61:2266–2273CrossRefGoogle Scholar
  15. Dolamic L, Savoy J (2010c) When stopword lists make the difference. J Am Soc Inf Sci Technol 61:200–203CrossRefGoogle Scholar
  16. Dumais ST (1994) Latent semantic indexing (LSI) and TREC-2. In: Proceedings TREC-2, vol #500-215. NIST, Gaithersburg, pp 105–115Google Scholar
  17. Fautsch C, Savoy J (2009) Algorithmic stemmers or morphological analysis: an evaluation. J Am Soc Inf Sci Technol 60:1616–1624CrossRefGoogle Scholar
  18. Ferro N, Silvello G (2016a) A general linear mixed models approach to study system component effects. In: Proceedings ACM SIGIR. ACM Press, New York, pp 25–34Google Scholar
  19. Ferro N, Silvello G (2016b) The CLEF monolingual grid of points. In: Fuhr N, Quaresma P, Gonçalves T, Larsen B, Balog K, Macdonald C, Cappellato L, Ferro N (eds) Experimental IR meets multilinguality, multimodality, and interaction. Proceedings of the eighth international conference of the CLEF association (CLEF 2017). LNCS, vol 9822. Springer, Berlin, pp 13–24Google Scholar
  20. Fox C (1990) A stop list for general text. ACM-SIGIR Forum 24:19–35CrossRefGoogle Scholar
  21. Fox EA, Shaw JA (1994) Combination of multiple searches. In: Proceedings TREC-2, vol 500-215. NIST, Gaithersburg, pp 243–249Google Scholar
  22. Gotti F, Langlais P, Lapalme G (2013) Designing a machine translation system for the Canadian weather warnings: a case study. Nat Lang Eng 20:399–433CrossRefGoogle Scholar
  23. Harman DK (1991) How effective is suffixing? J Am Soc Inf Sci 42:7–15MathSciNetCrossRefGoogle Scholar
  24. Hedlund T, Airio E, Keskustalo H, Lehtokangas R, Pirkola A, Järvelin K (2004) Dictionary-based cross-language information retrieval: learning experiences from CLEF 2000–2002. Inf Retrieval J 7:99–120CrossRefGoogle Scholar
  25. Hiemstra D (2000) Using language models for IR. PhD thesis, CTIT, EnschedeGoogle Scholar
  26. Kraaij W, Nie JY, Simard M (2003) Embedding web-based statistical translation models in cross-lingual information retrieval. Comput Linguist 29:381–419CrossRefGoogle Scholar
  27. Manning CD, Raghavan P, Schütze H (2008) Introduction to information retrieval. Cambridge University Press, CambridgeCrossRefGoogle Scholar
  28. McNamee P, Mayfield J (2002) Scalable Multilingual Information Access. In: Peters P, Braschler M, Gonzalo J, Kluck M (eds) Advances in cross-language information retrieval. LNCS, vol 2785. Springer, Berlin, pp 207–218CrossRefGoogle Scholar
  29. McNamee P, Mayfield J (2004) Character N-gram tokenization for European language text retrieval. Inf Retrieval J 7:73–98CrossRefGoogle Scholar
  30. McNamee P, Nicholas C, Mayfield J (2009) Addressing morphological variation in alphabetic languages. In: Proceedings ACM - SIGIR. ACM Press, New York, pp 75–82Google Scholar
  31. Moulinier I (2004) Thomson legal and regulatory at NTCIR-4: monolingual and pivot-language retrieval experiments. In: Proceedings NTCIR-4, pp 158–165Google Scholar
  32. Nie JY, Simard M, Isabelle P, Durand R (1999) Cross-language information retrieval based on parallel texts and automatic mining of parallel texts from the Web. In: Proceedings ACM - SIGIR. ACM Press, New York, pp 74–81Google Scholar
  33. Paik JH, Parai SK (2011) A fast corpus-based stemmer. ACM Trans Asian Lang Inf Process 10(2):8CrossRefGoogle Scholar
  34. Paik JH, Parai SK, Dipasree P, Robertson SE (2013) Effective and robust query-based stemming. ACM Trans Inf Syst 31(4):18CrossRefGoogle Scholar
  35. Peters C, Braschler M, Clough P (2012) Multilingual information retrieval. From research to practice. Springer, BerlinCrossRefGoogle Scholar
  36. Porter MF (1980) An algorithm for suffix stripping. Program 14:130–137CrossRefGoogle Scholar
  37. Powell AL, French JC, Callan J, Connell M, Viles CL (2000) The impact of database selection on distributed searching. In: Proceedings ACM-SIGIR. ACM Press, New York, pp 232–239Google Scholar
  38. Rasolofo Y, Hawking D, Savoy J (2003) Result merging strategies for a current news metasearcher. Inf Process Manage 39:581–609CrossRefGoogle Scholar
  39. Robertson SE, Walker S, Beaulieu M (2000) Experimentation as a way of life: Okapi at TREC. Inf Process Manage 36:95–108CrossRefGoogle Scholar
  40. Sanders RH (2010) German, biography of a language. Oxford University Press, OxfordGoogle Scholar
  41. Savoy J (2003a) Cross-language information retrieval: experiments based on CLEF 2000 corpora. Inf Process Manage 39:75–115CrossRefGoogle Scholar
  42. Savoy J (2003b) Cross-language retrieval experiments at CLEF 2002. In: Peters P, Braschler M, Gonzalo J, Kluck M (eds) Advances in cross-language information retrieval. LNCS, vol 2785. Springer, Berlin, pp 28–48CrossRefGoogle Scholar
  43. Savoy J (2004) Combining multiple strategies for effective monolingual and cross-lingual retrieval. Inf Retrieval J 7:121–148CrossRefGoogle Scholar
  44. Savoy J (2005) Comparative study of monolingual and multilingual search models for use with Asian languages. ACM Trans Asian Lang Inf Process 4:163–189CrossRefGoogle Scholar
  45. Savoy J (2006) Light stemming approaches for the French, Portuguese, German and Hungarian languages. In: Proceedings ACM-SAC. ACM Press, New York, pp 1031–1035Google Scholar
  46. Savoy J (2008a) Searching strategies for the Bulgarian language. Inf Retrieval J 10:509–529CrossRefGoogle Scholar
  47. Savoy J (2008b) Searching strategies for the Hungarian language. Inf Process Manage 44:310–324CrossRefGoogle Scholar
  48. Savoy J, Berger PY (2005) Selecting and merging strategies for multilingual information retrieval. In: Peters C, Clough P, Gonzalo J, Jones GJF, Kluck M, Magnini B (eds) Multilingual information access for text, speech and images. LNCS, vol 3491. Springer, Berlin, pp 27–37CrossRefGoogle Scholar
  49. Savoy J, Dolamic L (2010) How effective is Google’s translation service in search? Commun ACM 52:139–143CrossRefGoogle Scholar
  50. Zhou D, Truran M, Brailsford T, Wade V, Ashman H (2012) Translation techniques in cross-language information retrieval. ACM Comput Surv 45(1):1CrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  1. 1.Computer Science DepartmentUniversity of NeuchâtelNeuchâtelSwitzerland
  2. 2.Institut für Angewandte InformationstechnologieZürich University of Applied Sciences ZHAWWinterthurSwitzerland

Personalised recommendations