Skip to main content
Log in

How many performance measures to evaluate information retrieval systems?

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

Evaluating effectiveness of information retrieval systems is achieved by performing on a collection of documents, a search, in which a set of test queries are performed and, for each query, the list of the relevant documents. This evaluation framework also includes performance measures making it possible to control the impact of a modification of search parameters. The program trec_eval calculates a large number of measures, some being more used like the mean average precision or recall-precision curves. The motivation of our work is to compare all measures and to help the user to choose a small number of them when evaluating different information retrieval systems. In this paper, we present the study we carried out from a massive data analysis of TREC results. Relationships between the 130 measures calculated by trec_eval for individual queries are investigated, and we show that they can be clustered into homogeneous clusters.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  1. Alaux J, Dousset B, Chrisment C, Mothe J (2003) DocCube: multi-dimensional visualisation and exploration of large document sets. J Am Soc Inf Sci Technol 54(7): 650–659

    Article  Google Scholar 

  2. Al Hasan M, Salem S, Zaki MJ (2010) SimClus: an effective algorithm for clustering with a lower bound on similarity. Knowl Inf Syst, doi:10.1007/s10115-010-0360-6 (accepted oct. 2010)

  3. Aslam JA, Yilmaz E, Pavlu V (2005) A geometric interpretation of r-precision and its correlation with average precision. In: Proceedings of the international ACM SIGIR conference on research and development in information retrieval 573–574

  4. Bigot A, Chrisment C, Dkaki T, Hubert G, Mothe J (2011) Fusing different information retrieval systems according to query topics: a study based on correlation in information retrieval systems and query topics. Inf Retr J

  5. Belkin NJ, Croft WB (1992) Information filtering and information retrieval: two sides of the same coin?. Commun ACM 35(12): 29–38

    Article  Google Scholar 

  6. Borlund P (2003) The IIR evaluation model: a framework for evaluation of interactive information retrieval systems. Inf Res 8(3), paper no. 152 [Available at: http://informationr.net/ir/8-3/paper152.html]

  7. Buckley C (1991) Trec_eval,available at http://www-nlpir.nist.gov/projects/trecvid/trecvid.tools/trec_eval_video/README

  8. Buckley C, Voorhees EM (2004) Retrieval evaluation with incomplete information. In: Proceedings of the international ACM SIGIR conference on research and development in information retrieval, pp 25–32

  9. Buckley C, Voorhees EM (2005) Retrieval system evaluation. In: Voorhees EM, Harman DK (eds) TREC: experiment and evaluation in information retrieval. MIT Press, Cambridge, pp 53–75

  10. Caraux G, Pinloche S (2005) Permutmatrix: a graphical environment to arrange gene expression profiles in optimal linear order. Bioinformatics 21: 1280–1281

    Article  Google Scholar 

  11. Chen C-L, Tseng FSC, Liang T (2010) An integration of fuzzy association rules and WordNet for document clustering. Knowl Inf Syst, doi:10.1007/s10115-010-0364-2 (accepted nov. 2010)

  12. Cleverdon CW, Mills J, Keen EM (1966) Factors determining the performance of indexing systems (vol 1:Design; vol 2: Results). Aslib Cranfield Research Project, College of Aeronautics, Cranfield, UK

  13. Egghe L (2008) The measures precision, recall, fallout and miss as a function of the number of retrieved documents and their mutual interrelations. Inf Process Manage 44(2): 856–876

    Article  Google Scholar 

  14. Hersh WR, Elliot DL, Hickam DH, Wolf SL, Molnar A, Leichtenstien C (1994) Towards new measures of information retrieval evaluation. In: Proceedings of the annual symposium computer application in medical care, pp 895–899

  15. Hull D (1993) Using statistical testing in the evaluation of retrieval experiments. In: Proceedings of the international ACM SIGIR conference on research and development in information retrieval, pp 329–338

  16. Ishioka T (2003) Evaluation of criteria for information retrieval, Web Intelligence, WI 2003. In: Proceedings IEEE/WIC international conference, pp 425–431

  17. Jarvelin K, Keklinen J (2000) IR evaluation methods for retrieving highly relevant documents. In: Proceedings of the international ACM SIGIR conference on research and development in information retrieval, pp 41–48

  18. Jansen BJ, Spink A, Saracevic T (2000) Real life, real users, and real needs: a study and analysis of user queries on the web. Inf Process Manage 36(2): 207–227

    Article  Google Scholar 

  19. Jolliffe IT (2002) Principal component analysis, 2nd edn. Springer, Berlin

    MATH  Google Scholar 

  20. Kurland O (2009) Re-ranking search results using language models of query-specific clusters. Inf Retrieval J 12(4): 437–460

    Article  Google Scholar 

  21. Lebart L, Morineau A, Warwick KM (1984) Multivariate descriptive statistical analysis. Wiley, New York

    MATH  Google Scholar 

  22. Lebart L, Piron M, Morineau A (2006) Statistique exploratoire multidimensionnelle: visualisations et inférences en fouille de données, 4th edn. Dunod

  23. Mardia KV, Kent JT, Bibby JM (1979) Multivariate analysis. Academic Press, London

    MATH  Google Scholar 

  24. Melucci M (2007) On rank correlation in information retrieval evaluation. ACM SIGIR Forum 41(1): 18–33

    Article  Google Scholar 

  25. Mizzaro S, Robertson S (2007) Exploring IR Evaluation Results with Network Analysis. In: Proceedings of the international ACM SIGIR conference on research and development in information retrieval, pp 479–486

  26. Mothe J, Tanguy L (2008) Linguistic analysis of users’ queries: towards an adaptive information retrieval system. in: Proceesings of the international conference on signal image technologies and internet based systems (SITIS 2007), pp 77–84

  27. Poirier J, Sansas B (2009) Comparaison des classements de systèmes de recherche d’information en fonction des mesures de performances utilisées [Comparing IRS ranks in function of the evaluation measures that are used]. Internal Report NIRIT/RR–2009-31–FR, IRIT

  28. Pu H-T, Chuang S-L, Yang C (2002) Subject categorization of query terms for exploring Web users’ search interests. J Am Soc Inf Sci Technol Arch 53(8): 617–630

    Article  Google Scholar 

  29. Robertson SE (1981) The methodology of information retrieval experiment. In: Sparck Jones K (eds) Information retrieval experiments. Butterworths, London, pp 9–31

    Google Scholar 

  30. Sakai T (2007) On the reliability of information retrieval metrics based on graded relevance. Inf Process Manage 43(2): 531–548

    Article  Google Scholar 

  31. Sakai T, Kando N (2008) On information retrieval metrics designed for evaluation with incomplete relevance assessments. Inf Retrieval J 11(5): 447–470

    Article  Google Scholar 

  32. Sakuma J, Kobayashi S (2010) Large-scale k-means clustering with user-centric privacy-preservation. Knowl Inf Syst 25(2): 253–279

    Article  Google Scholar 

  33. Seber GAF (1984) Multivariate observations. Wiley, New York

    Book  MATH  Google Scholar 

  34. Tague-Sutcliffe J, Blustein J (1995) A statistical analysis of the TREC3 data. In: Proceedings of the third text retrieval conference (TREC-3), pp 385–398

  35. Taniar D (2007) Research and Trends in Data Mining Technologies and Applications. Information Retrieval Journal 11(2): 165–167

    Google Scholar 

  36. Voorhees EM, Harman D (1999) Overview of the Eighth Text REtrieval Conference (TREC-8). In: Proceedings NIST special publication:SP 500-246, pp 1–23

  37. Voorhees EM (2002) The philosophy of information retrieval evaluation. Lecture notes in computer science, vol 2406/2002, ISSN 0302-9743. Springer, Berlin

  38. Voorhees EM (2007) Overview of the TREC 2006. The fifteenth text retrieval conference (TREC 2006). In: Proceedings NIST special publication:SP 500-272, pp 1–16

  39. Webber W, Moffat A, Zobel J, Sakai T (2008) Precision-at-ten considered redundant. In: Proceedings of the international ACM SIGIR conference on research and development in information retrieval, pp 695–696

  40. Yilmaz E, Robertson S (2010) On the choice of effectiveness measures for learning to rank. Information Retrieval Journal, Special issue on Learning to rank for information retrieval 13(3): 271–290. doi:10.1007/s10791-009-9116-x

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Josiane Mothe.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Baccini, A., Déjean, S., Lafage, L. et al. How many performance measures to evaluate information retrieval systems?. Knowl Inf Syst 30, 693–713 (2012). https://doi.org/10.1007/s10115-011-0391-7

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-011-0391-7

Keywords

Navigation