Abstract
Evaluating effectiveness of information retrieval systems is achieved by performing on a collection of documents, a search, in which a set of test queries are performed and, for each query, the list of the relevant documents. This evaluation framework also includes performance measures making it possible to control the impact of a modification of search parameters. The program trec_eval calculates a large number of measures, some being more used like the mean average precision or recall-precision curves. The motivation of our work is to compare all measures and to help the user to choose a small number of them when evaluating different information retrieval systems. In this paper, we present the study we carried out from a massive data analysis of TREC results. Relationships between the 130 measures calculated by trec_eval for individual queries are investigated, and we show that they can be clustered into homogeneous clusters.
Similar content being viewed by others
References
Alaux J, Dousset B, Chrisment C, Mothe J (2003) DocCube: multi-dimensional visualisation and exploration of large document sets. J Am Soc Inf Sci Technol 54(7): 650–659
Al Hasan M, Salem S, Zaki MJ (2010) SimClus: an effective algorithm for clustering with a lower bound on similarity. Knowl Inf Syst, doi:10.1007/s10115-010-0360-6 (accepted oct. 2010)
Aslam JA, Yilmaz E, Pavlu V (2005) A geometric interpretation of r-precision and its correlation with average precision. In: Proceedings of the international ACM SIGIR conference on research and development in information retrieval 573–574
Bigot A, Chrisment C, Dkaki T, Hubert G, Mothe J (2011) Fusing different information retrieval systems according to query topics: a study based on correlation in information retrieval systems and query topics. Inf Retr J
Belkin NJ, Croft WB (1992) Information filtering and information retrieval: two sides of the same coin?. Commun ACM 35(12): 29–38
Borlund P (2003) The IIR evaluation model: a framework for evaluation of interactive information retrieval systems. Inf Res 8(3), paper no. 152 [Available at: http://informationr.net/ir/8-3/paper152.html]
Buckley C (1991) Trec_eval,available at http://www-nlpir.nist.gov/projects/trecvid/trecvid.tools/trec_eval_video/README
Buckley C, Voorhees EM (2004) Retrieval evaluation with incomplete information. In: Proceedings of the international ACM SIGIR conference on research and development in information retrieval, pp 25–32
Buckley C, Voorhees EM (2005) Retrieval system evaluation. In: Voorhees EM, Harman DK (eds) TREC: experiment and evaluation in information retrieval. MIT Press, Cambridge, pp 53–75
Caraux G, Pinloche S (2005) Permutmatrix: a graphical environment to arrange gene expression profiles in optimal linear order. Bioinformatics 21: 1280–1281
Chen C-L, Tseng FSC, Liang T (2010) An integration of fuzzy association rules and WordNet for document clustering. Knowl Inf Syst, doi:10.1007/s10115-010-0364-2 (accepted nov. 2010)
Cleverdon CW, Mills J, Keen EM (1966) Factors determining the performance of indexing systems (vol 1:Design; vol 2: Results). Aslib Cranfield Research Project, College of Aeronautics, Cranfield, UK
Egghe L (2008) The measures precision, recall, fallout and miss as a function of the number of retrieved documents and their mutual interrelations. Inf Process Manage 44(2): 856–876
Hersh WR, Elliot DL, Hickam DH, Wolf SL, Molnar A, Leichtenstien C (1994) Towards new measures of information retrieval evaluation. In: Proceedings of the annual symposium computer application in medical care, pp 895–899
Hull D (1993) Using statistical testing in the evaluation of retrieval experiments. In: Proceedings of the international ACM SIGIR conference on research and development in information retrieval, pp 329–338
Ishioka T (2003) Evaluation of criteria for information retrieval, Web Intelligence, WI 2003. In: Proceedings IEEE/WIC international conference, pp 425–431
Jarvelin K, Keklinen J (2000) IR evaluation methods for retrieving highly relevant documents. In: Proceedings of the international ACM SIGIR conference on research and development in information retrieval, pp 41–48
Jansen BJ, Spink A, Saracevic T (2000) Real life, real users, and real needs: a study and analysis of user queries on the web. Inf Process Manage 36(2): 207–227
Jolliffe IT (2002) Principal component analysis, 2nd edn. Springer, Berlin
Kurland O (2009) Re-ranking search results using language models of query-specific clusters. Inf Retrieval J 12(4): 437–460
Lebart L, Morineau A, Warwick KM (1984) Multivariate descriptive statistical analysis. Wiley, New York
Lebart L, Piron M, Morineau A (2006) Statistique exploratoire multidimensionnelle: visualisations et inférences en fouille de données, 4th edn. Dunod
Mardia KV, Kent JT, Bibby JM (1979) Multivariate analysis. Academic Press, London
Melucci M (2007) On rank correlation in information retrieval evaluation. ACM SIGIR Forum 41(1): 18–33
Mizzaro S, Robertson S (2007) Exploring IR Evaluation Results with Network Analysis. In: Proceedings of the international ACM SIGIR conference on research and development in information retrieval, pp 479–486
Mothe J, Tanguy L (2008) Linguistic analysis of users’ queries: towards an adaptive information retrieval system. in: Proceesings of the international conference on signal image technologies and internet based systems (SITIS 2007), pp 77–84
Poirier J, Sansas B (2009) Comparaison des classements de systèmes de recherche d’information en fonction des mesures de performances utilisées [Comparing IRS ranks in function of the evaluation measures that are used]. Internal Report NIRIT/RR–2009-31–FR, IRIT
Pu H-T, Chuang S-L, Yang C (2002) Subject categorization of query terms for exploring Web users’ search interests. J Am Soc Inf Sci Technol Arch 53(8): 617–630
Robertson SE (1981) The methodology of information retrieval experiment. In: Sparck Jones K (eds) Information retrieval experiments. Butterworths, London, pp 9–31
Sakai T (2007) On the reliability of information retrieval metrics based on graded relevance. Inf Process Manage 43(2): 531–548
Sakai T, Kando N (2008) On information retrieval metrics designed for evaluation with incomplete relevance assessments. Inf Retrieval J 11(5): 447–470
Sakuma J, Kobayashi S (2010) Large-scale k-means clustering with user-centric privacy-preservation. Knowl Inf Syst 25(2): 253–279
Seber GAF (1984) Multivariate observations. Wiley, New York
Tague-Sutcliffe J, Blustein J (1995) A statistical analysis of the TREC3 data. In: Proceedings of the third text retrieval conference (TREC-3), pp 385–398
Taniar D (2007) Research and Trends in Data Mining Technologies and Applications. Information Retrieval Journal 11(2): 165–167
Voorhees EM, Harman D (1999) Overview of the Eighth Text REtrieval Conference (TREC-8). In: Proceedings NIST special publication:SP 500-246, pp 1–23
Voorhees EM (2002) The philosophy of information retrieval evaluation. Lecture notes in computer science, vol 2406/2002, ISSN 0302-9743. Springer, Berlin
Voorhees EM (2007) Overview of the TREC 2006. The fifteenth text retrieval conference (TREC 2006). In: Proceedings NIST special publication:SP 500-272, pp 1–16
Webber W, Moffat A, Zobel J, Sakai T (2008) Precision-at-ten considered redundant. In: Proceedings of the international ACM SIGIR conference on research and development in information retrieval, pp 695–696
Yilmaz E, Robertson S (2010) On the choice of effectiveness measures for learning to rank. Information Retrieval Journal, Special issue on Learning to rank for information retrieval 13(3): 271–290. doi:10.1007/s10791-009-9116-x
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Baccini, A., Déjean, S., Lafage, L. et al. How many performance measures to evaluate information retrieval systems?. Knowl Inf Syst 30, 693–713 (2012). https://doi.org/10.1007/s10115-011-0391-7
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-011-0391-7