Evaluating information retrieval system performance based on user preference

Zhou, Bing; Yao, Yiyu

doi:10.1007/s10844-009-0096-5

Evaluating information retrieval system performance based on user preference

Published: 27 June 2009

Volume 34, pages 227–248, (2010)
Cite this article

Journal of Intelligent Information Systems Aims and scope Submit manuscript

Bing Zhou¹ &
Yiyu Yao¹

712 Accesses
37 Citations
Explore all metrics

Abstract

One of the challenges of modern information retrieval is to rank the most relevant documents at the top of the large system output. This calls for choosing the proper methods to evaluate the system performance. The traditional performance measures, such as precision and recall, are based on binary relevance judgment and are not appropriate for multi-grade relevance. The main objective of this paper is to propose a framework for system evaluation based on user preference of documents. It is shown that the notion of user preference is general and flexible for formally defining and interpreting multi-grade relevance. We review 12 evaluation methods and compare their similarities and differences. We find that the normalized distance performance measure is a good choice in terms of the sensitivity to document rank order and gives higher credits to systems for their ability to retrieve highly relevant documents.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Measuring Stability and Discrimination Power of Metrics in Information Retrieval Evaluation

How do interval scales help us with better understanding IR evaluation measures?

Article 04 September 2019

An Intrinsic Framework of Information Retrieval Evaluation Measures

References

Bollmann, P., & Wong, S. K. M. (1987). Adaptive linear information retrieval models. In SIGIR (pp. 157–163).
Borda, J. C. (1781). Memoire sur les elections au scrutin. In Histoire de l’Academie Royale des Sciences.
Buckley, C., & Voorhees, E. M. (2000). Evaluating evaluation measure stability. In Proceedings of the 23rd annual international ACM SIGIR conference on research and development in information retrieval (pp. 33–40).
Champney, H., & Marshall, H. (1939). Optimal refinement of the rating scale. Journal of Applied Psychology, 23, 323–331.
Article Google Scholar
Cleverdon, C. (1962). Report on the testing and analysis of an investigation into the comparative efficiency of indexing systems. Cranfield: Cranfield Coll. of Aeronautics.
Cleverdon, C., Mills, J., & Keen, M. (1966). Factors dermnining the performance of indexing systems. Cranfield: Aslib Cranfield Research Project.
Google Scholar
Cooper, W. S. (1968). Expected search length: A single measure of retrieval effectiveness based on weak ordering action of retrieval systems. Journal of the American Society for Information Science, 19(1), 30–41.
Article Google Scholar
Cox, E. P. (1980). The optimal number of response alternatives for a scale: A review. Journal of Marketing Research, 407–422.
Cuadra, C. A., & Katter, R. V. (1967). Experimental studies of relevance judgments: Final report. Santa Monica: System Development.
Google Scholar
Dwork, C., Kumar, R., Naor, M., & Sivakumar, D. (2001). Rank aggregation methods for the web. In WWW ’01: Proceedings of the 10th international conference on world wide web (pp. 613–622).
Eisenberg, M. (1988). Measuring relevance judgments. Information Processing and Management, 24(4), 373–389.
Article MathSciNet Google Scholar
Eisenberg, M., & Hu, X. (1987). Dichotomous relevance judgments and the evaluation of information systems. In Proceeding of the american scoiety for information science, 50th annual meeting. Medford.
Fishburn, F. C. (1970). Utility theory for decision making. New York: Wiley.
MATH Google Scholar
Frei, H. P., & Schsuble, P. (1991). Determine the effectiveness of retrieval algorithms. Information Processing and Management, 27, 153–164.
Article Google Scholar
Fuhr, N. (1989). Optimum polynomial retrieval functions based on probability ranking principle. ACM Transactions on Information System, 3, 183–204.
Article Google Scholar
Jacoby, J., & Matell, M. S. (1971). Three point likert scales are good enough. Journal of Marketing Research, 8, 495–500.
Article Google Scholar
Jarvelin, K., & Kekalainen, J. (2000). IR evaluation methods for retrieving highly relevant documents. In Proceedings of the 23rd annual international acm sigir conference on research and development in information retrieval.
Jarvelin, K., & Kekalainen, J. (2002). Cumulated gain-based evaluation of IR techniques. ACM Transactions on Information Systems, 20, 422–446.
Article Google Scholar
Kando, N., Kuriyams, K., & Yoshioka, M. (2001). Information retrieval system evaluation using multi-grade relevance judgments: Discussion on averageable single-numbered measures. In JPSJ SIG Notes (pp. 105–112).
Katter, R. V. (1968). The influence of scale form on relevance judgments. Information Storage and Retrieval, 4(1), 1–11.
Article Google Scholar
Kemeny, J. G., & Snell, J. L. (1962). Mathematical models in the social science. New York: Blaisdell.
Google Scholar
Kendall, M. (1938). A new measure of rank correlation. Biometrika, 30, 81–89.
MATH MathSciNet Google Scholar
Kendall, M. (1945). The treatment of ties in rank problems. Biometrika, 33, 239–251.
Article MATH MathSciNet Google Scholar
Maglaughlin, K. L., & Sonnenwald, D. H. (2002). User perspectives on relevance criteria: A comparison among relevant, partially relevant, and not-relevant judgments. Journal of the American Society for Information Science and Technology, 53(5), 327–342.
Article Google Scholar
Maron, M. E., & Kuhns, J. L. (1970). On relevance, probabilistic indexing and information retrieval. In T. Saracevis (Ed.), Introduction to information science (pp. 295–311). New York: R.R. Bowker.
Google Scholar
Mizzaro, S. (2001). A new measure of retrieval effectiveness (Or: What’s wrong with precision and recall). International workshop on information retrieval (pp. 43–52).
Myers, J. L., & Arnold, D. W. (2003). Research design and statistical analysis. Hove: Lawrence Erlbaum.
Google Scholar
Pollack, S. M. (1968). Measures for the comparison of information retrieval system. American Documentation, 19(4), 387–397.
Article Google Scholar
Rasmay, J. O. (1973). The effect of number of categories in rating scales on precision of estimation of scale values. Psychometrika, 38(4), 513–532.
Article Google Scholar
Rees, A. M., & Schultz, D. G. (1967). A field experimental approch to the study of relevance assessments in relation to document searching. Cleverland: Case Western Reserve University.
Google Scholar
Robertson, S. E. (1977). The probability ranking principle. In IR journal of documentation (Vol. 33, No. 4, pp. 294–304).
Rocchio, J. J. (1971). Performance indices for document retrieval. In G. Salton (Ed.), The SMART retrieval system-experiments in automatic document processing (pp. 57–67).
Sagara, Y. (2002). Performance measures for ranked output retrieval systems. Journal of Japan Society of Information and Knowledge, 12(2), 22–36.
Google Scholar
Sakai, T. (2003). Average gain ratio: A simple retrieval performance measure for evaluation with multiple relevance levels. Proceedings of ACM SIGIR (pp. 417–418).
Sakai, T. (2004). New performance matrics based on multi-grade relevance: Their application to question answering. In NTCIR-4 proceedings.
Spearman, C. (1904). General intelligence: Objectively determined and measured. American Journal of Psychology, 15, 201–293.
Article Google Scholar
Spink, A., Greisdorf, H., & Bateman, J. (1999). From highly relevant to not relevant: Examining different regions of relevance. Information Processing & Management, 34(4), 599–621.
Google Scholar
Stuart, A. (1953). The estimation and comparison of strengths of association in contingency tables. Biometrika, 40, 105–10.
Article MATH MathSciNet Google Scholar
Tang, R., Vevea, J. L., & Shaw, W. M. (1999). Towards the identification of optimal number of relevance categories. Journal of American Society for Information Science (JASIS), 50(3), 254–264.
Article Google Scholar
van Rijsbergen, C. J. (1979). Information retrieval. Newton: Butterworth-Heinemann.
Google Scholar
Voorhees, E. M. (2005). Overview of TREC 2004. In E. Voorhees, & L. Buckland (Eds.), Proceedings of the 13th text retrieval conference. Gaithersburg.
Wong, S. K. M., & Yao, Y. Y. (1990). Query formulation in linear retrieval models. Journal of the American Society for Information Science, 41, 334–341.
Article Google Scholar
Wong, S. K. M., Yao, Y. Y., & Bollmann, P. (1988). Linear structure in information retrieval. In Proceedings of the 11th annual international acmsigir conference on research and development in information retrieval (Vol. 2, pp. 19–232).
Yao, Y. Y. (1995). Measuring retrieval effectiveness bsed on user preference of documents. Journal of the American Society for Information Science, 46(2), 133–145.
Article Google Scholar

Download references

Acknowledgements

The authors are grateful for the financial support from NSERC Canada, constructive comments from professor Zbigniew W. Ras during the ISMIS 2008 conference in Toronto, and for the valuable suggestions from anonymous reviewers.

Author information

Authors and Affiliations

Department of Computer Science, University of Regina, Regina, Saskatchewan, Canada, S4S 0A2
Bing Zhou & Yiyu Yao

Authors

Bing Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Yiyu Yao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Bing Zhou.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhou, B., Yao, Y. Evaluating information retrieval system performance based on user preference. J Intell Inf Syst 34, 227–248 (2010). https://doi.org/10.1007/s10844-009-0096-5

Download citation

Received: 13 September 2008
Revised: 15 June 2009
Accepted: 21 June 2009
Published: 27 June 2009
Issue Date: June 2010
DOI: https://doi.org/10.1007/s10844-009-0096-5

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Evaluating information retrieval system performance based on user preference

Abstract

Access this article

Similar content being viewed by others

Measuring Stability and Discrimination Power of Metrics in Information Retrieval Evaluation

How do interval scales help us with better understanding IR evaluation measures?

An Intrinsic Framework of Information Retrieval Evaluation Measures

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Evaluating information retrieval system performance based on user preference

Abstract

Access this article

Similar content being viewed by others

Measuring Stability and Discrimination Power of Metrics in Information Retrieval Evaluation

How do interval scales help us with better understanding IR evaluation measures?

An Intrinsic Framework of Information Retrieval Evaluation Measures

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation