Increasing evaluation sensitivity to diversity
- 339 Downloads
- 5 Citations
Abstract
Many queries have multiple interpretations; they are ambiguous or underspecified. This is especially true in the context of Web search. To account for this, much recent research has focused on creating systems that produce diverse ranked lists. In order to validate these systems, several new evaluation measures have been created to quantify diversity. Ideally, diversity evaluation measures would distinguish between systems by the amount of diversity in the ranked lists they produce. Unfortunately, diversity is also a function of the collection over which the system is run and a system’s performance at ad-hoc retrieval. A ranked list built from a collection that does not cover multiple subtopics cannot be diversified; neither can a ranked list that contains no relevant documents. To ensure that we are assessing systems by their diversity, we develop (1) a family of evaluation measures that take into account the diversity of the collection and (2) a meta-evaluation measure that explicitly controls for performance. We demonstrate experimentally that our new measures can achieve substantial improvements in sensitivity to diversity without reducing discriminative power.
Keywords
Diversity Evaluation Diversity difficultyReferences
- Agrawal, R., Gollapudi, S., Halverson, A., Ieong, S. (2009). Diversifying search results. In Proceedings of the second ACM international conference on web search and data mining, WSDM ’09 (pp. 5–14). New York, NY: ACM.Google Scholar
- Aslam, J. A., & Pavlu, V. (2007). Query hardness estimation using jensen-shannon divergence among multiple scoring functions. In Proceedings of the 29th European conference on IR research, ECIR’07 (pp. 198–209). Berlin: Springer.Google Scholar
- Broder, A. (2002). A taxonomy of web search. SIGIR Forum, 36(2), 3–10.CrossRefGoogle Scholar
- Carbonell, J., & Goldstein, J. (1998). The use of MMR, diversity-based reranking for reordering documents and producing summaries. In Proceedings of the 21st annual international ACM SIGIR conference on research and development in information retrieval, SIGIR ’98 (pp. 335–336). New York, NY: ACM.Google Scholar
- Carmel, D., & Yom-Tov, E. (2010). Estimating the query difficulty for information retrieval. Synthesis lectures on information concepts, retrieval, and services. San Rafael: Morgan & Claypool.Google Scholar
- Carmel, D., Yom-Tov, E., Darlow, A., Pelleg, D. (2006). What makes a query difficult? In Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, SIGIR ’06 (pp. 390–397). New York, NY: ACM.Google Scholar
- Carterette, B. (2009). An analysis of NP-completeness in novelty and diversity ranking. In Proceedings of the 2nd international conference on theory of information retrieval: Advances in information retrieval theory, ICTIR ’09 (pp. 200–211). Berlin : Springer.Google Scholar
- Chapelle, O., Metlzer, D., Zhang, Y., Grinspan, P. (2009). Expected reciprocal rank for graded relevance. In Proceedings of the 18th ACM conference on information and knowledge management, CIKM ’09 (pp. 621–630). New York, NY: ACM.Google Scholar
- Chen, H., & Karger, D. R. (2006) Less is more: Probabilistic models for retrieving fewer relevant documents. In Proceedings of the 29th annual international ACM SIGIR conference on research and development in information retrieval, SIGIR ’06 (pp. 429–436). New York, NY: ACM.Google Scholar
- Clarke, C. L., Craswell, N., Soboroff, I., Ashkan, A. (2011). A comparative analysis of cascade measures for novelty and diversity. In Proceedings of the fourth ACM international conference on web search and data mining, WSDM ’11 (pp. 75–84). New York, NY: ACM.Google Scholar
- Clarke, C. L., Kolla, M., Cormack, G. V., Vechtomova, O., Ashkan, A., Büttcher, S., MacKinnon, I. (2008). Novelty and diversity in information retrieval evaluation. In Proceedings of the 31st annual international ACM SIGIR conference on research and development in information retrieval, SIGIR ’08 (pp. 659–666). New York, NY: ACM.Google Scholar
- Clarke, C. L., Kolla, M., Vechtomova, O. (2009). An effectiveness measure for ambiguous and underspecified queries. In Proceedings of the 2nd international conference on theory of information retrieval: Advances in information retrieval theory, ICTIR ’09 (pp. 188–199). Berlin: Springer.Google Scholar
- Clarke, C. L. A., Craswell, N., Soboroff, I. (2009). Overview of the TREC 2009 web track. In 18th text retrieval conference. Maryland: Gaithersburg.Google Scholar
- Clarke, C. L. A., Craswell, N., Soboroff, I., Cormack, G. V. (2010). Overview of the TREC 2010 web track. In 19th text retrieval conference. Maryland: Gaithersburg.Google Scholar
- Clarke, C. L. A., Craswell, N., Soboroff, I., Voorhees, E. M. (2011). Overview of the TREC 2011 web track. In 20th text retrieval conference. Maryland: Gaithersburg.Google Scholar
- Craswell, N., Zoeter, O., Taylor, M., Ramsey, B. (2008). An experimental comparison of click position-bias models. In Proceedings of the international conference on web search and data mining, WSDM ’08 (pp. 87–94). New York, NY: ACM.Google Scholar
- Cronen-Townsend, S., Zhou, Y., Croft, W. B. (2002). Predicting query performance. In Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval, SIGIR ’02 (pp. 299–306). New York, NY: ACM.Google Scholar
- Cronen-Townsend, S., Zhou, Y., Croft, W. B. (2006). Precision prediction based on ranked list coherence. Information Retrieval 9(6), 723–755. http://link.springer.com/article/10.1007%2Fs10791-006-9006-4.Google Scholar
- Golbus, P. B., Pavlu, V., Aslam, J. A. (2012) What we talk about when we talk about diversity. In Proceedings of diversity in document retrieval 2012.Google Scholar
- Hauff, C. (2010). Predicting the effectiveness of queries and retrieval systems. Ph.D. thesis, University of Twente, Enschede.Google Scholar
- He, B., & Ounis, I. (2004). Inferring query performance using pre-retrieval predictors. In Proceedings of symposium on string processing and information retrieval (pp. 43–54). Berlin: Springer.Google Scholar
- Järvelin, K., & Kekäläinen, J. (2002). Cumulated gain-based evaluation of IR techniques. ACM Transactions on Information Systems 20(4), 422–446.CrossRefGoogle Scholar
- Kendall, M. G. (1938). A new measure of rank correlation. Biometrika 30(1/2), 81–93.MathSciNetMATHCrossRefGoogle Scholar
- Moffat, A., & Zobel, J. (2008). Rank-biased precision for measurement of retrieval effectiveness. ACM Transactions on Information Systems 27(1), 2:1–2:27.Google Scholar
- Mothe, J., & Tanguy, L. (2005). Linguistic features to predict query difficulty. In In ACM SIGIR 2005 workshop on predicting query difficulty—methods and applications.Google Scholar
- Robertson, S. (2006). On GMAP: And other transformations. In Proceedings of the 15th ACM international conference on information and knowledge management, CIKM ’06 (pp. 78–83). New York, NY: ACM.Google Scholar
- Robertson, S. E. (1977). The probability ranking principle in IR. Journal of Documentation 33(4), 294–304.CrossRefGoogle Scholar
- Sakai, T. (2006). Evaluating evaluation metrics based on the bootstrap. In Proceedings of the 29th annual international ACM SIGIR conference on research and development in information retrieval, SIGIR ’06 (pp. 525–532). New York, NY: ACM.Google Scholar
- Sakai T. (2012). Evaluation with informational and navigational intents. In Proceedings of the 21st world wide web conference (WWW) 2012.Google Scholar
- Sakai, T., Craswell, N., Song, R., Robertson, S., Dou, Z., Lin, C. Y. (2010). Simple evaluation metrics for diversified search results. In The Third international workshop on evaluating information access (EVIA).Google Scholar
- Sakai, T., & Song, R. (2011). Evaluating diversified search results using per-intent graded relevance. In Proceedings of the 34th annual international ACM SIGIR conference on research and development in information retrieval, SIGIR ’11 (pp. 1043–1052). New York, NY: ACM.Google Scholar
- Santos, R., Macdonald, C., Ounis, I. (2012). On the role of novelty for search result diversification. Information Retrieval 15, 478–502.CrossRefGoogle Scholar
- Santos, R. L., Macdonald, C., Ounis, I. (2011). Intent-aware search result diversification. In Proceedings of the 34th international ACM SIGIR conference on research and development in information retrieval, SIGIR ’11 (pp. 595–604). New York, NY: ACM.Google Scholar
- Shtok, A., Kurland, O., Carmel, D. (2009). Predicting query performance by query-drift estimation. In Proceedings of the 2nd international conference on theory of information retrieval: Advances in information retrieval theory, ICTIR ’09 (pp. 305–312). Berlin: Springer.Google Scholar
- Vargas, S., Castells, P., Vallet, D. (2012). Explicit relevance models in intent-oriented information retrieval diversification. In Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval, SIGIR ’12 (pp. 75–84). New York, NY: ACM.Google Scholar
- Vinay, V., Cox, I. J., Milic-Frayling, N., Wood, K. (2006). On ranking the effectiveness of searches. In Proceedings of the 29th annual international ACM SIGIR conference on research and development in information retrieval, SIGIR ’06 (pp. 398–404). New York, NY: ACM.Google Scholar
- Yilmaz, E., Shokouhi, M., Craswell, N., Robertson, S. (2010). Expected browsing utility for web search evaluation. In Proceedings of the 19th ACM international conference on information and knowledge management, CIKM ’10 (pp. 1561–1564). New York, NY: ACM.Google Scholar
- Yom-Tov, E., Fine, S., Carmel, D., Darlow, A. (2005). Metasearch and federation using query difficulty prediction. In In ACM SIGIR 2005 workshop on predicting query difficulty—methods and applications.Google Scholar
- Zhai, C. X., Cohen, W. W., Lafferty, J. (2003). Beyond independent relevance: Methods and evaluation metrics for subtopic retrieval. In Proceedings of the 26th annual international ACM SIGIR conference on research and development in informaion retrieval, SIGIR ’03 (pp. 10–17). New York, NY: ACM.Google Scholar
- Zhang, Y., Park, L., Moffat, A. (2010). Click-based evidence for decaying weight distributions in search effectiveness metrics. Information Retrieval 13, 46–69. doi: 10.1007/s10791-009-9099-7.CrossRefGoogle Scholar
- Zhou, Y., & Croft, W. B. (2006). Ranking robustness: A novel framework to predict query performance. In Proceedings of the 15th ACM international conference on information and knowledge management, CIKM ’06 (pp. 567–574). New York, NY: ACM.Google Scholar
- Zhou, Y., & Croft, W. B. (2007). Query performance prediction in web search environments. In Proceedings of the 30th annual international ACM SIGIR conference on research and development in information retrieval, SIGIR ’07 (pp. 543–550). New York, NY: ACM.Google Scholar
- Zuccon, G., Azzopardi, L., Zhang, D., Wang, J. (2012). Top-k retrieval using facility location analysis. In Proceedings of the 34th European conference on advances in information retrieval, ECIR’12 (pp. 305–316). Berlin: Springer.Google Scholar