Information Retrieval

, Volume 16, Issue 4, pp 530–555 | Cite as

Increasing evaluation sensitivity to diversity

  • Peter B. Golbus
  • Javed A. Aslam
  • Charles L. A. Clarke
Search Intents and Diversification

Abstract

Many queries have multiple interpretations; they are ambiguous or underspecified. This is especially true in the context of Web search. To account for this, much recent research has focused on creating systems that produce diverse ranked lists. In order to validate these systems, several new evaluation measures have been created to quantify diversity. Ideally, diversity evaluation measures would distinguish between systems by the amount of diversity in the ranked lists they produce. Unfortunately, diversity is also a function of the collection over which the system is run and a system’s performance at ad-hoc retrieval. A ranked list built from a collection that does not cover multiple subtopics cannot be diversified; neither can a ranked list that contains no relevant documents. To ensure that we are assessing systems by their diversity, we develop (1) a family of evaluation measures that take into account the diversity of the collection and (2) a meta-evaluation measure that explicitly controls for performance. We demonstrate experimentally that our new measures can achieve substantial improvements in sensitivity to diversity without reducing discriminative power.

Keywords

Diversity Evaluation Diversity difficulty 

References

  1. Agrawal, R., Gollapudi, S., Halverson, A., Ieong, S. (2009). Diversifying search results. In Proceedings of the second ACM international conference on web search and data mining, WSDM ’09 (pp. 5–14). New York, NY: ACM.Google Scholar
  2. Aslam, J. A., & Pavlu, V. (2007). Query hardness estimation using jensen-shannon divergence among multiple scoring functions. In Proceedings of the 29th European conference on IR research, ECIR’07 (pp. 198–209). Berlin: Springer.Google Scholar
  3. Broder, A. (2002). A taxonomy of web search. SIGIR Forum, 36(2), 3–10.CrossRefGoogle Scholar
  4. Carbonell, J., & Goldstein, J. (1998). The use of MMR, diversity-based reranking for reordering documents and producing summaries. In Proceedings of the 21st annual international ACM SIGIR conference on research and development in information retrieval, SIGIR ’98 (pp. 335–336). New York, NY: ACM.Google Scholar
  5. Carmel, D., & Yom-Tov, E. (2010). Estimating the query difficulty for information retrieval. Synthesis lectures on information concepts, retrieval, and services. San Rafael: Morgan & Claypool.Google Scholar
  6. Carmel, D., Yom-Tov, E., Darlow, A., Pelleg, D. (2006). What makes a query difficult? In Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, SIGIR ’06 (pp. 390–397). New York, NY: ACM.Google Scholar
  7. Carterette, B. (2009). An analysis of NP-completeness in novelty and diversity ranking. In Proceedings of the 2nd international conference on theory of information retrieval: Advances in information retrieval theory, ICTIR ’09 (pp. 200–211). Berlin : Springer.Google Scholar
  8. Chapelle, O., Metlzer, D., Zhang, Y., Grinspan, P. (2009). Expected reciprocal rank for graded relevance. In Proceedings of the 18th ACM conference on information and knowledge management, CIKM ’09 (pp. 621–630). New York, NY: ACM.Google Scholar
  9. Chen, H., & Karger, D. R. (2006) Less is more: Probabilistic models for retrieving fewer relevant documents. In Proceedings of the 29th annual international ACM SIGIR conference on research and development in information retrieval, SIGIR ’06 (pp. 429–436). New York, NY: ACM.Google Scholar
  10. Clarke, C. L., Craswell, N., Soboroff, I., Ashkan, A. (2011). A comparative analysis of cascade measures for novelty and diversity. In Proceedings of the fourth ACM international conference on web search and data mining, WSDM ’11 (pp. 75–84). New York, NY: ACM.Google Scholar
  11. Clarke, C. L., Kolla, M., Cormack, G. V., Vechtomova, O., Ashkan, A., Büttcher, S., MacKinnon, I. (2008). Novelty and diversity in information retrieval evaluation. In Proceedings of the 31st annual international ACM SIGIR conference on research and development in information retrieval, SIGIR ’08 (pp. 659–666). New York, NY: ACM.Google Scholar
  12. Clarke, C. L., Kolla, M., Vechtomova, O. (2009). An effectiveness measure for ambiguous and underspecified queries. In Proceedings of the 2nd international conference on theory of information retrieval: Advances in information retrieval theory, ICTIR ’09 (pp. 188–199). Berlin: Springer.Google Scholar
  13. Clarke, C. L. A., Craswell, N., Soboroff, I. (2009). Overview of the TREC 2009 web track. In 18th text retrieval conference. Maryland: Gaithersburg.Google Scholar
  14. Clarke, C. L. A., Craswell, N., Soboroff, I., Cormack, G. V. (2010). Overview of the TREC 2010 web track. In 19th text retrieval conference. Maryland: Gaithersburg.Google Scholar
  15. Clarke, C. L. A., Craswell, N., Soboroff, I., Voorhees, E. M. (2011). Overview of the TREC 2011 web track. In 20th text retrieval conference. Maryland: Gaithersburg.Google Scholar
  16. Craswell, N., Zoeter, O., Taylor, M., Ramsey, B. (2008). An experimental comparison of click position-bias models. In Proceedings of the international conference on web search and data mining, WSDM ’08 (pp. 87–94). New York, NY: ACM.Google Scholar
  17. Cronen-Townsend, S., Zhou, Y., Croft, W. B. (2002). Predicting query performance. In Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval, SIGIR ’02 (pp. 299–306). New York, NY: ACM.Google Scholar
  18. Cronen-Townsend, S., Zhou, Y., Croft, W. B. (2006). Precision prediction based on ranked list coherence. Information Retrieval 9(6), 723–755. http://link.springer.com/article/10.1007%2Fs10791-006-9006-4.Google Scholar
  19. Golbus, P. B., Pavlu, V., Aslam, J. A. (2012) What we talk about when we talk about diversity. In Proceedings of diversity in document retrieval 2012.Google Scholar
  20. Hauff, C. (2010). Predicting the effectiveness of queries and retrieval systems. Ph.D. thesis, University of Twente, Enschede.Google Scholar
  21. He, B., & Ounis, I. (2004). Inferring query performance using pre-retrieval predictors. In Proceedings of symposium on string processing and information retrieval (pp. 43–54). Berlin: Springer.Google Scholar
  22. Järvelin, K., & Kekäläinen, J. (2002). Cumulated gain-based evaluation of IR techniques. ACM Transactions on Information Systems 20(4), 422–446.CrossRefGoogle Scholar
  23. Kendall, M. G. (1938). A new measure of rank correlation. Biometrika 30(1/2), 81–93.MathSciNetMATHCrossRefGoogle Scholar
  24. Moffat, A., & Zobel, J. (2008). Rank-biased precision for measurement of retrieval effectiveness. ACM Transactions on Information Systems 27(1), 2:1–2:27.Google Scholar
  25. Mothe, J., & Tanguy, L. (2005). Linguistic features to predict query difficulty. In In ACM SIGIR 2005 workshop on predicting query difficulty—methods and applications.Google Scholar
  26. Robertson, S. (2006). On GMAP: And other transformations. In Proceedings of the 15th ACM international conference on information and knowledge management, CIKM ’06 (pp. 78–83). New York, NY: ACM.Google Scholar
  27. Robertson, S. E. (1977). The probability ranking principle in IR. Journal of Documentation 33(4), 294–304.CrossRefGoogle Scholar
  28. Sakai, T. (2006). Evaluating evaluation metrics based on the bootstrap. In Proceedings of the 29th annual international ACM SIGIR conference on research and development in information retrieval, SIGIR ’06 (pp. 525–532). New York, NY: ACM.Google Scholar
  29. Sakai T. (2012). Evaluation with informational and navigational intents. In Proceedings of the 21st world wide web conference (WWW) 2012.Google Scholar
  30. Sakai, T., Craswell, N., Song, R., Robertson, S., Dou, Z., Lin, C. Y. (2010). Simple evaluation metrics for diversified search results. In The Third international workshop on evaluating information access (EVIA).Google Scholar
  31. Sakai, T., & Song, R. (2011). Evaluating diversified search results using per-intent graded relevance. In Proceedings of the 34th annual international ACM SIGIR conference on research and development in information retrieval, SIGIR ’11 (pp. 1043–1052). New York, NY: ACM.Google Scholar
  32. Santos, R., Macdonald, C., Ounis, I. (2012). On the role of novelty for search result diversification. Information Retrieval 15, 478–502.CrossRefGoogle Scholar
  33. Santos, R. L., Macdonald, C., Ounis, I. (2011). Intent-aware search result diversification. In Proceedings of the 34th international ACM SIGIR conference on research and development in information retrieval, SIGIR ’11 (pp. 595–604). New York, NY: ACM.Google Scholar
  34. Shtok, A., Kurland, O., Carmel, D. (2009). Predicting query performance by query-drift estimation. In Proceedings of the 2nd international conference on theory of information retrieval: Advances in information retrieval theory, ICTIR ’09 (pp. 305–312). Berlin: Springer.Google Scholar
  35. Vargas, S., Castells, P., Vallet, D. (2012). Explicit relevance models in intent-oriented information retrieval diversification. In Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval, SIGIR ’12 (pp. 75–84). New York, NY: ACM.Google Scholar
  36. Vinay, V., Cox, I. J., Milic-Frayling, N., Wood, K. (2006). On ranking the effectiveness of searches. In Proceedings of the 29th annual international ACM SIGIR conference on research and development in information retrieval, SIGIR ’06 (pp. 398–404). New York, NY: ACM.Google Scholar
  37. Yilmaz, E., Shokouhi, M., Craswell, N., Robertson, S. (2010). Expected browsing utility for web search evaluation. In Proceedings of the 19th ACM international conference on information and knowledge management, CIKM ’10 (pp. 1561–1564). New York, NY: ACM.Google Scholar
  38. Yom-Tov, E., Fine, S., Carmel, D., Darlow, A. (2005). Metasearch and federation using query difficulty prediction. In In ACM SIGIR 2005 workshop on predicting query difficulty—methods and applications.Google Scholar
  39. Zhai, C. X., Cohen, W. W., Lafferty, J. (2003). Beyond independent relevance: Methods and evaluation metrics for subtopic retrieval. In Proceedings of the 26th annual international ACM SIGIR conference on research and development in informaion retrieval, SIGIR ’03 (pp. 10–17). New York, NY: ACM.Google Scholar
  40. Zhang, Y., Park, L., Moffat, A. (2010). Click-based evidence for decaying weight distributions in search effectiveness metrics. Information Retrieval 13, 46–69. doi: 10.1007/s10791-009-9099-7.CrossRefGoogle Scholar
  41. Zhou, Y., & Croft, W. B. (2006). Ranking robustness: A novel framework to predict query performance. In Proceedings of the 15th ACM international conference on information and knowledge management, CIKM ’06 (pp. 567–574). New York, NY: ACM.Google Scholar
  42. Zhou, Y., & Croft, W. B. (2007). Query performance prediction in web search environments. In Proceedings of the 30th annual international ACM SIGIR conference on research and development in information retrieval, SIGIR ’07 (pp. 543–550). New York, NY: ACM.Google Scholar
  43. Zuccon, G., Azzopardi, L., Zhang, D., Wang, J. (2012). Top-k retrieval using facility location analysis. In Proceedings of the 34th European conference on advances in information retrieval, ECIR’12 (pp. 305–316). Berlin: Springer.Google Scholar

Copyright information

© Springer Science+Business Media New York 2013

Authors and Affiliations

  • Peter B. Golbus
    • 1
  • Javed A. Aslam
    • 1
  • Charles L. A. Clarke
    • 2
  1. 1.College of Computer and Information ScienceNortheastern UniversityBostonUSA
  2. 2.School of Computer ScienceUniversity of WaterlooWaterlooCanada

Personalised recommendations