Crowdsourcing interactions: using crowdsourcing for evaluating interactive information retrieval systems
- 1.2k Downloads
In the field of information retrieval (IR), researchers and practitioners are often faced with a demand for valid approaches to evaluate the performance of retrieval systems. The Cranfield experiment paradigm has been dominant for the in-vitro evaluation of IR systems. Alternative to this paradigm, laboratory-based user studies have been widely used to evaluate interactive information retrieval (IIR) systems, and at the same time investigate users’ information searching behaviours. Major drawbacks of laboratory-based user studies for evaluating IIR systems include the high monetary and temporal costs involved in setting up and running those experiments, the lack of heterogeneity amongst the user population and the limited scale of the experiments, which usually involve a relatively restricted set of users. In this paper, we propose an alternative experimental methodology to laboratory-based user studies. Our novel experimental methodology uses a crowdsourcing platform as a means of engaging study participants. Through crowdsourcing, our experimental methodology can capture user interactions and searching behaviours at a lower cost, with more data, and within a shorter period than traditional laboratory-based user studies, and therefore can be used to assess the performances of IIR systems. In this article, we show the characteristic differences of our approach with respect to traditional IIR experimental and evaluation procedures. We also perform a use case study comparing crowdsourcing-based evaluation with laboratory-based evaluation of IIR systems, which can serve as a tutorial for setting up crowdsourcing-based IIR evaluations.
KeywordsCrowdsourcing evaluation Interactive IR evaluation Interactions
The authors are thankful to the anonymous reviewers for their suggestions.
- Alonso, O., & Baeza-Yates, R. (2011). Design and implementation of relevance assessments using crowdsourcing. In: P. Clough, C. Foley, C. Gurrin, G. Jones, W. Kraaij, H. Lee, & V. Mudoch (Eds.), Advances in information retrieval, volume 6611 of lecture notes in computer science (pp. 153–164). New York: Springer.Google Scholar
- Alonso, O., & Mizzaro, S. (2009). Can we get rid of trec assessors? using mechanical turk for relevance assessment. In SIGIR ’09: workshop on the future of IR evaluation.Google Scholar
- Arguello, J., Diaz, F., Callan, J., & Carterette, B. (2011). A methodology for evaluating aggregated search results. In: P. Clough, C. Foley, C. Gurrin, G. Jones, W. Kraaij, H. Lee, & V. Mudoch (Eds.) Advances in information retrieval, volume 6611 of lecture notes in computer science (pp. 141–152). New York: Springer.Google Scholar
- Borlund, P. (2003). The IIR evaluation model: A framework for evaluation of interactive information retrieval systems. Information Research, 8(3), 152. http://www.doaj.org/doaj?func=abstract&id=88950.
- Carter P.J. (2007) IQ and psychometric tests. London: Kogan Page.Google Scholar
- Dang, H. T., Kelly, D., & Lin, J. (2007). Overview of the trec 2007 question answering track. In Proceedings of the text REtrieval conference.Google Scholar
- Dang, H. T., Lin, J., & Kelly, D. (2006). Overview of the trec 2006 question answering track. In Proceedings of the text REtrieval conference.Google Scholar
- Feild, H., Jones, R., Miller, R. C., Nayak, R., Churchill, E. F., & Velipasaoglu, E. (2009). Logging the search self-efficacy of amazon mechanical turkers. In SIGIR 2009 work on crowdsourcing for search eval.Google Scholar
- Grady, C., & Lease, M. (2010). Crowdsourcing document relevance assessment with mechanical turk. In Proceedings of the NAACL HLT 2010 workshop on creating speech and language data with amazon’s mechanical turk, CSLDAMT ’10, (pp. 172–179). PA, USA: Stroudsburg. Association for Computational Linguistics.Google Scholar
- Grimes, C., Tang, D., & Russell, D. (2007). Query logs alone are not enough. In Workshop on query log analysis at WWW.Google Scholar
- Ipeirotis, P. G. (2010b). Demographics of mechanical turk. NYU working paper no. ; CEDER-10-01. Available at http://hdl.handle.net/2451/29585, March 2010.
- Ipeirotis, P. G., Provost, F., & Wang, J. (2010). Quality management on amazon mechanical turk. In Proceedings of the ACM SIGKDD workshop on human computation, HCOMP ’10, (pp. 64–67). New York, NY, USA: ACM.Google Scholar
- Kazai, G. (2011). In search of quality in crowdsourcing for search engine evaluation. In P. Clough, C. Foley, C. Gurrin, G. Jones, W. Kraaij, H. Lee, & V. Mudoch (Eds.) Advances in information retrieval, volume 6611 of lecture notes in computer science (pp. 165–176). UK: Springer.Google Scholar
- Kelly, D. (2009). Methods for evaluating interactive information retrieval systems with users. Foundations and Trends in Information Retrieval, 3(1–2), 1–224.Google Scholar
- Leelanupab, T. (2012). A Ranking framework and evaluation for diversity-based retrieval. PhD thesis, University of Glasgow.Google Scholar
- Leelanupab, T., Hopfgartner, F., & Jose, J. (2009). User centred evaluation of a recommendation based image browsing system. In Proceedings of the 4th Indian international conference on artificial intelligence (pp. 558–573). Citeseer.Google Scholar
- Lin, C. Y. (2004). Rouge: a package for automatic evaluation of summaries. In Proceedings of the workshop on text summarization, ACL 2004. Spain: Barcelona.Google Scholar
- Mason, W., & Watts, D. J. (2009). Financial incentives and the performance of crowds. In Proceedings of the ACM SIGKDD workshop on human computation, HCOMP ’09, (pp. 77–85), New York, NY, USA: ACM.Google Scholar
- McCreadie, R., Macdonald, C., & Ounis, I.: Crowdsourcing Blog Track Top News Judgments at TREC. In M. Lease, V. Carvalho, E. Yilmaz (eds) Proceedings of the workshop on crowdsourcing for search and data mining (CSDM) at the 4th ACM international conference on web search and data mining (WSDM) (pp. 23–26). Hong Kong, China, February 2011.Google Scholar
- Over, P. (1997). Trec-6 interactive track report. In Proceedings of the text REtrieval conference (pp. 57–64).Google Scholar
- Over P. (2001) The trec interactive track: an annotated bibliography. Information Processing & Management, 37(3):369–381Google Scholar
- Potthast, M., Stein, B., Barrón-Cedeño, A., & Rosso, P. (2010). An evaluation framework for plagiarism detection. In Proceedings of the 23rd international conference on computational linguistics: posters, COLING ’10 (pp. 997–1005). Stroudsburg, PA, USA: Association for Computational Linguistics.Google Scholar
- Ross, J., Zaldivar, A., Irani, L., Tomlinson, B., & Silberman, M. S. (2010). Who are the crowdworkers? shifting demographics in mechanical turk. In Proceedings CHI 2010 (pp. 2863–2872).Google Scholar
- Santos, R., Peng, J., Macdonald, C., & Ounis, I. (2010). Explicit search result diversification through sub-queries. In: C. Gurrin, Y. He, G. Kazai, U. Kruschwitz, S. Little, T. Roelleke, S. Rüger, & K. van Rijsbergen (Eds.) Advances in information retrieval, volume 5993 of lecture notes in computer science. (pp. 87–99). UK: Springer.Google Scholar
- Shadish, W. R., Cook, T. D., & Campbell, D. T. (2001). Experimental and quasi-experimental designs for generalized causal inference, (2nd edn.). Boston: Houghton Mifflin.Google Scholar
- Voorhees, E. M., & Harman, D. (2005). TREC: Experiment and evaluation in information retrieval digital libraries and electronic publishing. Cambridge, MA: MIT Press.Google Scholar
- Zuccon, G., Leelanupab, T., Whiting, S., Jose, E. Y. J., & Azzopardi, L. (2011a). Crowdsourcing interactions—Capturing query sessions through crowdsourcing. In B. Carterette, E. Kanoulas, P. Clough, & M. Sanderson (Eds.), Proceedings of the workshop on information retrieval over query sessions at the European conference on information retrieval (ECIR). Dublin, Ireland, April 2011.Google Scholar
- Zuccon, G., Leelanupab, T., Whiting, S., Jose, J., & Azzopardi, L. (2011b). Crowdsourcing interactions—A proposal for capturing user interactions through crowdsourcing. In M. Lease, V. Carvalho, & E. Yilmaz (Eds.), Proceedings of the workshop on crowdsourcing for search and data mining (CSDM) at the 4th ACM international conference on web search and data mining (WSDM) (pp. 35–38). Hong Kong, China, February 2011.Google Scholar