Information Retrieval

, Volume 16, Issue 2, pp 267–305 | Cite as

Crowdsourcing interactions: using crowdsourcing for evaluating interactive information retrieval systems

  • Guido Zuccon
  • Teerapong Leelanupab
  • Stewart Whiting
  • Emine Yilmaz
  • Joemon M. Jose
  • Leif Azzopardi
Crowd Sourcing

Abstract

In the field of information retrieval (IR), researchers and practitioners are often faced with a demand for valid approaches to evaluate the performance of retrieval systems. The Cranfield experiment paradigm has been dominant for the in-vitro evaluation of IR systems. Alternative to this paradigm, laboratory-based user studies have been widely used to evaluate interactive information retrieval (IIR) systems, and at the same time investigate users’ information searching behaviours. Major drawbacks of laboratory-based user studies for evaluating IIR systems include the high monetary and temporal costs involved in setting up and running those experiments, the lack of heterogeneity amongst the user population and the limited scale of the experiments, which usually involve a relatively restricted set of users. In this paper, we propose an alternative experimental methodology to laboratory-based user studies. Our novel experimental methodology uses a crowdsourcing platform as a means of engaging study participants. Through crowdsourcing, our experimental methodology can capture user interactions and searching behaviours at a lower cost, with more data, and within a shorter period than traditional laboratory-based user studies, and therefore can be used to assess the performances of IIR systems. In this article, we show the characteristic differences of our approach with respect to traditional IIR experimental and evaluation procedures. We also perform a use case study comparing crowdsourcing-based evaluation with laboratory-based evaluation of IIR systems, which can serve as a tutorial for setting up crowdsourcing-based IIR evaluations.

Keywords

Crowdsourcing evaluation Interactive IR evaluation Interactions 

References

  1. Alonso, O., & Baeza-Yates, R. (2011). Design and implementation of relevance assessments using crowdsourcing. In: P. Clough, C. Foley, C. Gurrin, G. Jones, W. Kraaij, H. Lee, & V. Mudoch (Eds.), Advances in information retrieval, volume 6611 of lecture notes in computer science (pp. 153–164). New York: Springer.Google Scholar
  2. Alonso, O., & Mizzaro, S. (2009). Can we get rid of trec assessors? using mechanical turk for relevance assessment. In SIGIR ’09: workshop on the future of IR evaluation.Google Scholar
  3. Alonso, O., Rose, D. E., & Stewart, B. (2008). Crowdsourcing for relevance evaluation. SIGIR Forum, 42, 9–15.CrossRefGoogle Scholar
  4. Arguello, J., Diaz, F., Callan, J., & Carterette, B. (2011). A methodology for evaluating aggregated search results. In: P. Clough, C. Foley, C. Gurrin, G. Jones, W. Kraaij, H. Lee, & V. Mudoch (Eds.) Advances in information retrieval, volume 6611 of lecture notes in computer science (pp. 141–152). New York: Springer.Google Scholar
  5. Borlund, P. (2003). The IIR evaluation model: A framework for evaluation of interactive information retrieval systems. Information Research, 8(3), 152. http://www.doaj.org/doaj?func=abstract&id=88950.
  6. Carter P.J. (2007) IQ and psychometric tests. London: Kogan Page.Google Scholar
  7. Dang, H. T., Kelly, D., & Lin, J. (2007). Overview of the trec 2007 question answering track. In Proceedings of the text REtrieval conference.Google Scholar
  8. Dang, H. T., Lin, J., & Kelly, D. (2006). Overview of the trec 2006 question answering track. In Proceedings of the text REtrieval conference.Google Scholar
  9. Feild, H., Jones, R., Miller, R. C., Nayak, R., Churchill, E. F., & Velipasaoglu, E. (2009). Logging the search self-efficacy of amazon mechanical turkers. In SIGIR 2009 work on crowdsourcing for search eval.Google Scholar
  10. Grady, C., & Lease, M. (2010). Crowdsourcing document relevance assessment with mechanical turk. In Proceedings of the NAACL HLT 2010 workshop on creating speech and language data with amazon’s mechanical turk, CSLDAMT ’10, (pp. 172–179). PA, USA: Stroudsburg. Association for Computational Linguistics.Google Scholar
  11. Grimes, C., Tang, D., & Russell, D. (2007). Query logs alone are not enough. In Workshop on query log analysis at WWW.Google Scholar
  12. Ipeirotis, P. G. (2010a). Analyzing the amazon mechanical turk marketplace. XRDS, 17, 16–21CrossRefGoogle Scholar
  13. Ipeirotis, P. G. (2010b). Demographics of mechanical turk. NYU working paper no. ; CEDER-10-01. Available at http://hdl.handle.net/2451/29585, March 2010.
  14. Ipeirotis, P. G., Provost, F., & Wang, J. (2010). Quality management on amazon mechanical turk. In Proceedings of the ACM SIGKDD workshop on human computation, HCOMP ’10, (pp. 64–67). New York, NY, USA: ACM.Google Scholar
  15. Kazai, G. (2011). In search of quality in crowdsourcing for search engine evaluation. In P. Clough, C. Foley, C. Gurrin, G. Jones, W. Kraaij, H. Lee, & V. Mudoch (Eds.) Advances in information retrieval, volume 6611 of lecture notes in computer science (pp. 165–176). UK: Springer.Google Scholar
  16. Kelly, D. (2009). Methods for evaluating interactive information retrieval systems with users. Foundations and Trends in Information Retrieval, 3(1–2), 1–224.Google Scholar
  17. Kelly, D., Dumais, S., & Pedersen, J. (2009). Evaluation challenges and directions for information-seeking support systems. Computer, 42(3), 60–66.CrossRefGoogle Scholar
  18. Leelanupab, T. (2012). A Ranking framework and evaluation for diversity-based retrieval. PhD thesis, University of Glasgow.Google Scholar
  19. Leelanupab, T., Hopfgartner, F., & Jose, J. (2009). User centred evaluation of a recommendation based image browsing system. In Proceedings of the 4th Indian international conference on artificial intelligence (pp. 558–573). Citeseer.Google Scholar
  20. Lin, C. Y. (2004). Rouge: a package for automatic evaluation of summaries. In Proceedings of the workshop on text summarization, ACL 2004. Spain: Barcelona.Google Scholar
  21. Mason, W., & Watts, D. J. (2009). Financial incentives and the performance of crowds. In Proceedings of the ACM SIGKDD workshop on human computation, HCOMP ’09, (pp. 77–85), New York, NY, USA: ACM.Google Scholar
  22. McCreadie, R., Macdonald, C., & Ounis, I.: Crowdsourcing Blog Track Top News Judgments at TREC. In M. Lease, V. Carvalho, E. Yilmaz (eds) Proceedings of the workshop on crowdsourcing for search and data mining (CSDM) at the 4th ACM international conference on web search and data mining (WSDM) (pp. 23–26). Hong Kong, China, February 2011.Google Scholar
  23. Over, P. (1997). Trec-6 interactive track report. In Proceedings of the text REtrieval conference (pp. 57–64).Google Scholar
  24. Over P. (2001) The trec interactive track: an annotated bibliography. Information Processing & Management, 37(3):369–381Google Scholar
  25. Potthast, M., Stein, B., Barrón-Cedeño, A., & Rosso, P. (2010). An evaluation framework for plagiarism detection. In Proceedings of the 23rd international conference on computational linguistics: posters, COLING ’10 (pp. 997–1005). Stroudsburg, PA, USA: Association for Computational Linguistics.Google Scholar
  26. Ross, J., Zaldivar, A., Irani, L., Tomlinson, B., & Silberman, M. S. (2010). Who are the crowdworkers? shifting demographics in mechanical turk. In Proceedings CHI 2010 (pp. 2863–2872).Google Scholar
  27. Santos, R., Peng, J., Macdonald, C., & Ounis, I. (2010). Explicit search result diversification through sub-queries. In: C. Gurrin, Y. He, G. Kazai, U. Kruschwitz, S. Little, T. Roelleke, S. Rüger, & K. van Rijsbergen (Eds.) Advances in information retrieval, volume 5993 of lecture notes in computer science. (pp. 87–99). UK: Springer.Google Scholar
  28. Shadish, W. R., Cook, T. D., & Campbell, D. T. (2001). Experimental and quasi-experimental designs for generalized causal inference, (2nd edn.). Boston: Houghton Mifflin.Google Scholar
  29. Voorhees, E. M. (2005). Trec: Improving information access through evaluation. Bulletin of the American Society for Information Science and Technology, 32(1), 16–21.CrossRefGoogle Scholar
  30. Voorhees, E. M., & Harman, D. (2005). TREC: Experiment and evaluation in information retrieval digital libraries and electronic publishing. Cambridge, MA: MIT Press.Google Scholar
  31. Zuccon, G., Leelanupab, T., Whiting, S., Jose, E. Y. J., & Azzopardi, L. (2011a). Crowdsourcing interactions—Capturing query sessions through crowdsourcing. In B. Carterette, E. Kanoulas, P. Clough, & M. Sanderson (Eds.), Proceedings of the workshop on information retrieval over query sessions at the European conference on information retrieval (ECIR). Dublin, Ireland, April 2011.Google Scholar
  32. Zuccon, G., Leelanupab, T., Whiting, S., Jose, J., & Azzopardi, L. (2011b). Crowdsourcing interactions—A proposal for capturing user interactions through crowdsourcing. In M. Lease, V. Carvalho, & E. Yilmaz (Eds.), Proceedings of the workshop on crowdsourcing for search and data mining (CSDM) at the 4th ACM international conference on web search and data mining (WSDM) (pp. 35–38). Hong Kong, China, February 2011.Google Scholar

Copyright information

© Springer Science+Business Media, LLC 2012

Authors and Affiliations

  • Guido Zuccon
    • 1
  • Teerapong Leelanupab
    • 2
  • Stewart Whiting
    • 3
  • Emine Yilmaz
    • 4
  • Joemon M. Jose
    • 3
  • Leif Azzopardi
    • 3
  1. 1.Australian e-Health Research Centre, CSIROBrisbaneAustralia
  2. 2.King Mongkut’s Institute of Technology LadkrabangBangkokThailand
  3. 3.School of Computing ScienceUniversity of GlasgowGlasgowUK
  4. 4.Microsoft ResearchCambridgeUK

Personalised recommendations