Skip to main content

Continuous Evaluation of Large-Scale Information Access Systems: A Case for Living Labs

  • Chapter
  • First Online:

Part of the book series: The Information Retrieval Series ((INRE,volume 41))

Abstract

A/B testing is currently being increasingly adopted for the evaluation of commercial information access systems with a large user base since it provides the advantage of observing the efficiency and effectiveness of information access systems under real conditions. Unfortunately, unless university-based researchers closely collaborate with industry or develop their own infrastructure or user base, they cannot validate their ideas in live settings with real users. Without online testing opportunities open to the research communities, academic researchers are unable to employ online evaluation on a larger scale. This means that they do not get feedback for their ideas and cannot advance their research further. Businesses, on the other hand, miss the opportunity to have higher customer satisfaction due to improved systems. In addition, users miss the chance to benefit from an improved information access system. In this chapter, we introduce two evaluation initiatives at CLEF, NewsREEL and Living Labs for IR (LL4IR), that aim to address this growing “evaluation gap” between academia and industry. We explain the challenges and discuss the experiences organizing theses living labs.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   99.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   129.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD   179.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  • Allan J, Croft B, Moffat A, Sanderson M (2012) Frontiers, challenges, and opportunities for information retrieval: report from SWIRL 2012 the second strategic workshop on information retrieval in Lorne. SIGIR Forum 46(1):2–32

    Article  Google Scholar 

  • Azzopardi L, Balog K (2011) Towards a living lab for information retrieval research and development - a proposal for a living lab for product search tasks. In: Forner P, Gonzalo J, Kekäläinen J, Lalmas M, de Rijke M (eds) Multilingual and multimodal information access evaluation. Proceedings of the second international conference of the cross-language evaluation forum (CLEF 2011). Lecture notes in computer science (LNCS), vol 6941. Springer, Heidelberg, pp 26–37

    Google Scholar 

  • Balog K, Elsweiler D, Kanoulas E, Kelly L, Smucker MD (2014a) Report on the CIKM workshop on living labs for information retrieval evaluation. SIGIR Forum 48(1):21–28

    Article  Google Scholar 

  • Balog K, Kelly L, Schuth A (2014b) Head first: living labs for ad-hoc search evaluation. In: Proceedings of the 23rd international conference on information and knowledge management (CIKM’14). ACM, New York, pp 1815–1818

    Google Scholar 

  • Beck PD, Blaser M, Michalke A, Lommatzsch A (2017) A system for online news recommendations in real-time with Apache mahout. In: Working notes of the 8th international conference of the CLEF initiative, Dublin, CEUR workshop proceedings

    Google Scholar 

  • Bons P, Evans N, Kampstra P, van Kessel T (2017) A news recommender engine with a killer sequence. In: Working notes of the 8th international conference of the CLEF initiative, Dublin, CEUR workshop proceedings

    Google Scholar 

  • Broder A (2002) A taxonomy of web search. SIGIR Forum 36(2):3–10

    Article  Google Scholar 

  • Brodt T, Hopfgartner F (2014) Shedding light on a living lab: the CLEF NewsREEL open recommendation platform. In: Proceedings of the information interaction in context conference, IIiX’14. Springer, New York, pp 223–226

    Google Scholar 

  • Chapelle O, Joachims T, Radlinski F, Yue Y (2012) Large-scale validation and analysis of interleaved search evaluation. ACM Trans Info Syst (TOIS) 30:1–41

    Article  Google Scholar 

  • Ciobanu A, Lommatzsch A (2016) Development of a news recommender system based on Apache flink. In: Working notes of the 7th international conference of the CLEF initiative, Evora, CEUR workshop proceedings

    Google Scholar 

  • Corsini F, Larson M (2016) CLEF NewsREEL 2016: image based recommendation. In: Working notes of the 7th international conference of the CLEF initiative, Evora, CEUR workshop proceedings

    Google Scholar 

  • Diaz F, White R, Buscher G, Liebling D (2013) Robust models of mouse movement on dynamic web search results pages. In: Proceedings of the 22nd ACM international conference on information and knowledge management (CIKM’13), pp 1451–1460

    Google Scholar 

  • Domann J, Meiners J, Helmers L, Lommatzsch A (2016) Real-time news recommendations using Apache spark. In: Working notes of the 7th international conference of the CLEF initiative, Evora, CEUR workshop proceedings

    Google Scholar 

  • Freire J, Fuhr N, Rauber A (2016) Reproducibility of data-oriented experiments in e-science (Dagstuhl seminar 16041). In: Dagstuhl reports, Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, vol 6

    Google Scholar 

  • Gebremeskel G, de Vries AP (2015) The degree of randomness in a live recommender systems evaluation. In: Working notes for CLEF 2015 conference, Toulouse, CEUR

    Google Scholar 

  • Ghirmatsion AB, Balog K (2015) Probabilistic field mapping for product search. In: CLEF 2015 online working notes

    Google Scholar 

  • Golian C, Kuchar J (2017) News recommender system based on association rules at CLEF NewsREEL 2017. In: Working notes of the 8th international conference of the CLEF initiative, Dublin, CEUR workshop proceedings

    Google Scholar 

  • Gunawardana A, Shani G (2009) A survey of accuracy evaluation metrics of recommendation tasks. J Mach Learn Res 10:2935–2962

    MathSciNet  MATH  Google Scholar 

  • Hanbury A, Müller H, Balog K, Brodt T, Cormack GV, Eggel I, Gollub T, Hopfgartner F, Kalpathy-Cramer J, Kando N, Krithara A, Lin JJ, Mercer S, Potthast M (2015) Evaluation-as-a-service: overview and outlook. CoRR abs/1512.07454

    Google Scholar 

  • Hassan A, Shi X, Craswell N, Ramsey B (2013) Beyond clicks: query reformulation as a predictor of search satisfaction. In: Proceedings of the 22nd ACM international conference on information and knowledge management (CIKM’13). ACM, New York, pp 2019–2028

    Google Scholar 

  • Hawking D (2015) If SIGIR had an academic track, what would be in it? In: Proceedings of the 38th international ACM SIGIR conference on research and development in information retrieval, Santiago, August 9–13, 2015, p 1077

    Google Scholar 

  • Hofmann K, Whiteson S, de Rijke M (2011) A probabilistic method for inferring preferences from clicks. In: Proceedings of the 20th conference on information and knowledge management (CIKM’11). ACM, New York, p 249

    Google Scholar 

  • Hopfgartner F, Brodt T (2015) Join the living lab: evaluating news recommendations in real-time. In: Advances in information retrieval - 37th European conference on IR research, ECIR 2015, Proceedings, Vienna, March 29–April 2, 2015, pp 826–829

    Google Scholar 

  • Hopfgartner F, Jose JM (2014) An experimental evaluation of ontology-based user profiles. Multimed Tools Appl 73(2):1029–1051

    Article  Google Scholar 

  • Hopfgartner F, Kille B, Lommatzsch A, Plumbaum T, Brodt T, Heintz T (2014) Benchmarking news recommendations in a living lab. In: Kanoulas E, Lupu M, Clough P, Sanderson M, Hall M, Hanbury A, Toms E (eds) Information access evaluation – multilinguality, multimodality, and interaction. Proceedings of the fifth international conference of the CLEF initiative (CLEF 2014). Lecture notes in computer science (LNCS), vol 8685. Springer, Heidelberg, pp 250–267

    Google Scholar 

  • Hopfgartner F, Brodt T, Seiler J, Kille B, Lommatzsch A, Larson M, Turrin R, Serény A (2015a) Benchmarking news recommendations: the CLEF newsreel use case. SIGIR Forum 49(2):129–136

    Article  Google Scholar 

  • Hopfgartner F, Kille B, Heintz T, Turrin R (2015b) Real-time recommendation of streamed data. In: Proceedings of the 9th ACM conference on recommender systems, RecSys 2015, Vienna, September 16–20, 2015, pp 361–362

    Google Scholar 

  • Hopfgartner F, Lommatzsch A, Kille B, Larson M, Brodt T, Cremonesi P, Karatzoglou A (2016) The potentials of recommender systems challenges for student learning. In: Proceedings of CiML’16: challenges in machine learning: gaming and education

    Google Scholar 

  • Hopfgartner F, Hanbury A, Mueller H, Eggel I, Balog K, Brodt T, Cormack GV, Lin J, Kalpathy-Cramer J, Kando N, Kato MP, Krithara A, Gollub T, Potthast M, Viegas E, Mercer S (2018) Evaluation-as-a-service for the computational sciences: overview and outlook. ACM J Data Inf Qual. https://doi.org/10.1145/3239570

    Article  Google Scholar 

  • Jagerman R, Balog K, de Rijke M (2018) Opensearch: lessons learned from an online evaluation campaign. J Data Inf Qual 10(3):13:1–13:15

    Article  Google Scholar 

  • Joachims T (2003) Evaluating retrieval performance using clickthrough data. In: Franke J, Nakhaeizadeh G, Renz I (eds) Text mining. Physica. Springer, Heidelberg, pp 79–96

    Google Scholar 

  • Joachims T, Granka LA, Pan B, Hembrooke H, Radlinski F, Gay G (2007) Evaluating the accuracy of implicit feedback from clicks and query reformulations in web search. ACM Trans Inf Syst 25(2):7

    Article  Google Scholar 

  • Kamps J, Geva S, Peters C, Sakai T, Trotman A, Voorhees E (2009) Report on the SIGIR 2009 workshop on the future of IR evaluation. SIGIR Forum 43(2):13–23

    Article  Google Scholar 

  • Kelly D, Dumais ST, Pedersen JO (2009) Evaluation challenges and directions for information-seeking support systems. IEEE Comput 42(3):60–66

    Article  Google Scholar 

  • Kelly L, Bunbury P, Jones GJF (2012) Evaluating personal information retrieval. In: Proceedings of the 34th European conference on information retrieval (ECIR’12). Springer, Berlin

    Google Scholar 

  • Kille B, Hopfgartner F, Brodt T, Heintz T (2013) The plista dataset. In: NRS’13: proceedings of the international workshop and challenge on news recommender systems. ACM, New York, pp 14–21

    Google Scholar 

  • Kille B, Lommatzsch A, Turrin R, Serény A, Larson M, Brodt T, Seiler J, Hopfgartner F (2015) Stream-based recommendations: online and offline evaluation as a service. In: Mothe J, Savoy J, Kamps J, Pinel-Sauvagnat K, Jones GJF, SanJuan E, Cappellato L, Ferro N (eds) Experimental IR meets multilinguality, multimodality, and interaction. Proceedings of the sixth international conference of the CLEF association (CLEF 2015). Lecture notes in computer science (LNCS), vol 9283. Springer, Heidelberg, pp 497–517

    Google Scholar 

  • Kim J, Xue X, Croft WB (2009) A probabilistic retrieval model for semistructured data. In: Proc. of the 31st European conference on information retrieval (ECIR’09). Springer, Heidelberg, pp 228–239

    Google Scholar 

  • Kim Y, Hassan A, White R, Zitouni I (2014) Modeling dwell time to predict click-level satisfaction. In: Proc. of the 7th ACM international conference on web search and data mining (WSDM’14). ACM, New York, pp 193–202

    Google Scholar 

  • Kohavi R (2015) Online controlled experiments: lessons from running A/B/n tests for 12 Years. In: Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining, Sydney, NSW, August 10–13, 2015, p 1

    Google Scholar 

  • Kumar V, Khattar D, Gupta S, Gupta M, Varma V (2017) Deep neural architecture for news recommendation. In: Working notes of the 8th international conference of the CLEF initiative, Dublin, CEUR workshop proceedings

    Google Scholar 

  • Li J, Huffman S, Tokuda A (2009) Good abandonment in mobile and pc internet search. In: Proceedings of the 32nd international ACM SIGIR conference on research and development in information retrieval (SIGIR ’09). ACM, New York, pp 43–50

    Google Scholar 

  • Liang Y, Loni B, Larson M (2017) CLEF NewsREEL 2017: contextual bandit news recommendation. In: Working notes of the 8th international conference of the CLEF initiative, Dublin, CEUR workshop proceedings

    Google Scholar 

  • Liu TY (2009) Learning to rank for information retrieval. Found Trends Inf Retr 3(3):225–331

    Article  Google Scholar 

  • Liu TY, Xu J, Qin T, Xiong W, Li H (2007) LETOR: benchmark dataset for research on learning to rank for information retrieval. In: Proceedings of SIGIR 2007 workshop on learning to rank for information retrieval (LR4IR’07), pp 346–374

    Google Scholar 

  • Lommatzsch A, Albayrak S (2015) Real-time recommendations for user-item streams. In: Proc. of the 30th symposium on applied computing, SAC 2015, SAC ’15. ACM, New York, pp 1039–1046

    Google Scholar 

  • Lommatzsch A, Johannes N, Meiners J, Helmers L, Domann J (2016) Recommender ensembles for news articles based on most-popular strategies. In: Working notes of the 7th international conference of the CLEF initiative, Evora, CEUR workshop proceedings

    Google Scholar 

  • Lommatzsch A, Kille B, Hopfgartner F, Larson M, Brodt T, Seiler J, Özgöbek Ö (2017) CLEF 2017 NewsREEL overview: a stream-based recommender task for evaluation and education. In: Jones GJF, Lawless S, Gonzalo J, Kelly L, Goeuriot L, Mandl T, Cappellato L, Ferro N (eds) Experimental IR meets multilinguality, multimodality, and interaction. Proceedings of the eighth international conference of the CLEF association (CLEF 2017). Lecture notes in computer science (LNCS), vol 10456. Springer, Heidelberg, pp 239–254

    Google Scholar 

  • Ludmann C (2017) Recommending news articles in the CLEF news recommendation evaluation lab with the data stream management system odysseus. In: Working notes of the 8th international conference of the CLEF initiative, Dublin, CEUR workshop proceedings

    Google Scholar 

  • Radlinski F, Craswell N (2013) Optimized interleaving for online retrieval evaluation. In: Proc. of ACM international conference on web search and data mining (WSDM’13). ACM, New York, pp 245–254

    Google Scholar 

  • Radlinski F, Kurup M, Joachims T (2008) How does clickthrough data reflect retrieval quality? In: Proceedings of the 17th conference on information and knowledge management (CIKM’08). ACM, New York, pp 43–52

    Google Scholar 

  • Said A, Tikk D, Stumpf K, Shi Y, Larson M, Cremonesi P (2012) Recommender systems evaluation: a 3D benchmark. In: Proceedings of the workshop on recommendation utility evaluation: beyond RMSE (RUE 2012), CEUR-WS, vol 910, RUE’12, pp 21–23

    Google Scholar 

  • Sakai T (2018) Laboratory experiments in information retrieval. Springer, Singapore

    Book  Google Scholar 

  • Schaer P, Tavakolpoursaleh N (2015) GESIS at CLEF LL4IR 2015. In: CLEF 2015 Online Working Notes

    Google Scholar 

  • Schuth A, Sietsma F, Whiteson S, Lefortier D, de Rijke M (2014) Multileaved comparisons for fast online evaluation. In: Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management (CIKM’14). ACM, New York, pp 71–80

    Google Scholar 

  • Schuth A, Balog K, Kelly L (2015a) Extended overview of the living labs for information retrieval evaluation (LL4IR) CLEF lab 2015. In: CLEF 2015 online working notes

    Google Scholar 

  • Schuth A, Bruintjes RJ, Büttner F, van Doorn J, Groenland C, Oosterhuis H, Tran CN, Veeling B, van der Velde J, Wechsler R, Woudenberg D, de Rijke M (2015b) Probabilistic multileave for online retrieval evaluation. In: Proceedings of the 38th international ACM SIGIR conference on research and development in information retrieval (SIGIR’15). ACM, New York, pp 955–958

    Google Scholar 

  • Schuth A, Hofmann K, Radlinski F (2015c) Predicting search satisfaction metrics with interleaved comparisons. In: Proceedings of the 38th ACM international conference on information retrieval (SIGIR’15). ACM, New York pp 463–472

    Google Scholar 

  • Scriminaci M, Lommatzsch A, Kille B, Hopfgartner F, Larson M, Malagoli D, Serény A, Plumbaum T (2016) Idomaar: a framework for multi-dimensional benchmarking of recommender algorithms. In: Proceedings of the poster track of the 10th ACM conference on recommender systems (RecSys 2016), Boston, September 17, 2016

    Google Scholar 

  • Tavakolifard M, Gulla JA, Almeroth KC, Hopfgartner F, Kille B, Plumbaum T, Lommatzsch A, Brodt T, Bucko A, Heintz T (2013) Workshop and challenge on news recommender systems. In: Seventh ACM conference on recommender systems, RecSys ’13, Hong Kong, October 12–16, 2013, pp 481–482

    Google Scholar 

  • Teevan J, Dumais S, Horvitz E (2007) The potential value of personalizing search. In: Proceedings of the ACM international conference on information retrieval (SIGIR’07). ACM, New York, pp 756–757

    Google Scholar 

  • Turpin A, Scholar F (2006) User performance versus precision measures for simple search tasks. In: Proc. of the ACM international conference on information retrieval (SIGIR’06). ACM, New York, pp 11–18

    Google Scholar 

  • Verbitskiy I, Probst P, Lommatzsch A (2015) Developing and evaluation of a highly scalable news recommender system. In: Working notes for CLEF 2015 conference, Toulouse, CEUR

    Google Scholar 

  • Voorhees EM, Harman DK (2005) TREC: Experiment and evaluation in information retrieval, 1st edn. MIT Press, Cambridge, MA

    Google Scholar 

  • Wang K, Gloy N, Li X (2010) Inferring search behaviors using partially observable Markov (POM) model. In: WSDM’10. ACM, New York, pp 211–220

    Google Scholar 

  • Wilkins P, Byrne D, Jones GJF, Lee H, Keenan G, McGuinness K, O’Connor NE, O’Hare N, Smeaton AF, Adamek T, Troncy R, Amin A, Benmokhtar R, Dumont E, Huet B, Mérialdo B, Tolias G, Spyrou E, Avrithis YS, Papadopoulos GT, Mezaris V, Kompatsiaris I, Mörzinger R, Schallauer P, Bailer W, Chandramouli K, Izquierdo E, Goldmann L, Haller M, Samour A, Cobet A, Sikora T, Praks P, Hannah D, Halvey M, Hopfgartner F, Villa R, Punitha P, Goyal A, Jose JM (2008) K-space at TRECVid 2008. In: TRECVID 2008 workshop participants notebook papers, Gaithersburg, MD, Nov 2008

    Google Scholar 

  • Yilmaz E, Verma M, Craswell N, Radlinski F, Bailey P (2014) Relevance and effort: an analysis of document utility. In: Proceedings of the 23rd ACM international conference on information and knowledge management (CIKM’14). ACM, New York, pp 91–100

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Frank Hopfgartner .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Hopfgartner, F. et al. (2019). Continuous Evaluation of Large-Scale Information Access Systems: A Case for Living Labs. In: Ferro, N., Peters, C. (eds) Information Retrieval Evaluation in a Changing World. The Information Retrieval Series, vol 41. Springer, Cham. https://doi.org/10.1007/978-3-030-22948-1_21

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-22948-1_21

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-22947-4

  • Online ISBN: 978-3-030-22948-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics