Keywords

1 Motivation

Reproducible findings are fundamental for scientific progress and validity. In 2016, a Nature survey [2] revealed that lack of reproducibility nearly affects all scientific disciplines and can be considered as a general concern. Non-reproducible results limit the trustworthiness of publications and hinder progress. Besides investigating various reasons for non-reproducibility, the study showed that scientists mostly agree upon the importance of the problem that became known as reproducibility crisis during the last years. Especially in the field of information retrieval (IR), new findings are manifested by empirical studies and experiments. Innovations are assumed to be valid if their results are superior compared to those of previous findings. Despite this intuitive but rather naive assumption, achieving reproducibility in the field of IR is a many-faceted problem. For instance, the meta-evaluation by Armstrong et al. [1] reveal the illusory progress of ad-hoc retrieval performance over an entire decade, caused by comparisons to weak baselines. Ten years later, Yang et al. [16] report similar results as part of their meta-evaluation. The lacking upwards trend in retrieval performance can be traced back to non-reproducible findings. If baselines of previous results are not or only laboriously reproducible, the community does not use them adequately.

We see a gap between reproducibility efforts for offline evaluations on the one side and online retrieval experiments trying to include real-world user interactions on the other side. While several initiatives are trying to establish reproducible IR research for offline evaluations on standard test collections, there is little research effort concerning the reproducibility of online experiments. This dissertation project will be concerned with the reproducibility of online experiments in the field of information retrieval.

2 Related Work

Progress in information retrieval revolves around the evaluation of experimental results. This research project will focus specifically on two aspects of evaluation in IR - reproducible experiments and the living lab paradigm. This section gives a brief overview of the two evaluation branches.

As mentioned in the previous section, meta-evaluations of IR systems revealed limited progress over the years [1, 16]. During the last years, the IR community tried to tackle this problem with several attempts concerned with reproducibility. These can be broadly categorized into attempts on a conceptual level and initiatives in the form of workshops, infrastructures, and frameworks. Conceptually, Ferro and Kelly elaborate an implementation for the field of information retrieval [10] of the ACM Artifact and Review BadgingFootnote 1. The PRIMAD model [8] offers orientation which components of an IR experiment may affect reproducibility or have to be considered when trying to reproduce the corresponding experiment. The Evaluation-as-a-Service (EaaS) paradigm [13] reverses the conventional evaluation approach of a shared task like it is applied at the TREC conference. Instead of letting participants submit the results (runs) only, the complete retrieval system is submitted in a form such that it can be rerun independently by others to produce the results. Workshops deal with the reproducibility either re- or proactively. For example, the CENTRE workshop [9] challenges participants to reconstruct IR systems and their results, whereas The Open-Source IR Replicability Challenge (OSIRRC) [7] motivated participants to package their retrieval systems and corresponding software dependencies in advance to prepare them for appropriate reuse.

Compared to offline ad-hoc retrieval, online search experiments are affected by non-deterministic variables including user behavior, updated data collections, modifications of web interfaces, or traffic dependencies [11]. Balog et al. introduced the first living lab campaign in 2014 [3]. The infrastructure found application in several workshops and intiviates at the CLEF and TREC conferences from 2015 to 2017 [14]. Despite these elegant solutions for implementing living lab infrastructures, the aspect of reproducibility remained neglected, e.g., there was no specification of how the experiments could be archived for later use [12]. On the other hand, research efforts towards reproducible IR experiments have a strong focus on ad-hoc retrieval experiments and do not include any insights beyond offline environments at the time of writing.

3 Preliminary Work and Research Proposal

Preliminary Work. We participated in the CENTRE@CLEF2019 workshop dedicated to the replicability, reproducibility, and generalizability of ad-hoc retrieval experiments [5]. The workshop’s organizers challenge the participants to reconstruct results of previous submissions to the CLEF, NTCIR, and TREC conferences. CENTRE defines replicability and reproducibility by using the same or another test collection of the original setup, respectively. The results of our experimental setups showed that we can replicate the outcomes fairly well, whereas reproduced outcomes are significantly lower. Having the reimplementation of an ad-hoc retrieval system at hand, we decided to contribute it to the OSIRRC@SIGIR2019 workshop [7]. All contributions resulted in an image library of Docker images to which we contributed the IRC-CENTRE2019 image [4]. Additionally, we introduced STELLA - a new interpretation of the living lab paradigm - at the OSIRRC workshop [6]. We propose to transfer the idea of encapsulating retrieval systems with Docker containers to the online search scenario. In order to underline the feasibility and benefits of this proposal, we aligned components of the STELLA framework to the PRIMAD model.

Based on this preliminary work, we investigate the reproducibility of retrieval systems with the main focus on online environments. In the following, we present the research questions of this project.

RQ1 - How is the ACM terminology of repeatability, replicability, and reproducibility applied to online search experiments? While the ACM definitions can be implemented for offline ad-hoc experiments, an analogy for the online case is less obvious. Did previous online search experiments consider reproducibility? If so, is it possible to go a step further and align them with the PRIMAD model?

RQ2 - How can simulations of search sessions, based on user logs, help to identify key components of reproducible retrieval performance? To what extent affect the identified components of an online experiment the reproducibility? Compared to offline experiments, the user of the search engine is a key component in the online case. User logs comprise implicit feedback such that they can be used to model the user component of an experiment. What influence do the user and other session-related components have on the reproducibility?

RQ3 - What requirements must a living lab infrastructure meet in order to guarantee reproducible online search experiments? By identifying key components that affect the reproducibility of online search experiments, we gain insights about the requirements for reproducible online search experiments. What kind of practical steps have to be considered when implementing a framework for reproducible retrieval experiments in production environments?

4 Methodology and Experiments

Addressing RQ1, we want to conduct a literature survey and evaluate how previous living lab approaches and online experiments paid attention to the topic of reproducibility. Since there exist different terminologies, we use the ACM definitions of repeatability, replicability, and reproducibility as a starting point. As a result, we do not only want to give an overview of how existing literature paid attention to these concepts, but also provide an ontology that is inspired by the PRIMAD model [8]. While the ACM terminology is defined by the two experimental components of the research team and setup, the PRIMAD model conceptualizes the experiment on a more granular level. More specifically, it pays attention to the platform, research goal, implementation, method, actor, and data. This point of view is mainly data-focused and applies well to the offline ad-hoc experiment. However, it could be extended such that it also considers the actual user of a retrieval system.

Regarding RQ2, we primarily focus on vertical search experiments. As a result, we want to provide insights concerning key components that affect the reproducibility of online search experiments. Beforehand, the reusability of user logs is particularly interesting, since reusable test collections are fundamental to offline retrieval experiments. Tan et al. [15] examine the reusability of user judgments that contributed to a relevance pool by performing a leave-one-out analysis. As a starting point, we propose to repeat this study with the user logs of another search engine. Assuming we have retrieved a fair amount of interaction logs that deliver relevance feedback in the form of clicks and other interactions [11], we systematically assess the influence of specific components. For instance, we can simulate sessions with different durations, tasks, or users. By comparing a diverse set of different session constellations and corresponding outcomes, we identify significant influences. Are specific components more important than others or even crucial for successful reproduction? Furthermore, it is of interest to relate to previous offline reproducibility efforts. Consider two rankers A and B, that are compared by the conventional offline ad-hoc experiment. The retrieval effectiveness of A outperforms that of B, which is denoted as \(A \succ B\) and is confirmed to be reproducible. Under which circumstances and to which extent can \(A \succ B\) be reproduced in an online environment? Which components affect the reproducibility?

Having identified major influences and key components, we address RQ3 by deriving requirements that have to be met by an adequate living lab infrastructure. On a functional level, technical components of the infrastructure have to be included. Quality requirements play an essential role, as well. Since experimental systems will be deployed in production environments, a certain degree of quality has to be guaranteed. Subpar retrieval performance and latencies caused by long query processing may affect user behavior, and at the worst, damage the reputation of the sites. Furthermore, we have to consider general conditions like the ethical and juridical aspects of data logging. On an organizational level, it has to be specified, which prerequisites an embedded search engine provider has to fulfill.