1 Introduction

Experimentation has become the most common research method in Library and Information Science (LIS) [15] and, in IR in particular, has a dominant empirical tradition [30]. It is only really in the last 10-15 years, however, that user studies such as controlled laboratory studies with human users have been commonly accepted as part of the programme at the premier IR conference, ACM SIGIR. The acceptance of this kind of empirical contribution resulted from a growing movement within LIS, e.g., [17, 34], but also from the increased recognition that such studies provide value complementary to traditional Cranfield experiments [40]. Laboratory-based user studies offer the possibility to learn about aspects, such as interaction, which are difficult to study using Cranfield experiments alone [61]. They also provide insight on how behaviours differ across groups (e.g., experienced vs inexperienced users [4]) and contexts (e.g., varying topic familiarity [41]).

Despite now being commonly accepted in IR, user studies are often –sometimes unfairly– criticised for sample size, regardless of whether they are representative of the population studied or provide sufficient statistical power [12]. There are many reasons for small sample sizes in such studies. Recruitment is challenging, particularly when the population of interest includes highly-paid individuals with little time (e.g., lawyers [44], engineers [20] or healthcare professionals [27]). Moreover, each participant takes considerable time and effort to process; with informed consent and debrief this can take up to several hours each. A further issue is that the cost associated with running multiple conditions typically means reducing the number of conditions for reasons of pragmatism. There is a need to reduce the cost of user studies, not only in IR but also in other fields, where multiple user-related aspects are often studied [51, 65]. Even if power analysis is effectively employed to estimate the necessary number of participants and individual experiments required prior to the studies taking place, there can still be many instances where more resources are actually used than is necessary.

In this work we explore one means of using acquired participants, their time, and ours more efficiently. If we can achieve this, it may lower the entry barrier to user studies being performed in our field or perhaps allow additional conditions to be tested using the same resources. The idea is: rather than distributing the conditions uniformly across participants or participant tasks, as is typically done, we formulate the distribution of experimental conditions as an explore-exploit trade-off. We posit that, during the course of a study, experimenters incrementally gain information about which conditions are performing well and that this information could and should be used to design adaptive user studies. We treat the selection of conditions in a user study as a Best Arm Identification (BAI) problem and explore methods to intelligently adapt the adjudication of examples while a user study is in progress. If successful, this approach would offer a number of advantages, including: reduction in costs (due to the ability to run the study with fewer resources because less time is spent on poorly-performing conditions); effectiveness ([2] showed that uniformly allocating examples is a weak approach to correctly identifying the best performing model among a set of candidates); and user experience (the information obtained during the study is used to eliminate poorly-performing conditions and, thus, participants may potentially have a better user experience because they are not presented with inferior conditions).

While there is a history of adaptive trials in the testing of medical treatments (see review below), user studies in Information Retrieval (IR) and related disciplines do not presently utilise such approaches.

We explore here the value of a family of Multi-armed Bandit (MAB) algorithms –Best Arm Identification algorithms– to increase the efficiency of IR user studies. More specifically, our paper serves as a review of how existing BAI methods can alleviate the cost of certain types of user studies. BAI methods attempt to identify the best arm at a given confidence level, while consuming the minimum number of rounds. The motivation is that this framework provides a formal way to approach the problem of identifying the best experimental condition in the context of a user study whilst minimising the number of participants/individual experiments required. Each experimental condition is modelled as an arm in the BAI framework and the BAI algorithms provide us with a formal and effective way to guide the selection of the best condition. Using three freely-available data sets associated with previously published IR user studies, as well as a series of simulations, we test the extent to which the costs incurred (i.e. number of data points required to be collected) can be reduced.

2 Literature review

We review three bodies of related work. First, we report on methods for sampling users and determining appropriate sample sizes. Next, we review the use of adaptive trials, which have been used in medicine and other fields and for which we foresee benefit in IR. Lastly, we summarise MAB usage in A/B testing in our field, which is somewhat similar in concept and from which other types of user studies can draw inspiration.

2.1 Sampling approaches

One critical decision researchers must make when designing laboratory experiments with users is deciding how many participants to study. Most researchers who perform user studies are familiar with reviewer comments criticising sample size; reviewers use, sometimes incorrectly, sample size as a means to reject papers [12]. This phenomenon is known as the “sample size fallacy” and, although not often reported in the IR community, it has been described and empirically studied in other fields including HCI and medicine [6, 12, 29].

Acceptable sample size varies from field to field. In HCI, Nielsen controversially claimed that only 5 participants are needed for a qualitative usability study [56] and that 20 were sufficient for more quantitative studies as this “typically offers a reasonably tight confidence interval” [55]. The first claim in particular has been disputed by others in the same field [63, 71]. A recent systematic review of user studies at the CHI 2014 conference found sample sizes ranging from 1 to 916,000 with a mean sample size for in-person laboratory studies of 20 (SD= 12) [12]. In his tutorial for RecSys user studies Knijnenburg [43] is sceptical about the utility of small samples, which tend to be underpowered and are thus highly likely to miss important differences that exist. He is also critical of studies being overpowered, i.e. those that use a lot more resources than necessary. In interactive IR, the determination of sample size is often based on heuristics and limited by practical constraints such as time, availability of participants and finances [41]. As a result, many studies are underpowered. Sakai [62] performed post-hoc power analyses on 840 SIGIR full papers and 215 TOIS papers published between 2006 and 2015Footnote 1. The analyses revealed that both highly overpowered and highly underpowered experiments are reported in the IR literature. While power analysis is recognised as a rigorous and defensible method of determining sample size, it is not without issue. One limitation is that it requires a pre-study understanding of effect size, which is often difficult or impossible to accurately estimate [25].

A second critical decision researchers must take relates to how participants are sourced. The difficulties in achieving appropriate sample sizes lead to sampling from participant pools that are not always representative of the target population. The use of convenience samples and over-representation of undergraduate students have raised some concerns about the external validity of experimental results in many fields [12, 22, 38, 47]. For HCI, Caine reports that 19% of studies examined reported college students as the sole participants. In political science, a review for the period 1990-2006 found that about a quarter of the reported experiments were based solely on student samples [57]. A further means by which sample sizes and sampling frames for convenient samples can be increased is to design a user study suitable for remote deployment [31, 42]. Despite the loss of environmental control and lack of ability to observe, the evidence suggests that many behaviours do not change significantly when studies are performed remotely rather than in the lab [42]. The approach can be taken further by using crowd-sourcing platforms, such as Amazon Mechanical Turk [45, 72], which have become increasingly popular in IS and IR studies.

Regardless of where participants are sourced they are a precious resource and their participation should not be taken for granted. We posit that adaptive trials may be a means of maximising the benefit of participant effort irrespective of study type. The following subsection reviews how adaptive trials have been used in the past.

2.2 Adaptive trials

Although not applied in IR or related fields, the concept of adaptive trials has a long tradition in medicine [7, 16, 70, 75], where recruitment of study participants is even more challenging as they typically need to meet medical and geographical constraints. Moreover, randomly assigning patients to experimental conditions in clinical trials may have serious consequences. If researchers learn early in an experiment that a particular cancer treatment is more effective than a standard treatment they may feel ethically obliged to switch control group participants to the experimental condition as it may have existential outcomes. Several approaches have been proposed including techniques alluding to similar considerations as the multi-armed bandit problem. This includes play-the-winner strategies [75], drop-the-losers designs, where certain treatments are dropped or added in response to their response data [8], and Bayesian approaches, which choose the condition based on the highest posterior probability and can include stopping rules to facilitate early termination of a trial or condition, if appropriate [16, 36]. The guiding idea behind these is the ethical one of not prolonging a trial longer than necessary, as an unduly prolonged trial may result in an excessive number of patients being given the less beneficial treatment. See [14, 64] for detailed, recent reviews.

Aziz and colleagues [5] have worked on MAB designs for dose-finding in clinical trials. Their goal was to find the optimal dosage in early stage clinical trials. They tested multiple variants of Thompson Sampling and found solutions that outperform state-of-the-art dose identification algorithms. In the context of drug discovery, Terayama and colleagues [66] showed that a BAI algorithm was useful for structure-based drug design. The BAI method proposed by these authors can optimally control the number of simulations required to predict binding structures of drug candidate molecules. This team of researchers has also worked on how to effectively employ the BAI framework to select protein-protein complex structures [67].

In IR, assigning study participants to weaker conditions obviously has less grave consequences, the motivation for not wanting to prolong the experiment relates to cost and efficiency (we wish to achieve studies with fewer resources or study more conditions with the same resources). Although untried in laboratory user studies, certain MAB techniques have been applied in our field for online controlled experiments, which use live systems. We summarise such work in the next sub-section.

2.3 MAB in IR evaluation

A wide range of bandit-based models have been employed to support tasks in multiple domains and applications [28, 46, 60]. MABs have been successfully employed in online IR experimentation. Online controlled experiments are now common when evaluating system effectiveness, particularly in industrial research contexts (e.g. [9, 58]). MAB algorithms have been used in this context to learn ranking strategies by minimising the total number of poor rankings displayed over time. This is a task which can be modelled conveniently as a explore-exploit trade-off [33, 53, 59]. The formulation of the problem and the type of MAB algorithms used vary. For example, Yue and Joachims [74] employed duelling bandits to learn from noisy, relative signals between two candidate rankers. Burtini et al. [11] surveyed MAB approaches useful for online experiment design. Note, however, that the approaches described above in relation to online evaluation differ from those in our context as these are typically k-armed problems aiming to minimise total regret. Such notion of regret is important in online studies because users use the system and should not be penalised or potentially lost (e.g., by offering poorer conditions). IR lab studies are typically different because participants are testing prototypes with simulated tasks provided by the experimenter and thus are neither penalised for poor outcomes nor are they really invested in the system’s performance.

In the context of building test collections for batch evaluation of adhoc search, Losada and colleagues [48, 49] evaluated multiple bandit-based methods and concluded that a Bayesian approach performs the best at adjudicating judgments in pooling-based evaluation. Given a query and multiple search systems that contributed to the pool, a bandit-based solution iteratively learned about the quality of the systems and dynamically adapted the judgment process (by selectively choosing the systems from which new relevance judgments are made). This low-cost solution for creating relevance assessments was adopted by the TREC 2017 common core track [35]. Although these bandit-based models represent an example of effective use of MABs for reducing the cost in IR, they are intrinsically different to the BAI methods explored here. Losada and colleagues were interested in maximising the cumulative sum of rewards (i.e. number of relevant documents identified within the evaluation process) and, thus, they worked with several k-armed algorithms oriented to this task. BAI algorithms, instead, are oriented to minimise a notion of regret (see Section 3) that only depends on the quality of the final armFootnote 2 (regardless of the rewards obtained within the process).

2.4 Contributions

The literature reviewed above has highlighted difficulties relating to recruitment and sample size in user studies and hinted that MABs may offer utility in such situations. Three forms of controlled study were mentioned (in-personal studies, remotely-deployed studies & crowd-sourced studies), all of which differ from the online evaluations summarised – for which MABs have been used in IR – in that a live system is not used. Below we study the benefits MABs might offer for these kinds of study using publicly-available user study data sets and a series of simulations.

More concretely we make the following contributions:

  • We present the first investigation of the potential for BAI algorithms to reduce the cost of IR user studies

  • We study the utility of common approaches on diverse data sets from the IR literature (spanning topics such as privacy, food search and recommendation), as well as synthetic data sets.

  • We demonstrate that significant savings can be made (up to 72.4% fewer data points were achieved without any cost).

  • We show that one algorithm Hoeffding offered consistent savings over both the real and simulated data sets.

  • We present findings on how the scale of the study influences the benefit of the approaches demonstrating that advantages can be attained beyond 90 data points.

3 User studies as best arm identification problems

Let us consider the situation where researchers wish to evaluate different experimental conditions and need to identify the best performing one with respect to a single criterion. The researcher designs a user study (this could be in-person, remotely deployed or crowd-sourced), in which the conditions are tested by participants following either a between or within-groups design. Each participant performs one condition at a time and, in doing so, either implicitly or explicitly, provides a score (or his performance is evaluated using a given measure of performance). The goal of the user study is to establish the experimental condition which is most likely to offer the highest overall performance.

Many IR user studies fit with the above descriptionFootnote 3. We posit that the problem of identifying the best performing experimental condition among a set of competing conditions can be naturally cast as a Best-Arm Identification problem in Multi-Armed Bandits. This is a forecasting task that can be solved in the context of MABs with independent arms (pulling one arm does not reveal any information about the other arms). Under this setting, multiple algorithms that implement some form of gap-based exploration have been developed [2]. Essentially, these consist of exploring the arms (i.e., the conditions) in order to reduce the uncertainty about the gaps between the rewards of the arms and, when there is sufficient confidence, output a recommended arm. Unlike standard MAB methods, where the goal is to maximise the cumulative rewards obtained, BAI methods are evaluated on the quality of the recommended arm at the end of exploration.

The general structure of a BAI problem is sketched in Algorithm 1, often referred to as the pure exploration problem [2]. The prediction is evaluated in terms of regret, which is the difference between the mean reward of the recommended arms and the mean reward of the optimal arm. BAI algorithms are also evaluated in terms of sample complexity, which is defined as the total number of rounds the algorithm performed before termination, and is clearly something we wish to minimise. Further details about complexity and BAI algorithms can be consulted in the work by Kaufmann and colleages [26, 39], who have extensively worked on the characterization of the complexity of BAI algorithms.

figure a

In [2], the authors experimented with a number of simulated tasks and demonstrated that uniform arm allocation is substantially inferior to other alternatives. The results showed that the probability of error, defined as the probability of missing the optimal arm, is much smaller when the algorithm incorporates some form of bias towards the most effective arms. These experiments were performed under a wide range of conditions (different number of arms, varying difficulty –differences among the arms evaluated– and different number of rounds). These results inspired us to explore the role that BAI algorithms can play in optimising user studies. An intelligent selection of participant conditions may be beneficial, both in terms of cost (fewer number of rounds required to determine the optimal arm) and effectiveness (given the same budget, non-uniform alternatives have shown to be more precise). In the following, we explain the main characteristics of several algorithms that can be employed to support this task.

3.1 Racing algorithms

Racing algorithms, initially proposed by Maron and Moore [50], attempt to identify the best arm at a given confidence level while consuming the minimal number of rounds. To meet this aim, they quickly discard poor arms and concentrate effort on differentiating between the most promising ones. In practice, the algorithm is derived from Hoeffding’s inequality [32], which defines the confidence in the sample mean of a series of independently drawn points.

We model the conditions as arms and employ BAI to quickly concentrate on the best conditions. Given K, the number of conditions, and N, the maximal number of rounds allowed for deciding, a racing algorithm either finishes when the rounds are exhausted or when it can state that, with probability at least 1 − δ, it has found the best condition.Footnote 4 Precisely, after any given number of plays (t < N) of a condition a, the following confidence interval is constructed for its mean reward:

$$ \left[ \mu_{a} - R \sqrt{\frac{log(2 \cdot K \cdot N/\delta)}{2t}}, \mu_{a} + R \sqrt{\frac{log(2 \cdot K \cdot N/\delta)}{2t}} \right] $$
(1)

where μa is the mean reward obtained from the t plays and R the range of the rewards obtained. In this way, each condition is associated with its estimated mean and Hoeffding’s formula sets a bound on its possible spread. The main idea of the racing algorithm is to continuously eliminate those conditions whose best possible reward (upper bound) is still smaller than the worst possible reward of the best condition (lower bound). As more rounds are run, the intervals become smaller and the algorithm proceeds until it is left with a single condition or runs out of plays. The algorithm returns the condition(s) whose reward rates are insignificantly different after the whole process. Algorithm 2 sketches our implementation of the Racing Algorithm.

figure b

Alternative bounds to those set by Hoeffding’s inequality were proposed in [3]. The so-called empirical Bernstein bounds incorporate variance information in a principled manner and quickly become much tighter than Hoeffding’s bounds. The resulting confidence interval is:

$$ \begin{array}{@{}rcl@{}} &&\left[ \mu_{a} - (\overline{\sigma_{a}}\sqrt{\frac{2\cdot log(3\cdot K \cdot N/\delta)}{t}}+\frac{3\cdot R \cdot log(3\cdot K \cdot N/\delta)}{t} ), \right. \\ && \left.\mu_{a} + (\overline{\sigma_{a}}\sqrt{\frac{2\cdot log(3\cdot K \cdot N/\delta)}{t}}+\frac{3\cdot R \cdot log(3\cdot K \cdot N/\delta)}{t} ) \right] \end{array} $$
(2)

where \(\overline {\sigma _{a}}\) is the empirical standard deviation of the observed rewards. This bound leads to an alternative Racing Algorithm [52], which is a variant of Algorithm 2 (where the Hoeffding bounds –(1)– are replaced by Bernstein bounds –(2)–). This variant will be referred to as Bernstein’s Race.

3.2 Elimination algorithms

Even-Dar and colleagues [23, 24] proposed several Sucessive Elimination algorithms for the BAI problem, which repeatedly sample arms and eliminate the arm which has the lowest empirical reward in a principled manner. The resulting algorithm, illustrated in Algorithm 3, is guaranteed to select the optimal condition with probability at least δ. The number of steps taken (sample complexity) is bounded (see [24], Theorem 3).

figure c

A second algorithm, Median Elimination (ME), has a better dependence on the number of arms and improves the sample complexity bound by a logarithmic factor. To meet this aim, the algorithm discards the worst half of the arms on each round. Algorithm 4 depicts this method. This algorithm outputs an 𝜖-optimal condition, which is defined as one whose expected reward is at most 𝜖 from the optimal reward.Footnote 5

figure d

3.3 LUCB algorithm

Kalyanakrishnan and colleagues [37] designed an algorithm named LUCB that has improved sample complexity. The algorithm is inspired by the Upper Confidence Bound (UCB) algorithm, which has been popularly employed for regret minimisation in standard MAB problems. Elimination algorithms find it difficult to ensure low sample complexity because sometimes they induce erroneous eliminations. LUCB, instead, maintains a separation between the stopping rule and the sampling strategy and never eliminates any competing arm. Such an approach guarantees a low expected sample complexity.

figure e

The LUCB algorithm (Algorithm 5) proceeds as follows: First, the process is initialised by sampling each condition once. On each subsequent round, the algorithm extracts the best performing condition, estimates its lower performance bound and, subsequently, the competing condition with the highest upper confidence bound (HUCB_Condition) is obtained. The algorithm stops when the difference between the highest upper bound of the competing conditions and the low bound of the best performing condition falls below 𝜖Footnote 6. If the algorithm does not stop then the method samples the conditions BestCondition and HUCB_Condition and continues to the next round. The rationale is that it is advisable to sample these two conditions instead of others as these represent the frontier between the best performing condition and the others.

4 Data

We conducted experiments with both data obtained from real user studies and data obtained from simulations. In doing so, we are able to evaluate the performance of the algorithms under real world conditions, as well as with varying levels of performance, which we can exactly control for the simulated data.

4.1 User studies

We chose to evaluate the algorithms on sets of real-world data from user studies described in recent publications related to Information Retrieval. The data sets were created by various authors on different IR-related topics, but all are freely available to download online. The studies differ considerably in their aims, conditions tested and methods to assess the quality of the conditions.

In line with our research aims, to be considered for our experiments the data set had to meet the following critiera: i) the data set must be sourced from an experiment involving human users relating to information retrieval, ii) multiple conditions are evaluated and compared (e.g., multiple search methods, interface designs, or summarization strategies) and iii) it must be possible to identify a clearly defined dependent variable associated with each condition (e.g., clicks or ratings from human users).

To perform BAI experiments on the selected data sets, we iteratively assigned rewards to the conditions based on the users’ interactions.

  • The first data set, from a recent ACM CHIIR paper by Zimmerman et al. [76], was collected by means of a controlled in-person laboratory study. The experiment studied user search behaviour for health-related information (n = 40) and how this relates to privacy invasion. Four SERP variants were evaluated and the main aim was to determine the impact of these variants on good decision making and privacy protection. Performance was measured by the average number of privacy trackers encountered during searches. We modelled this user study as a 4-arm problem, where the arms (conditions) were control, nudge_filter, nudge_rank and nudge_stoplight. Every SERP, produced by a given condition, was assigned a non-binary reward (in [0,1]) based on the number of privacy trackers encountered (the fewer the better). We refer to this data set as Privacy.Footnote 7

  • The second data set, described in a recent ACM SIGIR paper by Elsweiler and colleagues [21], was collected by means of a remotely-deployed study performed as part of research work on helping people to make healthier food choices. Two algorithms were tested, top10 and images, which used different features of online recipes to predict which of a given pair of recipes a user would most likely choose. These recipe pairs, of which there were 50, were chosen such that the two recipes were similar in terms of their constituent ingredients but had a large percentage difference in their fat content per 100g. Research shows that, given choices of otherwise similar food, people typically choose the fattier option. Participants (n = 136) were shown pairs of recipes and asked to choose which one they would like to cook and eat. The model gets a reward if the user chooses the recipe in the pair with the least fat. We will model this user study as a 2-arm problem where rewards are binary and will refer to it as Nudge.Footnote 8

  • The other four data sets were collected using crowd-sourcing by Trattner and Jannach as part of research work investigating the problem of similar item recommendation, a common feature of many websites which points users to other interesting objects given a currently inspected item [68]. This was investigated in two domains of “quality and taste” (recipes and movies). The main task given to participants was to individually assess five similar item recommendations for a given reference item. The study had two questions in the form of five-point Likert scales for each recommendation: i) the similarity between each recommendation and the reference item, and ii) how likely it is that they would try out each recommendation. The movies study had 12 recommendation strategies and the recipes study had 6 recommendation strategies. Given the data from these studies, we tested BAI algorithms i) to rapidly estimate the quality of the different strategies in terms of selecting similar items, and ii) to rapidly estimate the quality of the different strategies in terms of selecting items that the users are likely to try. Each data point is a recommendation list presented to the user and the associated reward is the aggregation of Likert responses on similarity or “likely to try”, respectively (the five responses are added and the sum is divided by the maximum possible score). We refer to these data sets by combining each domain (Movies or Recipes) and each question (Sim or Try). For example, the similarity question – question i) – for the movies data is named Movies-Sim.Footnote 9

Table 1 shows statistics for the six user study-derived data sets. Note that for each dataset a performance metric is calculated for all conditions. As the studies are different, the metric reported is different. In the case of Nudge, performance is based on a binary reward i.e. how often the condition led to the participant choosing the healthy choice of two recipes. In Privacy, the reward is a normalised value (ranging in [0,1]) whereby a higher score reflects fewer trackers being accessed by participants. In all the experiments, the BAI algorithms were run on a random permutation of the available data points and each BAI algorithm was run until the best condition was chosen or until some condition exhausted its maximum number of points. For example, if we allow a maximum of 100 points per condition then we have to stop when any condition was tested 100 times (and recommend the condition with the highest performance so far). Observe that the BAI algorithms still make substantial savings in these cases because, unlike a full user study, they tend to quickly discard weak performing conditions, leading to savings in the overall effort.

Table 1 Statistics of user study data sets

This process was repeated 20 times (20 random sequences) and the results were averaged. The BAI algorithms are evaluated in terms of the percentage of savings (reduced effort with respect to the full user study) and the probability of error (normalised number of times where the BAI algorithm did not recommend the arm that had the highest performance in the full user study).

4.2 Simulated user study data

To further evaluate the BAI algorithms under different conditions, we performed additional experiments using sets of simulated data. Inspired by [2], we simulated K-arm problems where the conditions are modelled by probability distributions with rewards obtained by sampling from the distribution associated with each conditions. We generated 14 simulated datasets; 7 producing binary rewards (as in Nudge) and 7 producing non-binary rewards (as in the other real user studies). For all simulations, we generated K conditions and each condition was parameterized by paramk. The best condition had always the first index, and we set its parameter (param1) to 0.5 (Bernoulli parameter or mean of the Truncated Normal set to 0.5, respectively). We then continued to generate performance for weaker conditions by varying the Bernoulli or Truncated Normal parameter as appropriateFootnote 10.

To test how different approaches function in diverse situations, we tailored the simulated experiments, such that each experiment corresponds to varying performance differences between conditions. As in [2], conditions were either clustered into groups or distributed according to an arithmetic or geometric progression. In doing so, we can represent divergent levels of difficulty for the BAI algorithms (i.e. the closer weaker conditions get to matching that of the best condition, the more difficult the task is for algorithms). The following experiments represent diverse plausible scenarios for IR user studies:

  1. I.

    one group of weak conditions, K = 20,∀j= 2..20 paramj = 0.3.

  2. II.

    two groups of weak conditions, K = 20,∀j= 2..6 paramj = 0.33, ∀j= 7..20 paramj = 0.27.

  3. III.

    geometric progression, K = 4,∀j= 2..4 paramj = 0.5 − (0.47)j.

  4. IV.

    6 conditions divided in three groups, K = 6,param2 = 0.45,param3 = param4 = 0.35,param5 = param6 = 0.25.

  5. V.

    arithmetic progression, K = 15,∀j= 2..15 paramj = 0.5 − (0.03) ⋅ j.

  6. VI.

    two good conditions and a large group of weak conditions, K = 20, param2 = 0.48,∀j= 3..20 paramj = 0.27.

  7. VII.

    three groups of bad conditions, K = 30,∀j= 2..6 paramj = 0.45,∀j= 7..20 paramj = 0.43,∀j= 21..30 paramj = 0.38.

These seven experimental designs combined with the two alternative distributions (i.e. binary and non-binary) produce 14 different simulated scenarios. The number of samples produced from each condition was set to 1,000 for all simulated experiments. While 1,000 data points per condition is far from small, the real datasets described above show that this is not an implausible figure. Each BAI algorithm was run on a random permutation of the simulated data and the algorithm was run until either the best condition was found or some condition was exhausted. Each simulation was repeated 20 times and the results reported are averages of the 20 executions. The probability of error represents the proportion of cases were the BAI algorithm did not select the first condition.

5 Results

5.1 User study data

The first result to report is that on the Privacy data set, none of the algorithms offered any improvement. In each case, all 320 data points were required and as such, none of the algorithms stopped early. Large improvements were, however, found for the remaining real-world data sets. The results are summarised in Table 2, which reports the effort (#number of data points –pulls– required), the percentage of savings and the probability of generating an outcome different to that with the full data set. Hoeffding, Bernstein and Successive Elimination are all promising methods to reduce the cost of user studies without resulting in unacceptably high error rates. Bernstein and SE are the most conservative and, thus, save less –in some cases only reducing the number of necessary trials by a little less than 2%. However, these two methods have the same probability of error as Hoeffding, suggesting that it may be the most useful method overall. Observe that this method can produce up to a 15% reduction in cost whilst making nearly no mistakes; only for the Movies-Try data does the method sometimes identify a condition that is not the optimal. However, note (Table 1) that in Movies-Try the difference between the best condition (svd) and the second best (all-all) is negligible and, thus, arguably, selecting the second best is not a major issue. Indeed, even after running the full user study there is some uncertainty about the “true” winner. Observe also that, in practice, the recommendation of the BAI algorithm can be complemented with proper statistics for the competing conditions (e.g., confidence intervals after running the study) and, thus, the experimenter can gain further insights into the difference between the chosen condition and its competitors.

Table 2 Results - real user studies data

Despite LUCB1 being the algorithm that, overall, results in the least effort for all data sets except Nudge, offering up to \(\sim 79\%\) improvement, it also errs an unacceptably large number of times for the Movies-Sim, Movies-Try and Recipes-Try data sets. This means that, although it has the greatest potential for savings, it also has by far the greatest risk of incorrectly identifying the best condition. Median Elimination generally performs well by saving considerable effort (between 25 and 72% savings) whilst maintaining a low error rate. In the Recipes-Try and Movie-Try data sets, however, the error rate is unacceptably high at 0.4 and 0.2 respectively.

5.2 Simulated data

Results from the simulated data sets are described in Table 3. These generally align with those reported for the real-world data sets. Again, we find that LUCB1 offers large savings, but is far too risky to be of use –in some cases erring more than half of the time and returning negligible (i.e. acceptable) error rates for only 3 out of the 14 simulations. In contrast to the real-world data, where it also tended to be somewhat error-prone, for the simulated data Median Elimination does not make any mistakes and is consistently able to reduce effort by around 35%.

Table 3 Results - simulated user studies

Other findings of note include that Successive Elimination does not make mistakes but offers little benefit. Confirming the positive results from the real-world datasets, neither Bernstein nor Hoeffding provide different outcomes to the full data set in any of the experiments (i.e. do not err). Both, however, often offer substantial savings. Hoeffding tends to offer larger savings more often, particularly in the binary case. A general observation is that the methods tend to save more under non-binary situations. In all of the experiments, the variant that produced non-binary rewards led to higher rates of savings. This is likely because when rewards are non-binary there is greater scope to distinguish among the competing conditions.

To gain an understanding of what sizes of user study can benefit from BAI, we ran an experiment using Hoeffding with varying numbers of maximum numbers of samples to be produced from each condition (i.e. varying the number of points per condition). We experimented with 10 to 1000 (in steps of 10) points per condition, ran the simulation and recorded the point where the BAI started producing savings with respect to the full user study. Hoeffding was chosen as it offers consistently good performance in both the real and simulated experiments and makes almost no errors. We wanted to establish from when this algorithm starts to offer benefit. The results of these experiments are shown in Table 4. We tested the 14 simulated studies described above and, thus, we can see how different effect sizes (modelled by the 14 different configurations) behave with respect to the size of the user study (as the number of data points per condition directly determines the size of the full study). No figures are given for Experiment VII in Table 4 as Hoeffding offered no benefit at all in this scenario.

Table 4 Hoeffding races

The results show that the starting point for benefits when applying Hoeffding vary with experiment, ranging from 90 (Experiment V, binary and non-binary) to 380 data points (Experiment I, non-binary) per condition. It seems that binary reward experiments saw benefit more quickly. Experiment V, where weak arms became progressively worse, saw the earliest benefit. Whereas, experiment I, which had a single group of weaker arms, saw the benefit come last. This makes sense because the worst performers of Experiment V have mean effectiveness scores (e.g., 0.05, 0.10, 0.15) that are substantially lower than the best arm’s performance (0.5) and, thus, the BAI algorithm can quickly discard these low performers and, as a consequence, savings come earlier. The rest of the experiments exhibited profit after a comparable number of data points.

6 Discussion

We discuss our work in three sub-sections. First, we discuss what our findings mean with respect to the points outlined in the introduction and related work. Next, study limitations are discussed and, finally, we reflect on how the results may be utilised in practice.

6.1 Principal findings

The experimental results show that the kind of algorithms we have tested offer promise with respect to substantially lowering the entry barrier to performing user studies. We have shown empirically that data points can be saved; using the Nudge data set, median elimination used 72.4% fewer data points without incurring any error in the results. In several other cases, up to 38% savings were made whilst achieving the same results as if the full user study had been performed. These are considerable benefits which, depending on the study design, would translate into fewer participants being recruited, more conditions studied or individual participants being asked to do less work. Such differences could potentially mean less reliance on convenience samples, more user studies being performed or less fatigued participants. Moreover, unlike power analysis, no pre-study effect size estimate is needed.

The primary take-away from our results is that one of the racing algorithms, Hoeffding, holds the most promise. This algorithm offered consistent savings across both the real and simulated data sets. It only extremely rarely, as discussed above, returned a result inconsistent with the result of the full trial. Another important benefit of Hoeffding is that it only requires two input parameters (the significance level and the maximum number of rounds), while other algorithms, such as ME or LUCB1, also need 𝜖, whose setting might not be obvious.

We emphasise that if a researcher wishes to perform a user study where the aim is to determine which empirical condition performs best (and has a single metric with which to measure performance), then there is no clear disadvantage to applying our adaptive approach, driven by the Hoeffding algorithm. The methods are simple to deploy and, even in cases where no gains were made – as in the Privacy data set – no costs are incurred. If, however, researchers wish to place a greater emphasis on recruitment savings (for example, when participants from the target population are extremely rare or expensive), then Median Elimination may be an option. This algorithm leads to savings that are typically larger than those achieved by Hoeffding. However, the researcher should be aware that in doing so the likelihood of attaining an incorrect result is increased.

The fact that no advantage was observed for the in-person lab study (Privacy) data set most likely results from the data set being too small to benefit. The earliest performance gain in the simulated experiments was observed from 90 data points per condition, which was beyond that of Privacy study. However, a between-groups design with 3 tasks per participant and n > 30 would in this case start leading to savings and anyone who has performed user studies will testify that, after having completed trials with 30 participants, savings are welcome. Given the number of data points required before benefits are seen, the results suggest that the approach is most useful for remotely-deployed and crowded-sourced studies. This setting would also be the easiest in which to build the algorithms into the process. It could be argued that lower costs associated with recruitment and performance in these types of study make the savings less pertinent. A counter argument would be that even in the case of crowd-sourcing, where costs are known to be particularly low, there are cases where potential participant pools are very small, such as studies of users with particular demographics, language skills or impairments [13].

6.2 Limitations

A number of limitations with the presented work are worthy of discussion. An obvious one is that despite the evidence provided for efficiency savings without error, we cannot offer a theoretical guarantee of a correct outcome under all circumstances. More evidence is required before the approach can become common practice, but BAI methods provide a principled way to estimate what is the best condition. We note too that we treat the user study results as a gold-standard (i.e. if the same result is achieved then we judge the result to be error free). In practice, both type I and type II errors can occur in the original analyses, which we cannot account for here. This is of course not the case for the synthetic data sets where no uncertainty exists as we produced the simulation.

We have only studied a single type of user study (where the strongest condition with respect to a single metric is being sought). While, as discussed above, such studies are common in our field, other studies may instead seek to investigate the effects of different conditions on multiple dependent variables. We plan to study how to adapt BAI algorithms to such multi-purpose settings. There may also be user studies whose designs do not fit well with the BAI framework but may still benefit from another sort of adaptive device. In such cases, other MAB methods might be considered. For example, we could explore the application of MAB algorithms that handle multi-objective rewards and are oriented to maximise the overall utility (e.g., [69]).

Another point, discussed extensively in the medical literature (e.g., [73]), is the potential cost of losing randomisation (conditions are no longer randomly assigned). This is worthy of consideration as randomisation is a means to reduce, for example, learning effects and effects relating to fatigue, which could not be studied in our experiments. Further work is necessary to analyse these in detail including using approaches such as Bayesian randomisation [73]. Whereas in medicine the ethical benefits and empirical evidence for the efficacy and reliability of the approach have won the debate, it is our position that it is important for the IR community to have the debate, regardless of the outcome.

6.3 Utility in practice

One question readers may have is how they might use these results for their own studies. In practice applying the best-arm identification algorithms would mean switching participants between conditions during experiments. In cases, such as the evaluation of search algorithms, this is not a problem as it is not obvious to participants that anything has changed. For example, we could employ a BAI solution to quickly select the best retrieval method among a set of competing alternatives (e.g., multiple cluster-based methods [10, 18, 19]).

In the case of search user interfaces, this may be more problematic since dramatic interfaces changes would be obvious to participants and noticing may inherently alter their behaviour. This is something that researchers must consider when planning their studies.

To enable the changing of conditions we will make code available describing the algorithms so that experimenters can introduce them into their own pipelines. Furthermore, an online service could be developed to assist researchers in assigning conditions based on previous results. The setup would be similar in a sense to that used by NIST in the TREC CORE Track 2017 [1]. In order to generate relevance judgements, NIST utilised a MAB method [48, 49] that adaptively selected the documents to be judged by human assessors. The MAB algorithm was implemented on a server that received judgements from the assessors and returned the next suggested judgement. We could imagine setting up a similar service, where the experimenter defines the conditions and associated rewards while the algorithm drives the selection of conditions. In the case of a Crowdsourced study, the MAB algorithm could be built directly into the code and could, therefore, after initial set up, be set to run and minimise costs with no additional input or monitoring from the researcher [54].

7 Conclusions

By studying BAI algorithms using freely-available and synthetic data sets, we have presented a strong case for the utility of adaptive IR user studies. Whilst we do not wish to argue that existing approaches should be replaced, it is clear from our findings that, in the class of studies investigated, efficiency savings can be made that could lead to fewer wasted resources, more conditions being tested and less reliance on convenience samples. We hope to test these and other MAB approaches more thoroughly in the future with diverse study designs. More specifically, we want to study recent MAB proposals that lead to generalisable algorithms (e.g., the recent adaptation of the Sequential Halving algorithm that leverages variants of Thompson Sampling [5]) and see how their perform in comparison to the BAI solutions proposed here. We encourage researchers who perform user studies to make their data available, if they are so permitted.