1 Introduction

In high recall retrieval tasks, the goal is to find all (or almost all) the documents which are relevant to a given information need, from an unlabelled set of documents (often called the pool P). Examples of these tasks are e-discovery, systematic reviews in empirical medicine, and online content moderation.

In these scenarios, the simplest strategy to guarantee a high recall target is to have in-domain human experts reviewing the whole pool of documents, that is, labelling each document as either relevant or non-relevant. When working with large collections, however, this operation becomes incredibly expensive, both in terms of time and costs: indeed, the number of data items that can be annotated are usually limited by either the availability of the reviewer, or the money one is willing to invest in the process (the annotation budget).

Given some time/cost budget, annotating a random sample of the data is a suboptimal approach: the reviewing process is usually aided by one of the many machine learning techniques which go under the name of “Active Learning” (AL, Dasgupta and Hsu 2008; Huang et al. 2014; Lewis and Gale 1994). AL methods adopt a human-in-the-loop model, in which the human expert labels items selected by an automatic classifier. The classifier is then updated, exploiting the additional knowledge coming from new labels, in an iterative process. In the high-recall scenarios mentioned earlier, the human-in-the-loop annotation workflow is usually referred to as Technology-Assisted Review (TAR, Cormack et al. 2010; Grossman and Cormack 2011; Kanoulas et al. 2019).

One of the most challenging issues in TAR applications is the so-called “when-to-stop” problem: that is, we need to choose when to stop the AL process, in order to jointly minimize the annotation effort and satisfy the information need, e.g., a target recall value. Recently, IR literature has proposed many stopping methods (Cormack and Grossman 2016; Li and Kanoulas 2020; Oard et al. 2018; Yang et al. 2021a): the when-to-stop issue is usually tackled by either changing the sampling policy of the AL algorithm (see Sect. 2.1), by crafting task-specific heuristics, and/or by estimating the currently achieved recall. In this paper, we focus on the latter approach, and propose a new technique based on the Saerens–Latinne–Decaestecker (SLD) algorithm (Saerens et al. 2002), adapting it to the AL workflow typically leveraged in TAR processes.

The paper is structured as follows: Sect. 2 describes the related work; we then analyze the shortcomings of the SLD algorithms when used “as-is” in AL scenarios (Sect. 3); we then propose a solution to this problem, our own method in Sect. 4. Experiments and results are discussed in Sects. 5 and 6. Section 7 concludes.

2 Background and related work

2.1 Active learning: relevance sampling

AL algorithms prioritize the annotation of certain data items over others. Two of the most well-known AL techniques, still used to this day, were presented in 1994 by Lewis and Gale (1994): Active Learning via Relevance Sampling (ALvRS) and Active Learning via Uncertainty Sampling (ALvUS).

Algorithm 2.1 shows the typical structure of an AL process with a stopping rule. Given a data pool of unlabelled documents P, the reviewer annotates a “seed” set of documents \(S\subset P\) that defines the initial training set L. An iterative procedure is then started, in which L is used to train a classifier \(\phi \), which is then exploited by an AL policy pol to select the next set of documents to be presented to the reviewer. In ALvRS, \(\phi \) is used to rank the documents in \((P\setminus L)\) in decreasing order of their posterior probability of relevance \(\Pr (\oplus {\vert }{\textbf{x}})\). The reviewer is asked to annotate the b documents for which \(\Pr (\oplus {\vert }{\textbf{x}})\) is highest (with b the batch size), which, once annotated, are added to the training set L. The classifier is retrained on the new training set and the process repeats until a stopping condition is met, e.g., the annotation budget is exhausted or a target recall is reached. ALvRS is most-effective and has been mostly used when we are interested in finding all the items relevant to a given information need, as quickly as possible.

figure a

The ALvUS policy is a variation of ALvRS, where documents are ranked in ascending order of \({\vert }\Pr (\oplus {\vert }{\textbf{x}})-0.5{\vert }\), i.e., we top-rank the documents which the classifier is most uncertain about. ALvUS can be useful when we want to build a high-quality training set to later train a machine learning model on it, and has not been employed as often as ALvRS in TAR applications. While many other AL techniques have been proposed over the years (Dasgupta and Hsu 2008; Huang et al. 2014; Konyushkova et al. 2017), in this work we focus especially on ALvRS, and in particular on its variant called Continuous Active Learning (CAL), proposed by Cormack and Grossman (2015), which is specifically tailored to TAR applications.

Both ALvRS and ALvUS policies suffer from what is called sampling bias (Dasgupta and Hsu 2008; Krishnan et al. 2021), i.e. the fact that, due to the document selection policy, the sample of annotated items L is not representative of P, nor of the unlabelled set U (see Sect. 3 for a more thorough explanation and analysis). In order to investigate how and when a classifier is affected by this bias, Esuli et al. (2022) have introduced a policy called Rand(pol). The Rand(pol) policy is an oracle policy, i.e., it requires knowing the true labels of all documents in the pool. It is thus a synthetic policy designed to investigate the issue of sampling bias of a given AL policy pol.

Rand(pol) observes the prevalences of labels in the L set produced by pol and produces its own \(L^{Rand}\) set which exhibits the same prevalences, but using a random document selection policy. The idea is that substituting the selection policy of pol with random sampling keeps the content of elements in \(L^{Rand}\) unbiased with respect to P, while using the same prevalences produced by pol. The comparison of results produced by pol and Rand(pol) is thus a way to understand whether and how much sampling bias is affecting the decisions of a pol-based classifier: being a controlled random sampling, the Rand(pol) policy should produce a “sampling bias free” dataset.

2.2 TAR tasks and workflows

TAR processes usually operate in a “needle in a haystack” scenario, i.e. the number of relevant items is just a tiny fraction of the whole collection of documents. Three of the main TAR real-world applications are: e-discovery, in the legal domain; the production of systematic reviews in empirical medicine, and online content moderation. In this work, we focus on the first two tasks.

2.2.1 TAR for e-discovery

E-discovery is an important aspect of the civil litigation in many (but not only) common law countries: in e-discovery a large pool of documents P need to be reviewed in order to find all items “responsive” (i.e., relevant) to the litigation matter. The documents labelled as responsive are “produced” by the defendant party, and disclosed to the other party in the civil litigation. However, the former party holds the right to keep some of these documents “hidden”, putting them in a private log only available to the jury: this is only allowed if the “logged” documents are deemed to contain “privileged” information (e.g., intellectual property). Making different misclassification errors (i.e., producing a document which contains privileged information) can bring about different costs for the defendant party, based on the severity of the error committed. It is worth noticing, however, that there are only few works which really take e-discovery costs into consideration (Oard et al. 2018; Yang et al. 2021b).

In e-discovery, the review usually happens in two stages: (i) documents are first reviewed by responsiveness (i.e., relevancy) by a team of junior reviewers; (ii) the documents judged as responsive are then passed on to a second team of senior reviewers (with an hourly rate several times higher than the junior team’s), which mainly re-review the documents by privilege.Footnote 1 As it can be inferred, annotating documents by privilege is usually a much more costly and delicate operation than annotating by responsiveness (Oard et al. 2018).

2.2.2 TAR for systematic reviews

In empirical medicine, a systematic review discusses (ideally) all medical literature relevant to a given research question. The production of a systematic review is usually carried out by one or more physicians and can span even years (Michelson and Reuter 2019; Shemilt et al. 2016). A systematic review usually collects a large pool of documents P by issuing a boolean query on a (medical literature) search engine. Then, similarly to e-discovery, systematic reviews are conducted in two stages: (i) a first one, where doctors review document abstracts to determine their relevance and (ii) a second one, where documents which passed the first phase are reviewed in their entirety.

The production of systematic reviews has recently attracted the interest of the IR community (Callaghan and Müller-Hansen 2020; O’Mara-Eves et al. 2015; Lease et al. 2016; Wang et al. 2022), which has focused on several aspects of the process, from improving the query formulation issued to search engines, to finding the optimal stopping criterion, reducing the annotation costs (and the time spent on a systematic review).

2.2.3 One-phase TAR and two-phase TAR workflows

TAR workflows can usually be divided in “one-phase” and “two-phase” approaches (Yang et al. 2021a), where:

  1. 1.

    In one-phase TAR workflows, we assume that there is a single review process that is stopped when some condition is met. Relevance sampling is usually the preferred AL technique, since the aim is to annotate the highest number of relevant items in the shortest amount of time possible;

  2. 2.

    In two-phase TAR workflows, a review team annotates a sample of the data pool, on which a classifier is trained and later used to help a second review team complete the process. The two review teams may and usually have different per-document costs.

Note that the number of phases is not related to how many stages are required by the specific task: both approaches work with multi-stage review (see also Yang et al. (2021b) on when either of the two workflows might be preferred).

This paper focuses on one-phase TAR workflows, although the new method we propose might as well be used in two-phase workflows. This choice is in line with the recent literature, which has mostly focused on one-phase reviews (Cormack and Grossman 2016, 2020; Li and Kanoulas 2020; Yang et al. 2021a) [with the notable exception of Oard et al. (2018)].

2.3 Stopping methods for TAR

Finding a way to stop the AL process as soon as a target recall R is reached is a key aspect in lowering the cost of human review, the other being using an AL policy that efficiently selects relevant documents over non-relevant ones. The target recall is usually very high, in many cases equal or close to 100%, transforming the task in a “total recall” task; other high target recalls are often seen in TAR applications, such as 80% or 90% (Li and Kanoulas 2020; Yang et al. 2021a, b).

The scope and aim of this paper is on stopping algorithms which work inside an AL process without changing the sampling policy. We thus do not consider “interventional stopping rules”, e.g. Li and Kanoulas (2020). As observed in (Yang et al. 2021a, §2), we argue that interventional methods are (i) less efficient than AL (i.e., they usually trade off annotation costs for a safer recall estimation) and (ii) less applicable in real case scenarios, where reviewers are often limited to using some AL methods (i.e., relevance sampling) provided by a specific software. We will now give an overview of the most relevant methods, which are then compared to our proposal in the experiments.

2.3.1 The knee and budget method

The Knee method was first proposed by Cormack and Grossman (2016). The method is based on a gain curve for a one-phase TAR workflow, i.e., a plot of how the number of positive documents increases as more documents are reviewed, during the AL process. The method, based on Satopaa et al. (2011), empirically finds “knees” in the plot, ideally stopping the process when the effort of continuing to review documents is not supported by the retrieval of a sufficient amount of positive documents.

The Budget method (Cormack and Grossman 2016) is a heuristic variant of the knee method, where the process is stopped no earlier than when at least 70% of the document collection has been reviewed. This follows the observation that, if we were to review by random sampling, we would expect to achieve a recall of 0.7 when reviewing 70% of the collection; by using an AL technique, we expect the recall to be much higher. After the 70% threshold has been reached, the Budget method stops the review if a knee test passes (detailed in Cormack and Grossman (2016); Yang et al. (2021a)), and if the number of relevant items found Rel(L) is somewhat large, i.e.: \({\vert }L{\vert } \ge 10 \ P/Rel(L)\). Both methods do not allow users to specify a target recall.

2.3.2 The Callaghan Müller–Hansen (CMH) method

Callaghan and Müller-Hansen (2020) propose a stopping heuristic based on estimating the probability of having reached the target recall and comparing it against a confidence level. The CMH method consists of a first part of the process that uses an AL method and a confidence level which, once reached, stops the AL process and starts a second part that continues reviewing using random sampling and a higher confidence level. Following an approach similar to Yang et al.’s (2021a) for picking our baselines, we will not use the random sampling part of CMH in our experiments and only use its heuristic method as a stopping rule; notice, however, that the random sampling and the confidence level estimation which follows the heuristic stopping criterion is applicable to any of the other methods we explore in this paper (including our method presented in Sect. 4).

CMH heuristic treats batches of previously screened documents as if they were random samples (an assumption somewhat similar to the one we make in Sect. 5.1); for subsets \(A_i = \{d_{N_{seen}-1},...,d_{N_{seen}-i}\}\) of these documents they compute \(p = \Pr (X \le k)\), where \(X \sim \textrm{Hypergeometric}(N, K_{tar}, n)\): n is the size of the subsample, N is the total number of documents and \(K_{tar} = \lfloor \frac{\rho _{seen}}{R} - \rho _{AL} + 1 \rfloor \) represents the minimum number of relevant documents remaining at the start of sampling. This is done for all sets \(A_i\) with \(i \in N_{seen} -1... 1\); \(p_{min}\) is the value where the null-hypothesis (i.e., recall being below target) is lowest. The review of documents proceeds with AL until \(p_{min} < 1 - \alpha \); \(\alpha \) is a confidence level, which is set to 95%.

2.3.3 The QuantCI method

The QuantCI method proposed by Yang et al. (2021a) leverages the classifier predictions (a logistic regression) to estimate the current recall, computes a confidence interval based on variance in the predictions, and stops the reviewing process when the lower bound of the confidence interval reaches the target recall.

More specifically, the estimated recall \({\hat{R}}\) is computed as:

$$\begin{aligned} {\hat{R}} = \frac{{\widehat{Rel}}(L)}{{\widehat{Rel}}(P)} = \frac{\sum _x^{{\vert }L{\vert }} \Pr (\oplus {\vert }x)}{\sum _x^{{\vert }P{\vert }} \Pr (\oplus {\vert }x)} \end{aligned}$$
(1)

This estimate is based on modeling the relevance of a document i as the outcome of a Bernoulli random variable \(D_i \sim Bernoulli(\Pr (\oplus {\vert }x))\). The 95% confidence interval (CI) is then computed as:

$$\begin{aligned} \pm 2 \sqrt{\frac{1}{{\widehat{Rel}}(P)^2} Var(D_{L}) + \frac{{\widehat{Rel}}(L)^2}{{\widehat{Rel}}(P)^4} \left( Var(D_L) + Var(D_U)\right) } \end{aligned}$$
(2)

The authors tested their method with and without the confidence interval (i.e., using the recall estimate as is, not lowering it with the confidence interval) resulting in two stopping techniques, called QuantCI and Quant.

2.3.4 The QBCB method

The Quantile Binomial Confidence Bound (QBCB) stopping rule, proposed by Lewis et al. (2021), leverages on theory of quantile estimation to define a stopping rule that is a function of the target recall, a confidence value, and the size r of a human annotated sample of relevant documents \(P_r\). Given these three values, QBCB returns the minimum number of relevant documents \(j\le r\) from \(P_r\) that have to be found while annotating P to have a statistical guarantee to reach the target recall on P within the given confidence value. The actual equation used by QBCB is:

$$\begin{aligned} \sum _{k=0}^{j-1}\left( \begin{array}{l} r \\ k \end{array}\right) t^k(1-t)^{r-k} \ge 1-\alpha \end{aligned}$$
(3)

where \(1-\alpha \) is the confidence level. The equation is tested for increasing values of j until it is satisfied. This methods thus requires an initial phase of human annotation of documents randomly sampled from P in order to define the set \(P_r\). A larger size of \(P_r\) raises this initial annotation cost, yet it produces a more accurate estimation of j, with a lower expected annotation cost for the main annotation phase. Experiments in Lewis et al. (2021) on various dataset of different difficulty and prevalence have different optimal values for r, which minimize the average overall annotation cost, in the range of 30 to 60 samples (see Lewis et al. 2021, Fig. 3), with a more accurate targeting of recall for larger r values.

2.3.5 The IPP method

The Inhomogeneous Poisson Process Power (IPP) was recently proposed by Sneyd and Stevenson (2021). The authors actually propose several stopping rules based on counting processes, that is, stochastic models of the number of occurrences of an event over time (Sneyd and Stevenson 2021, §3). In order to be applied to TAR, the authors treat position in a search ranking as “time”, and occurrences of relevant documents as “events”.

Poisson processes assume that the number of occurrences follow a Poisson distribution: if the rate at which events occur varies, the process is said to be inhomogeneous. Furthermore, Poisson processes have a \(\lambda \) rate, a function representing the frequency with which events occur over the space (i.e., our ranking). More in details,

$$\begin{aligned} \Lambda (a, b) = \int _a^b \lambda (x)dx \end{aligned}$$
(4)

If we indicate with N(ab) a random variable denoting the number of events occurring in the interval (ab], then:

$$\begin{aligned} P(N(a,b)=r) = \frac{[\Lambda (a,b)]^r}{r!} e^{-\Lambda (a,b)}, \end{aligned}$$
(5)

where N has a Poisson distribution with expected value \(\Lambda (a,b)\). The \(\lambda \) function is unknown, and the goal is to choose a suitable one for the problem at hand.

The IPP method uses a Poisson process with a power curve function as the rate function. Given a counting process such as IPP, the stopping criterion is then applied as in Algorithm 2 [which we report from (Sneyd and Stevenson 2021, Algorithm 1)].

figure b

2.4 Using the SLD algorithm in AL processes

The Saerens–Latinne–Decaestecker (SLD) algorithm (Saerens et al. 2002) was proposed as a technique to adjust a priori and a posteriori probabilities (priors and posteriors here after) in prior probability shift (PPS) scenarios, i.e. when the prior probability \(\Pr (y)\) diverges between the labelled L and the unlabelled U sets (Moreno-Torres et al. 2012). The algorithm works by iteratively and jointly updating the prior and posterior probabilities (see Algorithm 3).

figure c

AL policies, such as relevance and uncertainty sampling, naturally tend to generate a high PPS: in particular, when \(\Pr _P(\oplus )\) is fairly low to start with (as it usually is in TAR applications), the AL process generates a PPS such that \(\Pr _L(\oplus ) \gg \Pr _U(\oplus )\). Hence, using SLD to improve both our prevalence estimates and our posteriors might seem like a promising idea. However, recent works (Esuli et al. 2022; Molinari et al. 2023) have shown that in this context SLD has a disastrous effect on the posteriors. In the next sections, we analize this behaviour (Sect. 3) and propose a solution (Sect. 4) that enables using SLD in AL (and hence, in TAR).

3 An analysis of SLD shortcomings in active learning scenarios

Two recent studies (Esuli et al. 2022; Molinari et al. 2023) have shown how SLD can have disastrous behaviours in AL contexts, leading to an extremization of the posterior probability distribution, i.e. the fact that most (if not all) posterior probabilities \(\Pr (\oplus {\vert }x)\) are dragged to either 0 or 1. Experiments from Esuli et al. (2022) show that when SLD is applied to Rand(pol) version of an AL policy, be it either ALvRS or ALvUS, the phenomenon of corruption of posteriors does not happen. The cause of the issue is thus to be found in the document selection policy, and in the fact that both ALvRS and ALvUS produce a sampling bias (Dasgupta and Hsu 2008; Krishnan et al. 2021), which leads to a dataset shift where \(\Pr _L(y) \ne \Pr _U(y)\) and \(\Pr _L(x \vert y) \ne \Pr _U(x \vert y)\).

Sampling bias emerges from AL algorithms due to both the selection policy pol and the initial seed S. S is usually very small (in many TAR applications, it may consist of a single positive instance). Considering the ALvRS policy, the classifier trained on S will have high confidence on documents similar to the few ones contained in S. As the AL process continues, the training set L will diverge from the underlying data distribution. A classifier trained on such a biased training set can hardly make sense of the whole document pool. Indeed, Dasgupta and Hsu (2008, §2) shows that the classifier can be overly confident in attributing the negative label to a cluster of data which actually contains several positive instances.

Fig. 1
figure 1

LSA visualization of documents in the pool of an ALvRS experiment. Yellow/Blue indicates relevant/non-relevant documents. Green/Red circles indicates relevant/non-relevant documents selected via RS. A strong sampling bias selects documents in a restricted region. See Molinari et al. (2023, Fig. 1) (Color figure online)

A visual representation of the sampling bias is shown in Fig. 1, from Molinari et al. (2023), which shows how the document selection is extremely focused on one small region of the positive instances.

When it comes to the classifier capability to estimate the prevalence of relevant documents in the unlabeled part of the pool U, this means that its estimates are going to be much lower than those of a classifier trained on a random and representative sample of the same population: we can see this in Table 1, where we compare the prevalence estimates of a calibrated SVM classifier trained on the ALvRS training set (at different sizes) and the same classifier trained on a controlled random sample of the pool, with same size and same positive prevalence (we call ALvRS SVM and Rand(RS) SVM, the two SVMs trained on the two different training sets). The estimate is the average of the posterior probabilities for the relevant class \(\Pr (\oplus {\vert }x)\) for all \(x \in U\).

Table 1 Prevalence estimates of an SVM classifier trained on a ALvRS and Rand(RS) training set respectively, compared to true prevalences of the L and U sets

The last two columns of the table report the true prevalence of the positive class in L and U, showing the strong PPS generated by AL techniques; clearly, the shift is stronger if \(p_P(\oplus )\) is already fairly low to start with (which is usually the case in many TAR applications). In this scenario, the classifier usually overestimates \(p_U(\oplus )\), given its bias on \(p_L(\oplus )\) prevalence, which is what we see for Rand(RS) SVM. The values in the table seem to indicate that the ALvRS-based estimates are better than the Rand(RS) ones. We argue that this is actually due to the sampling bias: the classifier is very likely underestimating the prevalence of those positive clusters it does not know about; the end result is that the output prevalence estimate is much lower than the Rand(RS) and incidentally closer to the real prevalence of U.

In order to better visualize how sampling bias affects the AL trained classifier, we give a visual representation of this phenomenon on a synthetic dataset:

Fig. 2
figure 2

ALvRS and Rand(RS) applied to synthetic data. a ALvRS training set (marked with X) and prevalence estimation of an SVM trained on this training set. We report a Rand(RS) trained classifier estimation as well for completeness. The blue and yellow clusters are the “positive” clusters, whereas the pink and purple ones are the “negative” ones. b The Rand(RS) policy training set (marked with X), used for the Rand classifier in (a).

  1. 1.

    We generate an artificial dataset consisting of 10,000 data items with four clusters (blue, yellow, purple and pink in Fig. 2). Blue and yellow clusters are the positive clusters (i.e., every item in these clusters has a positive \(\oplus \) label); purple and pink clusters are the negative clusters (i.e., every item in these clusters has a negative \(\ominus \) label). Notice that negative clusters are much more populated than positive ones (i.e., the overall positive prevalence is low);

  2. 2.

    We start the active learning process with two positive items coming from one of the positive clusters and 10 negative items, randomly sampled from the negative clusters. We then annotate 500 documents with the ALvRS policy and generate an analogous training set with the Rand(RS) policy. The two training sets are shown with “X” markers in Fig. 2a, b, for ALvRS and Rand(RS) respectively;

  3. 3.

    We show the estimated (and the true) proportion of positive items remaining in each cluster, for a Calibrated SVM trained on the ALvRS and Rand(RS) training sets respectively. Furthermore, we show in the title the true \(\textrm{P}_U(\oplus )\) and estimated prevalence (\(\textrm{P}_U^{\textrm{ALvRS}}(\oplus )\) and \(\textrm{P}_U^{Rand}(\oplus )\)) on the test set.

In Fig. 2a we see how the AL process annotates a specific subregion of positive items. This in turn “misleads” the classifier to output a much lower prevalence than the Rand(RS) classifier for the clusters it has never seen during training; notice that, being the overall positive prevalence quite low, the prevalence estimates of the AL classifier seem better than the Rand’s.

3.1 How is sampling bias related to SLD failures in active learning contexts?

The reason why sampling bias is a key element in our analysis lies beneath the main assumptions made by SLD on the posterior probability distribution of the classifier: SLD reasonably assumes to be in the scenario represented by the Rand(RS) policy rather than the AL one. More precisely, it assumes that there is no dataset shift on the conditional probability \(\Pr (x\vert y)\) between the labelled set L and U, i.e., that \(\Pr _L(x \vert y) = \Pr _U(x \vert y)\). If this assumption holds, the trained classifier will have a bias on L which will “uniformly” translate (i.e., in our previous example, the bias is consistent for all regions of the graph) to the posterior probabilities \(\Pr _{U}(\oplus \vert x)\) on the unlabelled set. In a PPS scenario, this means that the classifier estimate of the prevalence (i.e., the average of the classifier posteriors) will be closer to L, and that they can be “adjusted” consistently across the whole distribution. Nonetheless, when using an active learning policy such as ALvRS or ALvUS, we not only generate prior probability shift, but we also affect the distribution of the conditional probability \(\Pr (x \vert y)\), such that \(\Pr _L(x \vert y) \ne \Pr _U(x \vert y)\).

Let us now focus on one of the key updates in SLD: the priors ratio (Line 13 of Algorithm 3), which is later multiplied by the posteriors. This is defined as:

$$\begin{aligned} \frac{{\hat{\Pr }}_{U}^{(s)}(\oplus )}{\Pr _{L}(\oplus )} \end{aligned}$$
(6)

This linear relation is represented by the red line of Fig. 3. When \({\hat{\Pr }}_{U}(\oplus ) = \Pr _{L}(\oplus )\), the ratio is 1, and posteriors do not change. However, as \({\hat{\Pr }}_{U}(\oplus )\) drifts further away from \(\Pr _{L}(\oplus )\) (recall that this latter quantity is constant for all iterations in SLD), the ratio becomes progressively smaller, resulting in a multiplication of \(\Pr _U(\oplus {\vert }x)\) by a number very close to 0.Footnote 2 We argue this is one of the main culprits of degenerated outputs from SLD.

In other words, let us assume that \(\Pr _L(x\vert y) = \Pr _U(x\vert y)\), and that L is then a representative sample of U. A classifier trained and biased on L will tend to shift any prevalence estimate toward \(\Pr _{L}(\oplus )\). When the classifier estimated \({\hat{\Pr }}_{U}(\oplus )\) is lower (or higher) than \(\Pr _{L}(\oplus )\), SLD deems the true \(\Pr _{U}(\oplus )\) to be even lower (or higher). Indeed, this works very well when the previous assumption holds, e.g., for Rand(pol). As a matter of fact, SLD has been a state-of-the-art technique for prior and posterior probabilities adjustments in PPS for 20 years. However, ALvRS and similar techniques generate a \(\Pr _L(x\vert y) \ne \Pr _U(x\vert y)\) type of shift (as well as PPS), and, as a result, we cannot apply SLD update with confidence: we should rather find a way to apply a milder correction, when possible, or no correction at all when we have no way of estimating how “far” we are from the SLD assumption.

4 Adapting the SLD algorithm to active learning

As discussed in previous section, in AL scenarios, the priors ratio defined in the SLD algorithm should be milder in order to avoid extreme behaviours. A simple but possibly effective action is to directly add a correction factor \(\tau \) to the priors ratio equation, so that:

  • when \(\tau = 1\) we get the original SLD algorithm;

  • when \(\tau = 0\) the ratio always equals 1, i.e., we do not apply any correction;

  • all other intermediate values adjust the ratio, making it milder with respect to SLD original ratio.

We thus define a correction to the priors ratio of SLD:

$$\begin{aligned} \delta = -\left[ \tau \cdot \left( -\frac{{\hat{\Pr }}^{(s)}_{U}(y)}{{\hat{\Pr }}^{(0)}_{U}(y)} + 1\right) -1 \right] \end{aligned}$$
(7)

the new ratio, which we call \(\delta \), will then be multiplied by the posteriors at every SLD iteration. We show how \(\tau \) affects the slope of the ratio in Fig. 3. We call our method “ SLD for Active Learning ” or SAL\(_\tau \) for short. The complete algorithm is reported in Algorithm 4.

Fig. 3
figure 3

The \(\tau \)-based correction to the priors ratio of SLD that we propose. The value of this ratio (i.e., the y axis) is multiplied by the posteriors during SLD iterations. Notice that when \(\tau = 1\) we get the SLD original ratio, whereas when \(\tau = 0\) we multiply the posteriors by 1, i.e. we do not change the classifier posteriors

figure d

In the next section we detail how we set the value of \(\tau \).

4.1 Estimating SALτ τ across active learning iterations

We have seen that SLD works on the output of a Rand(RS)-based classifier (Esuli et al. 2022). We can build our estimate of \(\tau \) for SAL\(_\tau \) by measuring how much the ALvRS classifier posteriors are more or less distributed like those of a Rand(RS) classifier. The more ALvRS posteriors diverge from Rand(RS) ones, the milder SLD correction should be, up to the point where we do not use SLD at all.

Let us consider the posteriors for the relevant class, i.e., \(\Pr (\oplus {\vert }x)\): we collect a vector \({\mathcal {A}}\) of posteriors \(\Pr (\oplus {\vert }x)\) for all documents in U for the classifier trained on the ALvRS training set, and an analogous one \({\mathcal {R}}\) for the classifier trained on the Rand(RS) training set. We define \(\tau \) as the cosine similarity between these two vectors:

$$\begin{aligned} \tau = \mathrm {cosine\ similarity}({\mathcal {A}},{\mathcal {R}}) = \frac{{\mathcal {A}} \cdot {\mathcal {R}}}{{\vert }{\vert }{\mathcal {A}}{\vert }{\vert } \ {\vert }{\vert }{\mathcal {R}}{\vert }{\vert }} \end{aligned}$$
(8)

Since \(\Pr (\oplus {\vert }x) \ge 0\) by definition of probability, the cosine similarity is naturally bounded between 0 and 1, a required property of our \(\tau \) parameter. In other words, we apply the SLD update when AL posteriors are similar to posteriors for which we know the assumption made in SLD holds (i.e., \(\Pr _L(x\vert y) = \Pr _U(x\vert y)\), see Sect. 3); we apply an accordingly milder correction the further the AL posteriors are from this assumption.

Equation (8) would be a good solution, as well as an impossible one, since the Rand(pol) policy requires knowledge of the labels (e.g., relevancy) for the entire data pool.

We thus resort to an heuristic formulation based on the evolution of the classifier during the iterations of the AL process.Footnote 3 Given the batch of documents \(B_i\) reviewed at the i-th iteration, we define \({\mathcal {A}}_{\phi _i}\) and \({\mathcal {A}}_{\phi _{i-1}}\) as:

$$\begin{aligned} {\mathcal {A}}_{\phi _i}&= {{\Pr }^{(i)}(\oplus ,x)=\phi _i(x) {\vert } x \in B_i} \end{aligned}$$
(9)
$$\begin{aligned} {\mathcal {A}}_{\phi _{i-1}}&= {{\Pr }^{(i-1)}(\oplus ,x)=\phi _{i-1}(x) {\vert } x \in B_i} \end{aligned}$$
(10)

i.e., the posteriors on \(B_i\) returned by the classifiers \(\phi _i\) and \(\phi _{i-1}\). The \(\tau \) parameter is then defined as the cosine similarity between \({\mathcal {A}}_{\phi _i}\) and \({\mathcal {A}}_{\phi {i-1}}\). The assumption we make is that:

  • At early iterations, there will not likely be substantial differences between the two vectors (thus making the cosine similarity useless);

  • However, as the active learning (AL) process progresses, this could assist us in obtaining an estimation of the sampling bias that impacts the classifier.

Let us consider the ALvRS policy, in a scenario similar to the one depicted in Fig. 2a, that is, overall prevalence is low and we start with a small seed set: in the first iterations we are likely going to find many positive items; that is, when we compare \({\phi _i}\) and \({\phi _{i-1}}\) predictions on \(B_i\) they are likely to be similar, since the “clusters” of data available to \({\phi _i}\) are probably the same that were available to \({\phi _{i-1}}\). However, as we review all the positive items in a cluster, the process is forced to explore items in the neighbour clusters. This is when the cosine similarity should be effective: by comparing the posterior distribution of \(\phi _i\) to that of \(\phi _{i-1}\) for the same \(B_i\) documents, we should be able to assess the impact of the new documents on the classifier. Indeed, if \(\phi _i\) accessed previously unseen clusters, the posteriors on \(B_i\) might radically change and, in turn, roughly give us an estimate of how much sampling bias was affecting \(\phi _{i-1}\) predictions.

Finally, let us consider one last issue: we said that in the first AL iterations \({\mathcal {A}}_{\phi _i}\) will likely be similar to \({\mathcal {A}}_{\phi _{i-1}}\), and their distance cannot be used as an indication of sampling bias. How can we establish when to use our method and when to fallback to the classifier posteriors? In lack of better solutions (which we defer to future works), we introduce a hyperparameter \(\alpha \):

  • At every iteration i, we apply SAL\(_\tau \) and obtain a new set of posterior probabilities \(\Pr ^{\textrm{SAL}_\tau }(\oplus \vert x)\) and a prevalence estimation \(\Pr ^{\textrm{SAL}_\tau }(\oplus )\) (computed as \(\frac{\sum _x^{X} \Pr ^{\textrm{SAL}_\tau }(\oplus \vert x)}{\vert X\vert }\));

  • We measure the Normalized Absolute Error (NAE) between the estimated prevalence and the true prevalence of batch \(B_i\), where \(\textrm{NAE}(\Pr _{B_i}(y),{\hat{\Pr }}_{B_i}^{\textrm{SAL}_\tau }(y))\) is defined as:

    $$\begin{aligned} \begin{aligned} \textrm{NAE}&= \frac{\sum _{j=1}^{Y}\vert \Pr _{B_i}(y)-{\hat{\Pr }}_{B_i}^{\textrm{SAL}_\tau }(y_{j})\vert }{2\left( 1-\displaystyle \min _{y_{j}\in Y}{\Pr }^{\textrm{SAL}_\tau }_{B_i}(y_{j})\right) } \end{aligned} \end{aligned}$$
    (11)
  • If \(\textrm{NAE} > \alpha \) we do not use our SAL\(_\tau \) method.

The simple intuition behind this heuristic is that when the NAE between the true prevalence and the prevalence estimation of SAL\(_\tau \) is too high, then the estimates of SAL\(_\tau \) are likely going to be poor on the rest of the pool as well.

To recap, we give below an overview of how SAL\(_\tau \) integrates into the active learning process (see Algorithm 5):

  1. 1.

    At each iteration i we employ an active learning policy, annotating a batch \(B_i\) of b documents, which are added to the training set L;

  2. 2.

    We train a classifier \(\phi _{i}\) on L;

  3. 3.

    At each iteration \(i > 1\), we compute the cosine similarity between the two vectors of scores \({\mathcal {A}}_{\phi _i} = \langle \Pr ^{\phi _i}(\oplus \vert x) \ \forall \ x \in B_i \rangle \) and \({\mathcal {A}}_{\phi _{i-1}} = \langle \Pr ^{\phi _{i-1}}(\oplus \vert x) \ \forall \ x \in B_i \rangle \). The cosine similarity will be used as the \(\tau \) parameter in SAL\(_\tau \);

    1. (a)

      We obtain a new set of posterior probabilities \(\Pr ^{\textrm{SAL}_\tau }(\oplus \vert x)\) and a new prevalence estimate \(\Pr ^{\textrm{SAL}_\tau }(\oplus )\) on U, using SAL\(_\tau \) on the posteriors coming from \(\phi _{i-1}\);

  4. 4.

    We compute NAE between SAL\(_\tau \) prevalence estimate for \(B_i\) and the true prevalence of \(B_i\). If this is lower than a threshold \(\alpha \), we consider SAL\(_\tau \)-based probabilities to be the correct ones, otherwise we fall back to \(\phi _i\)-based probabilities. In a TAR process this means that we can use SAL\(_\tau \)-based probabilities to estimate the recall and decide when to stop reviewing documents.

figure e

4.2 Mitigating SALτ recall overestimation: SALτ m

As it will be clear in the results section (Sect. 6), SAL\(_\tau \) achieves significant improvements with respect to the compared methods. However, SAL\(_\tau \) also tends, in some cases and especially for higher recall targets, to overestimate the recall, stopping the TAR process too early. Leaving the study of more complex approaches to future work, we propose a simple way to mitigate the issue of recall overestimation by raising by a margin value the target recall given in input to the method. We call this variant SAL\(_\tau ^m\) (SAL\(_\tau \) with margin m), which trades off an increment in annotation costs for a safer TAR process that reduces the early stops. SAL\(_\tau ^m\) is actually SAL\(_\tau \) with the only difference being the use of a target recall \(R^m\) determined as a function of the target recall R and the margin m:

$$\begin{aligned} R^m= R+(1-R)m \end{aligned}$$
(12)

The margin m ranges from 0 to 1. When \(m=0\), \(\textrm{SAL}_\tau ^m= \textrm{SAL}_\tau \); when \(m=1\), \(R^m=1\). A low value means fully trusting SAL\(_\tau \), a high value means accepting to label more documents to avoid early stops. In order to avoid adding a free parameter to the method we decided to set \(m=R\), following the intuition that it is crucial to guarantee a given target recall R, the closer R is to 1. Equation (12) thus becomes:

$$\begin{aligned} R^m = R+(1-R)R = 2R - R^2 \end{aligned}$$
(13)

We call this configuration SAL\(_\tau ^R\) and comment upon its results in Sect. 6. We defer to future works the exploration of more informed methods to set m, which could possibly enable a more convenient trade-off between annotation costs and proper recall targeting.

5 Experiments

5.1 Using SALτ to stop a TAR process

The SAL\(_\tau \) algorithm can be tested in any AL scenario where we need to improve priors and posteriors. In this paper, we focus on testing SAL\(_\tau \) capabilities for TAR: our goal is to stop the review process as soon as a target recall R is reached, lowering the review cost.

We test the SAL\(_\tau \) algorithm with three configurations:

  1. 1.

    The SAL\(_\tau \) formulation of Algorithm 5;

  2. 2.

    SAL\(_\tau ^R\), i.e., with margin, as described in Sect. 4.2;

  3. 3.

    SAL\(_\tau \)CI, a variant of SAL\(_\tau \) that uses the confidence interval heuristic from the QuantCI technique (Yang et al. 2021a).

We compare against the following well-known methods (see Sect. 2):

  1. 1.

    The Knee method by Cormack and Grossman (2015);

  2. 2.

    The Budget method by Cormack and Grossman (2015)

  3. 3.

    The CHM method by Callaghan and Müller-Hansen (2020);

  4. 4.

    The QuantCI method by Yang et al. (2021a);

  5. 5.

    The QBCB method by Lewis et al. (2021);

  6. 6.

    The IPP by Sneyd and Stevenson (2021).

For all methods that require a confidence interval, we set it to \(95\%\). The positive sample size r of the QBCB method (see Sect. 2.3.4) is instead set at 50, which in the results reported in the original paper appears to be a good trade-off between a low overall cost and a low variance in such cost across different samples.

5.2 The active learning workflow

We run the same active learning workflow for all tested methods. In most TAR applications (Yang et al. 2021a; Cormack and Grossman 2015), we usually have only a single positive document to start the active learning with, called the initial seed: for our experiments, we decide to seed the active learning process with an additional negative document randomly sampled from the document pool P; that is, our initial seed set S consists of a positive and a negative document, randomly sampled from P.

As mentioned earlier (Sect. 2), the active learning policy we pick is CAL (Cormack and Grossman 2015) (a variation of ALvRS) since this is the most common policy used in TAR tasks. As the batch size b, we follow (Yang et al. 2021a) and set it to 100. The classifier we use in all our experiments is a standard Logistic Regression, as this is also the classifier of choice in most (if not all) TAR applications. We test the target recalls values \(\{0.8, 0.9, 0.95\}\).

5.3 Datasets

We run our experimentsFootnote 4 on two well-known datasets: the RCV1-v2 and the CLEF Technology-Assisted Reviews in Empirical Medicine (EMED) datasets (specifically, the dataset made available for the CLEF 2019 edition). Both datasets have been already used to test TAR frameworks and algorithms, e.g., the MINECORE framework (Oard et al. 2018), the QuantCI stopping technique (Yang et al. 2021a) and Li and Kanoulas sampling methodology (Li and Kanoulas 2020). We also use a sample of the Jeb Bush Email collection to set SAL\(_\tau \) \(\alpha \) hyperparameter.

All text is converted into vector representation first converting it to lowercase, removing English stopwords and any term occurring in more than 90% of the documents in P; vectors are weighted using tf-idf.

5.3.1 RCV1-v2

RCV1-v2 (Lewis et al. 2004) is a publicly available collection of 804,414 news stories from the late nineties, published on the Reuters website. RCV1-v2 is a multi-label multi-class collection, i.e., every document can be assigned to one or more classes from a set \({\mathcal {C}}\) of 103 classes. Since for our experiments we need binary classification datasets (i.e., a document can either be relevant or not), for each class \(c \in {\mathcal {C}}\) we consider each document d as either belonging to c or not, thus obtaining 103 binary datasets. In the experiments, we use a random sample of 10,000 documents, for efficiency and to keep RCV1 pool size close to that of CLEF.

5.3.2 CLEF EMED 2019

The CLEF EMED datasets were made publicly availableFootnote 5 for the TAR in EMED tasks ran from 2017 to 2019. The goal of the task was to assess TAR algorithms aimed at supporting the production of systematic reviews in empirical medicine. Following (Li and Kanoulas 2020), we use the Diagnostic Test Accuracy (DTA) reviews part of the dataset, working with the abstract relevance assessments (i.e., the first phase of the review, where the physician only assesses abstracts): the dataset consists of 72 “Training” topics and 8 “Testing” topics.

The texts of the reviewed documents are not available for download on the GitHub platform: they have to be downloaded from PubMED. While an HTTP API is available, the full text of documents are often under a paywall: hence the choice of focusing on the abstract reviews only. Moreover, we have encountered several issues in downloading some of the abstracts and we were unable to retrieve the whole dataset: in total we have retrieved abstracts for 60 topics (between “Training” and “Testing”), downloading a collection of 264,750 documents. Since we do not need to distinguish between a training and testing phase with different topics (e.g., for transfer learning), we merge together the “Training” and “Testing” subcollections.Footnote 6

5.3.3 Jeb Bush Email collection

The Jeb Bush’s emails collection consists of 290,099 emails sent and received by the former governor of Florida Jeb Bush. We used the subset published by Grossman et al.Footnote 7: the sample consists of 9 topics, with 50,000 documents. For each document and topic, a relevance judgment is available. We do not run experiments on this sample, but only use it to set the hyperparameter \(\alpha \) (see Sect. 4.1).

5.4 Evaluation measures

We evaluate the tested methods using three measures:

  1. 1.

    The Mean Square Error (MSE) between the recall at stopping \(R_s\) and the target recall R:

    $$\begin{aligned} \textrm{MSE} = (R - R_s)^2 \end{aligned}$$
    (14)
  2. 2.

    The Relative Error (RE):

    $$\begin{aligned} \textrm{RE} = \frac{{\vert }R - R_s{\vert }}{R} \end{aligned}$$
    (15)
  3. 3.

    The “idealized” cost (IC) presented by Yang et al. (2021b). This measure specifically evaluates a stopping method on its capability of saving effort and money when stopping the TAR process.

The first two measures are well-known and used in many different tasks in IR and machine learning literature, and they have been used to evaluate TAR tasks (Li and Kanoulas 2020; Yang et al. 2021a).

The IC metric is defined as follows: it uses a cost structure (similar to what was previously done in Oard et al. (2018)), a four-tuple \(s = (\alpha _p, \alpha _n, \beta _p, \beta _n)\). Subscript p and n indicate the cost of reviewing positive (relevant) and negative (non-relevant) documents; \(\alpha \) and \(\beta \) represents the costs of reviewing a document in a first or second phase: in a one-phase TAR process, this “second” phase is referred to as the failure penalty. That is, it would be the cost of continuing the review with an optimal second phase, by ranking documents with the model trained in the first one.

Let Q be the minimum number of documents to review to reach the recall target R. Say we review batches of size b and we stop at iteration t: let \(Q_t\) be the number of documents we reviewed before the method stopped the review. If \(Q_t < Q\), we have a deficit of \(Q - Q_t\) positive documents; let \(\rho _t\) be the number of documents that need be reviewed,Footnote 8 following the ranking, to find the additional \(Q - Q_t\) positive documents. The total cost of our review is then:

$$\begin{aligned} \textrm{IC}= & {} \alpha _pQ_t + \alpha _n(bt - Q_t) + I[Q_t < Q](\beta _p(Q - Q_t) \nonumber \\{} & {} \quad + \beta _n(\rho _t - Q + Q_t)) \end{aligned}$$
(16)

Where \(I[Q_t < Q]\) is 0 if Q documents were found in the first t iterations and 1 otherwise.

Following both Yang et al. (2021b) and Oard et al. (2018), we evaluate our results with three cost structures:

  • a uniform cost structure, where \(s = (1, 1, 1, 1)\), which we call \(\textrm{Cost}_u\). This assumes that there is no difference between the different phases of review, and that reviewing positive and negative documents have the same cost (we keep this latter assumption in all our cost structures). As argued in Yang et al. (2021b, §4.1), this cost structure is common in many review scenarios;

  • the expensive training cost structure, where \(s = (10, 10, 1, 1)\), which we call \(\textrm{Cost}_e\). This assumes that reviewing a document in the first phase is 10 times more expensive than in the second phase. According to Yang et al. (2021b, §4.2) this is fairly common in systematic reviews in empirical medicine;

  • a MINECORE-like cost structure, where \(s = (1, 1, 5, 5)\), which we call \(\textrm{Cost}_m\). This cost structure reflects MINECORE cost structure 2 (Oard et al. 2018, Table 5), where we assume that reviewing in the second stage (in MINECORE case, it was reviewing by privilege) is 5 times as expensive as in the first stage.

6 Results

We run each of the experiments defined in the previous section 20 times, using a different randomly generated seed set S each time (the same random seed set for all methods compared); we set \(\alpha = 0.3\) (Sect. 4.1), as this was the best-performing value in a hyperparameter search conducted on the Jeb Bush dataset. Therefore, the reported results for a dataset represent the average evaluation measure applied to all generated subsets across all classes within that dataset. We also define three equal-sized bins sorted by class prevalence (Very low, Low, and Medium) to show how the different methods performed with different level of imbalance between relevant and non-relevant documents. The average prevalence for the classes in the three bins of RCV1 is 0.002, 0.012, and 0.084, and for CLEF is 0.005, 0.027, and 0.117.

Table 2 MSE and RE results on RCV1. For each target recall, bold indicates the best result, underline indicates the second best
Table 3 MSE and RE results on CLEF. For each target recall, bold indicates the best result, underline indicates the second best

Tables 2 and 3 report the MSE and RE values, Tables 4 and 5 report the IC values. The most competitive models are SAL\(_\tau \), SAL\(_\tau ^R\), Budget, QBCB and the IPP method. For the MSE and RE metrics our SAL\(_\tau \) and SAL\(_\tau ^R\) stopping rules performs on par, or better, than the other state-of-the-art techniques. More in details, for lower target recalls (\(R=0.8\) and \(R=0.9\)) the above-mentioned methods show a comparable performance, both for RCV1-v2 and CLEF datasets. When \(R = 0.95\), the Budget method performs slightly better instead.

Table 4 IC measure on RCV1. For each target recall, bold indicates the best result, underline indicates the second best
Table 5 IC measure on CLEF. For each target recall, bold indicates the best result, underline indicates the second best
Fig. 4
figure 4

Box plots of actual recall reached by the methods, given a target recall: 0.8 (top), 0.90 (center), 0.95 (bottom). Left column is RCV1, right column is CLEF

With respect to the IC measure, SAL\(_\tau \) shows the lowest costs on average (and for the Very Low prevalence bin) for all proposed cost structures, closely followed by SAL\(_\tau ^R\) and IPP. Notice that the IPP method, despite achieving good cost results, was not as performant when previously compared on the MSE and RE metrics. On the other hand, the QBCB method, which achieves state-of-the-art performance on MSE and RE, incurs very high costs due to the annotation of the pre-review random sample required by the method. Furthermore, the most relevant reduction of cost shown by SAL\(_\tau \) and SAL\(_\tau ^R\), with respect to the compared methods, is for the Very Low prevalence bin. This is an important result, as the “needle in a haystack” scenario is the most common in TAR. The Budget method has comparable costs for high target recall values and in higher prevalence bins.

Overall, we argue that our SAL\(_\tau ^R\) method seems to strike the best trade-off between proper recall targeting (i.e., actually stopping the review at the given recall target) and low costs.

Indeed, as we anticipated, SAL\(_\tau \) tends to stop the TAR process too early and, as a result, cannot be consistently used in all scenarios despite achieving lower costs. This can be seen in the box plots in Fig. 4, which show at which real recall values the different stopping rules decided to halt the review process. The average recall value for SAL\(_\tau \) is always higher than the target recall, i.e. the expected value of recall satisfies the target recall requirement. Yet, it is evident how for higher recall targets, the distribution of recall values produced by SAL\(_\tau \) goes under the target not only for the tail of the distribution, as it happens for Budget and Knee, but also for a portion of the center part of the distribution. Most TAR tasks require to match the target recall in a much larger portion of the cases. By simply adding a margin that is a function of the target recall, SAL\(_\tau ^R\) shifts the distribution of the reached recall values up. Its distribution is similar to the Budget method’s, with slightly increasing costs with respect to SAL\(_\tau \) but still lower than the other tested methods.

Finally, notice that the IPP method, despite achieving good costs in some cases (and good MSE/RE values in some other), cannot stop the review process at the right moment, often greatly overshooting the recall estimate.

To summarize, SAL\(_\tau \) enables the use of SLD in AL and TAR processes, solving the issues observed in Esuli et al. (2022) and Molinari et al. (2023). SAL\(_\tau \) brings consistent and substantial improvements, especially for medium/high (0.8, 0.9) target recalls; for higher targets, it tends to stop too early. SAL\(_\tau ^R\) solves this issue without a significant increase in annotation costs with respect to SAL\(_\tau \), finding a very good trade-off between annotation costs and proper target recall matching. In future works we propose to investigate more informed methods to mitigate SAL\(_\tau \) target recall overestimation, as well as testing SAL\(_\tau \) and SAL\(_\tau ^R\) with other sampling techniques (e.g., Li and Kanoulas 2020).

7 Conclusions

In this paper, we introduced a new method called SAL\(_\tau \) (and its variations SAL\(_\tau ^R\) and SAL\(_\tau \)CI) to improve posterior probabilities and prevalence estimates in an active learning review process; more specifically, we tested our method as a “when-to-stop” stopping rule for TAR tasks. SAL\(_\tau \) is a variant of the well-known SLD algorithm (Saerens et al. 2002): our algorithm was designed to enable the use of SLD in AL and TAR processes, solving the issues observed in Esuli et al. (2022) and Molinari et al. (2023). Experiments have shown that SAL\(_\tau \) still tends to slightly overestimate the true recall as it gets close to 1, stopping the process too early. SAL\(_\tau ^R\) solves this issue, without significantly increasing the review cost compared to SAL\(_\tau \). In the experiments, SAL\(_\tau ^R\) has consistently improved over state-of-the-art methods by improving the estimation of prevalence and thus stopping the TAR process much earlier than other methods, while still achieving the target recall.

For future work, we propose to investigate more informed methods to mitigate the overestimation of target recall of SAL\(_\tau \), as well as to test SAL\(_\tau \) and SAL\(_\tau ^R\) with other sampling techniques (e.g., Li and Kanoulas 2020).