SALτ: efficiently stopping TAR by improving priors estimates

Molinari, Alessio; Esuli, Andrea

doi:10.1007/s10618-023-00961-5

SAL_τ: efficiently stopping TAR by improving priors estimates

Open access
Published: 28 August 2023

Volume 38, pages 535–568, (2024)
Cite this article

Download PDF

You have full access to this open access article

Data Mining and Knowledge Discovery Aims and scope Submit manuscript

SAL_τ: efficiently stopping TAR by improving priors estimates

Download PDF

924 Accesses
1 Altmetric
Explore all metrics

Abstract

In high recall retrieval tasks, human experts review a large pool of documents with the goal of satisfying an information need. Documents are prioritized for review through an active learning policy, and the process is usually referred to as Technology-Assisted Review (TAR). TAR tasks also aim to stop the review process once the target recall is achieved to minimize the annotation cost. In this paper, we introduce a new stopping rule called SAL$_\tau ^R$ (SLD for Active Learning), a modified version of the Saerens–Latinne–Decaestecker algorithm (SLD) that has been adapted for use in active learning. Experiments show that our algorithm stops the review well ahead of the current state-of-the-art methods, while providing the same guarantees of achieving the target recall.

Statistical stopping criteria for automated screening in systematic reviews

Article Open access 28 November 2020

Goldilocks: Just-Right Tuning of BERT for Technology-Assisted Review

Impact of Active learning model and prior knowledge on discovery time of elusive relevant papers: a simulation study

Article Open access 08 July 2024

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

In high recall retrieval tasks, the goal is to find all (or almost all) the documents which are relevant to a given information need, from an unlabelled set of documents (often called the pool P). Examples of these tasks are e-discovery, systematic reviews in empirical medicine, and online content moderation.

In these scenarios, the simplest strategy to guarantee a high recall target is to have in-domain human experts reviewing the whole pool of documents, that is, labelling each document as either relevant or non-relevant. When working with large collections, however, this operation becomes incredibly expensive, both in terms of time and costs: indeed, the number of data items that can be annotated are usually limited by either the availability of the reviewer, or the money one is willing to invest in the process (the annotation budget).

Given some time/cost budget, annotating a random sample of the data is a suboptimal approach: the reviewing process is usually aided by one of the many machine learning techniques which go under the name of “Active Learning” (AL, Dasgupta and Hsu 2008; Huang et al. 2014; Lewis and Gale 1994). AL methods adopt a human-in-the-loop model, in which the human expert labels items selected by an automatic classifier. The classifier is then updated, exploiting the additional knowledge coming from new labels, in an iterative process. In the high-recall scenarios mentioned earlier, the human-in-the-loop annotation workflow is usually referred to as Technology-Assisted Review (TAR, Cormack et al. 2010; Grossman and Cormack 2011; Kanoulas et al. 2019).

One of the most challenging issues in TAR applications is the so-called “when-to-stop” problem: that is, we need to choose when to stop the AL process, in order to jointly minimize the annotation effort and satisfy the information need, e.g., a target recall value. Recently, IR literature has proposed many stopping methods (Cormack and Grossman 2016; Li and Kanoulas 2020; Oard et al. 2018; Yang et al. 2021a): the when-to-stop issue is usually tackled by either changing the sampling policy of the AL algorithm (see Sect. 2.1), by crafting task-specific heuristics, and/or by estimating the currently achieved recall. In this paper, we focus on the latter approach, and propose a new technique based on the Saerens–Latinne–Decaestecker (SLD) algorithm (Saerens et al. 2002), adapting it to the AL workflow typically leveraged in TAR processes.

The paper is structured as follows: Sect. 2 describes the related work; we then analyze the shortcomings of the SLD algorithms when used “as-is” in AL scenarios (Sect. 3); we then propose a solution to this problem, our own method in Sect. 4. Experiments and results are discussed in Sects. 5 and 6. Section 7 concludes.

2 Background and related work

2.1 Active learning: relevance sampling

AL algorithms prioritize the annotation of certain data items over others. Two of the most well-known AL techniques, still used to this day, were presented in 1994 by Lewis and Gale (1994): Active Learning via Relevance Sampling (ALvRS) and Active Learning via Uncertainty Sampling (ALvUS).

Algorithm 2.1 shows the typical structure of an AL process with a stopping rule. Given a data pool of unlabelled documents P, the reviewer annotates a “seed” set of documents $S\subset P$ that defines the initial training set L. An iterative procedure is then started, in which L is used to train a classifier $\phi $, which is then exploited by an AL policy pol to select the next set of documents to be presented to the reviewer. In ALvRS, $\phi $ is used to rank the documents in $(P\setminus L)$ in decreasing order of their posterior probability of relevance $\Pr (\oplus {\vert }{\textbf{x}})$. The reviewer is asked to annotate the b documents for which $\Pr (\oplus {\vert }{\textbf{x}})$ is highest (with b the batch size), which, once annotated, are added to the training set L. The classifier is retrained on the new training set and the process repeats until a stopping condition is met, e.g., the annotation budget is exhausted or a target recall is reached. ALvRS is most-effective and has been mostly used when we are interested in finding all the items relevant to a given information need, as quickly as possible.

The ALvUS policy is a variation of ALvRS, where documents are ranked in ascending order of ${\vert }\Pr (\oplus {\vert }{\textbf{x}})-0.5{\vert }$, i.e., we top-rank the documents which the classifier is most uncertain about. ALvUS can be useful when we want to build a high-quality training set to later train a machine learning model on it, and has not been employed as often as ALvRS in TAR applications. While many other AL techniques have been proposed over the years (Dasgupta and Hsu 2008; Huang et al. 2014; Konyushkova et al. 2017), in this work we focus especially on ALvRS, and in particular on its variant called Continuous Active Learning (CAL), proposed by Cormack and Grossman (2015), which is specifically tailored to TAR applications.

Both ALvRS and ALvUS policies suffer from what is called sampling bias (Dasgupta and Hsu 2008; Krishnan et al. 2021), i.e. the fact that, due to the document selection policy, the sample of annotated items L is not representative of P, nor of the unlabelled set U (see Sect. 3 for a more thorough explanation and analysis). In order to investigate how and when a classifier is affected by this bias, Esuli et al. (2022) have introduced a policy called Rand(pol). The Rand(pol) policy is an oracle policy, i.e., it requires knowing the true labels of all documents in the pool. It is thus a synthetic policy designed to investigate the issue of sampling bias of a given AL policy pol.

Rand(pol) observes the prevalences of labels in the L set produced by pol and produces its own $L^{Rand}$ set which exhibits the same prevalences, but using a random document selection policy. The idea is that substituting the selection policy of pol with random sampling keeps the content of elements in $L^{Rand}$ unbiased with respect to P, while using the same prevalences produced by pol. The comparison of results produced by pol and Rand(pol) is thus a way to understand whether and how much sampling bias is affecting the decisions of a pol-based classifier: being a controlled random sampling, the Rand(pol) policy should produce a “sampling bias free” dataset.

2.2 TAR tasks and workflows

TAR processes usually operate in a “needle in a haystack” scenario, i.e. the number of relevant items is just a tiny fraction of the whole collection of documents. Three of the main TAR real-world applications are: e-discovery, in the legal domain; the production of systematic reviews in empirical medicine, and online content moderation. In this work, we focus on the first two tasks.

2.2.1 TAR for e-discovery

E-discovery is an important aspect of the civil litigation in many (but not only) common law countries: in e-discovery a large pool of documents P need to be reviewed in order to find all items “responsive” (i.e., relevant) to the litigation matter. The documents labelled as responsive are “produced” by the defendant party, and disclosed to the other party in the civil litigation. However, the former party holds the right to keep some of these documents “hidden”, putting them in a private log only available to the jury: this is only allowed if the “logged” documents are deemed to contain “privileged” information (e.g., intellectual property). Making different misclassification errors (i.e., producing a document which contains privileged information) can bring about different costs for the defendant party, based on the severity of the error committed. It is worth noticing, however, that there are only few works which really take e-discovery costs into consideration (Oard et al. 2018; Yang et al. 2021b).

In e-discovery, the review usually happens in two stages: (i) documents are first reviewed by responsiveness (i.e., relevancy) by a team of junior reviewers; (ii) the documents judged as responsive are then passed on to a second team of senior reviewers (with an hourly rate several times higher than the junior team’s), which mainly re-review the documents by privilege.^{Footnote 1} As it can be inferred, annotating documents by privilege is usually a much more costly and delicate operation than annotating by responsiveness (Oard et al. 2018).

2.2.2 TAR for systematic reviews

In empirical medicine, a systematic review discusses (ideally) all medical literature relevant to a given research question. The production of a systematic review is usually carried out by one or more physicians and can span even years (Michelson and Reuter 2019; Shemilt et al. 2016). A systematic review usually collects a large pool of documents P by issuing a boolean query on a (medical literature) search engine. Then, similarly to e-discovery, systematic reviews are conducted in two stages: (i) a first one, where doctors review document abstracts to determine their relevance and (ii) a second one, where documents which passed the first phase are reviewed in their entirety.

The production of systematic reviews has recently attracted the interest of the IR community (Callaghan and Müller-Hansen 2020; O’Mara-Eves et al. 2015; Lease et al. 2016; Wang et al. 2022), which has focused on several aspects of the process, from improving the query formulation issued to search engines, to finding the optimal stopping criterion, reducing the annotation costs (and the time spent on a systematic review).

2.2.3 One-phase TAR and two-phase TAR workflows

TAR workflows can usually be divided in “one-phase” and “two-phase” approaches (Yang et al. 2021a), where:

1.
In one-phase TAR workflows, we assume that there is a single review process that is stopped when some condition is met. Relevance sampling is usually the preferred AL technique, since the aim is to annotate the highest number of relevant items in the shortest amount of time possible;
2.
In two-phase TAR workflows, a review team annotates a sample of the data pool, on which a classifier is trained and later used to help a second review team complete the process. The two review teams may and usually have different per-document costs.

Note that the number of phases is not related to how many stages are required by the specific task: both approaches work with multi-stage review (see also Yang et al. (2021b) on when either of the two workflows might be preferred).

This paper focuses on one-phase TAR workflows, although the new method we propose might as well be used in two-phase workflows. This choice is in line with the recent literature, which has mostly focused on one-phase reviews (Cormack and Grossman 2016, 2020; Li and Kanoulas 2020; Yang et al. 2021a) [with the notable exception of Oard et al. (2018)].

2.3 Stopping methods for TAR

Finding a way to stop the AL process as soon as a target recall R is reached is a key aspect in lowering the cost of human review, the other being using an AL policy that efficiently selects relevant documents over non-relevant ones. The target recall is usually very high, in many cases equal or close to 100%, transforming the task in a “total recall” task; other high target recalls are often seen in TAR applications, such as 80% or 90% (Li and Kanoulas 2020; Yang et al. 2021a, b).

The scope and aim of this paper is on stopping algorithms which work inside an AL process without changing the sampling policy. We thus do not consider “interventional stopping rules”, e.g. Li and Kanoulas (2020). As observed in (Yang et al. 2021a, §2), we argue that interventional methods are (i) less efficient than AL (i.e., they usually trade off annotation costs for a safer recall estimation) and (ii) less applicable in real case scenarios, where reviewers are often limited to using some AL methods (i.e., relevance sampling) provided by a specific software. We will now give an overview of the most relevant methods, which are then compared to our proposal in the experiments.

2.3.1 The knee and budget method

The Knee method was first proposed by Cormack and Grossman (2016). The method is based on a gain curve for a one-phase TAR workflow, i.e., a plot of how the number of positive documents increases as more documents are reviewed, during the AL process. The method, based on Satopaa et al. (2011), empirically finds “knees” in the plot, ideally stopping the process when the effort of continuing to review documents is not supported by the retrieval of a sufficient amount of positive documents.

The Budget method (Cormack and Grossman 2016) is a heuristic variant of the knee method, where the process is stopped no earlier than when at least 70% of the document collection has been reviewed. This follows the observation that, if we were to review by random sampling, we would expect to achieve a recall of 0.7 when reviewing 70% of the collection; by using an AL technique, we expect the recall to be much higher. After the 70% threshold has been reached, the Budget method stops the review if a knee test passes (detailed in Cormack and Grossman (2016); Yang et al. (2021a)), and if the number of relevant items found Rel(L) is somewhat large, i.e.: ${\vert }L{\vert } \ge 10 \ P/Rel(L)$. Both methods do not allow users to specify a target recall.

2.3.2 The Callaghan Müller–Hansen (CMH) method

Callaghan and Müller-Hansen (2020) propose a stopping heuristic based on estimating the probability of having reached the target recall and comparing it against a confidence level. The CMH method consists of a first part of the process that uses an AL method and a confidence level which, once reached, stops the AL process and starts a second part that continues reviewing using random sampling and a higher confidence level. Following an approach similar to Yang et al.’s (2021a) for picking our baselines, we will not use the random sampling part of CMH in our experiments and only use its heuristic method as a stopping rule; notice, however, that the random sampling and the confidence level estimation which follows the heuristic stopping criterion is applicable to any of the other methods we explore in this paper (including our method presented in Sect. 4).

CMH heuristic treats batches of previously screened documents as if they were random samples (an assumption somewhat similar to the one we make in Sect. 5.1); for subsets $A_i = \{d_{N_{seen}-1},...,d_{N_{seen}-i}\}$ of these documents they compute $p = \Pr (X \le k)$, where $X \sim \textrm{Hypergeometric}(N, K_{tar}, n)$: n is the size of the subsample, N is the total number of documents and $K_{tar} = \lfloor \frac{\rho _{seen}}{R} - \rho _{AL} + 1 \rfloor $ represents the minimum number of relevant documents remaining at the start of sampling. This is done for all sets $A_i$ with $i \in N_{seen} -1... 1$; $p_{min}$ is the value where the null-hypothesis (i.e., recall being below target) is lowest. The review of documents proceeds with AL until $p_{min} < 1 - \alpha $; $\alpha $ is a confidence level, which is set to 95%.

2.3.3 The QuantCI method

The QuantCI method proposed by Yang et al. (2021a) leverages the classifier predictions (a logistic regression) to estimate the current recall, computes a confidence interval based on variance in the predictions, and stops the reviewing process when the lower bound of the confidence interval reaches the target recall.

More specifically, the estimated recall ${\hat{R}}$ is computed as:

$$\begin{aligned} {\hat{R}} = \frac{{\widehat{Rel}}(L)}{{\widehat{Rel}}(P)} = \frac{\sum _x^{{\vert }L{\vert }} \Pr (\oplus {\vert }x)}{\sum _x^{{\vert }P{\vert }} \Pr (\oplus {\vert }x)} \end{aligned}$$

(1)

This estimate is based on modeling the relevance of a document i as the outcome of a Bernoulli random variable $D_i \sim Bernoulli(\Pr (\oplus {\vert }x))$. The 95% confidence interval (CI) is then computed as:

$$\begin{aligned} \pm 2 \sqrt{\frac{1}{{\widehat{Rel}}(P)^2} Var(D_{L}) + \frac{{\widehat{Rel}}(L)^2}{{\widehat{Rel}}(P)^4} \left( Var(D_L) + Var(D_U)\right) } \end{aligned}$$

(2)

The authors tested their method with and without the confidence interval (i.e., using the recall estimate as is, not lowering it with the confidence interval) resulting in two stopping techniques, called QuantCI and Quant.

2.3.4 The QBCB method

The Quantile Binomial Confidence Bound (QBCB) stopping rule, proposed by Lewis et al. (2021), leverages on theory of quantile estimation to define a stopping rule that is a function of the target recall, a confidence value, and the size r of a human annotated sample of relevant documents $P_r$. Given these three values, QBCB returns the minimum number of relevant documents $j\le r$ from $P_r$ that have to be found while annotating P to have a statistical guarantee to reach the target recall on P within the given confidence value. The actual equation used by QBCB is:

$$\begin{aligned} \sum _{k=0}^{j-1}\left( \begin{array}{l} r \\ k \end{array}\right) t^k(1-t)^{r-k} \ge 1-\alpha \end{aligned}$$

(3)

where $1-\alpha $ is the confidence level. The equation is tested for increasing values of j until it is satisfied. This methods thus requires an initial phase of human annotation of documents randomly sampled from P in order to define the set $P_r$. A larger size of $P_r$ raises this initial annotation cost, yet it produces a more accurate estimation of j, with a lower expected annotation cost for the main annotation phase. Experiments in Lewis et al. (2021) on various dataset of different difficulty and prevalence have different optimal values for r, which minimize the average overall annotation cost, in the range of 30 to 60 samples (see Lewis et al. 2021, Fig. 3), with a more accurate targeting of recall for larger r values.

2.3.5 The IPP method

The Inhomogeneous Poisson Process Power (IPP) was recently proposed by Sneyd and Stevenson (2021). The authors actually propose several stopping rules based on counting processes, that is, stochastic models of the number of occurrences of an event over time (Sneyd and Stevenson 2021, §3). In order to be applied to TAR, the authors treat position in a search ranking as “time”, and occurrences of relevant documents as “events”.

Poisson processes assume that the number of occurrences follow a Poisson distribution: if the rate at which events occur varies, the process is said to be inhomogeneous. Furthermore, Poisson processes have a $\lambda $ rate, a function representing the frequency with which events occur over the space (i.e., our ranking). More in details,

$$\begin{aligned} \Lambda (a, b) = \int _a^b \lambda (x)dx \end{aligned}$$

(4)

If we indicate with N(a, b) a random variable denoting the number of events occurring in the interval (a, b], then:

$$\begin{aligned} P(N(a,b)=r) = \frac{[\Lambda (a,b)]^r}{r!} e^{-\Lambda (a,b)}, \end{aligned}$$

(5)

where N has a Poisson distribution with expected value $\Lambda (a,b)$. The $\lambda $ function is unknown, and the goal is to choose a suitable one for the problem at hand.

The IPP method uses a Poisson process with a power curve function as the rate function. Given a counting process such as IPP, the stopping criterion is then applied as in Algorithm 2 [which we report from (Sneyd and Stevenson 2021, Algorithm 1)].

2.4 Using the SLD algorithm in AL processes

The Saerens–Latinne–Decaestecker (SLD) algorithm (Saerens et al. 2002) was proposed as a technique to adjust a priori and a posteriori probabilities (priors and posteriors here after) in prior probability shift (PPS) scenarios, i.e. when the prior probability $\Pr (y)$ diverges between the labelled L and the unlabelled U sets (Moreno-Torres et al. 2012). The algorithm works by iteratively and jointly updating the prior and posterior probabilities (see Algorithm 3).

AL policies, such as relevance and uncertainty sampling, naturally tend to generate a high PPS: in particular, when $\Pr _P(\oplus )$ is fairly low to start with (as it usually is in TAR applications), the AL process generates a PPS such that $\Pr _L(\oplus ) \gg \Pr _U(\oplus )$. Hence, using SLD to improve both our prevalence estimates and our posteriors might seem like a promising idea. However, recent works (Esuli et al. 2022; Molinari et al. 2023) have shown that in this context SLD has a disastrous effect on the posteriors. In the next sections, we analize this behaviour (Sect. 3) and propose a solution (Sect. 4) that enables using SLD in AL (and hence, in TAR).

3 An analysis of SLD shortcomings in active learning scenarios

Two recent studies (Esuli et al. 2022; Molinari et al. 2023) have shown how SLD can have disastrous behaviours in AL contexts, leading to an extremization of the posterior probability distribution, i.e. the fact that most (if not all) posterior probabilities $\Pr (\oplus {\vert }x)$ are dragged to either 0 or 1. Experiments from Esuli et al. (2022) show that when SLD is applied to Rand(pol) version of an AL policy, be it either ALvRS or ALvUS, the phenomenon of corruption of posteriors does not happen. The cause of the issue is thus to be found in the document selection policy, and in the fact that both ALvRS and ALvUS produce a sampling bias (Dasgupta and Hsu 2008; Krishnan et al. 2021), which leads to a dataset shift where $\Pr _L(y) \ne \Pr _U(y)$ and $\Pr _L(x \vert y) \ne \Pr _U(x \vert y)$.

Sampling bias emerges from AL algorithms due to both the selection policy pol and the initial seed S. S is usually very small (in many TAR applications, it may consist of a single positive instance). Considering the ALvRS policy, the classifier trained on S will have high confidence on documents similar to the few ones contained in S. As the AL process continues, the training set L will diverge from the underlying data distribution. A classifier trained on such a biased training set can hardly make sense of the whole document pool. Indeed, Dasgupta and Hsu (2008, §2) shows that the classifier can be overly confident in attributing the negative label to a cluster of data which actually contains several positive instances.

A visual representation of the sampling bias is shown in Fig. 1, from Molinari et al. (2023), which shows how the document selection is extremely focused on one small region of the positive instances.

When it comes to the classifier capability to estimate the prevalence of relevant documents in the unlabeled part of the pool U, this means that its estimates are going to be much lower than those of a classifier trained on a random and representative sample of the same population: we can see this in Table 1, where we compare the prevalence estimates of a calibrated SVM classifier trained on the ALvRS training set (at different sizes) and the same classifier trained on a controlled random sample of the pool, with same size and same positive prevalence (we call ALvRS SVM and Rand(RS) SVM, the two SVMs trained on the two different training sets). The estimate is the average of the posterior probabilities for the relevant class $\Pr (\oplus {\vert }x)$ for all $x \in U$.

Table 1 Prevalence estimates of an SVM classifier trained on a ALvRS and Rand(RS) training set respectively, compared to true prevalences of the L and U sets

Full size table

The last two columns of the table report the true prevalence of the positive class in L and U, showing the strong PPS generated by AL techniques; clearly, the shift is stronger if $p_P(\oplus )$ is already fairly low to start with (which is usually the case in many TAR applications). In this scenario, the classifier usually overestimates $p_U(\oplus )$, given its bias on $p_L(\oplus )$ prevalence, which is what we see for Rand(RS) SVM. The values in the table seem to indicate that the ALvRS-based estimates are better than the Rand(RS) ones. We argue that this is actually due to the sampling bias: the classifier is very likely underestimating the prevalence of those positive clusters it does not know about; the end result is that the output prevalence estimate is much lower than the Rand(RS) and incidentally closer to the real prevalence of U.

In order to better visualize how sampling bias affects the AL trained classifier, we give a visual representation of this phenomenon on a synthetic dataset:

1.
We generate an artificial dataset consisting of 10,000 data items with four clusters (blue, yellow, purple and pink in Fig. 2). Blue and yellow clusters are the positive clusters (i.e., every item in these clusters has a positive $\oplus $ label); purple and pink clusters are the negative clusters (i.e., every item in these clusters has a negative $\ominus $ label). Notice that negative clusters are much more populated than positive ones (i.e., the overall positive prevalence is low);
2.
We start the active learning process with two positive items coming from one of the positive clusters and 10 negative items, randomly sampled from the negative clusters. We then annotate 500 documents with the ALvRS policy and generate an analogous training set with the Rand(RS) policy. The two training sets are shown with “X” markers in Fig. 2a, b, for ALvRS and Rand(RS) respectively;
3.
We show the estimated (and the true) proportion of positive items remaining in each cluster, for a Calibrated SVM trained on the ALvRS and Rand(RS) training sets respectively. Furthermore, we show in the title the true $\textrm{P}_U(\oplus )$ and estimated prevalence ($\textrm{P}_U^{\textrm{ALvRS}}(\oplus )$ and $\textrm{P}_U^{Rand}(\oplus )$) on the test set.

In Fig. 2a we see how the AL process annotates a specific subregion of positive items. This in turn “misleads” the classifier to output a much lower prevalence than the Rand(RS) classifier for the clusters it has never seen during training; notice that, being the overall positive prevalence quite low, the prevalence estimates of the AL classifier seem better than the Rand’s.

3.1 How is sampling bias related to SLD failures in active learning contexts?

The reason why sampling bias is a key element in our analysis lies beneath the main assumptions made by SLD on the posterior probability distribution of the classifier: SLD reasonably assumes to be in the scenario represented by the Rand(RS) policy rather than the AL one. More precisely, it assumes that there is no dataset shift on the conditional probability $\Pr (x\vert y)$ between the labelled set L and U, i.e., that $\Pr _L(x \vert y) = \Pr _U(x \vert y)$. If this assumption holds, the trained classifier will have a bias on L which will “uniformly” translate (i.e., in our previous example, the bias is consistent for all regions of the graph) to the posterior probabilities $\Pr _{U}(\oplus \vert x)$ on the unlabelled set. In a PPS scenario, this means that the classifier estimate of the prevalence (i.e., the average of the classifier posteriors) will be closer to L, and that they can be “adjusted” consistently across the whole distribution. Nonetheless, when using an active learning policy such as ALvRS or ALvUS, we not only generate prior probability shift, but we also affect the distribution of the conditional probability $\Pr (x \vert y)$, such that $\Pr _L(x \vert y) \ne \Pr _U(x \vert y)$.

Let us now focus on one of the key updates in SLD: the priors ratio (Line 13 of Algorithm 3), which is later multiplied by the posteriors. This is defined as:

$$\begin{aligned} \frac{{\hat{\Pr }}_{U}^{(s)}(\oplus )}{\Pr _{L}(\oplus )} \end{aligned}$$

(6)

This linear relation is represented by the red line of Fig. 3. When ${\hat{\Pr }}_{U}(\oplus ) = \Pr _{L}(\oplus )$, the ratio is 1, and posteriors do not change. However, as ${\hat{\Pr }}_{U}(\oplus )$ drifts further away from $\Pr _{L}(\oplus )$ (recall that this latter quantity is constant for all iterations in SLD), the ratio becomes progressively smaller, resulting in a multiplication of $\Pr _U(\oplus {\vert }x)$ by a number very close to 0.^{Footnote 2} We argue this is one of the main culprits of degenerated outputs from SLD.

In other words, let us assume that $\Pr _L(x\vert y) = \Pr _U(x\vert y)$, and that L is then a representative sample of U. A classifier trained and biased on L will tend to shift any prevalence estimate toward $\Pr _{L}(\oplus )$. When the classifier estimated ${\hat{\Pr }}_{U}(\oplus )$ is lower (or higher) than $\Pr _{L}(\oplus )$, SLD deems the true $\Pr _{U}(\oplus )$ to be even lower (or higher). Indeed, this works very well when the previous assumption holds, e.g., for Rand(pol). As a matter of fact, SLD has been a state-of-the-art technique for prior and posterior probabilities adjustments in PPS for 20 years. However, ALvRS and similar techniques generate a $\Pr _L(x\vert y) \ne \Pr _U(x\vert y)$ type of shift (as well as PPS), and, as a result, we cannot apply SLD update with confidence: we should rather find a way to apply a milder correction, when possible, or no correction at all when we have no way of estimating how “far” we are from the SLD assumption.

4 Adapting the SLD algorithm to active learning

As discussed in previous section, in AL scenarios, the priors ratio defined in the SLD algorithm should be milder in order to avoid extreme behaviours. A simple but possibly effective action is to directly add a correction factor $\tau $ to the priors ratio equation, so that:

when $\tau = 1$ we get the original SLD algorithm;
when $\tau = 0$ the ratio always equals 1, i.e., we do not apply any correction;
all other intermediate values adjust the ratio, making it milder with respect to SLD original ratio.

We thus define a correction to the priors ratio of SLD:

$$\begin{aligned} \delta = -\left[ \tau \cdot \left( -\frac{{\hat{\Pr }}^{(s)}_{U}(y)}{{\hat{\Pr }}^{(0)}_{U}(y)} + 1\right) -1 \right] \end{aligned}$$

(7)

the new ratio, which we call $\delta $, will then be multiplied by the posteriors at every SLD iteration. We show how $\tau $ affects the slope of the ratio in Fig. 3. We call our method “ SLD for Active Learning ” or SAL$_\tau $ for short. The complete algorithm is reported in Algorithm 4.

In the next section we detail how we set the value of $\tau $.

4.1 Estimating SAL_τ τ across active learning iterations

We have seen that SLD works on the output of a Rand(RS)-based classifier (Esuli et al. 2022). We can build our estimate of $\tau $ for SAL$_\tau $ by measuring how much the ALvRS classifier posteriors are more or less distributed like those of a Rand(RS) classifier. The more ALvRS posteriors diverge from Rand(RS) ones, the milder SLD correction should be, up to the point where we do not use SLD at all.

Let us consider the posteriors for the relevant class, i.e., $\Pr (\oplus {\vert }x)$: we collect a vector ${\mathcal {A}}$ of posteriors $\Pr (\oplus {\vert }x)$ for all documents in U for the classifier trained on the ALvRS training set, and an analogous one ${\mathcal {R}}$ for the classifier trained on the Rand(RS) training set. We define $\tau $ as the cosine similarity between these two vectors:

$$\begin{aligned} \tau = \mathrm {cosine\ similarity}({\mathcal {A}},{\mathcal {R}}) = \frac{{\mathcal {A}} \cdot {\mathcal {R}}}{{\vert }{\vert }{\mathcal {A}}{\vert }{\vert } \ {\vert }{\vert }{\mathcal {R}}{\vert }{\vert }} \end{aligned}$$

(8)

Since $\Pr (\oplus {\vert }x) \ge 0$ by definition of probability, the cosine similarity is naturally bounded between 0 and 1, a required property of our $\tau $ parameter. In other words, we apply the SLD update when AL posteriors are similar to posteriors for which we know the assumption made in SLD holds (i.e., $\Pr _L(x\vert y) = \Pr _U(x\vert y)$, see Sect. 3); we apply an accordingly milder correction the further the AL posteriors are from this assumption.

Equation (8) would be a good solution, as well as an impossible one, since the Rand(pol) policy requires knowledge of the labels (e.g., relevancy) for the entire data pool.

We thus resort to an heuristic formulation based on the evolution of the classifier during the iterations of the AL process.^{Footnote 3} Given the batch of documents $B_i$ reviewed at the i-th iteration, we define ${\mathcal {A}}_{\phi _i}$ and ${\mathcal {A}}_{\phi _{i-1}}$ as:

$$\begin{aligned} {\mathcal {A}}_{\phi _i}&= {{\Pr }^{(i)}(\oplus ,x)=\phi _i(x) {\vert } x \in B_i} \end{aligned}$$

(9)

$$\begin{aligned} {\mathcal {A}}_{\phi _{i-1}}&= {{\Pr }^{(i-1)}(\oplus ,x)=\phi _{i-1}(x) {\vert } x \in B_i} \end{aligned}$$

(10)

i.e., the posteriors on $B_i$ returned by the classifiers $\phi _i$ and $\phi _{i-1}$. The $\tau $ parameter is then defined as the cosine similarity between ${\mathcal {A}}_{\phi _i}$ and ${\mathcal {A}}_{\phi {i-1}}$. The assumption we make is that:

At early iterations, there will not likely be substantial differences between the two vectors (thus making the cosine similarity useless);
However, as the active learning (AL) process progresses, this could assist us in obtaining an estimation of the sampling bias that impacts the classifier.

Let us consider the ALvRS policy, in a scenario similar to the one depicted in Fig. 2a, that is, overall prevalence is low and we start with a small seed set: in the first iterations we are likely going to find many positive items; that is, when we compare ${\phi _i}$ and ${\phi _{i-1}}$ predictions on $B_i$ they are likely to be similar, since the “clusters” of data available to ${\phi _i}$ are probably the same that were available to ${\phi _{i-1}}$. However, as we review all the positive items in a cluster, the process is forced to explore items in the neighbour clusters. This is when the cosine similarity should be effective: by comparing the posterior distribution of $\phi _i$ to that of $\phi _{i-1}$ for the same $B_i$ documents, we should be able to assess the impact of the new documents on the classifier. Indeed, if $\phi _i$ accessed previously unseen clusters, the posteriors on $B_i$ might radically change and, in turn, roughly give us an estimate of how much sampling bias was affecting $\phi _{i-1}$ predictions.

Finally, let us consider one last issue: we said that in the first AL iterations ${\mathcal {A}}_{\phi _i}$ will likely be similar to ${\mathcal {A}}_{\phi _{i-1}}$, and their distance cannot be used as an indication of sampling bias. How can we establish when to use our method and when to fallback to the classifier posteriors? In lack of better solutions (which we defer to future works), we introduce a hyperparameter $\alpha $:

At every iteration i, we apply SAL$_\tau $ and obtain a new set of posterior probabilities $\Pr ^{\textrm{SAL}_\tau }(\oplus \vert x)$ and a prevalence estimation $\Pr ^{\textrm{SAL}_\tau }(\oplus )$ (computed as $\frac{\sum _x^{X} \Pr ^{\textrm{SAL}_\tau }(\oplus \vert x)}{\vert X\vert }$);
We measure the Normalized Absolute Error (NAE) between the estimated prevalence and the true prevalence of batch $B_i$, where $\textrm{NAE}(\Pr _{B_i}(y),{\hat{\Pr }}_{B_i}^{\textrm{SAL}_\tau }(y))$ is defined as:
$$\begin{aligned} \begin{aligned} \textrm{NAE}&= \frac{\sum _{j=1}^{Y}\vert \Pr _{B_i}(y)-{\hat{\Pr }}_{B_i}^{\textrm{SAL}_\tau }(y_{j})\vert }{2\left( 1-\displaystyle \min _{y_{j}\in Y}{\Pr }^{\textrm{SAL}_\tau }_{B_i}(y_{j})\right) } \end{aligned} \end{aligned}$$
(11)
If $\textrm{NAE} > \alpha $ we do not use our SAL$_\tau $ method.

The simple intuition behind this heuristic is that when the NAE between the true prevalence and the prevalence estimation of SAL$_\tau $ is too high, then the estimates of SAL$_\tau $ are likely going to be poor on the rest of the pool as well.

To recap, we give below an overview of how SAL$_\tau $ integrates into the active learning process (see Algorithm 5):

1.
At each iteration i we employ an active learning policy, annotating a batch $B_i$ of b documents, which are added to the training set L;
2.
We train a classifier $\phi _{i}$ on L;
3.
At each iteration $i > 1$, we compute the cosine similarity between the two vectors of scores ${\mathcal {A}}_{\phi _i} = \langle \Pr ^{\phi _i}(\oplus \vert x) \ \forall \ x \in B_i \rangle $ and ${\mathcal {A}}_{\phi _{i-1}} = \langle \Pr ^{\phi _{i-1}}(\oplus \vert x) \ \forall \ x \in B_i \rangle $. The cosine similarity will be used as the $\tau $ parameter in SAL$_\tau $;
1. (a)
  We obtain a new set of posterior probabilities $\Pr ^{\textrm{SAL}_\tau }(\oplus \vert x)$ and a new prevalence estimate $\Pr ^{\textrm{SAL}_\tau }(\oplus )$ on U, using SAL$_\tau $ on the posteriors coming from $\phi _{i-1}$;
4.
We compute NAE between SAL$_\tau $ prevalence estimate for $B_i$ and the true prevalence of $B_i$. If this is lower than a threshold $\alpha $, we consider SAL$_\tau $-based probabilities to be the correct ones, otherwise we fall back to $\phi _i$-based probabilities. In a TAR process this means that we can use SAL$_\tau $-based probabilities to estimate the recall and decide when to stop reviewing documents.

4.2 Mitigating SAL_τ recall overestimation: SAL_τ ^m

As it will be clear in the results section (Sect. 6), SAL$_\tau $ achieves significant improvements with respect to the compared methods. However, SAL$_\tau $ also tends, in some cases and especially for higher recall targets, to overestimate the recall, stopping the TAR process too early. Leaving the study of more complex approaches to future work, we propose a simple way to mitigate the issue of recall overestimation by raising by a margin value the target recall given in input to the method. We call this variant SAL$_\tau ^m$ (SAL$_\tau $ with margin m), which trades off an increment in annotation costs for a safer TAR process that reduces the early stops. SAL$_\tau ^m$ is actually SAL$_\tau $ with the only difference being the use of a target recall $R^m$ determined as a function of the target recall R and the margin m:

$$\begin{aligned} R^m= R+(1-R)m \end{aligned}$$

(12)

The margin m ranges from 0 to 1. When $m=0$, $\textrm{SAL}_\tau ^m= \textrm{SAL}_\tau $; when $m=1$, $R^m=1$. A low value means fully trusting SAL$_\tau $, a high value means accepting to label more documents to avoid early stops. In order to avoid adding a free parameter to the method we decided to set $m=R$, following the intuition that it is crucial to guarantee a given target recall R, the closer R is to 1. Equation (12) thus becomes:

$$\begin{aligned} R^m = R+(1-R)R = 2R - R^2 \end{aligned}$$

(13)

We call this configuration SAL$_\tau ^R$ and comment upon its results in Sect. 6. We defer to future works the exploration of more informed methods to set m, which could possibly enable a more convenient trade-off between annotation costs and proper recall targeting.

5 Experiments

5.1 Using SAL_τ to stop a TAR process

The SAL$_\tau $ algorithm can be tested in any AL scenario where we need to improve priors and posteriors. In this paper, we focus on testing SAL$_\tau $ capabilities for TAR: our goal is to stop the review process as soon as a target recall R is reached, lowering the review cost.

We test the SAL$_\tau $ algorithm with three configurations:

1.
The SAL$_\tau $ formulation of Algorithm 5;
2.
SAL$_\tau ^R$, i.e., with margin, as described in Sect. 4.2;
3.
SAL$_\tau $CI, a variant of SAL$_\tau $ that uses the confidence interval heuristic from the QuantCI technique (Yang et al. 2021a).

We compare against the following well-known methods (see Sect. 2):

1.
The Knee method by Cormack and Grossman (2015);
2.
The Budget method by Cormack and Grossman (2015)
3.
The CHM method by Callaghan and Müller-Hansen (2020);
4.
The QuantCI method by Yang et al. (2021a);
5.
The QBCB method by Lewis et al. (2021);
6.
The IPP by Sneyd and Stevenson (2021).

For all methods that require a confidence interval, we set it to $95\%$. The positive sample size r of the QBCB method (see Sect. 2.3.4) is instead set at 50, which in the results reported in the original paper appears to be a good trade-off between a low overall cost and a low variance in such cost across different samples.

5.2 The active learning workflow

We run the same active learning workflow for all tested methods. In most TAR applications (Yang et al. 2021a; Cormack and Grossman 2015), we usually have only a single positive document to start the active learning with, called the initial seed: for our experiments, we decide to seed the active learning process with an additional negative document randomly sampled from the document pool P; that is, our initial seed set S consists of a positive and a negative document, randomly sampled from P.

As mentioned earlier (Sect. 2), the active learning policy we pick is CAL (Cormack and Grossman 2015) (a variation of ALvRS) since this is the most common policy used in TAR tasks. As the batch size b, we follow (Yang et al. 2021a) and set it to 100. The classifier we use in all our experiments is a standard Logistic Regression, as this is also the classifier of choice in most (if not all) TAR applications. We test the target recalls values $\{0.8, 0.9, 0.95\}$.

5.3 Datasets

We run our experiments^{Footnote 4} on two well-known datasets: the RCV1-v2 and the CLEF Technology-Assisted Reviews in Empirical Medicine (EMED) datasets (specifically, the dataset made available for the CLEF 2019 edition). Both datasets have been already used to test TAR frameworks and algorithms, e.g., the MINECORE framework (Oard et al. 2018), the QuantCI stopping technique (Yang et al. 2021a) and Li and Kanoulas sampling methodology (Li and Kanoulas 2020). We also use a sample of the Jeb Bush Email collection to set SAL$_\tau $ $\alpha $ hyperparameter.

All text is converted into vector representation first converting it to lowercase, removing English stopwords and any term occurring in more than 90% of the documents in P; vectors are weighted using tf-idf.

5.3.1 RCV1-v2

RCV1-v2 (Lewis et al. 2004) is a publicly available collection of 804,414 news stories from the late nineties, published on the Reuters website. RCV1-v2 is a multi-label multi-class collection, i.e., every document can be assigned to one or more classes from a set ${\mathcal {C}}$ of 103 classes. Since for our experiments we need binary classification datasets (i.e., a document can either be relevant or not), for each class $c \in {\mathcal {C}}$ we consider each document d as either belonging to c or not, thus obtaining 103 binary datasets. In the experiments, we use a random sample of 10,000 documents, for efficiency and to keep RCV1 pool size close to that of CLEF.

5.3.2 CLEF EMED 2019

The CLEF EMED datasets were made publicly available^{Footnote 5} for the TAR in EMED tasks ran from 2017 to 2019. The goal of the task was to assess TAR algorithms aimed at supporting the production of systematic reviews in empirical medicine. Following (Li and Kanoulas 2020), we use the Diagnostic Test Accuracy (DTA) reviews part of the dataset, working with the abstract relevance assessments (i.e., the first phase of the review, where the physician only assesses abstracts): the dataset consists of 72 “Training” topics and 8 “Testing” topics.

The texts of the reviewed documents are not available for download on the GitHub platform: they have to be downloaded from PubMED. While an HTTP API is available, the full text of documents are often under a paywall: hence the choice of focusing on the abstract reviews only. Moreover, we have encountered several issues in downloading some of the abstracts and we were unable to retrieve the whole dataset: in total we have retrieved abstracts for 60 topics (between “Training” and “Testing”), downloading a collection of 264,750 documents. Since we do not need to distinguish between a training and testing phase with different topics (e.g., for transfer learning), we merge together the “Training” and “Testing” subcollections.^{Footnote 6}

5.3.3 Jeb Bush Email collection

The Jeb Bush’s emails collection consists of 290,099 emails sent and received by the former governor of Florida Jeb Bush. We used the subset published by Grossman et al.^{Footnote 7}: the sample consists of 9 topics, with 50,000 documents. For each document and topic, a relevance judgment is available. We do not run experiments on this sample, but only use it to set the hyperparameter $\alpha $ (see Sect. 4.1).

5.4 Evaluation measures

We evaluate the tested methods using three measures:

1.
The Mean Square Error (MSE) between the recall at stopping $R_s$ and the target recall R:
$$\begin{aligned} \textrm{MSE} = (R - R_s)^2 \end{aligned}$$
(14)
2.
The Relative Error (RE):
$$\begin{aligned} \textrm{RE} = \frac{{\vert }R - R_s{\vert }}{R} \end{aligned}$$
(15)
3.
The “idealized” cost (IC) presented by Yang et al. (2021b). This measure specifically evaluates a stopping method on its capability of saving effort and money when stopping the TAR process.

The first two measures are well-known and used in many different tasks in IR and machine learning literature, and they have been used to evaluate TAR tasks (Li and Kanoulas 2020; Yang et al. 2021a).

The IC metric is defined as follows: it uses a cost structure (similar to what was previously done in Oard et al. (2018)), a four-tuple $s = (\alpha _p, \alpha _n, \beta _p, \beta _n)$. Subscript p and n indicate the cost of reviewing positive (relevant) and negative (non-relevant) documents; $\alpha $ and $\beta $ represents the costs of reviewing a document in a first or second phase: in a one-phase TAR process, this “second” phase is referred to as the failure penalty. That is, it would be the cost of continuing the review with an optimal second phase, by ranking documents with the model trained in the first one.

Let Q be the minimum number of documents to review to reach the recall target R. Say we review batches of size b and we stop at iteration t: let $Q_t$ be the number of documents we reviewed before the method stopped the review. If $Q_t < Q$, we have a deficit of $Q - Q_t$ positive documents; let $\rho _t$ be the number of documents that need be reviewed,^{Footnote 8} following the ranking, to find the additional $Q - Q_t$ positive documents. The total cost of our review is then:

$$\begin{aligned} \textrm{IC}= & {} \alpha _pQ_t + \alpha _n(bt - Q_t) + I[Q_t < Q](\beta _p(Q - Q_t) \nonumber \\{} & {} \quad + \beta _n(\rho _t - Q + Q_t)) \end{aligned}$$

(16)

Where $I[Q_t < Q]$ is 0 if Q documents were found in the first t iterations and 1 otherwise.

Following both Yang et al. (2021b) and Oard et al. (2018), we evaluate our results with three cost structures:

a uniform cost structure, where $s = (1, 1, 1, 1)$, which we call $\textrm{Cost}_u$. This assumes that there is no difference between the different phases of review, and that reviewing positive and negative documents have the same cost (we keep this latter assumption in all our cost structures). As argued in Yang et al. (2021b, §4.1), this cost structure is common in many review scenarios;
the expensive training cost structure, where $s = (10, 10, 1, 1)$, which we call $\textrm{Cost}_e$. This assumes that reviewing a document in the first phase is 10 times more expensive than in the second phase. According to Yang et al. (2021b, §4.2) this is fairly common in systematic reviews in empirical medicine;
a MINECORE-like cost structure, where $s = (1, 1, 5, 5)$, which we call $\textrm{Cost}_m$. This cost structure reflects MINECORE cost structure 2 (Oard et al. 2018, Table 5), where we assume that reviewing in the second stage (in MINECORE case, it was reviewing by privilege) is 5 times as expensive as in the first stage.

6 Results

We run each of the experiments defined in the previous section 20 times, using a different randomly generated seed set S each time (the same random seed set for all methods compared); we set $\alpha = 0.3$ (Sect. 4.1), as this was the best-performing value in a hyperparameter search conducted on the Jeb Bush dataset. Therefore, the reported results for a dataset represent the average evaluation measure applied to all generated subsets across all classes within that dataset. We also define three equal-sized bins sorted by class prevalence (Very low, Low, and Medium) to show how the different methods performed with different level of imbalance between relevant and non-relevant documents. The average prevalence for the classes in the three bins of RCV1 is 0.002, 0.012, and 0.084, and for CLEF is 0.005, 0.027, and 0.117.

Table 2 MSE and RE results on RCV1. For each target recall, bold indicates the best result, underline indicates the second best

Full size table

Table 3 MSE and RE results on CLEF. For each target recall, bold indicates the best result, underline indicates the second best

Full size table

Tables 2 and 3 report the MSE and RE values, Tables 4 and 5 report the IC values. The most competitive models are SAL$_\tau $, SAL$_\tau ^R$, Budget, QBCB and the IPP method. For the MSE and RE metrics our SAL$_\tau $ and SAL$_\tau ^R$ stopping rules performs on par, or better, than the other state-of-the-art techniques. More in details, for lower target recalls ($R=0.8$ and $R=0.9$) the above-mentioned methods show a comparable performance, both for RCV1-v2 and CLEF datasets. When $R = 0.95$, the Budget method performs slightly better instead.

Table 4 IC measure on RCV1. For each target recall, bold indicates the best result, underline indicates the second best

Full size table

Table 5 IC measure on CLEF. For each target recall, bold indicates the best result, underline indicates the second best

Full size table

With respect to the IC measure, SAL$_\tau $ shows the lowest costs on average (and for the Very Low prevalence bin) for all proposed cost structures, closely followed by SAL$_\tau ^R$ and IPP. Notice that the IPP method, despite achieving good cost results, was not as performant when previously compared on the MSE and RE metrics. On the other hand, the QBCB method, which achieves state-of-the-art performance on MSE and RE, incurs very high costs due to the annotation of the pre-review random sample required by the method. Furthermore, the most relevant reduction of cost shown by SAL$_\tau $ and SAL$_\tau ^R$, with respect to the compared methods, is for the Very Low prevalence bin. This is an important result, as the “needle in a haystack” scenario is the most common in TAR. The Budget method has comparable costs for high target recall values and in higher prevalence bins.

Overall, we argue that our SAL$_\tau ^R$ method seems to strike the best trade-off between proper recall targeting (i.e., actually stopping the review at the given recall target) and low costs.

Indeed, as we anticipated, SAL$_\tau $ tends to stop the TAR process too early and, as a result, cannot be consistently used in all scenarios despite achieving lower costs. This can be seen in the box plots in Fig. 4, which show at which real recall values the different stopping rules decided to halt the review process. The average recall value for SAL$_\tau $ is always higher than the target recall, i.e. the expected value of recall satisfies the target recall requirement. Yet, it is evident how for higher recall targets, the distribution of recall values produced by SAL$_\tau $ goes under the target not only for the tail of the distribution, as it happens for Budget and Knee, but also for a portion of the center part of the distribution. Most TAR tasks require to match the target recall in a much larger portion of the cases. By simply adding a margin that is a function of the target recall, SAL$_\tau ^R$ shifts the distribution of the reached recall values up. Its distribution is similar to the Budget method’s, with slightly increasing costs with respect to SAL$_\tau $ but still lower than the other tested methods.

Finally, notice that the IPP method, despite achieving good costs in some cases (and good MSE/RE values in some other), cannot stop the review process at the right moment, often greatly overshooting the recall estimate.

To summarize, SAL$_\tau $ enables the use of SLD in AL and TAR processes, solving the issues observed in Esuli et al. (2022) and Molinari et al. (2023). SAL$_\tau $ brings consistent and substantial improvements, especially for medium/high (0.8, 0.9) target recalls; for higher targets, it tends to stop too early. SAL$_\tau ^R$ solves this issue without a significant increase in annotation costs with respect to SAL$_\tau $, finding a very good trade-off between annotation costs and proper target recall matching. In future works we propose to investigate more informed methods to mitigate SAL$_\tau $ target recall overestimation, as well as testing SAL$_\tau $ and SAL$_\tau ^R$ with other sampling techniques (e.g., Li and Kanoulas 2020).

7 Conclusions

In this paper, we introduced a new method called SAL$_\tau $ (and its variations SAL$_\tau ^R$ and SAL$_\tau $CI) to improve posterior probabilities and prevalence estimates in an active learning review process; more specifically, we tested our method as a “when-to-stop” stopping rule for TAR tasks. SAL$_\tau $ is a variant of the well-known SLD algorithm (Saerens et al. 2002): our algorithm was designed to enable the use of SLD in AL and TAR processes, solving the issues observed in Esuli et al. (2022) and Molinari et al. (2023). Experiments have shown that SAL$_\tau $ still tends to slightly overestimate the true recall as it gets close to 1, stopping the process too early. SAL$_\tau ^R$ solves this issue, without significantly increasing the review cost compared to SAL$_\tau $. In the experiments, SAL$_\tau ^R$ has consistently improved over state-of-the-art methods by improving the estimation of prevalence and thus stopping the TAR process much earlier than other methods, while still achieving the target recall.

For future work, we propose to investigate more informed methods to mitigate the overestimation of target recall of SAL$_\tau $, as well as to test SAL$_\tau $ and SAL$_\tau ^R$ with other sampling techniques (e.g., Li and Kanoulas 2020).

Data availability

The implementation of the proposed method and the code to replicate the experiments is published at https://github.com/levnikmyskin/salt. The version of the CLEF EMED 2019 dataset used in the experiments of this paper is available at https://doi.org/10.5281/zenodo.7142639 (Molinari 2022).

Notes

Notice, however, that this procedure may be conducted in another order, with different teams of reviewers, and in many other configurations. See, for instance, Yang et al. (2021b).
This is the case for low prevalence scenarios, typical of TAR. The opposite case is the one with very high prevalence, in which the posteriors for the relevant class are all pushed to 1.
This idea of leveraging previously annotated batches is more or less similar to what Callaghan and Müller-Hansen (2020) proposed in their method (see also Sect. 2.3.2).
The code is available at https://github.com/levnikmyskin/salt.
https://github.com/CLEF-TAR/tar.
We published the dataset at (Molinari 2022).
https://github.com/hical/sample-dataset.
Note that the value of $\rho _t$ can be computed, at evaluation time, since we know the relevance labels of all documents in the pool in the experimental setting.

References

Cormack GV, Grossman MR (2020) Systems and methods for a scalable continuous active learning approach to information classification. Google Patents. US Patent 10:671–675
Callaghan MW, Müller-Hansen F (2020) Statistical stopping criteria for automated screening in systematic reviews. Syst Rev 9(1):1–14
Article Google Scholar
Cormack GV, Grossman MR (2015) Autonomy and reliability of continuous active learning for technology-assisted review. CoRR abs/1504.06868
Cormack GV, Grossman MR (2016) Engineering quality and reliability in technology-assisted review. In: Proceedings of the 39th ACM conference on research and development in information retrieval (SIGIR 2016), Tokyo, JP, pp 75–84. https://doi.org/10.1145/2911451.2911510
Cormack GV, Grossman MR (2016) Scalability of continuous active learning for reliable high-recall text classification. In: Proceedings of the 25th ACM conference on information and knowledge management (CIKM 2016), pp 1039–1048. https://doi.org/10.1145/2983323.2983776
Cormack GV, Grossman MR, Hedin B, Oard DW (2010) Overview of the TREC 2010 legal track. In: Proceedings of the 19th text retrieval conference (TREC 2010)
Dasgupta S, Hsu D (2008) Hierarchical sampling for active learning. In: Proceedings of the 25th international conference on machine learning (ICML 2018), Stockholm, SE, pp 208–215
Esuli A, Molinari A, Sebastiani F (2022) Active learning and the Saerens–Latinne–Decaestecker algorithm: an evaluation. In: CIRCLE 2022: 2nd joint conference of the information retrieval communities in Europe
Grossman MR, Cormack GV (2011) Technology-assisted review in e-discovery can be more effective and more efficient than exhaustive manual review. Richmond J Law Technol 17(3):5
Google Scholar
Huang S, Jin R, Zhou Z (2014) Active learning by querying informative and representative examples. IEEE Trans Pattern Anal Mach Intell 36(10):1936–1949. https://doi.org/10.1109/TPAMI.2014.2307881
Article PubMed Google Scholar
Kanoulas E, Li D, Azzopardi L, Spijker R (2019) CLEF 2019 technology assisted reviews in empirical medicine overview. In: Working notes of the conference and labs of the evaluation forum (CLEF 2019), Lugano, CH
Konyushkova K, Sznitman R, Fua P (2017)Learning active learning from data. Adv Neural Inf Process Syst 30
Krishnan R, Sinha A, Ahuja N, Subedar M, Tickoo O, Iyer R (2021) Mitigating sampling bias and improving robustness in active learning. arXiv preprint arXiv:2109.06321
Lease M, Cormack GV, Nguyen AT, Trikalinos TA, Wallace BC (2016) Systematic review is e-discovery in doctor’s clothing. In: Proceedings of the SIGIR 2016 medical information retrieval workshop (MedIR 2016), Pisa, IT
Lewis DD, Gale WA (1994) A sequential algorithm for training text classifiers. In: Proceedings of the 17th ACM international conference on research and development in information retrieval (SIGIR 1994), Dublin, IE, pp 3–12. https://doi.org/10.1007/978-1-4471-2099-5_1
Lewis DD, Yang Y, Rose TG, Li F (2004) RCV1: a new benchmark collection for text categorization research. J Mach Learn Res 5:361–397
Google Scholar
Lewis DD, Yang E, Frieder O (2021) Certifying one-phase technology-assisted reviews. In: Proceedings of the 30th ACM international conference on information and knowledge management. CIKM ’21. Association for Computing Machinery, New York, NY, USA, pp 893–902. https://doi.org/10.1145/3459637.3482415
Li D, Kanoulas E (2020) When to stop reviewing in technology-assisted reviews: sampling from an adaptive distribution to estimate residual relevant documents. ACM Trans Inf Syst 38(4):41–14136. https://doi.org/10.1145/3411755
Article Google Scholar
Michelson M, Reuter K (2019) The significant cost of systematic reviews and meta-analyses: a call for greater involvement of machine learning to assess the promise of clinical trials. Contemp Clin Trials Commun 16:100443
Article PubMed PubMed Central Google Scholar
Molinari A (2022) CLEF EMED 2019 dataset. Zenodo. https://doi.org/10.5281/zenodo.7142640
Article Google Scholar
Molinari A, Esuli A, Sebastiani F (2023) Improved risk minimization algorithms for technology-assisted review
Moreno-Torres JG, Raeder T, Alaız-Rodrıguez R, Chawla NV, Herrera F (2012) A unifying view on dataset shift in classification. Pattern Recognit 45(1):521–530. https://doi.org/10.1016/j.patcog.2011.06.019
Article ADS Google Scholar
O’Mara-Eves A, Thomas J, McNaught J, Miwa M, Ananiadou S (2015) Using text mining for study identification in systematic reviews: a systematic review of current approaches. Syst Rev 4(5):1–22
Google Scholar
Oard DW, Sebastiani F, Vinjumur JK (2018) Jointly minimizing the expected costs of review for responsiveness and privilege in e-discovery. ACM Trans Inf Syst 37(1):11–11135. https://doi.org/10.1145/3268928
Article Google Scholar
Saerens M, Latinne P, Decaestecker C (2002) Adjusting the outputs of a classifier to new a priori probabilities: a simple procedure. Neural Comput 14(1):21–41. https://doi.org/10.1162/089976602753284446
Article PubMed Google Scholar
Satopaa V, Albrecht J, Irwin D, Raghavan B (2011) Finding a" kneedle" in a haystack: Detecting knee points in system behavior. In: 2011 31st International conference on distributed computing systems workshops, pp 166–171. IEEE
Shemilt I, Khan N, Park S, Thomas J (2016) Use of cost-effectiveness analysis to compare the efficiency of study identification methods in systematic reviews. Syst Rev 5(140):1–13. https://doi.org/10.1186/s13643-016-0315-4
Article Google Scholar
Sneyd A, Stevenson M (2021) Stopping criteria for technology assisted reviews based on counting processes. In: Proceedings of the 44th international ACM SIGIR conference on research and development in information retrieval, pp 2293–2297
Wang S, Scells H, Mourad A, Zuccon G (2022) Seed-driven document ranking for systematic reviews: a reproducibility study. In: European conference on information retrieval, pp 686–700. Springer
Yang E, Lewis DD, Frieder O (2021) Heuristic stopping rules for technology-assisted review. In: Proceedings of the 21st ACM symposium on document engineering (DocEng 2021), Limerick, IE, pp 31–13110. https://doi.org/10.1145/3469096.3469873
Yang E, Lewis DD, Frieder O (2021) On minimizing cost in legal document review workflows. In: Proceedings of the 21st ACM symposium on document engineering, pp 1–10

Download references

Funding

Open access funding provided by Consiglio Nazionale Delle Ricerche (CNR) within the CRUI-CARE Agreement.

Author information

Authors and Affiliations

Department of Computer Science, University of Pisa, 56124, Pisa, Italy
Alessio Molinari
Istituto di Scienza e Tecnologie dell’Informazione “A. Faedo”, Consiglio Nazionale delle Ricerche, 56124, Pisa, Italy
Alessio Molinari & Andrea Esuli

Authors

Alessio Molinari
View author publications
You can also search for this author in PubMed Google Scholar
Andrea Esuli
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Both authors equally contributed to the definition of the proposed method and in writing the paper. Alessio Molinari wrote the python implementation of the method used in the experiments.

Corresponding author

Correspondence to Andrea Esuli.

Ethics declarations

Conflict of interest

Authors declare to have no competing interests.

Ethical approval

Not applicable.

Additional information

Responsible editor: Sriraam Natarajan.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Molinari, A., Esuli, A. SAL_τ: efficiently stopping TAR by improving priors estimates. Data Min Knowl Disc 38, 535–568 (2024). https://doi.org/10.1007/s10618-023-00961-5

Download citation

Received: 10 November 2022
Accepted: 12 July 2023
Published: 28 August 2023
Issue Date: March 2024
DOI: https://doi.org/10.1007/s10618-023-00961-5

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

SALτ: efficiently stopping TAR by improving priors estimates

Abstract

Similar content being viewed by others

Statistical stopping criteria for automated screening in systematic reviews

Goldilocks: Just-Right Tuning of BERT for Technology-Assisted Review

Impact of Active learning model and prior knowledge on discovery time of elusive relevant papers: a simulation study

1 Introduction

2 Background and related work

2.1 Active learning: relevance sampling

2.2 TAR tasks and workflows

2.2.1 TAR for e-discovery

2.2.2 TAR for systematic reviews

2.2.3 One-phase TAR and two-phase TAR workflows

2.3 Stopping methods for TAR

2.3.1 The knee and budget method

2.3.2 The Callaghan Müller–Hansen (CMH) method

2.3.3 The QuantCI method

2.3.4 The QBCB method

2.3.5 The IPP method

2.4 Using the SLD algorithm in AL processes

3 An analysis of SLD shortcomings in active learning scenarios

3.1 How is sampling bias related to SLD failures in active learning contexts?

4 Adapting the SLD algorithm to active learning

4.1 Estimating SALτ τ across active learning iterations

4.2 Mitigating SALτ recall overestimation: SALτ m

5 Experiments

5.1 Using SALτ to stop a TAR process

5.2 The active learning workflow

5.3 Datasets

5.3.1 RCV1-v2

5.3.2 CLEF EMED 2019

5.3.3 Jeb Bush Email collection

5.4 Evaluation measures

6 Results

7 Conclusions

Data availability

Notes

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation

SAL_τ: efficiently stopping TAR by improving priors estimates

4.1 Estimating SAL_τ τ across active learning iterations

4.2 Mitigating SAL_τ recall overestimation: SAL_τ ^m

5.1 Using SAL_τ to stop a TAR process