A review and experimental analysis of active learning over crowdsourced data

Training data creation is increasingly a key bottleneck for developing machine learning, especially for deep learning systems. Active learning provides a cost-effective means for creating training data by selecting the most informative instances for labeling. Labels in real applications are often collected from crowdsourcing, which engages online crowds for data labeling at scale. Despite the importance of using crowdsourced data in the active learning process, an analysis of how the existing active learning approaches behave over crowdsourced data is currently missing. This paper aims to fill this gap by reviewing the existing active learning approaches and then testing a set of benchmarking ones on crowdsourced datasets. We provide a comprehensive and systematic survey of the recent research on active learning in the hybrid human–machine classification setting, where crowd workers contribute labels (often noisy) to either directly classify data instances or to train machine learning models. We identify three categories of state of the art active learning methods according to whether and how predefined queries employed for data sampling, namely fixed-strategy approaches, dynamic-strategy approaches, and strategy-free approaches. We then conduct an empirical study on their cost-effectiveness, showing that the performance of the existing active learning approaches is affected by many factors in hybrid classification contexts, such as the noise level of data, label fusion technique used, and the specific characteristics of the task. Finally, we discuss challenges and identify potential directions to design active learning strategies for hybrid classification problems.


Introduction
Despite remarkable advances in machine learning (ML), training data remains a key bottleneck for the successful application of ML techniques. Obtaining a large amount of highquality training data is usually a long, laborious, and costly process. Active learning (AL) provides an effective means to accelerate the process, by iterating data labeling and model training, and identifying at each iteration which data to label next, to converge more rapidly and effectively to an accurate model. Crowdsourcing is often used in conjunction with ML, both as a way to collect labeled data efficiently and as a way to assist trained models for predictions where the model confidence is not deemed sufficient (Callaghan et al. 2018;Krivosheev et al. 2021). Despite their joint usage, the interaction between AL and crowdsourcing has been largely unexplored. This interaction is non-trivial for many reasons: for example, crowdsourcing typically produces rather noisy labels and the impact of such noise on ML algorithm confidence estimation and calibration is still unclear. Furthermore, at every AL iteration, we are faced with several choices, from how to aggregate crowd votes on a label to whether we should ask the crowd to label new data items or verify (reduce the noise on) already labeled items-and these choices may impact AL performance.
This paper reviews existing AL approaches and investigates their performance in the hybrid human-machine classification setting, where crowd workers contribute labels (often noisy) to either directly classify data instances or to train an ML model for classification. Unlike existing surveys (Settles 2010;Aggarwal et al. 2014) that focus on the algorithmic design of AL and the review paper about integrating AL with deep learning (Budd et al. 2019), here we aim at re-evaluating existing AL approaches in terms of cost-effectiveness when data labels are crowdsourced, and so given the constraints of a limited crowdsourcing budget and noisiness of worker-contributed labels.
To this end, we first review existing AL approaches under three categories, based on their reliance on a strategy, that is, specific queries to get the sample of interest from the data. These categories are: (i) fixed-strategy approaches, that apply a specific item selection method regardless of the data or problem, (ii) dynamic-strategy approaches that have a portfolio of strategies and choose one each time they need to sample a batch of items for labeling, based on past performance on that specific problem and data, (iii) strategy-free approaches that do not have any apriori selected portfolio of strategies, but rather learn the best strategy from scratch based on the problem, data, and prior experiences. We also review proposals in the literature that discuss how AL can deal with noisy labels. We then report the results of an extensive experimental comparison evaluating the performance of the different AL approaches in human-machine classification.
We evaluate the performance of AL approaches under two different scenarios: (1) ML only: this is the "traditional" approach of training a model with AL and then testing its performance on a pool of items. (2) Hybrid, where crowd and ML interact also in the classification phase, not just in data labeling. Indeed, many problems we face have a finite pool, where the set of items to classify is finite, and there is therefore a trade-off between spending our budget or effort to train an ML model (using AL methods) versus spending that budget to directly classify items in the pool via the crowd, or using a combination of crowd and ML.
To run this comparison, we developed a library of AL approaches collecting implementations provided by the authors when available, and re-implementing them when we could not find existing code. As part of this process, we also created a collection of crowdsourced datasets containing micro-level information (i.e., individual crowd votes), by aggregating the publicly available ones and adding the ones we collected (note that most available crowdsourced datasets do not make individual votes available). Both the software library 1 and the collection of benchmarking datasets 2 are made freely available to the scientific community. We believe that both the implementations and the crowdsourced micro-data will be an important contribution in their own right given the difficulty we had in obtaining both, despite extensive searches.
An important lesson we learned from the experimentation is that prior conclusions on the performance of AL approaches obtained in non-crowd labeling settings cannot be blindly extended to crowdsourced data. Specifically, strategy-free approaches that have shown to be effective in many contexts do not achieve the best performances across crowdsourced datasets. We speculate that this can be due to the impact of noise in ML model calibration and uncertainty estimation. We also observed that hybrid classification improves the performance of AL approaches over crowdsourced datasets.
In summary, we make the following contributions: • We identify three categories of AL approaches in the literature and analyze their characteristics and effectiveness. • We contribute a library of implementations of state-of-the-art AL algorithms and a collection of benchmarking datasets for human-machine classification. • We report the results of an extensive experimental evaluation, providing insights on the performance of existing AL strategies in hybrid human-machine classification contexts. • We provide a critical discussion on the main insights that emerged from our analysis, highlighting relevant open challenges and potential future directions to address them.
2 Active learning strategies: a review AL (Cohn et al. 1996) has been a very lively research field over the last decade. Given a set of items I and an ML algorithm M, AL aims at defining a strategy to progressively sample items from I on which to obtain true labels for, so that M can be trained with a smaller dataset with respect to random item sampling. The underlying assumption is that obtaining training data is costly, and therefore minimizing the size of such dataset for a given target accuracy is highly beneficial (Johnson et al. 2018). In the following, we review existing AL approaches by grouping them in terms of the type of strategy used to choose the sample of interest.

Fixed-strategy approaches
Pioneering fixed-strategy approaches have been proposed in the 1990s. Seung et al. (1992) proposed the query by committee (QBC) approach, which polls a committee of different classifiers trained on the current set of labeled items to predict the label of each unlabeled item. Then, items to label are selected based on the maximum degree of disagreement among the classifiers. They showed that the prediction error decreases exponentially fast in the number of queries. The approach and experimentation are however limited to parametric learning models with continuously varying weights and cases where learning is perfectly realizable, and the learning algorithm is Gibbs algorithm (Haussler et al. 1991).
In Freund et al. (1997), they proved that QBC such an exponential decrease is guaranteed for a general class of learning problems. They used two machine classifiers as the "committee", and determined general bounds on both the number of queries and the number of instances to be labeled. Specifically, the paper defines higher and lower bounds for the expected information gain of QBC and proves that if the queries have high expected information gain then the prediction error is guaranteed to decrease rapidly with the number of queries. McCallum and Nigam (1998) follow a similar "committee-based" approach but apply expectation-maximization (EM) to determine class probabilities and the extent to which classifiers disagree, and weigh item selection by their density, defined as the distance from a document to the others. With this approach, they reduced the required number of labeled documents by 42% over the previous QBC approaches. Lewis and Gale (1994) presented the idea of uncertainty sampling, where the intuition is to sample items on which a model M is more uncertain, and this approach has long been a de-facto standard in the AL literature. They showed that aiming at reducing the uncertainty of M significantly decreases the number of items that must be labeled to achieve the target accuracy. Cohn et al. (1994) proposed selective sampling, based on the idea of identifying uncertain regions in a vector space used to represent items, and then on selecting items (points in space) from such uncertain regions to minimize them. In the case of support vector machines, uncertainty sampling is implemented by selecting instances that are closest to the decision boundary (Tsai et al. 2010). Roy and McCallum (2001) introduced the error-reduction sampling approach, aiming at selecting items that will reduce the expected error of the active learner in the next test examples. They computed the expected error rate of an item either by using the entropy of the posterior class distribution (log-loss), or by using the posterior probability of the most likely class (0-1 loss). Mozafari et al. (2014) focused on the same objective and proposed the MinExpError approach that uses the theory of non-parametric bootstrap (Efron and Tibshirani 1993) to design generic and scalable sampling strategies. First, bootstraps are created and assigned to different classifiers; then, the expected error of these classifiers for every single item in the unlabeled data is computed; finally, the items that minimize the expected error are selected. The authors showed that the MinExpError algorithm requires significantly fewer labeled items than existing approaches back then.
Probabilistic Active Learning (PAL) (Deroski et al. 2014) combined the idea of uncertainty sampling and the expected error reduction with smoothness assumption (Chapelle et al. 2010). The underlying assumption is that if two items are close in the feature space, then their labels should also be close. An item is represented by two attributes; (i) the total number of labeled instances in the neighborhood of the item, and (ii) posterior estimate for the total number of the positive labeled neighbor set. This approach uses probabilistic estimates to investigate the neighborhood statistics of an item (label statistics), and measures the overall gain in classification performance (probabilistic gain) in terms of a userdefined point classification performance (Parker 2011). It then selects items that improve the expected probabilistic gain most within their neighborhood. Its time complexity is comparable to uncertainty sampling, and it provides fast and stable performance.
Saar-Tsechansky and Provost (2004) proposed the Bootstrap-LV approach, which detects the variance in the probability estimates of bootstrap samples and uses weighted sampling to find the most informative items. Another weighted-sampling approach is known as Importance-Weighted Active Learning (IWAL) (Beygelzimer et al. 2009) which applies an adaptive rejection sampling to each instance and assigns an importance weight (the inverse probability of being retained) to each retained item. Beygelzimer et al. (2010a) improved IWAL by using a rejection threshold based on the importance-weighted error estimates that minimize the prediction error. They showed that this approach improves the label complexity which reflects the intrinsic difficulty of the learning problem (Wang 2011;Yan et al. 2019).
Additional strategies include clustering instances and selecting cluster representatives as the most informative items (Brew et al. 2010), or combining representativeness and informativeness of instances to minimize the maximum possible classification loss (i.e. Query Informative and Representative Examples (QUIRE) (Huang et al. 2010)).
Although many of these approaches explicitly target generalization performance improvement, by means of expected error reduction (Roy and McCallum 2001;Zhu et al. 2003;Guo and Greiner 2007;Roy and McCallum 2001;Mozafari et al. 2014), or variance reduction (Schein and Ungar 2007;Settles and Craven 2008;Hoi et al. 2006), fixedstrategy approaches are unlikely to work on all scenarios (Baram et al. 2004;Hsu and Lin 2015). The reason is that they rely on intuitions and heuristics that do not generalize to all datasets and ML problems. For example, even in our experiments, discussed next, uncertainty sampling performs well when false positives and false negatives have the same "cost", but less so when errors, and specifically errors of a specific type are more costly than others. This reveals that fixed-strategy approaches are not able to adapt to the data and the problem at hand.

Dynamic-strategy approaches
Approaches to dynamic strategy selection are in essence based on progressively learning which AL approach works best for the data at hand. This kind of "learning to learn" approach was first proposed by Baram et al. (2004), who showed that one single strategy cannot perform well on all problems. Their approach, named COMB, combines a group of AL strategies and dynamically evaluates them to achieve the best possible performance on the problem at hand. Although the online selection of strategies expedites the AL process, combining multiple strategies and evaluating their performance brings two challenges (Baram et al. 2004). First, as dynamic strategies choose the next action by estimating the performance of each AL approach given the current state and the past observations, the quality of such estimation becomes crucial. However, items selected by the active learner are biased to be the "hard" ones and do not reflect the exact distribution of the items, and as such the quality estimation in absence of a test dataset (which is rarely available in AL) is biased. Second, at each batch only the label of the instances proposed by the selected AL approach are available; there is no way to know the consequences of labeling other instances proposed by other strategies. To handle these challenges, COMB (Baram et al. 2004) is designed as an adversarial multi-armed bandit problem (MAB) (Auer et al. 1995(Auer et al. , 2003Audibert and Bubeck 2009) combined with the EXP4 algorithm (Auer et al. 2003), where AL strategies are considered as "experts" and unlabeled items are the "slot machines". Thus, the selection of an item is based on the opinion of all experts. Hsu and Lin (2015) proposed the Active Learning by Learning (ALBL) algorithm as an extension of COMB. It represents each bandit machine as an AL approach and uses the EXP4.P algorithm (Beygelzimer et al. 2010b) to select a machine adaptively. The main differences between COMB and ALBL are as follows: (i) while COMB represents each machine as a single unlabeled instance, a machine corresponds to an AL approach in ALBL, (ii) both COMB and ALBL adopt the EXP4 algorithm, but COMB restricts each machine to being pulled only once, while a machine can be pulled many times in ALBL, and (iii) COMB uses human-designed evaluation criteria based on entropy, while ALBL uses an unbiased estimator of the test accuracy (weighted accuracy) to decide the rewards of the single strategies. Hsu and Lin (2015) showed that ALBL gives either comparable or better results than COMB. In general, the above papers show that ALBL works better than fixed-strategy approaches when the problem is easier to learn, while it is comparable for harder problems (where strategy selection is also more challenging).
While ALBL probabilistically blends the items suggested by different AL strategies to select the most informative one, Chu and Lin (2016) proposed blending the strategies themselves to build an aggregated strategy. Their approach, called LSA (Linear Strategy Aggregation), combines LinUCB (linear upper-confidence-bound) (Li et al. 2010), a stateof-the-art MAB approach, with the task at hand. They represent experience as the weights with which to aggregate strategies, and adaptively adjust these weights when tackling a new problem. They aim to transfer this experience learned from the model to other AL tasks through biased regularization. They proved that the transfer of the learned experience is beneficial to achieve better performance.
Although dynamic-strategy approaches use bandits to ensemble multiple strategies in the learning process (Baram et al. 2004;Hsu and Lin 2015;Chu and Lin 2016), they still assume that there is a single best combination in each batch (stationary bandits). In so doing they are not robust to non-stationary cases, where the weighting proportions must be adapted over time in the learning process. Recently, strategy-free approaches have been proposed to overcome these limitations.

Strategy-free approaches
Strategy-free AL processes use prior experience (meta-data) to learn new tasks (active meta-learning). The main difference between dynamic and strategy-free approaches is that the latter does not rely on any human-designed strategies.
In this category, Konyushkova et al. (2017) devised a novel data-driven AL algorithm, named Learning Active Learning (LAL). LAL is formulated as a regression problem that learns how to predict the reduction in the expected generalization error when we add a new label to the training set. It uses Monte-Carlo sampling to correlate the test performance directly with the classifier and item properties. Both the classifier and the items are represented with a set of parameters so that LAL can sense any change in the training set. As a result, the learning state is continuously tracked as a vector whose elements depend on the state of the current classifier and the selected item. A drawback of this algorithm is being classifier-specific, which is designed as a random forest regressor. Woodward and Finn (2017) and Fang et al. (2017) use reinforcement learning (RL) for learning an active learner in a data-driven approach. They adopted a stream-based AL process in which the agent observes the data in sequence and decides whether a single item should be labeled by the agent itself or it should be asked to an oracle. Based on the decision, the agent receives a reward and a prediction model is adopted to be used in new tasks. They improved the performance of models, but they tend to learn only from related datasets and domains (Konyushkova et al. 2018. Many other data-driven approaches for pool-based AL processes have been proposed recently. While Bachman et al. (2017) and Pang et al. (2018b) used RL to build the learning model, Liu et al. (2018) formulated learning AL strategies as an imitation learning problem (i.e., the machine is trained to perform a task from demonstrations by learning a mapping between observations and actions), Contardo et al. (2017) and Ravi and Larochelle (2018) applied few-shot learning (i.e. classifying a new data having seen only a few training examples), and Pang et al. (2018a) extended the LSA approach using non-stationary multi-armed bandit with expert advice.
Active meta-learning approaches have also been developed in application-specific scenarios. Sun-Hosoya et al. (2018) developed (ActivMetaL), an active meta-learning recommender system. ActivMetaL keeps the scores of multiple AL approaches on given tasks in a sparsely populated collaborative matrix, predicts the performance of each approach for a new task, and then fills the corresponding row of the matrix for this task. In this way, they predict which algorithm will perform best for the new task/dataset.
While capable of generalizing across learning task, these meta-learning approaches still have many limitations, such as being classifier specific (Bachman et al. 2017;Contardo et al. 2017;Ravi and Larochelle 2018), having a greedy approach and missing a long-term reward (Liu et al. 2018) or being limited to specific domains (i.e. imitation learning, or few-shot learning) (Bachman et al. 2017;Fang et al. 2017;Liu et al. 2018;Sun-Hosoya et al. 2018).
To overcome these limitations, (Konyushkova et al. 2018) proposed an RL variant of their LAL algorithm, named LAL-RL, that defines AL as a Markov Decision Process and tries to find the optimal and general-purpose strategy. LAL-RL is independent of the dataset and ML classifier (contrarily to LAL that is designed for Random Forests), and its objective does not depend on a specific performance measure. The authors show how LAL-RL can transfer learned strategies across substantially different datasets.
Recently, Desreumaux and Lemaire (2020) tested the performance of LAL-RL on 20 real-world datasets and compared it to random sampling and uncertainty sampling. Although LAL-RL shows very good performance on average, it is not always better than random sampling, especially in the case of highly unbalanced datasets (Desreumaux and Lemaire 2020). In addition, their results show that the choice of the model is decisive (i.e. random forest classifier gives better results than logistic regression). They also report that LAL-RL is sensitive to the metric used to evaluate the performance, and it requires optimizing many hyper-parameters. This analysis shows that even general-purpose strategyfree approaches have limitations when dealing with real-world problems.
Although the main objective of these meta-learning approaches is to adapt the prediction/learning model to new environments/tasks, they do not consider the characteristics of the target environment in the prediction model. Hence, they are likely to be effective in similar environments only. Vu et al. (2019) proposed a new approach that learns a good policy directly based on the target environment either by using a pre-trained AL model or learning a new policy from scratch considering the budget for human annotation. They showed that this approach is more effective than the previous work (Fang et al. 2017;Liu et al. 2018) when the source task and the target task are different. Rudovic et al. (2019) proposed a deep Q-learning approach that is capable of dealing with multiple environments by learning a multi-modal AL strategy. They focus on the task of engagement estimation from real-world child-robot interactions during autism therapy. They employ an LSTM network to classify the individual modalities into engagement levels (i.e. low, medium, or high) and feed its predictions into the deep RL agent, making it capable of efficiently personalizing the interaction strategy to the target user.
In summary, the state of the art shows that the existing active meta-learning approaches can outperform the fixed-strategy and strategy-free approaches, especially when the dataset is not highly unbalanced. However, most of them do not present comprehensive benchmarks to prove the transferability of the learned policies into realworld tasks (i.e. when we do not have a separate test set) (Desreumaux and Lemaire 2020). While showing a noticeable improvement in terms of generalization ability with respect to previous approaches, meta-learning approaches still cannot cope with all challenges that are to be faced in real-world scenarios.
Specifically, nearly any real-world application we encountered has two aspects that haven't received much attention so far in AL research: the first is that the amount of noise in the labels is much higher with respect to standard settings where labels are provided by domain experts, and in crowdsourcing there are several trade-offs we can make to reduce the noise and to balance budget vs noise trade-offs (for example, we can collect more votes for the same items and aggregate them to get a more reliable label, or we can pay more budget to ask for highly rated workers, or we can instead focus on getting a larger number of items labeled at low cost, although with higher noise). The second is that the cost of false positive and false negatives (and more generally of different types of errors) is rarely the same and that low calibration error is often a key quality of a good ML model. There are however a few contributions on dealing with noisy labels and we discuss them next.

Dealing with noisy labels
While crowdsourced labels can be made to be very precise via redundancy and aggressive worker selection/crowd testing strategies, the individual votes are often noisy. There indeed exists a line of work in the AL community that explicitly focuses on dealing with noisy labels. Zhao et al. (2011) proposed combining uncertainty and inconsistency (entropy of the label distribution) measures to select instances. The underlying idea is that the learning strategy should select items that are in unexplored regions or near the ones that may have been mislabeled. They also propose relabeling the mislabeled items via crowdsourcing. They show that when the labels are noisy and the aggregated label is not trustful (i.e. the aggregated label is provided by less than 50% of the workers, who annotated the corresponding item, in binary classification setting), then relabeling significantly improves the performance of AL. Bouguelia et al. (2016) proposed a method for identifying and mitigating mislabeling errors, where they derive an informativeness measure to see how much a queried label would be useful if it was corrected. They show that this approach is more efficient in characterizing label noise compared to the commonly used entropy measure. Then, Bouguelia et al. (2018) extended this approach by measuring how much the queried item's label is likely to be wrong, based on disagreement with the current classification model, without relying on crowdsourcing.
A related line of work aims at addressing label noise by explicitly modeling the uncertainty of annotators. Yan et al. (2010) introduce a model that jointly learns a classifier and infers annotators' reliability, in an active learning setting that involves both data instance and annotators selection. Fang et al. (2012) consider the case when annotators can learn from one another to improve their annotation reliability. Zhong et al. (2015) model a scenario where workers can explicitly express their annotation confidence, by allowing them to choose an unsure option. Yan et al. (2016) further consider labeler properties such as consistency. Yang et al. (2018) extend the problem to enabling deep active learning from crowds, i.e., enabling deep neural networks to actively learn from crowd workers. However, none of them tried to actively estimate the noise level of data during the learning process.

Experimental work
As we have seen, the near totality of existing AL approaches (i) assume that oracles provide the gold (ground truth) labels (ii) strive for accuracy as a metric, and (iii) assume classification is done by ML only. In reality, the situation and needs are different, especially when we have access to a large set of human annotators of different reliability. First, labels can be noisy; second, the trade-off between the benefit of a correct classification and cost of an error varies greatly by application, and achieving a low calibration error may be as important as achieving high accuracy; finally, in many scenarios, we do have the option of relying on human classification when ML is uncertain, and in those contexts it is important that ML learns to know when it doesn't know.
In this section, we examine the behavior of AL approaches in crowdsourcing settings. Specifically, we focus on problems where we start from a blank slate, have a pool of items to classify and a crowd at our disposal, and need not only to choose/assess AL approaches but also to assess if the crowd is leveraged only to get labeled data for training or also to perform classification at inference time, as done in hybrid classification contexts (Krivosheev et al. 2018a;Callaghan et al. 2018).

Problem formulation
We focus on hybrid binary classification problems where classification is accomplished via the combined contribution of humans and machines. The problem can be formulated as follows. We are given a tuple of (I, M, Q, B), where: • I is a pool of unlabeled items to be classified. • M is an untrained machine learning classifier. • Q is an active learning query strategy. • B is the budget available (expressed in terms of the total number of crowd votes we can ask).
Our aim is to classify all items I via the ML classifier M or/and crowd workers, assuming we do not have any training data to start with. To achieve this, we apply AL strategy Q for querying the most informative items T ⊂ I that will be annotated by crowd workers on money B. The annotated items T will be used as training data for machine M and finally, I will be classified based on M and the feedback collected from the crowd. Our hybrid AL workflow is described in detail in Algorithm 1.

A new approach: Block Certainty
Most of the AL approaches do not consider cost-sensitive learning scenarios, in which different types of errors can be associated with different costs. This is the typical scenario in medical screening tests for instance, where we try to uncover the presence of a disease. A false negative (FN) error (or missed alarm) is in this case extremely critical and much more costly than a false positive (FP) one (false alarm). The same is true in many enterprise contexts such as the ones some of the authors face daily, where companies are ok if a customer request is not understood and has to be routed to an agent, but not ok if ML predicts the wrong intent and gives the wrong answer to the customer. An effective ML classifier for this scenario should be trained using a cost-sensitive loss, that potentially trades precision for recall. We thus propose a simple cost-sensitive AL approach, that we name "Block Certainty", which intrinsically considers the relative harm of FN over FP errors during the AL process.
Let k indicate the relative harm of FN over FP errors, i.e., a single FN error is k times more costly than a FP one. Let p be the probability of an item being positive according to the machine M. Let d be the decision threshold for M. We first investigate how to set d so as to minimize the risk of the classifier M given k. Assuming p > d (item classified as positive), we define the risk score for this item as risk = 1 − p . Similarly, the probability of an item being negative is 1 − p . Assuming p ≤ d (item classified as negative), the risk score for this item will be risk = k ⋅ p . Figure 1 shows the relation between p and the risk score . The optimal classification threshold d is given by the value of p for which the lines risk = k ⋅ p and risk = 1 − p intersect, i.e., d = 1∕(1 + k).  Then, in Algorithm 2 we define the "Block Certainty" sampling, where at every AL iteration we query predicted positive and negative items with the lowest risk score (according to k) for further annotation.

AL approaches and ML classifier
We examine the following seven AL approaches: (i) random (R) (items are randomly sampled), (ii) uncertainty (UC) (Lewis and Gale 1994), (iii) certainty (C) (the reverse of uncertainty sampling, i.e., the most certain items are sampled), (iv) block certainty (BC) (our proposal for cost-sensitive sampling), (v) QUIRE (Q) (Huang et al. 2010), (vi) MinExpError (MinExp) (Mozafari et al. 2014), and (vii) meta-learning (LAL) (Konyushkova et al. 2017). We used R, UC, C, Q, and MinExp as the state of the art fixed-strategies that have been used in most of the comparative analyses in the literature. Since Konyushkova et al. (2017) already proved that LAL approach outperforms the state of the art dynamic strategy approach ALBL (Hsu and Lin 2015), we tested only LAL to have an intuition about adaptive approaches.
Since the LAL (Konyushkova et al. 2017) approach is specifically designed to work with the random forest classifier, we chose random forests as the underlying classifier for all AL approaches. We used the implementation provided by the scikit-learn library (Pedregosa et al. 2011) with the following parameters: n_estimators = 100, criterion = "gini", max_ depth = None, bootstrap = True, class_weight = "balanced", and random_state = 2020. To evaluate the effect of the choice of the classifier on the performance, we additionally evaluated a subset of the AL approaches (excluding LAL) using a support vector machine as the underlying classifier.

Crowdsourcing scenarios
We consider the following two crowdsourcing scenarios in our experiments: • Unlimited votes Every single item can be selected an unlimited number of times for a crowdsourced vote, as long as we have an available budget and voters. • Limited votes There is a maximum number of votes maxVote that every single item can receive ( maxVote = 3 in the experiments). Upon reaching this limit, the corresponding item is removed from I and cannot be queried anymore. In principle, this limit could be automatically inferred via meta-learning approaches, but this out of the scope of this paper.

Evaluation scenarios
We evaluate the results under two different cases: • ML (only): We use the trained M to predict the label of each item in the pool I and evaluate its performance compared to the ground truth labels. • ML+C: We take crowdsourced labels for items LI and use M to predict the label for the unlabelled items in UI. We evaluate the performance of this combined crowdmachine classifier by comparing its predictions with the ground truth labels.

Label fusion methods
Since we use crowd answers instead of gold labels in the learning phase, it is important to consider that they may yield low-quality or noisy labels (Zheng et al. 2017). In the crowdsourcing literature, this problem is addressed by assigning each task to multiple crowd workers and then aggregating the votes (answers) to obtain the correct label. This process is called truth inference and relies on an aggregation strategy to combine labels. Several label fusion strategies has been investigated in the literature (Aydin et al. 2014;Demartini et al. 2012;Callison-Burch 2009;Fan et al. 2015;Li et al. 2014;Liu et al. 2012;Ma et al. 2015). Because of its simplicity and effectiveness, the most popular label fusion method is Majority Voting (MV) (Tu et al. 2019), which selects the label voted by the majority of the workers as the correct answer (Franklin et al. 2011;Parameswaran et al. 2012;Marcus et al. 2011). We thus use majority voting as the label fusion strategy in our experimental evaluation. A known limitation of majority voting is the fact that it assumes that all workers provide the same quality of answers. To measure whether the results depend on this simplifying assumption, we also ran an additional experimental investigation using a more refined label fusion method (Sect. 3.7).

Metrics
We use F1, F3, Accuracy, and Loss metrics to evaluate the performance of the machine classifier (ML) and of the hybrid crowd-machine classifier (ML+C). We define the Loss of classification as the following (Nguyen et al. 2015;Krivosheev et al. 2018b): where k denotes how much the cost of a false negative outweighs the one of a false positive, FNN is the number of false negatives, FPN is the number of false positives, and |I| is the number of items in I. The Loss summarizes the subjective perspective of the risks for False Positive/Negative errors. This is especially common in real-world applications, such as potential credit card fraud, identifying tweets linked to criminal activities, or literature reviews where screening out a relevant paper is considered to be a serious error affecting the quality of the review, while a falsely included paper just requires some extra work by the authors. (1)

Datasets
Crowdsourced datasets can either include the answer (label) of each crowd worker to each item or just provide the aggregated labels (a single discrete label for each item, determined by combining votes from multiple crowd workers). The latter type of datasets is less informative and doesn't allow to test strategies that involve queries to individual workers. However, most of the available crowdsourced datasets are of this type. We thus created an open repository 2 of the available crowdsourced datasets with individual crowd votes; we also added the datasets we collected. We provide a standard format for accessing the datasets so that they can be used in the experiments without any workload in preprocessing. Researchers can benefit from this repository for hybrid human-machine classification and ranking tasks, truth discovery based on crowdsourced data, estimation of the crowd bias, and active learning. Table 1 shows the properties of each dataset we used in our experiments; the number of tasks, number of workers, total number of votes, minimum vote count per item, data proportion in terms of the number of positives and negatives, batch count (a batch is a predefined number of instances, where the batch count is the number of batches that defines the total number of iterations), size of a batch, and the number of experiments (repetitions) for each dataset. We show the distribution of labels for each dataset in Fig. 2.
The task in the Recognizing Textual Entailment (RTE) dataset (Snow et al. 2008) is to identify whether a given hypothesis sentence is implied by the information in the given text 3 .
The Emotion 3 dataset (Snow et al. 2008) is about rating the emotion ("anger") of a given text. Each rating is a value between 0 and 100, and we converted them to binary form (0 if rating ≤ 49, else 1).
The Amazon Sentiment-1 4 dataset (Krivosheev et al. 2018a) includes annotations about deciding whether the given product review belongs to a book or not. Similarly, the Amazon Sentiment-2 4 dataset (Krivosheev et al. 2018a) includes annotations about whether the given product review has a negative or positive sentiment. 1 3 The Crisis-1 5 dataset (Imran et al. 2013) consists of human-labeled tweets collected during the 2012 Hurricane Sandy and the 2011 Joplin tornado. The task is to decide whether the author of the tweet seems to be an eyewitness of the event. Similarly, Crisis-2 5 dataset  (Imran et al. 2013) contains annotations about deciding the type of the message (tweet). The task in Crisis-3 5 dataset (Imran et al. 2013) is to analyze hurricane-related tweets and decide whether the tweet is informative or not.
Finally, the Exergame dataset includes annotations about whether the given paper describes a study that uses an exergame. An exergame is a form of interactive gaming where people do physical activities while playing a video game, that is, physical exercises by way of video games.

Results
The elaborated experiment results, all datasets, and the source code for reproducing the experiments are available online 1 . In addition, we present the visual experiment results in a notebook 6 , while summarizing the important outcomes below. Tables 2 and 3 show F1 scores of each AL approach in Scenario 1 and Scenario 2, respectively (bold cells show the best performing AL strategies). We observed that ML+C prediction outperforms the ML prediction on each dataset, regardless of the AL approach. When we compare F1 and F3 scores, the best AL approach remains the same. That is why we present F1 scores here, while F3 scores can be seen in the results sheet. 7 Comparing the Loss with different k values, we noticed that (i) when the harm of FP and FN is the same ( k = 1 ) uncertainty sampling performs 35% better in average than other approaches, (ii) when the problem is characterized by high k value ( k ≥ 10 ) then block certainty sampling provides 25% better performance in average on five datasets, while uncertainty sampling provides 36% better performance in average across all datasets, and (iii) block certainty sampling approach outperforms others on five datasets with small k values ( k ≤ 0.1 ) with an improvement of 34.5% with respect to the average performance. We only present the results for k = 100 here (see Table 4), while keeping others in the notebook (see footnote 6). When we analyze the performance of the approaches individually, we draw the following conclusions:  (Huang et al. 2010) approaches did not show a promising performance over crowdsourced data in terms of accuracy, F1, and F3 scores. 3. Random sampling performed comparable or sometimes better (i.e. in the Crisis-1 dataset) over imbalanced data with more negatives. 4. Although it is claimed that LAL (Konyushkova et al. 2017) approach outperforms uncertainty sampling (Lewis and Gale 1994), results show that uncertainty sampling outperforms others in most cases for the finite-pool hybrid classification over crowdsourced data. 5. Table 5 shows that uncertainty sampling and random sampling created the biggest number of training sets (note that we may relabel the training data and we may end up with a different number of votes per item; this affects the size of training data after the learning process ends). 6. Limited votes scenario increased the size of training sets for all cases. This shows that sampling strategies may be stuck at some items if we do not limit the maximum number of votes per item.

Further analysis
The previous experimental analysis was run using random forests as the underlying classifier and majority voting as the label fusion strategy. In this section, we investigate whether changing the classifier or fusion strategy affects the overall picture. Since the LAL (Konyushkova et al. 2017) approach is specifically designed to be used with random forests, we omit it from the following analysis. We also omit minExpError (Mozafari et al. 2014) as its complexity is very high and it did not show to be competitive with less expensive approaches. Hence, we focused on the following AL approaches: (i) random (R), (ii) uncertainty (UC) Lewis and Gale (1994), (iii) certainty (C), (iv) block certainty (BC), and (v) QUIRE (Q) (Huang et al. 2010).

3
We repeated all experiments using an SVM classifier, utilizing an implementation provided by the scikit-learn library (Pedregosa et al. 2011) with the following parameters: class_weight ='balanced' and C=0.1. As an alternative to majority voting, we used the Dawid-Skene (DS) (Dawid and Skene 1979) strategy, a popular label fusion method that models each worker as a confusion matrix and uses an expectation-maximization approach to decide the correct label.
Results 8 confirmed that the combination of humans and the machine (ML + C) outperforms the machine-only case (ML). For this reason, in the following, we discuss the results of the ML + C case only. Table 6 shows which (classifier, AL approach) pair performs best on each dataset in terms of Accuracy, F1, and Loss (K=100) metrics. We evaluate results in two . Results show that this pair may change even on the same dataset with respect to the metric used. For example, when we look at the Crisis-1 dataset random sampling with SVM classifier and DS label fusion method performs best in terms of accuracy while uncertainty sampling with RF classifier is the best in terms of F1 score. Summing up, these results suggest that the performance that AL strategies exhibit in standard settings cannot be directly transferred to hybrid classification problems. Many factors that may affect the behavior of an AL approach, such as the machine classifier being used, the amount of labeling noise in the data, the characteristics of the problem, and the label fusion method. For example, when we look at the F1 scores in Tables 2 and 3, we see that uncertainty sampling is not the best method for Emotion, Amazon-2 (in Scenario 1), and Crisis-1 datasets. These three datasets are the most unbalanced datasets we used. In addition, the noise level in Emotion and Crisis-1 datasets are very high (please see Fig. 2). These observations show that uncertainty sampling performs best when the dataset is balanced and the noise level of the data is low.

Conclusions and open issues
In this paper, we first reviewed the existing AL approaches under three categories: (i) fixedstrategy approaches, (ii) dynamic-strategy approaches, and (iii) strategy-free approaches. We then investigated the performance of a set of representative approaches for different strategies in the hybrid human-machine classification setting. The aim was to discover if and how the performance of the existing approaches can be transferred into the hybrid classification context.
Experimental results showed that as expected, hybrid human-machine classification always improves over purely machine-based classification. When comparing different AL approaches, however, no clear winner emerges. Even a state-of-the-art meta-learning approach like LAL fails to show consistent improvements over the alternatives when evaluated across different datasets. We observed that if we have a finite pool classification problem with noisy crowd labels (i.e. RTE, Emotion, Crisis-1, and Crisis-2 datasets), then we can simply start by picking the RF classifier, UC approach, and DS label fusion method to achieve an acceptable performance (see Table 6). In addition, if the cost of FN and FP errors are very different for the problem at hand, then we can consider using a BC approach instead of UC (see Table 4, where BC improved the performance on RTE, Emotion, and Exergame datasets with a big k value).
Hybrid crowd-machine classification is promising but needs more investigation. Most of the existing AL approaches assume that enough high-quality labeled data exist. However, gathering high-quality labeled data is challenging and labels, especially if crowdsourced, can be noisy. For this reason, we analyzed what happens if we use noisy labels instead of gold data in finite-pool hybrid classification problems. We have shown that in the presence of noisy data the conclusions from the state of the art need to be revisited and cannot be taken at face value when data is crowdsourced. This points to the need for further work in the community on what is the role of noise in the success of specific AL algorithms and how much noise they can tolerate before the assumptions and intuitions on which they are based do not hold any longer. The community also needs to analyze how different types of noise (i.e. random noise or biased crowdsourced answers) in data affect the performance of AL, and how the effect of such noise can be smoothed. Indeed, label smoothing can be a promising direction to pursue as it helps to improve the calibration of ML models, which is central for many AL algorithms.
In addition, our analysis highlights several open issues and challenges that we believe can shape future research on this topic, and specifically: (i) what are the trade-offs between having a smaller but accurately labeled dataset vs a larger but more noisy one? (ii) Would a dynamic assessment of crowd accuracy help strategy-free approaches? and, (iii) How do we operate at the start of an AL process when we have no idea of the crowd accuracy?
Another direction is that of balancing the AL and human vs machine contribution in hybrid crowd-ML classification problems where the pool of items to classify is finite: here the challenge is deciding when to stop spending our budget for collecting samples to optimize our AL process and obtain a stronger ML model, and spend the budget for classifying the remaining items in the pool, by applying the current ML model and resorting to humans for items on which the ML model is unsure about.
In summary, this paper has pointed to several interesting research paths that we need to undertake if we want to address problems that companies face when developing their ML solutions and that are needed to make AL a mainstream part of ML pipelines.
Funding Open access funding provided by Università degli Studi di Trento within the CRUI-CARE Agreement.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/.