Data Mining and Knowledge Discovery

, Volume 33, Issue 6, pp 1625–1673 | Cite as

More for less: adaptive labeling payments in online labor markets

  • Tomer Geva
  • Maytal Saar-TsechanskyEmail author
  • Harel Lustiger
Open Access


In many predictive tasks where human intelligence is needed to label training instances, online crowdsourcing markets have emerged as promising platforms for large-scale, cost-effective labeling. However, these platforms also introduce significant challenges that must be addressed in order for these opportunities to materialize. In particular, it has been shown that different trade-offs between payment offered to labelers and the quality of labeling arise at different times, possibly as a result of different market conditions and even the nature of the tasks themselves. Because the underlying mechanism giving rise to different trade-offs is not well understood, for any given labeling task and at any given time, it is not known which labeling payments to offer in the market so as to produce accurate models cost-effectively. Importantly, because in these markets the acquired labels are not always correct, determining the expected effect of labels acquired at any given payment on the improvement in model performance is particularly challenging. Effective and robust methods for dealing with these challenges are essential to enable a growing reliance on these promising and increasingly popular labor markets for large-scale labeling. In this paper, we first present this new problem of Adaptive Labeling Payment (ALP): how to learn and sequentially adapt the payment offered to crowd labelers before they undertake a labeling task, so as to produce a given predictive performance cost-effectively. We then develop an ALP approach and discuss the key challenges it aims to address so as to yield consistently good performance. We evaluate our approach extensively over a wide variety of market conditions. Our results demonstrate that the ALP method we propose yields significant cost savings and robust performance across different settings. As such, our ALP approach can be used as a benchmark for future mechanisms to determine cost-effective selection of labeling payments.


Machine learning Supervised learning Label acquisition Crowdsourcing Online labor markets Adaptive labeling payments 

1 Introduction

Predictive modeling has radically impacted a wide variety of industries, becoming integral to the operations and competitive strategies of firms and giving rise to entirely new business platforms. Supervised learning of predictive models requires labeled training data—namely, instances for which the dependent variable value (label) is known. However, in many important applications, costly human intelligence is also needed to determine the labels for training data. Such labeling tasks include, for example, image classification tasks (e.g., whether an image contains a car, for application of surveillance, autonomous driving, and scene understanding), and text classification problems (e.g., whether a text contains hate remarks, or is humorous or sarcastic). Many similar labeling tasks are relatively intuitive and simple for humans to perform, but may require a large number of training examples for supervised learning to yield good performance. For such tasks, online crowdsourcing marketplaces, such as Amazon Mechanical Turk (AMT), have emerged as promising platforms for large-scale labeling, offering unprecedented scalability and immediacy. Markets such as AMT offer substantial savings and agility by allowing employers (“requesters”) to offer simple micro-tasks to hundreds of thousands of online workers simultaneously, thereby allowing labels to be acquired cheaply and relatively quickly. In these markets, requesters typically post the task description and payment offered to all workers who meet certain criteria, and workers can review this information when choosing which tasks to undertake.

A fundamental challenge for the cost-effective use of such markets, however, is that for any given task and at any given time, it is not known what level of labeling quality can be acquired on the market for any given payment per label. Prior studies have found that different prevailing trade-offs arise between payment and quality for different tasks and at different times (e.g., Kazai 2011; Kazai et al. 2013; Mason and Watts 2010). Yet, the mechanisms giving rise to these different trade-offs are not well understood, and can be affected by a host of properties such as the labeling task itself and the market conditions—such as the competing tasks and the composition and availability of workers on the market at any given time.

Because the cumulative costs of routine use of crowdsourcing labor markets for labeling can be substantial, a robust, data-driven methodology for identifying advantageous payments for labeling that yield a desired predictive performance in a cost-effective manner is essential.

In this paper, we refer to this problem as Adaptive Labeling Payment (ALP). Specifically, given a budget for label acquisition, ALP refers to data-driven, sequential learning and adaptation of payments for labeling. At each step, i, labels for b training instances are acquired on the market, and a payment \( p^{i} \) for labeling each unlabeled instance is offered to crowd workers, before they undertake a labeling task. The objective for selecting payments \( p^{1} ,p^{2} , \ldots ,p^{I} \) is that the ultimate predictive model induced from the set \( L \) of labeled instances, acquired over \( I \) steps, yields the best performance, without exceeding the acquisition budget (see the complete formal problem formulation in Sect. 3.1).

The market we consider here is an online crowdsourcing labor market, where a labeling task for a given payment is offered to all workers, and where no prior knowledge on an individual worker’s performance for the particular task may be available (e.g., when a new worker is hired). In addition, we consider markets such as AMT, where the population of available workers and the competitive settings (e.g., alternative tasks and workers) may also vary over time.

Note that at the outset no data is available. Thus, payments are adapted sequentially, after each step, based on the data acquired thus far. Furthermore, because the cost-effectiveness of a given payment for labeling can change over time due to changes in the population of workers and competing tasks, the payment offered per label to workers can also be revised to adapt to these changes.

The ALP problem is related to active learning (Lewis and Gale 1994; Abe and Mamitsuka 1998; Kong and Saar-Tsechansky 2014), and to prior work about online labor markets (e.g., Raykar et al. 2010; Wang et al. 2017). However, as we discuss in detail below, both prior streams of work did not aim to adaptively identify advantageous payments per label to improve model performance cost-effectively. The most closely related prior work considered theoretical settings in which individual labelers produce a predetermined level of quality and require a predetermined price that is known to the requester (Yang and Carbonell 2012); however, such settings do not hold in many real-world crowdsourcing labor markets (e.g., Amazon Mechanical Turk) that we consider here, where new workers with no prior history are frequently encountered, workers’ labeling quality for different payments are unknown a priori and can change over time, and where the tasks are offered to all workers on the market (or to all worker who meet a certain criteria).

The ALP problem presents several challenges, which we discuss in the remainder of the paper. Perhaps the most fundamental challenge pertains to the objective of identifying the labeling payments that cost-effectively improve predictive performance. The choice of payment per label that will yield the most cost-effective improvement in model performance is ultimately affected by a host of factors that impact model performance and its cost. These factors include the quality of the labels that can be acquired at different costs, the predictive task, the inductive modeling algorithm, and the labels purchased thus far as also reflected by the current position on the learning curve.1

Identifying advantageous payments per label does not correspond merely to learning the effect of payment per label on the quality of the labels, so as to infer what payment(s) can produce a given labeling quality. This is because, as has been noted in prior work, acquiring labels of a given quality and cost, can yield different benefits to model learning under different settings (Lin and Weld 2014); consequently, to assess the benefits of acquiring labels at given quality and cost, one must consider the impact of these prospective acquisitions on model performance. To identify advantageous payments at any given time, in this work we propose to estimate the effect of any given payment per label directly on the predictive performance of the model. This allows an ALP approach to pursue different payments and labeling qualities under different settings to improve the model’s performance, cost-effectively. Specifically, as we will see later on, for some domains and market settings, such a data-driven approach with a focus on model performance allows an ALP approach to either acquire fewer and higher quality labels, or pursue a larger number of cheaper labels, if it estimates such acquisitions will yield comparable (or better) performance for a given cost.

In this paper, we develop an ALP algorithm that aims to select the payment per label that is estimated to yield the most cost-effective improvement in model performance. We then evaluate our approach’s performance and robustness relative to alternatives under different settings. Our results demonstrate that the ALP method we propose often yields substantial cost savings compared to the existing benchmark, and that its performance is robust over a wide variety of settings. In addition, we find that when the underlying trade-off between payment and labeling quality changes over time, our ALP method effectively adapts and continues to maintain robust performance. Overall, our results show that the proposed ALP method constitutes both an effective and robust approach for adapting payments to labelers in dynamic crowdsourcing markets.

The contributions of this work are as follows. Our study is the first to introduce the problem of Adaptive Labeling Payment for online crowdsourcing labor markets. We propose a data-driven ALP method that both learns the benefits of alternative payments and continuously adapts the payment per label, with the goal of improving the model’s predictive performance cost-effectively. Importantly, our ALP approach aims to directly estimate the expected benefits to predictive performance from labels acquired at different costs. To do this, our algorithm introduces a novel approach for estimating the expected change in model performance from future acquisitions of labels at different payments; thus, our approach uses previously acquired, noisy, labeled data, and it does not rely on the additional acquisition of costly “gold standard” data. The ALP method we develop here is also generic, and can be applied in order to cost-effectively acquire labeled training data for a given task, induction algorithm, population of workers, or set of market conditions. Finally, because our approach aims to identify advantageous payments for label acquisition, it can also be applied in conjunction with methods that address complementary problems, such as repeated labeling methods that aggregate multiple labels per instance to improve the labeling quality, but which do not address what payment per label to offer labelers. In the empirical evaluations, we demonstrate how such methods can be effectively combined. Finally, we conduct an extensive set of experiments to offer insights into our approach’s performance and choices of acquisitions under different settings.

The remaining of the paper is organized as follows. We review related research in the prior work section. We then discuss the desired properties of an ALP method, followed by the development of our proposed ALP approach. In the empirical evaluation section, we discuss in detail the setting and procedures we use to evaluate our ALP method’s performance compared to alternatives in different settings. We then report our results, followed by conclusions and a discussion of the implications of our work and directions for future research.

2 Prior work

Prior work did not consider how to determine and continuously adapt the payments offered to labelers before they undertook a task, so as to improve the model’s predictive performance cost-effectively. Existing work about online labor markets in particular discussed a variety of mechanisms to improve the work quality, such as ways to screen workers and improve task design, and methods for acquiring multiple labels for the same instance so as to increase the probability of obtaining a correct label (e.g., Kazai 2011; Downs et al. 2010; Lee et al. 2013; Paolacci et al. 2010).

One stream of research focuses on repeated acquisition of multiple labels for the same instance, assuming that the payment per label is fixed and pre-determined (e.g., Ipeirotis et al. 2014, Dai et al. 2013, Lin et al. 2012, Zhang et al. 2016). Such repeated acquisitions can be used, for example, to infer the most likely label from multiple noisy ones. In contrast to the study we present here, these works did not address the problem of determining the payment per label to offer workers. Within this stream of research, some studies suggest methods to infer the likely label so as to learn better models using data instances that undergo repeated labeling (Dalvi et al. 2013; Kumar and Lease 2011; Raykar et al. 2010; Rodrigues et al. 2013; Zhang et al. 2015; Zhou et al. 2012). Some repeated labeling methods have also been applied together with active learning methods, in order to reduce the number of instances for which multiple labels are acquired for a pre-determined payment (Ipeirotis et al. 2014; Karger et al. 2011, 2014; Sheng et al. 2008; Wauthier and Jordan 2011; Lin et al. 2016). However, none of these methods identify which payment per label is advantageous to offer labelers and assume that the payment and subsequent quality of each label is predetermined and remains fixed.

Our work differs from this stream of research in several important ways. As noted earlier, the key difference is that prior work did not aim to identify advantageous payment per label to be offered to crowd workers so as to cost-effectively yield a given predictive performance. Some prior work proposed means to assess work quality retrospectively, after workers have performed the task (e.g., Raykar et al. 2010; Wang et al. 2017). In this work, we aim to identify advantageous payments offered to workers per label before they decide whether or not to take on the task. We also do not assume the availability of individual worker-performance history. Therefore, the method we propose is suitable for recommending payments in popular crowdsourcing labor-market settings such as Amazon Mechanical Turk, where employers continuously encounter new workers, and where tasks and the corresponding payment are offered to all workers who meet certain criteria (e.g., workers are from a certain country). We also consider settings where the work quality that can be obtained for a given payment may not remain the same indefinitely, but rather that market conditions, and consequently the labeling quality that can be obtained for different payment levels, may vary over time. To remain cost effective over time, our approach continuously adapts the payment so as to yield cost-effective improvements in model performance. Finally, in contrast to prior work that aimed to improve label quality (e.g., Wang et al. 2017), we aim to cost-effectively improve the predictive performance of the model induced from the acquired labels. Indeed, as we discuss in more detail below, the same quality of labels may have different implications for model performance across modeling tasks, induction algorithms, and across different points along the learning curve. Hence, our approach proposes to assess directly and in a data-driven manner the cost-effectiveness of different label acquisitions toward the model’s performance. Consequently, in some settings our approach may recommend buying fewer labels of higher quality and in other settings it may recommend buying more labels for the same or lower quality.

The ALP problem we consider here is also related to active learning (e.g., Lewis and Gale 1994; Abe and Mamitsuka 1998; Saar-Tsechansky and Provost 2004; Kong and Saar-Tsechansky 2014). Active learning methods typically assume that labels for unlabeled instances can be acquired at some fixed, pre-determined cost, and that the acquired labels are also correct; consequently, active learning methods aim to identify the training instances for which to acquire labels, so as to produce the best model performance for any given number of acquisitions. Therefore, while ALP aims to determine how much to offer workers per label to produce cost-effective models, active learning addresses a complementary problem, where the payment per label is assumed to be predetermined and fixed, and thus the goal is to identify the instances for which to acquire labels. Importantly for the challenge we address here, most active learning methods also assume that the acquired labels are correct, thus, active learning methods do not typically consider the challenge of inferring the benefit of alternative acquisitions, when the labels acquired are noisy.

Our study is perhaps most closely related to work by Yang and Carbonell (2012), but they make assumptions that do not hold in real online labor markets. Specifically, Yang and Carbonell (2012) do not consider settings which we consider here where: new crowd workers with no prior history are encountered continuously, workers labeling quality for different payments are unknown a priori and can change over time, and where the tasks are offered to all workers on the market (or to all worker who meet a certain criteria); rather, they assume theoretical settings in which labelers’ label quality for a given price is predetermined and known to the requester.

In what follows, we discuss the desired properties of an Adaptive Labeling Payment approach in our setting, followed by our proposed ALP approach from which we derive a specific ALP method.

3 The adaptive labeling payment (ALP) approach

In this section, we outline the ALP problem and develop our approach to address it.

3.1 ALP problem formalization

We consider online crowdsourcing labor markets, where a labeling task is offered to all workers for a given payment per label, where no prior knowledge on an individual worker’s performance for the particular task may be available, and where the population of workers available at any given time as well as the competitive settings (e.g., the number and properties of other tasks available on the market) may vary over time. Given the market above, a set of unlabeled instances UL for a given classification task, a budget for label acquisition, a model inducer of choice \( M \), and a model performance measure of choice, \( Performance \) (e.g., AUC), the ALP problem refers to sequential decisions, where at each step, \( i \), a batch of \( b \) labeled instances are acquired and added to the set \( L \) of labeled training instances. At each step \( i \), one has to decide what payment level \( p^{i} \in {\text{C}} \) from a set of payment options \( C = \{ c_{1} ,c_{2} , \ldots ,c_{po} \} \) to offer workers for labeling a single training instance, before they undertake the labeling task. The goal of selecting payments \( p^{1} ,p^{2} , \ldots p^{I} \) is that the model \( M\left( L \right) \), induced from the set of labeled instances \( L \), yields the best model \( Performance \) for a given budget.2 Specifically, our goal is to select payments \( p^{1} ,p^{2} , \ldots p^{I} \) such that: \( \mathop {\text{arg max}}\limits_{{\left\{ {p^{1} ,p^{2} , \ldots ,p^{I} } \right\}}} Performance\left( {M\left( L \right)} \right) \), subject to \( \mathop \sum \limits_{i = 1}^{I} p^{i} \cdot b \le budget \).

A key property of an ALP problem is that it is data-driven, and an ALP method should effectively learn from the set of instances labeled thus far and continuously adapt to identify advantageous payments in a data-driven manner. In light of prior research findings that different trade-offs between labeling payment and quality arise in different settings, one cannot assume any given trade-off between payment per label and labeling quality for either an arbitrary task or for market conditions. Therefore, a data-driven approach should identify cost-effective payments without making a priori assumptions on the prevailing trade-off. Furthermore, because market conditions, and, consequently, the relationship between pay and quality per label may vary over time, to remain cost effective an ALP approach should also allow to adapt the payment per label to adapt to changing market conditions.

It is useful to note that while a model’s generalization performance can be estimated by empirically measuring its performance on a correctly labeled test set, the labels available in our setting are acquired via online crowdsourcing markets, and are thus inherently noisy. An alternative is to acquire “gold standard” data; however, such data are costly and may become obsolete when the underlying concept being learned changes over time. Therefore, in this paper, we aim to develop an ALP method that assesses generalization performance directly from noisy labels acquired via online crowdsourcing markets, without the acquisition of additional data for model evaluation.

Finally, because ALP methods aim to determine advantageous labeling payments that improve model performance cost-effectively, they can be applied alongside methods addressing complementary problems, such as the acquisition of multiple labels for the same instance, as we demonstrate in our empirical evaluations, or with active learning for selecting advantageous instances to label.

3.2 Adaptive labeling payment algorithm

In this section, we develop an ALP algorithm to address the ALP problem and to adaptively select the payment at which labels are acquired. Recall that the ALP problem objective is to identify advantageous payments per label so as to yield the best model performance for a given budget. Because the benefits to learning from different payments per label are unknown at the outset and because such benefits may also change over time, our approach employs a sequential, myopic heuristic. Specifically, at each acquisition phase, a set of labels is procured on the market, and our approach aims to identify the payment per label to offer labelers that is likely to yield the largest marginal improvement in performance per unit cost.

The approach we propose is iterative, where at each phase \( i = 1,2, \ldots ,I \), the labels of \( b \) instances are acquired on the market. At each phase, the labels for \( b \) instances are acquired at a payment level \( p^{i} \) per label, where \( p^{i} \) is selected from a fixed and predetermined set of payment level options C = \( \left\{ {{\text{c}}_{1} , \ldots ,{\text{c}}_{\text{po}} } \right\} \), such that it is estimated to yield the largest marginal improvement in the model’s generalization performance per unit cost. The acquisition of labels proceeds sequentially, until either the budget is exhausted or the model reaches a desired level of performance. Table 1 lists the key notations we use throughout the paper.
Table 1

Key notations



\( {\text{C}} = \left\{ {{\text{c}}_{\text{k}} } \right\},\quad {\text{k}} = 1, \ldots ,{\text{po}} \)

Set of alternative payment levels per label

\( {\text{n}} \)

Number of labeled instances acquired

\( p^{i} \)

Selected payment per label at phase \( i \)

\( {\text{Tc}} \)

Total cost incurred for labeling payments

\( {\text{S}}_{\text{n}} \)

Set of labeled instances so far

\( {\text{b}} \)

Number of instances labeled at each acquisition phase

\( {\text{B}}_{{{\text{c}}_{\text{k}} }}^{\text{j}} \subset {\text{S}}_{\text{n}} \)

j-th random draw of \( {\text{b}} \) instances from \( {\text{S}}_{\text{n}} \), labeled previously at payment \( {\text{C}}_{\text{k}} \)


Number of subsets \( {\text{B}}_{{{\text{c}}_{\text{k}} }}^{\text{j}} \)


Number of recent acquisitions used to determine payment

\( {\text{M}}\left( \cdot \right) \)

Model induced from training set \( \left( \cdot \right) \)


Labeling acquisition budget


Number of batches of labeled instances required for initialization

3.2.1 Initialization

To compile an initial dataset from which learning proceeds, our ALP algorithm is first initialized by acquiring labels at pay rates that are uniformly drawn from the set of payment level \( \left\{ {{\text{c}}_{1} , \ldots ,{\text{c}}_{\text{po}} } \right\} \), so as to yield a representative set of instances labeled at different levels of payments. As described below, following the initialization phase, subsequent acquisitions will be made iteratively, such that at each acquisition phase, the next batch of \( b \) labels are acquired at the payment level that is estimated to yield the best improvement in performance per unit cost.

3.2.2 Estimating model performance improvement (EPI)

Because our aim is to improve the model’s predictive accuracy in a cost-effective manner, our ALP approach aims first to directly estimate the expected impact on model performance of a batch of labels acquired at different possible payments \( \left\{ {{\text{c}}_{1} , \ldots ,{\text{c}}_{\text{po}} } \right\} \), per label. Specifically, ALP aims to estimate the expected model performance if the next batch of labels is acquired at each alternative payment level \( \left\{ {{\text{c}}_{1} , \ldots ,{\text{c}}_{\text{po}} } \right\} \). The next batch of labels are then acquired at the payment expected to yield the greatest improvement per unit cost.

To estimate the expected effect on model performance of acquiring labeled instances at a given payment per label, our approach omits instances labeled previously for the corresponding payment. Specifically, our approach estimates the expected effect on model performance from acquiring the next batch of b labels at a pay \( {\text{c}}_{\text{k}} \in \left\{ {{\text{c}}_{1} , \ldots ,{\text{c}}_{\text{po}} } \right\}, \) at the current point along the learning curve, by the change in model performance resulting from removing from the current training data a set of b instances previously labeled at the corresponding pay \( {\text{c}}_{\text{k}} \). The motivation underlying our omission-based approach is that if labeled instances acquired at pay \( {\text{c}}_{\text{k}} \) are more beneficial for induction, their omission from the training data would result in a greater drop in model performance, at the current point along the learning curve.

Formally, let \( {\text{s}}_{\text{n}} \) denote the set of all labeled training instances acquired so far (until phase \( i \)), let \( {\text{s}}_{\text{n}} \backslash {\text{B}}_{{{\text{c}}_{\text{k}} }} \) denote the set of training instances after removing from \( {\text{s}}_{\text{n}} \) a subset \( {\text{B}}_{{{\text{c}}_{\text{k}} }} \) of b instances previously labeled at payment option \( {\text{c}}_{\text{k}} \), and \( {\text{M}}\left( \cdot \right) \) be a model induced from training set \( \left( \cdot \right) \) via the induction technique M. ALP approximates the expected change in model performance from acquiring the next batch of b labeled instances at payment option \( {\text{c}}_{\text{k}} \) by the Expected Performance Improvement (\( {\text{EPI}}_{{{\text{c}}_{\text{k}} }} \)): the difference between the estimated performance of the current model, \( {\text{M}}({\text{s}}_{\text{n}} ) \), and the estimated performance of a model, \( {\text{M}}\left( {{\text{s}}_{\text{n}} \backslash {\text{B}}_{{{\text{c}}_{\text{k}} }}^{{}} } \right) \), induced after omitting the set \( {\text{B}}_{{{\text{c}}_{\text{k}} }} \) of b instances previously labeled at payment option \( {\text{c}}_{\text{k}} \). Formally:
$$ {\text{EPI}}_{{{\text{c}}_{\text{k}} }} = {\text{Performance}}\left( {{\text{M(s}}_{\text{n}} )} \right) - {\text{Performance}}\left( {{\text{M}}\left( {{\text{s}}_{\text{n}} \backslash {\text{B}}_{{{\text{c}}_{\text{k}} }}^{{}} } \right)} \right) $$

In Eq. 1, \( {\text{Performance}}\left( {{\text{M}}\left( \cdot \right)} \right) \) corresponds to any relevant measure of model performance, such as Area Under the ROC Curve exhibited by model \( {\text{M}}\left( \cdot \right) \). The more advantageous a batch of labels acquired at payment option \( {\text{c}}_{\text{k}} \) is for subsequent learning, the greater the drop in performance between model \( {\text{M}}\left( {{\text{s}}_{\text{n}} } \right) \), induced from all labeled data acquired so far, and model \( {\text{M}}\left( {{\text{s}}_{\text{n}} \backslash {\text{B}}_{{{\text{c}}_{\text{K}} }} } \right) \), induced after removing the subset \( {\text{B}}_{{{\text{c}}_{\text{k}} }} \).3

Figure 1 illustrates our proposed approximation of the expected impact on performance of new instances labeled at payment options \( {\text{c}}_{1} \) and \( {\text{c}}_{2} \). The learning curve shows the model’s performance, measured here by the Area Under the ROC Curve (AUC), obtained at different acquisition phases. Point A corresponds to the AUC obtained by a model induced from all the data acquired so far, \( {\text{Performance}}\left( {{\text{M}}({\text{s}}_{\text{n}} } \right)) \), and points 1 and 2 reflect \( {\text{Performance}}\left( {{\text{M}}\left( {{\text{s}}_{\text{n}} \backslash {\text{B}}_{{{\text{c}}_{1} }}^{{}} } \right)} \right) \) and \( {\text{Performance}}\left( {{\text{M}}\left( {{\text{s}}_{\text{n}} \backslash {\text{B}}_{{{\text{c}}_{2} }}^{{}} } \right)} \right) \)—namely, the AUC of the models induced after excluding a batch of labels acquired at payment \( {\text{c}}_{1} \) and \( {\text{c}}_{2} \), respectively. As shown, omitting a set of labels acquired at payment option \( {\text{c}}_{1} \) results in a more significant drop in performance. Using this approach, ALP assesses the effect on modeling performance of acquiring new labeled instances at different payment options \( {\text{c}}_{\text{k}} \)\( \in {\text{C}} \).
Fig. 1

Illustration of ALP’s approximation of the Expected Impact on performance from acquiring labels at payment \( {\text{c}}_{1} \) and \( {\text{c}}_{2} \)

3.2.3 Selecting payment level

As mentioned above, our ALP method seeks to select the payment that yields the best improvement in model performance per unit cost. Thus, the particular measure we propose to capture the cost-effectiveness of a labeling payment, called “Maximum Total Ratio” (mtr), simply corresponds to the ratio between the (estimated) model’s performance after acquiring b labels at payment \( {\text{c}}_{\text{k}} \), and the total labeling cost incurred after the next batch of b labels is acquired. Formally, at each acquisition phase and for each payment option, \( {\text{c}}_{\text{k}} , \)\( {\text{MTR}}_{{{\text{C}}_{\text{k}} }} \) corresponds to the cost-effectiveness of acquiring a batch of b instances labeled at payment \( {\text{c}}_{\text{k}} \), and is given by:
$$ {\text{MTR}}_{{{\text{C}}_{\text{k}} }} = \frac{{Performance(M(s_{n} )) + EPI_{{c_{k} }} }}{{Tc + \left( {b \cdot c_{k} } \right)}}, $$
where \( {\text{T}}_{\text{c}} \) denotes the total labeling payments incurred thus far. At each phase \( i \), our approach selects the payment option \( {\text{c}}_{\text{k}} \), such that it yields the maximum \( {\text{MTR}}_{{{\text{C}}_{\text{k}} }} \)—namely, the selected payment \( p^{i} \) is given by \( \mathop {p^{i} = argmax}\limits_{{{\text{c}}_{\text{k}} }} \left\{ {{\text{MTR}}_{{{\text{C}}_{\text{k}} }} } \right\} \). Henceforth, we refer to our approach as alp-mtr.

In what follows, we discuss the remaining elements of our alp-mtr approach, address how it adapts to changing market conditions, and describe how we improve the estimation of model performance, \( Performance(M(s_{n} )) \), in the presence of noisy labels.

3.3 Estimating EPI with noisy data

A key challenge in estimating a model’s performance, and, consequently, a challenge in estimating \( {\text{EPI}}_{{{\text{c}}_{\text{k}} }} \) (Eq. 1) as well, is the presence of noisy labels. Recall from our earlier discussion, that we aim to use the noisy labels acquired via crowdsourcing, without relying on the availability of “gold standard” labels. Relying on labels acquired from crowd workers adversely affects the accuracy of the model performance’s estimation. In particular, errors in the labels contribute to a higher estimation variance, compared to when the estimation relies on correctly labeled training instances. To improve this estimation, our approach incorporates several elements that aim to reduce the variance in the model’s performance estimation.

Recall that our approach approximates the expected change in performance from acquiring b labels at a cost \( {\text{c}}_{\text{k}} \) by estimating \( {\text{Performance}}\left( {{\text{s}}_{\text{n}} \backslash {\text{B}}_{{{\text{c}}_{\text{k}} }}^{{}} } \right) \); namely, by estimating the performance of a model induced after omitting from our training data b instances labeled at payment \( {\text{c}}_{\text{k}} \). To reduce the estimation variance in estimating \( {\text{Performance}}\left( {{\text{s}}_{\text{n}} \backslash {\text{B}}_{{{\text{c}}_{\text{k}} }}^{{}} } \right) \), we repeat the estimation multiple times by excluding random subsets of b instances acquired previously at payment \( {\text{c}}_{\text{k}} \), and then estimating the model’s performance after this omission. \( {\text{Performance}}\left( {{\text{s}}_{\text{n}} \backslash {\text{B}}_{{{\text{c}}_{\text{k}} }}^{{}} } \right) \) is then estimated by averaging the models’ performances measured over the multiple repetitions. Specifically, we randomly draw with replacement\( m \) subsets, \( {\text{B}}_{{{\text{c}}_{\text{k}} }}^{\text{j}} , {\text{j }} \in 1 \ldots {\text{m}} \), of b instances previously labeled at payment \( {\text{c}}_{\text{k}} \). At each repetition, \( j \), a different subset \( {\text{B}}_{{{\text{c}}_{\text{k}} }}^{\text{j}} \) is removed from \( {\text{s}}_{\text{n}} \) and a measure of performance of the model \( {\text{M}}\left( {{\text{s}}_{\text{n}} \backslash {\text{B}}_{{{\text{c}}_{\text{K}} }}^{\text{j}} } \right) \), induced from the reduced training set \( {\text{s}}_{\text{n}} \backslash {\text{B}}_{{{\text{c}}_{\text{K}} }}^{\text{j}} \), is estimated via cross-validation. For now, we assume that the Area Under the ROC Curve, denoted \( {\text{AUC}}\left( {{\text{M}}\left( \cdot \right)} \right) \), is the relevant performance measure for a model M induced from training set \( \left( \cdot \right) \); however, any other performance measure of interest can be used. The expected model’s performance after omitting a batch of b instances labeled at payment level \( {\text{c}}_{\text{k}} \) is estimated as the average over the \( m \) repeated experiments above:
$$ {\text{Performance}}\;\left( {{\text{M}}\left( {{\text{s}}_{\text{n}} \backslash {\text{B}}_{{{\text{c}}_{\text{k}} }}^{{}} } \right)} \right) = \frac{{\mathop \sum \nolimits_{{{\text{j}} = 1}}^{\text{m}} {\text{AUC}}\left( {{\text{M}}\left( {{\text{s}}_{\text{n}} \backslash {\text{B}}_{{{\text{c}}_{\text{k}} }}^{\text{j}} } \right)} \right)}}{\text{m}}. $$

Finally, to further reduce the error in the estimation of both \( {\text{AUC}}\left( {{\text{M}}\left( {{\text{s}}_{\text{n}} } \right)} \right) \) and \( {\text{AUC}}\left( {{\text{M}}\left( {{\text{s}}_{\text{n}} \backslash {\text{B}}_{{{\text{c}}_{\text{k}} }}^{\text{j}} } \right)} \right) \), we perform repeated applications (with different random seeds) of cross-validation, and estimate the model’s performance as the average performance over multiple applications of cross-validation. As noted earlier, cross-validation to evaluate \( {\text{AUC}}\left( {{\text{M}}\left( {{\text{s}}_{\text{n}} } \right)} \right) \) and \( {\text{AUC}}\left( {{\text{M}}\left( {{\text{s}}_{\text{n}} \backslash {\text{B}}_{{{\text{c}}_{\text{k}} }}^{\text{j}} } \right)} \right) \) is done using previously labeled training sets (\( {\text{s}}_{\text{n}} {\text{and is}}_{\text{n}} \backslash {\text{B}}_{{{\text{c}}_{\text{k}} }}^{\text{j}} \), respectively).

3.4 Continuous adaptation of payments

Recall that, changes in the market settings over time, such as those due to changes in the population of workers and competitive settings, may give rise to different trade-offs between payment per label and labeling quality. To adapt labeling payments to the prevailing trade-off, ALP estimates the EPI (Eq. 1) at pay \( {\text{c}}_{\text{k}} , \) based on a subset of recently labeled instances. More specifically, recall that our ALP approach estimates the EPI from labels acquired at different payment levels, \( {\text{c}}_{\text{K}} \), by estimating the average change in performance between a model induced from the complete set of instances, \( {\text{M}}\left( {{\text{s}}_{\text{n}} } \right) \), and a model, \( {\text{M}}\left( {{\text{s}}_{\text{n}} \backslash {\text{B}}_{{{\text{c}}_{\text{k}} }}^{\text{j}} } \right) \), induced after omitting b instances labeled for a payment \( {\text{c}}_{\text{k}} \). To facilitate adaptation, rather than draw subsets \( {\text{B}}_{{{\text{c}}_{\text{K}} }}^{\text{j}} \) from all prior acquisitions at pay \( {\text{c}}_{\text{k}} \), \( {\text{B}}_{{{\text{c}}_{\text{k}} }}^{\text{j}} \) is drawn from a set \( {\text{D}}_{{{\text{c}}_{\text{k}} }} \) of the most recent\( h \) instances acquired at pay \( {\text{c}}_{\text{k}} \) (\( {\text{B}}_{{{\text{c}}_{\text{k}} }}^{\text{j}} \subset {\text{D}}_{{{\text{c}}_{\text{k}} }} \)), and we subsequently evaluate their average impact on performance. In the empirical evaluations that follow, we evaluate the benefits of this feature in settings where the trade-off between payment and quality per label changes over time, as well as when this trade-off remains the same.

Algorithm 1 outlines the pseudo code for our ALP approach, alp-mtr. Note that our proposed approach estimates the cost-effectiveness of different payment alternatives by the Maximum Total Ratio measure (Eq. 2). However, other ALP methods may use alternative measures of cost-effectiveness.

4 Experimental setup

We evaluate our method’s performance, compared to alternatives under a variety of simulated settings, corresponding to different labeling tasks and trade-offs between payment and quality that have been reported in prior work on crowdsourcing labor markets. In addition, because different trade-offs may arise at different times, we also evaluate our approach’s performance under conditions in which the prevailing trade-off between payment and quality changes over time. Across multiple settings, we aim to identify whether a given method yields a robust performance. Specifically, a robust ALP method that can be relied on in practice ought to yield consistent performance across settings, and should not fail miserably in some settings.

Towards the evaluations we simulate different trade-offs between payment and quality identified in prior works. These include “Asymptotic”, “Concave”, and “Fixed” trade-offs illustrated in Fig. 2a–c. In Fig. 2 and throughout the paper, quality is characterized by the likelihood of error. Specifically, Kazai (2011) and later Kazai et al. (2013) have found an “asymptotic” trade-off between payment and quality, illustrated in Fig. 2a. A “concave” trade-off, illustrated in Fig. 2b, where quality improves initially with increasing payments, but then degrades beyond a certain payment level, was reported in several prior works, including by Kazai (2011) and Kazai et al. (2013), as well as by Feng et al. (2009). Kazai et al. (2013) conducted analyses to understand the possible causes for this trade-off; they have identified that while individual workers tend to produce better quality with a higher payment, higher payment levels also draw increasingly unethical and opportunistic workers, which can reduce the overall work quality. In fact, the two opposing effects may also explain the asymptotic trade-off function observed by Kazai (2011), and by Kazai et al. (2013). The third trade-off we consider here is a “Fixed” trade-off, documented by Mason and Watts (2010) and illustrated in Fig. 2c, where different payments yield the same quality.
Fig. 2

Different trade-offs between payment and quality observed in real experiments

Because different trade-offs arise in different market settings that can vary over time, it is also useful to examine the robustness of an ALP approach when the underlying trade-off between payment and quality shifts from one trade-off to another. In the evaluations reported below, we report alp-mtr’s performance when a given trade-off switches to another after 50% of the acquisition budget has been used.

In the simulation studies we report below, once the payment to be offered to labelers for the next batch of labels is selected, the probability q that a label acquired at that payment is correct is determined by the prevailing trade-off. Thus, the correct label for an instance is assigned at probability q, and the label is reversed at probability \( (1 - q \)). We also followed prior work for the payment range and the number of payment alternatives from which payment methods can select. Specifically, prior studies considered 2–4 payment levels, ranging from a minimum payment of $0.01–$0.03 to a maximum payment of $0.1–$0.25 (e.g., Feng et al. 2009; Kazai 2011; Kazai et al. 2013; Mason and Watts 2010; Rogstadius et al. 2011); thus, in the evaluations that follow, we considered three payment alternatives within this range of $0.02, $0.14, and $0.25, reflecting low, mid, and high payment levels.

For each of the trade-offs outlined above, as well as for settings where the trade-off changes over time, we report experimental results for labeling tasks corresponding to three publicly available datasets. The first two, Pen Digits and SPAM (Lichman 2013), were selected as they reflect typical labeling tasks that are easy for humans to produce. The third dataset, Mushroom (Lichman 2013), is used for additional robustness and had been used in prior simulation-based research on labeling via crowdsourcing markets (Ipeirotis et al. 2014). Table 2 provides details regarding the different datasets. In “Appendix F” we also provides empirical results that show the number of instances required to reach various levels of model performance (measured by AUC) as a function of different levels of labeling quality.
Table 2

Datasets used for evaluation



Number of observations

Number of variables

Percent positive (%)


Classifying email as spam or non-spam




Pen digits

Classify whether an image displays a digit





Classify whether a mushroom is edible




4.1 Evaluation procedure and measures

Our evaluation procedure is illustrated in Fig. 3 and includes three modules: (a) Payment Method, (b) A module which simulates the Prevailing Trade-off in the market between payment and quality at any given time, and (c) Reporting and Measurement module.
Fig. 3

Overview of the Evaluation Platform

The evaluation begins by randomly partitioning a labeled dataset into an Unlabeled dataset (70% of instances), for which the true labels are hidden and later can be acquired by different payment methods, and a Holdout dataset (the remaining 30%) containing instances with the correct labels; the correctly labeled holdout set is used exclusively for the evaluation of the models produced from labeled instances acquired by the alternative approaches (and thus not used by any approach to inform their payment selection). For control, the same holdout set is used to evaluate alp-mtr and baseline methods. At each iteration of batch label acquisition, the payment for purchasing the labels of a batch of 10 unlabeled data instances from the Unlabeled set is selected by a payment method (e.g., alp-mtr or baseline). Once the payment is determined, the Simulated Trade-off module uses the prevailing trade-off between payment and quality per label (asymptotic, fixed, or concave) to determine the labeling quality \( q \). As described above, this probability is then used to produce the label, such that the correct label for each instance is assigned at probability q. Once the labels are determined, the labeled instances are added to the corresponding method’s training set. Also, for control, at each iteration (acquisition phase), the labels for the same instances, drawn at random from the unlabeled set, are acquired by alp-mtr and the baseline. Hence, any difference in cost-effectiveness can be attributed exclusively to the payment selected by each approach and the corresponding labeling quality produced as a result.

For each approach, once a new set of labeled instances is acquired and added to the corresponding approach’s training set, a new model is induced from the augmented training set, and we report its performance over the correctly labeled holdout data set, so as to assess the improvement in model performance achieved by each approach from all of its label acquisitions.

We report the average performance, over 20 repetitions, of the learned model’s Area Under the ROC Curves (AUC) as a function of cumulative labeling costs incurred by each approach to achieve this performance. AUC is evaluated over the correctly labeled holdout set.

Because the goal of developing ALP methods is to reduce the labeling cost, we report the labeling Cost Saving (CS) enabled by alp-mtr to yield the highest AUC obtained by the baseline alternative. CS is illustrated using the stylized plot shown in Fig. 4: The benchmark method (B) incurs costs of $130 to obtain its highest AUC of 0.945, while the ALP method (A) achieves the same level of performance after incurring $63, yielding cost savings of 52%.
Fig. 4

Illustration of label acquisition costs incurred by two methods, A (red curve) and B (black curve), to achieve difference model performances. The horizontal line reflects the Cost Saving measure: the difference between the labeling costs incurred by B to yield its best performance, and the costs incurred by A to yield the same performance (Color figure online)

We also report the statistical significance of the difference between the areas under the AUC curve as a function of the labeling costs generated by two competing methods. To this end, we use the BCA bootstrap method implemented in R, with 10,000 repetitions.

In the experiments reported here, we induced predictive models using the R implementation of the popular Random Forest algorithm (R Package: RandomForest). We used the default parameter setting with 100 base trees. We later replicate our main results for SVM and Bagging models. In the experiments reported below, alp-mtr used eight-fold cross-validation repeated four times toward this estimation.4 For the history parameter \( h \), we used 100 recently labeled instances to evaluate alternative payments. We later explore the robustness of alp-mtr’s performance when varying the values of these parameters.

4.2 Baseline payment policy

To the best of our knowledge, there are no existing solutions to the ALP problem. Hence, to identify a robust baseline policy we evaluated three possible alternative baseline payment policies, two of which are optimal under some settings. The first policy, henceforth referred to as Minpay, always offers the lowest payment for labels. A second policy, Maxpay, always offers the highest payment. A third policy, henceforth referred to as uniform, acquires labels at a representative set of payments—thus, in each batch of payments, Uniform draws uniformly at random the payment at which labels are acquired. In “Appendix A” we examine the robustness of the three alternative policies.

We find that both Minpay and Maxpay lack robustness and they each exhibit very poor performance under some settings. By contrast, the uniform policy yields robust performance and can be relied on in practice not to fail miserably. Henceforth, we evaluate alp-mtr’s performance relative to that of the robust Uniform policy.

5 Results

Table 3 summarizes the average cost savings achieved by alp-mtr to yield the best performance exhibited by the Uniform policy for different data sets and trade-offs. As shown, alp-mtr yields a significant reduction in the costs necessary to yield the best model performance achieved by the uniform payment policy. Specifically, Table 3 shows that across different settings and labeling tasks, alp-mtr yields a substantial improvement of 35.6% savings on average in labeling costs.
Table 3

Cost savings generated by alp-mtr to achieve uniform’s best performance


Trade-off function

Cost savings produced by alp-mtr (%)















Pen Digits







Average savings



Figure 5a–i shows the performances obtained for different labeling acquisition costs incurred by alp-mtr and the uniform policy, under different labeling tasks and trade-offs between payment and quality. As shown, the alp-mtr method exhibits consistent performance and is superior to the uniform policy. For the learning curves presented in Fig. 5, Table 4 shows significance test results for the difference between the areas under the AUC curves produced by alp-mtr and uniform and shown in Fig. 5. For all datasets and underlying trade-offs, the difference between alp-mtr and the uniform policy is statistically significant (p < 0.01).5
Fig. 5

Comparison of alp-mtr and the uniform payment policy. alp-mtr is often significantly better and is otherwise comparable to that of the uniform policy

Table 4

Significance testing for differences in area under the AUC curves


Trade-off function


Significance of difference between alp-mtr and uniform (***p < 0.01)





















Pen digits










Bootstrap p values for significance test on the difference between AUC produced by alp-mtr and uniform, shown in Fig. 5. Bootstrap tests were performed using the bca method implemented in R. ***p < 0.01

Overall, as shown in Tables 3, 4, and Fig. 5, across settings, both the cost saving is substantial and the difference in performance between alp-mtr and uniform is statistically significant. Note that the absolute magnitude of the possible improvement in model performance is also a characteristic of the data domain, the underlying tradeoffs in the market between labeling payment and quality, and the induction algorithm—and not merely of the approach used to acquire labels. Specifically, for different data domains, treadeoff in the market, and induction algorithms, different improvements are feasible through different labeling payment selection strategies.

5.1 Changing trade-off between payment and quality per label

A key motivation for an Adaptive Labeling Payment policy is that, for any arbitrary labeling task and market settings, the trade-off between payment per label and labeling quality is unknown, and may also change over time (e.g., possibly due to changing market conditions). In what follows, we evaluate the proposed alp-mtr’s robustness in settings where the underlying trade-off between pay and label quality varies. These evaluations aim to establish whether under these conditions the ALP method continues to be advantageous and to yield robust performance. In the experiments reported here, the prevailing trade-off in the market changes once ALP incurs $75 in labeling costs.

Table 5 summarizes the average cost savings achieved by alp-mtr relative to the uniform policy when the underlying trade-off changes from one trade-off to another. As shown, alp-mtr exhibits robust behavior and is superior to the uniform policy, yielding a significant savings of 32.3% in labeling costs, on average.
Table 5

Cost Savings achieved by alp-mtr to yield uniform’s best performance


Change in trade-off (From → to)

Cost savings produced by alp-mtr (%)


Concave → asymptotic


Concave → fixed


Asymptotic → concave


Asymptotic → fixed


Fixed → asymptotic


Fixed → concave



Concave → asymptotic


Concave → fixed


Asymptotic → concave


Asymptotic → fixed


Fixed → asymptotic


Fixed → concave


Pen digits

Concave → asymptotic


Concave → fixed


Asymptotic → concave


Asymptotic → fixed


Fixed → asymptotic


Fixed → concave


Average savings



Figure 6 shows curves of the predictive performance obtained for different labeling acquisition costs. In the interest of space, we present results for a subset of the settings, and include in “Appendix B” results for all remaining settings. As shown, when the trade-off varies, alp-mtr often achieves superior performance and is otherwise comparable to the uniform approach across settings. Figure 6f shows an interesting result in which alp-mtr’s performance briefly deteriorates soon after the underlying trade-off changes; alp-mtr then detects and adapts to the change, yielding cost-effective improvements in the model’s performance. For all the results reported in Figs. 6 and 12 (“Appendix B”), Table 6 shows significance test results for the difference between the area under the AUC curves produced by alp-mtr and uniform. In almost all settings, the difference is statistically significant. Importantly, in the single setting where alp-mtr did not yield statistically significant superior performance, alp-mtr yields performance comparable to that produced by the uniform policy.
Fig. 6

Performance of the alp-mtr and the uniform payment policy when the underlying trade-off between labeling payment and quality changes. As shown, alp-mtr’s performance is often significantly better than that of uniform, and alp-mtr is never worse than uniform

Table 6

Significance testing for difference in area under the curves


Change in trade-off functions (From → to)


Difference between alp-mtr and uniform


Concave → asymptotic



Concave → fixed



Asymptotic → concave



Asymptotic → fixed



Fixed → asymptotic



Fixed → concave




Concave → asymptotic



Concave → fixed



Asymptotic → concave



Asymptotic → fixed



Fixed → asymptotic



Fixed → concave



Pen Digits

Concave → asymptotic



Concave → fixed



Asymptotic → concave



Asymptotic → fixed



Fixed → asymptotic



Fixed → concave



Bootstrap p-values for significance testing of the difference between the area under the AUC curves produced by alp-mtr and uniform, shown in Figs. 6 and 12 (in “Appendix B”). Bootstrap tests were performed using bca method implemented in R. *p < 0.1, **p < 0.05, ***p < 0.01

5.2 Studies of alp-mtr’s performance, its elements, and parameter settings

In this section we evaluate the contribution of different elements of alp-mtr to performance. In addition, we discuss and provide empirical evaluations to assess the effect of different alp-mtr parameters on performance.

5.2.1 Generalization performance in the presence of noisy labels

A key element of alp-mtr is the assessment of a model’s generalization performance in the presence of noisy labels. To that end, alp-mtr performs repeated omissions of sets of instances that were previously labeled at a given payment level, drawn at random with replacement, and the effect of these repeated omissions on predictive performance is averaged.

We empirically evaluate the benefits of using repeated omission to alp-mtr’s selection of cost-effective payments. Figure 7 shows the performance of alp-mtr when \( m = 10 \) repeated omissions are done to estimate the cost-effectiveness of each payment alternative (default settings), and the performance of an alp-mtr variant, alp-mtr-single-omission, in which a single subset of instances is omitted toward this estimation. In the interest of space, Fig. 7 shows results for the Pen Digits dataset, which are representative of the comparison across datasets. As shown, alp-mtr’s repeated sampling of subsets for omission benefits alp-mtr’s selection of cost-effective payments, thereby often producing better and otherwise comparable performance, compared to the effect of omitting a single set of instances.
Fig. 7

Comparison between alp-mtr and alp-mtr-single-omission. alp-mtr achieves superior or otherwise comparable performance to that of alp-mtr- single-omission

Another element of alp-mtr that aims to reduce variance in the estimation of the cost-effectiveness of different labeling payments is the repeated applications of a cross-validation for estimating the expected change in performance after omitting instances labeled at alternative payment levels. Figure 8 shows a comparison of the standard alp-mtr with four repetitions of cross-validation (default setting) and alp-mtr-single-cv, where only a single cross-validation is used in the estimation. In the interest of space, we present representative results for the Pen Digits dataset and these findings are consistent across datasets. As shown, repeated cross-validations benefits the selection of cost-effective labeling payments, thereby often producing better and otherwise comparable performance as compared to when this procedure is not applied. Therefore, it is recommended to apply this feature and take the average measurement of the repeated cross-validation procedure. Yet, note that repeated cross-validation bears increased run-time or can be parallelized.
Fig. 8

Comparison between alp-mtr and alp-mtr-single-cv. alp-mtr yields better or comparable performance to that of alp-mtr-single-cv

5.2.2 The history parameter settings

Recall that alp-mtr evaluates alternative labeling payments by estimating the effect of omitting only labels acquired during the most recent h batches. The history parameter h aims to enable alp-mtr to adapt to recent changes in the underlying trade-off between labeling payment and quality and ignore older patterns. A small value of h will result in alp-mtr taking into account only recent market patterns, whereas, a large value for h takes will result in taking into account market patterns that may have occurred a long time ago.

Figure 9 shows a comparison of alp-mtr (using the default value of h = last 10 batches, per payment alternative) with an alp-mtr variant, alp-mtr-full-history, which evaluates the cost-effectiveness of alternative payments based on the entire purchase history. Figure 9a–c show results for settings in which the trade-off between labeling payment and quality remains constant, and Fig. 9d–i show learning curves for settings in which the trade-off changes. In Fig. 9, we show results for the Pen Digits dataset; results with the other datasets produced the same finding. As shown, across different trade-off settings, alp-mtr (with a default value of h = last 10 batches) yields either superior or comparable cost-effective acquisitions as compared to alp-mtr-full-history, suggesting that consideration only of recent purchases when assessing alternative payments is beneficial.
Fig. 9

Comparison between alp-mtr and alp-mtr-full-history. alp-mtr yields better or equivalent performance compared to alp-mtr-full-history

5.2.3 Batch size

The batch size parameter, \( b \), determines how many labels are acquired simultaneously at each acquisition phase. In “Appendix E” we present an evaluation of alp-mtr’s robustness to different batch sizes (5, the default setting of 10, 15, and 20). We find that in most cases altering the batch size does not significantly affect performance, however, in a few cases a, larger batch size slightly improves performance as compared to a batch size of \( b \) = 5.

5.2.4 Initial random purchases

During the initialization phase, alp-mtr compiles an initial dataset from which learning proceeds, and the number of batch acquisitions performed during initialization is controlled by the init_batch parameter. Setting a larger init_batch value can have a contrasting effect on performance. On the one hand, a larger init_batch value can help produce a more representative initial sample, but because these acquisitions are uninformed, this will also delay the production of informed, cost-effective, payment decisions. In “Appendix G”, we present experimental results exploring the effect on performance when the number of initial batch purchases is either increased or decreased by 30%. We find that performance is not significantly affected by this change in the init_batch parameter.

5.2.5 Number of payment alternatives

alp-mtr considers a set of payment levels \( \left\{ {{\text{c}}_{1} , \ldots ,{\text{c}}_{\text{po}} } \right\} \) and select advantageous payments in each acquisition phase. In the experiments presented here we considered three payment levels produced by evenly partitioning the range payments per label into three payment alternatives. However, such partitioning can indeed be made more refined to explore a larger number of alternatives. In “Appendix H” we evaluated the effect of refining this partition and considering a larger number of payment options (while maintaining the same number of instances in the initialization phase). We find that alp-mtr achieves better or comparable performance across settings when considering a larger number of alternative payment levels.

5.2.6 Induction algorithms

In “Appendix C”, we replicate our main results using two alternative classification algorithms, Support Vector Machine and Bagging, to induce predictive models from the labels acquired by alp-mtr. In both cases, alp-mtr yields significant cost reduction as compared to the baseline uniform policy. These results suggest that our method is generic and offers benefits for inducing different classification models.

5.3 Additional insights and analyses

In this section we discuss additional analyses of the alp-mtr approach, including insights into its choices of acquisitions, its runtime complexity, and evaluation of a new variant. We also demonstrate how our approach can be used to determined payment per label in other settings, when repeated labeling is used.

5.3.1 Payment regulation

Because labels are noisy in our settings, having more acquisitions for a given payment can improve alp-mtr’s estimation of the contribution to model induction of labels acquired for the corresponding payment. However, it is possible that due to noise, cost-effective payments should be deemed undesirable. Further, because such a payment is not selected, the estimation is not improved and the advantageous payment will continue to be ignored. This problem may be particularly significant when the trade-off between labeling payment and quality varies over time. In “Appendix D”, we consider an alp-mtr variant, alp-mtr-payment-reg, which uses alp-mtr’s selection of payments as before; however, it also acquires labels at a given payment level if it has not been selected in the recent \( t \) consecutive batch acquisitions. Our results suggest that initiating acquisitions at different payment levels, as done by alp-mtr-payment-reg, can improve performance in some cases, but often yields inferior performance to that achieved by the standard alp-mtr.

5.3.2 Insights into alp-mtr’s payment strategies

The motivation underlying the ALP problem and which we outline in the introduction, is that the choice of payment per label that will yield a cost-effective improvement in model performance is affected by a host of factors, including the prevailing trade-off in the market between payment and quality per label, the underlying predictive task (data domain), the inductive algorithm used, and the labels available thus far (the corresponding “position” along the learning curve). Consequently, our alp-mtr method aims to estimate directly, and in a data-driven manner, the effect of any given payment per label at any given time on the model’s predictive performance.

As such, we note in the introduction, how an effective ALP approach ought to be able to acquire labels at a lower cost if this is estimated to yield more cost-effective improvements in performance, while in other settings, the same approach may produce a given model performance at a lower cost by acquiring fewer, but higher quality labels. We now show results that demonstrate alp-mtr’s flexibility to pursue different labeling qualities and costs under different settings to yield cost-effective improvements in model performance.

Figure 10 shows how alp-mtr indeed achieves cost-effective model performance via versatile acquisition choices under different data domains and trade-offs between labeling payment and quality. For different trade-offs between payment and quality per label in the market, Fig. 10 shows the model performance, average label quality and number of labels acquired as a function of acquisition cost, resulting from acquisitions made by alp-mtr and the Uniform policies. Specifically, for the Spam data set, under an asymptotic trade-off between labeling payment and quality, Fig. 10a–c show that alp-mtr initiates the acquisition of fewer and higher quality labels on average. For the same data set, Fig. 10d–f shows that when the underlying trade-off between labeling quality and cost is fixed (the same label quality is produced for any given payment per label), and improvements in cost-effectiveness can only be achieved by acquiring less costly labels, alp-mtr purchases a larger number of cheaper labels. Finally, for a linear positive relationship between payment per label and labeling quality,6 Fig. 10g–i shows that alp-mtr does not pursue the highest quality labels, but a larger number of lower quality labels. Overall, alp-mtr is able to produce versatile label acquisition choices across different settings, which is fundamental to its ability to produce cost-effective improvements in model performance, under arbitrary settings.
Fig. 10

Model performance, average label quality, and number of labels acquired

5.3.3 Runtime complexity

alp-mtr runtime complexity at each acquisition phase depends on the parameter choices reported above as well as the time required to train and evaluate a model when utilizing an induction algorithm of choice (e.g., Random Forest, or SVM). Specifically, given:
  • T_m—Time to train and evaluate a single model with the induction algorithm of choice.

  • po—Number of payment options

  • m—Number of subsets (batches) that are repeatedly omitted

  • folds—Number of cross-validation folds

  • R_cv—the number of times the entire cross-validation procedure is repeated

At each acquisition phase, the runtime complexity is given by:
$$\text{T}\_\text{m}^*\text{folds}^*(1+ \text{po}{^*\text{m}}^*\text{R}\_\text{cv})$$

Yet, because all model inductions and evaluations can be done simultaneously, prior to each batch acquisition by alp-mtr, these computations can be fully parallelized. This yields an effective run time of ~ O(T_m). We timed running alp-mtr using default settings and unoptimized code, on an Intel E5-2630 2.2 GHZ, 10 core machine.7 Average computation time at each acquisition phase was 15.86, 16.25, and 8.63 seconds for the Spam, Mushroom and Pendigits datasets, respectively.

5.3.4 Using alp-mtr to select payment per label for repeated labeling

Repeated labeling refers to the acquisition of multiple noisy labels for the same instance so as to infer the most likely label and improve the accuracy of the labels. Importantly, methods for repeated labeling do not determine which payments to offer labelers per label, but assume that the payment per label is fixed and somehow pre-determined. Repeated labeling thus aims to reduce noise (error) in the data by acquiring multiple labels for the same instance at a fixed and pre-determined cost per label, and use the set of labels available for a given instance to infer the most likely label. For example, a popular approach is to infer the majority label for each instance (e.g., Lee et al. 2013; Mason and Suri 2012). Since alp-mtr and repeated labeling address complementary tasks, and can thus be applied in conjunction—in “Appendix J” we demonstrate how alp-mtr can be applied to identify advantageous payments per label when repeated labeling is used. In particular, we apply alp-mtr first to identify the cost-effective payment per label at which to acquire the next batch of labels; multiple labels for the same instance are then purchased at the payment determined by alp-mtr, and these labels are aggregated by repeated labeling. We find that applying alp-mtr for advantageous payment selection and subsequently applying repeated labeling, yields significant savings.

5.3.5 Potential for future improvements

alp-mtr labeling payment choices yield robust performance, but may not be optimal. For a given setting, in “Appendix I” we offer insights into the potential for future improvements to our approach via a comparison between the payment choices alp-mtr makes and those of an oracle that knows the optimal payment.

6 Conclusions, limitations, and future work

As machine learning becomes integral to the routine operations of firms and the products and services they provide, the immediacy and accessibility of online labor markets present unprecedented opportunities for on-demand labeling to be brought to bear on machine-learning tasks. Yet for these opportunities to materialize, it is important to devise solutions for the novel challenges these markets present. Given the different trade-offs that can arise between labeling payment and work quality under different labeling tasks and market conditions, in this paper we first formulate the problem of Adaptive Labeling Payments (ALP), then develop an approach to address it, and study extensively the performance of our proposed approach. Specifically, we develop an ALP approach, alp-mtr, designed to determine and continuously adapt the payment offered to crowd workers before they undertake a labeling task, so as to improve the performance of a model induced from the data cost-effectively. Our ALP approach estimates the effect on induction from omitting training instances, previously acquired at different payments; our approach also incorporates elements that benefit this estimation, particularly in the presence of noisy labels and changing prevailing trade-offs between payment and quality.

We empirically evaluated the performance of alp-mtr relative to that of a robust alternative under a variety of market scenarios, reflecting different labeling tasks and trade-offs between labeling payments and quality documented in prior work, including settings where the trade-off changes over time. Our results show that alp-mtr yields robust performance across settings and that it offers meaningful and substantial cost savings, with an average savings of 33.4% across settings. We also demonstrate that the design elements of alp-mtr, which aim to improve its estimations of the expected benefits from alternative payments in the presence of noisy labels, and its adaptation to changing market conditions, indeed benefit alp-mtr’s performance. Given our method’s consistently robust benefits under different settings, it can be considered a benchmark for evaluating future mechanisms to determine labeling payments.

The practical implications of this research are important for enabling a growing reliance on instance labeling via crowdsourcing labor markets. Our ALP approach yields both robust performance and meaningful cost savings. Because crowdsourcing for labeling tasks is becoming increasingly popular in academic research as well, academic efforts can similarly benefit from our method’s efficiencies.

Our proposed approach offers a wealth of opportunities on which future research can build, both to improve as well as to extend our work. One interesting direction would be an adaptation of our ALP approach to regression predictive tasks, in addition to the classification task that we consider here. As prescribed by our ALP approach, the effect of target (dependent variable) values acquired at different payments may, in principle, be evaluated by estimating the effect on the regression model’s performance after omitting instances previously acquired at the corresponding payment, as the estimations can benefit from the same elements included in alp-mtr to improve the estimation under noisy labels and changing market conditions. Similarly, while in this work we considered an important and popular measure of performance (namely, Area Under the ROC Curve) to evaluate a classification model’s performance, the framework we develop could accommodate other performance measures as well.

Our approach is designed to adapt to changing market conditions. However, inherent to all predictive modeling approaches is the assumption that the environment remains stable for some time so that the learned patterns can be exploited. Different solutions may be required in chaotic settings, where market conditions change significantly and very frequently.

Any data-driven-learning-based approach incurres costs for learning. In our context, a practical reference to assess these costs corresponds to the non-data-driven and robust, uniform, alternative. A useful question to consider is whether our approach remains beneficial even if only a small number (i.e., a few hundred) of labels are acquired on the market. Indeed, alp-mtr includes an initialization phase, during which it acquires an initial set of labeled instances, and from which it begins learning; importantly, these initial label acquisitions are uninformed by an estimation of the expected benefits and may thus be rather costly. However, during the initialization phase, alp-mtr selects payments in the same manner by which payments are selected by the robust uniform alternative. This is also demonstrated in our empirical results, where alp-mtr does not incur higher costs relative to a non-data-driven alternative early on. Nevertheless, it would be beneficial for future work to further improve alp-mtr’s performance as well as reduce the cost of (shorten) the initialization phase.

alp-mtr was designed for and evaluated in challenging settings where all the labels are potentially noisy and no gold standard labels are available. However, with trivial modifications, alp-mtr can be adjusted to operate in more favorable conditions when correct labels (“gold standard”) are available for some instances. To improve performance, such gold standard data can be used towards alp-mtr’s evaluation of \( {\text{EPI}}_{{{\text{c}}_{\text{k}} }} \) (estimation of expected performance improvement for a certain payment level) as an internal holdout set, instead of applying cross-validation over noisy instances. Future work can also suggest how to best use gold standard data, including how to partition such data for training and alp-mtr’s internal holdout sets.

Recall that alp-mtr adapts the payment per label only in consecutive acquisition phases, and does not simultaneously offer different payments for the same task. Nevertheless, it is possible that adaptive labeling methods may introduce “second-order effects”—i.e., that the mere change in payment, rather than the payment level, offered at different times might impact worker behavior. Prior work (Chen and Horton 2016) has shown that in different settings than those we consider here, and where the same worker who works continuously on a task is faced with wage cuts, the worker is more likely to discontinue taking additional tasks. (Chen and Horton did not find evidence that this impacts the work quality, unless workers are solicited to continue working in an inadequate manner). Importantly, our approach’s performance may not be effected by an individual worker’s decision to discontinue work due to changes in pay. This is because payments are revised and offered to the entire market in consecutive phases, and payments are not revised for a given worker. Furthermore, because the market is often large (as in real markets such as Amazon Mechanical Turk), the potential effect of such adaptation may not be substantial. Nevertheless, it is possible that this or similar effects may arise in different market conditions. It would be therefore beneficial for future work to model and measure any effects from revising the payment offered on the market over consecutive phases. Lastly, because our approach corresponds to a sequential process, where labeling tasks are offered sequentially on the market, to alleviate such effects it may be beneficial to consider revising the framing of the labeling task in subsequent phases.

Our approach does not consider the properties of individual instances, such as instance level properties that might yield differential difficulty for labelers to label them correctly. If a given instance is less likely than other instances to be labeled correctly, it may be beneficial for future work to develop methods that identify advantageous payments for labeling individual instances, so as to cost-effectively improve model performance.

Another possible venue for extending the ALP problem is to adapt it to domains in which it is possible to compartmentalize the feature sets into different subsets, each of which can be used to produce different models, and for which a different trade-off between labeling payment and quality might arise. For example, in text classification tasks, some of the workers may be presented just with the text headline and receive a lower payment, whereas, other workers may be requested to classify an entire news article and receive a higher level of payment.

Finally, it would be interesting for future research to explore different ways of extending our approach. In particular, it would be beneficial for future work to explore the possibility of removing instances labeled at certain payments if this could improve the model’s generalization performance. Similarly, it would be interesting to explore an extension of our approach that studies possible effective stopping criteria for acquisitions at a given time. Such stopping criteria could be based, for example, on the EPI measure outlined in Eq. 1 to assess whether acquiring labels at a given payment might undermine induction.


  1. 1.

    The marginal improvement in predictive performance from acquiring the same set of labeled training instances is likely to be quite different if we already have a large nunber of labeled training data instances, than if the training set is very small. Similarly, acquiring highly noisy labels may have a different impact on learning when the training set is small, than if added to a large training set.

  2. 2.

    Note that the number of acquisition phases, \( I \), is an outcome of the selected payments and the budget. Thus, if lower payments are selected for labeling, more instances can be acquired over a larger number of acquisition phases.

  3. 3.

    Note that this notion holds for all learning curves, including those that are not strictly increasing. For example, in unusual cases when the learning curve decreases, our method can identify the least harmful payment option. In such cases, this notion can also be used to determine when to stop acquiring additional labels.

  4. 4.

    These cross-validation parameters were selected for efficiency using 32 core machines.

  5. 5.

    Figure 5 is adjusted to improve visibility of the relative performance of each approach. In addiiton, note that we aim to establish whether our method outperforms the baseline over multiple settings, rather than support a single hypothesis of improvement over the baseline. Hence, the significance tests across all settings should is examined (i.e., in what proportion of the settings the improvement over the baseline was significant). Consequently, individual tests do not include correction for False Discovery Rate.

  6. 6.

    The linear relationship is given by \( Quality = 0.8265 + 0.1739 \cdot Cost \), where \( 0.83 \le Quality \le 0.87 \), and \( 0.02 \le Cost \le 0.25.\)

  7. 7.

    Default settings include po = 3, m = 10, folds = 8, R_cv = 4 indicating the potential for substantial run time improvement given the availability of additional computational resources compared to our 10 core machine.



The authors are grateful for insightful comments and suggestions by the Associate Editor and three reviewers. The paper has also greatly benefited from comments and discussions with seminar participants at Cornell University’s Operation Technology and Information Management (OTIM) Workshop, Boston University, The University of Iowa, Temple University, as well as participants of the Statistical Challenges in E-Commerce Research (SCECR) workshop and of the Winter Conference on Business Intelligence (WCBA). The authors are grateful for financial support from the Jeremy Coller Foundation, Blavatnik ICRC at Tel Aviv University, and the Henry Crown Center for Business Research.


  1. Abe N, Mamitsuka H (1998) Query learning strategies using boosting and bagging. In: Proceedings of the international conference on machine learning (ICML). Morgan Kaufmann, pp 1–9Google Scholar
  2. Chen DL, Horton JJ (2016) Research note—are online labor markets spot markets for tasks? A field experiment on the behavioral response to wage cuts. Inf Syst Res 27(2):403–423CrossRefGoogle Scholar
  3. Dai P, Lin CH, Mausam M, Weld DS (2013) Pomdp-based control of workflows for crowdsourcing. Artif Intell 202:52–85MathSciNetzbMATHCrossRefGoogle Scholar
  4. Dalvi N, Dasgupta A, Kumar R, Rastogi V (2013) Aggregating crowdsourced binary ratings. In: Proceedings of the 22nd international conference on world wide web. ACM, New York, pp 285–294Google Scholar
  5. Downs JS, Holbrook MB, Sheng S, Cranor LF (2010) Are your participants gaming the system? Screening mechanical Turk workers. In: Proceedings of the SIGCHI conference on human factors in computing systems. New York: ACM, pp 2399–2402Google Scholar
  6. Feng D, Besana S, Zajac R (2009) Acquiring high quality non-expert knowledge from on-demand workforce. In: Proceedings of the 2009 workshop on the people’s web meets NLP: collaboratively constructed semantic resources. Association for Computational Linguistics, Stroudsburg, pp 51–56Google Scholar
  7. Ipeirotis PG, Provost F, Sheng VS, Wang J (2014) Repeated labeling using multiple noisy labelers. Data Min Knowl Disc 28(2):402–441MathSciNetzbMATHCrossRefGoogle Scholar
  8. Karger DR, Oh S, Shah D (2011) Iterative learning for reliable crowdsourcing systems. In: Proceedings of advances in neural information processing systems: 25th annual conference on neural information processing, December 12–14Google Scholar
  9. Karger DR, Oh S, Shah D (2014) Budget-optimal task allocation for reliable crowdsourcing systems. Oper Res 62(1):1–24zbMATHCrossRefGoogle Scholar
  10. Kazai G (2011) In search of quality in crowdsourcing for search engine evaluation. In: Clough P et al (eds) Advances in information retrieval. Springer, Berlin Heidelberg, pp 165–176CrossRefGoogle Scholar
  11. Kazai G, Kamps J, Milic-Frayling N (2013) An analysis of human factors and label accuracy in crowdsourcing relevance judgments. Inf Retrieval 16(2):138–178CrossRefGoogle Scholar
  12. Kong D, Saar-Tsechansky M (2014) Collaborative information acquisition for data-driven decisions. Mach Learn 95(1):71–86MathSciNetCrossRefGoogle Scholar
  13. Kumar A, Lease M (2011) Modeling annotator accuracies for supervised learning. In: Proceedings of the workshop on crowdsourcing for search and data mining (CSDM), at the fourth ACM international conference on web search and data mining (WSDM), pp 19–22Google Scholar
  14. Lee D, Hosanagar K, Nair H (2013) The effect of advertising content on consumer engagement: evidence from Facebook (working paper). Available at SSRN 2290802Google Scholar
  15. Lewis D, Gale W (1994) A sequential algorithm for training text classifiers. In: Proceedings of the ACM SIGIR conference on research and development in information retrieval. ACM/Springer, pp 3–12Google Scholar
  16. Lichman M (2013) UCI machine learning repository. University of California, School of Information and Computer Science, Irvine.
  17. Lin CH, Weld DS (2014) To re (label), or not to re (label). In: Second AAAI conference on human computation and crowdsourcingGoogle Scholar
  18. Lin CH, Mausam M, Weld DS (2012) Crowdsourcing control: moving beyond multiple choice. In: UAIGoogle Scholar
  19. Lin CH, Mausam M, Weld DS (2016) Re-active learning: active learning with relabeling. In: AAAI, pp 1845–1852Google Scholar
  20. Mason W, Suri S (2012) Conducting behavioral research on Amazon’s mechanical Turk. Behav Res Methods 44(1):1–23CrossRefGoogle Scholar
  21. Mason W, Watts DJ (2010) Financial incentives and the performance of crowds. ACM SIGKDD Explor Newsl 11(2):100–108CrossRefGoogle Scholar
  22. Paolacci G, Chandler J, Ipeirotis PG (2010) Running experiments on amazon mechanical Turk. Judgm Decis Mak 5(5):411–419Google Scholar
  23. Raykar VC, Yu S, Zhao LH, Valadez GH, Florin C, Bogoni L, Moy L (2010) Learning from crowds. J Mach Learn Res 1(11):1297–1322MathSciNetGoogle Scholar
  24. Rodrigues F, Pereira F, Ribeiro B (2013) Learning from multiple annotators: distinguishing good from random labelers. Pattern Recogn Lett 34(12):1428–1436CrossRefGoogle Scholar
  25. Rogstadius J, Kostakos V, Kittur A, Smus B, Laredo J, Vukovic M (2011) An assessment of intrinsic and extrinsic motivation on task performance in crowdsourcing markets. ICWSM 11:17–21Google Scholar
  26. Saar-Tsechansky M, Provost F (2004) Active sampling for class probability estimation and ranking. Mach Learn 54(2):153–178zbMATHCrossRefGoogle Scholar
  27. Sheng VS, Provost F, Ipeirotis PG (2008) Get another label? Improving data quality and data mining using multiple, noisy labelers. In: Proceedings of the 14th ACM SIGKDD international conference on knowledge discovery and data mining, pp 614–622Google Scholar
  28. Wang J, Ipeirotis PG, Provost F (2017) Cost-effective quality assurance in crowd labeling. Inf Syst Res 28:137–158CrossRefGoogle Scholar
  29. Wauthier FL, Jordan MI (2011) Bayesian bias mitigation for crowdsourcing. In: Bartlett P, Pereira F, Shawe-Taylor J, Zemel R (eds.) Advances in neural information processing systems (NIPS), pp 1800–1808Google Scholar
  30. Yang L, Carbonell J (2012) Adaptive Proactive learning with cost-reliability trade-off. In: Seel NM (ed) Encyclopedia of the sciences of learning. Springer, New York, pp 121–127CrossRefGoogle Scholar
  31. Zhang Jing, Xindong Wu, Sheng Victor S (2015) Active learning with imbalanced multiple noisy labeling. IEEE Trans Cybern 45(5):1095–1107CrossRefGoogle Scholar
  32. Zhang Jing, Xindong Wu, Sheng Victor S (2016) Learning from crowdsourced labeled data: a survey. Artif Intell Rev 46(4):543–576CrossRefGoogle Scholar
  33. Zhou D, Basu S, Mao Y, Platt JC (2012) Learning from the wisdom of crowds by minimax entropy. In: Advances in neural information processing systems, pp 2204–2212Google Scholar

Copyright information

© The Author(s) 2019

Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (, which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

Authors and Affiliations

  1. 1.Coller School of Management, Tel Aviv UniversityTel-AvivIsrael
  2. 2.McCombs School of Business, The University of Texas at AustinAustinUSA

Personalised recommendations