1 Introduction

Huge and complex models are popularly used in today’s machine learning applications, since they can take advantage of big data to get better performance. Indeed, they have significantly boosted performance of many important tasks, such as image classification (Russakovsky et al. 2015), speech recognition (Hinton et al. 2012), dialogue systems (Sordoni et al. 2015) and autonomous driving (Bojarski et al. 2016). More recently, they even beat a human champion by a large margin in the game Go (Silver et al. 2016). However, a primary question arises: how can we provide sufficient fuel (a plethora of annotated data) to propel our rocket (complex models)? The most appealing way may be the crowdsourcing technology (Russakovsky et al. 2015; Zhong et al. 2015; Wang and Zhou 2016; Wang et al. 2017), since the process of annotations is convenient and the cost of annotations is very cheap.

While crowdsourcing techniques (Li et al. 2016, 2017a, b) have been commonly used in many commercial platforms, such as Amazon Mechanical Turk (AMT), the quality of crowdsourced labels is not satisfactory (Ipeirotis et al. 2010). The reasons are that workers may not be domain experts (Vuurens et al. 2011; Yan et al. 2014; Rodrigues et al. 2014). For example, it is hard for an average person to distinguish some professional tasks, such as labeling bird images or medical data (Wais et al. 2010). Besides, some workers can just be spammers, who response questions with arbitrary answers (Difallah et al. 2012; Raykar and Yu 2012). Such low-quality labels inevitably degenerates the performance of subsequent learning models (Natarajan et al. 2013; Sukhbaatar et al. 2015; Han et al. 2016). For instance, noisy labels degrade the accuracy of deep neural networks by 20% to 40% (Patrini et al. 2017; Yu et al. 2017a, b).

The previous efforts have extensively focused on statistical inferences, which aggregate crowdsourced labels when they are already collected (Karger et al. 2011; Liu et al. 2012; Chen et al. 2013; Zhou et al. 2014; Tian and Zhu 2015; Zhang et al. 2016b). However, as crowdsourced labels are intrinsically noisy, statistical inferences are hard to guarantee that aggregated labels are reliable. In order to improve the label quality, recently, many researchers resort to a complementary direction, namely proposing approaches controlling the process of label collection (Singla et al. 2015; Litman et al. 2015; Chen et al. 2016; Pennock et al. 2016; Zheng et al. 2015b; Fan et al. 2015; Han et al. 2017).

These approaches aim to encourage workers to provide more reliable labels at the stage of collection. For example, the skip-based approach encourages workers to skip uncertain tasks. However, if many tasks are difficult, the label requester may collect only a few labels, which are not enough for subsequent learning models (Shah and Zhou 2015; Ding and Zhou 2017). The self-corrected approach encourages workers to check whether they need to correct their answers after looking at references. However, references consisting of responses from other workers are noisy, which may mislead workers. Moreover, this approach is not realized on crowdsourcing platforms (Shah and Zhou 2016). Therefore, existing approaches fail to acquire a sufficient number of high-quality labels on real tasks (Table 1). Besides, these approaches cannot detect and give potentially larger payment to high-quality workers. However, this is very important as these workers are always preferred by crowdsourcing platforms, thus they should be identified and more paid (Ipeirotis et al. 2010).

Table 1 Comparison of related approaches and our hint-guided approach (in bold)
Fig. 1
figure 1

A task that requires workers to answer the question “Which one is the Sydney Harbour Bridge?”. Top panel: the proposed interface under the hybrid-stage setting, consists of two options (“A” and “B”) and a “? & Hints” button. Bottom panel: when workers feel unsure of this question and click the button, the content of hints (gray) is visible, which guides workers to make a choice

To address these issues, we are inspired by the “Guess-with-Hints” answer strategy from the Millionaire game show,Footnote 1 where a challenger has opportunities to request hints from the show host when he/she feels unsure of the questions. By this strategy, we introduce a hint-guided approach to improve the quality of crowdsourced labels. This approach encourages workers to get help from auxiliary hints when they answer questions that they are unsure of. To be specific, we introduce a hybrid-stage setting, which consists of the main stage and the hint stage. In the main stage, for each question, workers answer it directly when they feel confident or jump into the hint stage when they feel uncertain. Once they enter the hint stage, they are allowed to look up hints before making any answer to this unsure question. The less number of times workers enter the hint stage, the higher quality they are estimated to be. To realize this setting, we provide an explicit “? & Hints” button (the bottom panel in Fig. 1) for each question. For example, when the worker is unsure of the question in Fig. 1, he/she can click this button and answer the question under the help of hints (the gray sentence).

Nevertheless, only hybrid-stage setting is not enough to address all issues. For example, if hints are freely available in the hint stage, even high-quality workers may abuse free hints for higher accuracy and rewards. This issue causes failure in the detection of high-quality workers. Under the hybrid-stage setting, we develop a hint-guided payment mechanism, which aims to incentivize workers to use the hints properly. Specifically, our mechanism penalizes workers who use the hints. Therefore, high-quality workers will answer most of the questions directly (without hints) for higher rewards. Then, our mechanism assists our setting to detect the high-quality workers effectively. Moreover, we prove that our mechanism is unique under the hybrid-stage setting. Since our mechanism has a multiplicative form, it prevents spammers as well. Our contributions are summarized as follows.

  • In crowdsourcing, our hint-guided approach is the first attempt to improve the quality of labels by auxiliary hints, and detect the high-quality workers. Our approach is different in both setting and payment mechanism from existing approaches such as self-corrected and skip-based approaches.

  • We introduce a hybrid-stage setting. Under this setting, we propose a hint-guided payment mechanism, which incentivizes workers to use hints properly instead of abusing them. Moreover, we prove the uniqueness of our mechanism under the proposed setting.

  • We further give some general rules for task requester, which helps them easily design hints for their own tasks.

  • Unlike many machine learning papers in crowdsourcing, which do not or cannot perform experiments on real datasets, we conduct comprehensive real-world experiments on AMT platforms. Empirical results on three real tasks show that, the proposed approach reaches an excellent performance on the adequate collection of high-quality labels using low expenditure. Meanwhile, our approach prevents spammers and detects high-quality workers as well.

The remainder of this paper is organized as follows. In Sect. 2, related literature is presented. Section 3 introduces the novel setup in crowdsourcing, namely the hybrid-stage setting. In Sect. 4, we propose a hint-guided payment mechanism under this setting. In Sect. 5, we provide the experiment setup and empirical results related to three real-world tasks. The conclusions are given in Sect. 6.

2 Related literature

2.1 Post-processed approach

In crowdsourcing, the statistical inference (post-processed) approach is popularly used to improve the quality of labels (Zheng et al. 2015a, 2016, 2017, Zhang et al. 2016a). Such approach tries to find the correct label for each question only after noisy labels being collected from the platform. Many methods have been developed under this approach.

For example, Raykar et al. (2010) presented a two-coin probabilistic model, where each worker’s labels are generated by flipping the ground-truth labels with a certain probability. Yan et al. (2010) extended this two-coin model by setting the dynamic flipping probability associated with samples. Kajino et al. (2012) formulated a probabilistic multi-task model, where each worker is considered as a task. Zhou et al. (2012) proposed a minimax entropy model. Bi et al. (2014) employed a mixture probabilistic model for worker annotations, which learns a prediction model directly. Tian and Zhu (2015) extended weighted Majority Voting by the max-margin principle, which provides a geometric interpretation of crowdsourcing margin. However, as labels are intrinsically noisy, it is hard for this type of approach to obtain a sufficient among of correct labels with statistical inference.

2.2 Pre-processed approach

While previous efforts have extensively focused on several statistical inferences, pre-processing approach has been recently developed as an alternative way to improve label quality. Namely, the crowdsourced setting is coupled with the payment mechanism, which incentivizes workers to provide more reliable labels at the stage of label collection. Thus, unlike post-processed approach, pre-processed one can directly reduce the noise in obtained labels. Moreover, post-processed approach can be used to further reduce the noise in labels after they are obtained by the pre-processed approach.

In this paper, we target the pre-processed approach from the perspective of machine learning (Buhrmester et al. 2011; Singla and Krause 2013; Goel et al. 2014; Ho et al. 2015; Lambert et al. 2015; Shah and Zhou 2015; Ding and Zhou 2017). The most related works are the skip-based (Shah and Zhou 2015; Ding and Zhou 2017) and self-corrected approaches (Shah and Zhou 2016). In the skip-based approach, workers are allowed to select a skip option based on their confidence for each question. However, this in turn leads to insufficient label quantity. A two-stage setting is used in the self-corrected approach. Workers firstly answer all questions in the first stage, and then they are allowed to correct their first-stage answers after looking at a reference in the second stage. However, references consisting of responses from other workers are noisy, which may mislead workers to providing incorrect labels. Besides, as a reference needs to be set for each task, such a setting is not supported by the AMT platform and only simulation results are reported in Shah and Zhou (2016). Finally, neither the skip-based nor self-corrected approaches can identify worker quality as our approach.

The pre-processed approach was also considered in the database area, but the focus is different. Normally, their research is to dynamically assign the optimal K (\( \le N\)) problems to each worker by his/her work quality, where N is the total number of problems to be annotated (Zheng et al. 2015b; Fan et al. 2015; Hu et al. 2016). Thus, worker quality control plays an fundamental role in the quality of crowdsourcing from the viewpoint of database.

2.3 Worker quality control

As workers’ quality has huge impact on the obtained labels, many researchers tried to improve label quality by offering better control over workers’ quality. For example, Raykar and Yu (2012) considered detecting spammers or adversarial behavior, and tried to eliminate them in the following iterations or phases. However, this method does not consider how to detect high-quality workers. Then, Joglekar et al. (2013) devised techniques to generate confidence intervals for worker error rate estimates, thereby enabling a better evaluation of worker quality. However, this method is complex to be deployed. For our hybrid-stage setting, the less number of times workers enter the hint stage, the higher quality they are estimated to be.

3 Problem setup

Inspired by the “Guess-with-Hints” answer strategy, we introduce the hint-guided approach to improve the quality of crowdsourced labels and detect the high-quality workers at the same time. This approach encourages workers to get help from the useful hints when they answer uncertain questions (Fig. 1). Specifically, we realize this approach in Sect. 3.1, including the hybrid-stage setting and the payment mechanism. Then, easy usage of hints is discussed in Sect. 3.2. Finally, the rationality of our design is discussed in Sect. 3.3.

3.1 Hint-guided approach

Here, we describe our hint-guided approach from the following four aspects.

3.1.1 Hybrid-stage setting

We first set up definitions for the hybrid-stage setting that consists of the main stage and the hint stage. To model our setting, let us consider a simple example: each worker answers N binary-valued (objective) questions, and each question has precisely one correct answer, either “A” or “B”. Therefore, for every question \(i \in \{1, \ldots , N\}\), a worker chooses an answer matching his/her own belief under the following hybrid-stage setting.

  • The main stage (Fig. 1a): For question i, he/she should be incentivized to select the option that he/she feels confident. When he/she feels unsure and clicks the “? & Hints” button, he/she jumps into the hint stage formalized by the “H” option, namely,

    $$\begin{aligned} \text {select} \; {\left\{ \begin{array}{ll} ``A\hbox {''} &{} \text {if}\; P_{A,i} \in \left[ \frac{1}{2} + \epsilon ,1\right) ,\\ ``B\hbox {''} &{} \text {if}\; P_{A,i} \in \left( 0,\frac{1}{2} - \epsilon \right] ,\\ ``H\hbox {''} &{} \text {otherwise}, \end{array}\right. } \end{aligned}$$

    where \(\epsilon \in [0, \frac{1}{2})\) models the worker’s uncertainty degree in this stage, \(P_{A,i}\) is the probability of the worker’s belief that the answer to the \(i\hbox {th}\) question is “A” (i.e., the probability that the worker believes “A” is the correct answer for the \(i\hbox {th}\) question).

  • The hint stage (Fig. 1b): When he/she feels unsure of the question, the worker clicks the “? & Hints” button. This means that he/she enters the hint stage. Then, the worker picks up “A” or “B” according to

    $$\begin{aligned} \text {select}\;{\left\{ \begin{array}{ll} ``A\hbox {''} &{} \text {if}\; P_{A|H,i} \in [T,1),\\ ``B\hbox {''} &{} \text {if}\; P_{B|H,i} \in [T,1), \end{array}\right. } \end{aligned}$$

    where \(T \in (\frac{1}{2}, 1)\) is the predefined threshold value of the worker’s belief in the hint stage, \(P_{A|H,i}\) is the probability of the worker’s belief that the answer to the \(i\hbox {th}\) question is “A” given hints, and \(P_{B|H,i}\) is the probability of the worker’s belief that the answer to the \(i\hbox {th}\) question is “B” given hints (\(P_{B|H,i} = 1 - P_{A|H,i}\)).

The above modeling of the decision process is also summarized in Fig. 2. As we can see, \(\epsilon \) controls the decision in the main stage and the hint stage depends on T. When \(\epsilon \) is large, i.e., \(\epsilon \rightarrow \frac{1}{2}\), more workers need hints to make their decision for each question. When \(\epsilon \) is smaller, i.e., \(\epsilon \rightarrow 0\), fewer workers need hints to make their decision for each question. Once the worker enters the hint stage, when T is set to a large value, i.e., \(T \rightarrow 1\), he/she will become more confident to make his/her final decision for each question. When T is set to a small value, i.e., \(T \rightarrow \frac{1}{2}\), he/she will be less confident to make his/her final decision for each question.

Fig. 2
figure 2

Mathematical model of the decision process under our hybrid-stage setting

Note that, \(\epsilon \) is decided by T according to Proposition 2 in Sect. 4.2, and T is controlled by a mechanism designer. The choice of T is based on different applications and given to us. In the experiments, we empirically choose \(T = 0.75\) due to the qualitative psychology (Smith 2007).

3.1.2 Model assumption

Based on the hybrid-stage setting, we will introduce the corresponding payment mechanism, where it is rooted in the following assumption.

Assumption 1

  1. (A)

    There are G “gold standard” questions (\(1 \le G \le N\)), of which answers are known to the requester, uniformly distributed at random positions among all N questions;

  2. (B)

    Each worker aims to maximize his/her expected payment for N questions;

Assumption 1 is a standard one in analyzing pre-processed approaches for crowdsourcing (Shah and Zhou 2015, 2016; Zhang et al. 2016a). Specifically, as answers to the “gold standard” questions are known to the requester in advanced, workers’ responses to them can be used to evaluate workers’ performance and decide payment for workers. This is the functionality of Assumption 1(A). Then, Assumption 1(B) is a must for analyzing workers’ performance. It originates from game theory (Nisan et al. 2007), and means that each work wants to maximize its revenue.

Next, we make the following Assumption 2, which specifies our usage of hints here. It is motivated by the educational psychology (Koedinger and Aleven 2007), and means that the hints are useful enough to guide workers to making final decisions.

Assumption 2

Workers have enough confidence to make a final decision after acquiring useful hints, i.e., \(T \in (\frac{5}{8}, 1)\) in the hint state.

Note that the confidence of a random guess is \(T = \frac{1}{2}\), thus \(T > \frac{5}{8}\) means that the worker’s confidence to pick up an answer is high after looking at the hint. This value (5/8) is related to the proof of Corollary 1. As an illustration, let us see Fig. 1a. Workers outside Australia may not know which one is the Sydney Harbour Bridge. However, after reading the hints (grey) in Fig. 1b, workers should have enough confidence to make a final decision “A” as the pylons structure is very obvious. When T approaches to 1, the beliefs from the hint are maximal, or equivalently, the hint provides the worker with a certain answer.

3.1.3 Payment mechanism

According to the model assumption, we are ready to introduce our payment mechanism based on the hybrid-stage setting. Specifically, after the worker answers all N questions in the hybrid-stage setting, his/her performance is evaluated by his/her responses to G (\(\le N\)) questions. Namely, his/her choice for each question in the gold standard gets evaluated to one of four states, denoted by \(\{\mathbb {D}_{+}, \mathbb {D}_{-}, \mathbb {H}_{+}, \mathbb {H}_{-}\}\). We define the four states as follows.

  • \(\mathbb {D}_{+}\): answer in the main stage and correct;

  • \(\mathbb {D}_{-}\): answer in the main stage and incorrect;

  • \(\mathbb {H}_{+}\): answer in the hint stage and correct;

  • \(\mathbb {H}_{-}\): answer in the hint stage and incorrect.

Note that “answer in the main stage” means that he/she feels confident in the main stage and answers directly; “answer in the hint stage” means that he/she feels unsure in the main stage and answers with hints in the hint stage. “correct” or “incorrect” denotes whether the worker’s selection matched with the standard answer in G questions or not.

Therefore, under the hybrid-stage setting, we can formulate any payment mechanism as function

$$\begin{aligned} f: \left\{ \mathbb {D}_{+}, \mathbb {D}_{-}, \mathbb {H}_{+}, \mathbb {H}_{-} \right\} ^G \rightarrow [\mu _{\min }, \mu _{\max }], \end{aligned}$$
(1)

where \(\min f(\cdot ) = \mu _{\min }\) and \(\max f(\cdot ) = \mu _{\max }\). We reserve the rights to set \(\mu _{\min }\) and \(\mu _{\max }\), where \(0 \le \mu _{\min } \le \mu _{\max }\). In our paper, the goal is to design f such that its expected payment for each worker is strictly maximized under the above setting.

3.1.4 Difference from previous approaches

The most related approach to ours is the self-corrected approach (Shah and Zhou 2016), since both of us have two phases in the setting. However, they are totally different in probabilistic modeling. The self-corrected approach builds up the two-stage setting, and workers are necessarily required to enter the second stage to check the reference answer of every question, whereas, our approach builds up the hybrid-stage setting, and workers are not necessary to enter the hint stage related to confident questions. Besides, since each payment mechanism is customized based on some designed goals (examples are in Sect. 4.1) under its corresponding setting, our hint-guided payment mechanism is also different from the one used in the self-corrected approach.

It would be also interesting to discuss the advantages of the proposed approach over active learning (Yan et al. 2011) for crowdsourcing. There are two points to be highlighted. First, compared with active learning, hints in our approach may not be as strong as querying the ground-truth label; the hint only guides the worker to make a choice. Second, active learning is constrained to query which data sample should be labeled next and which annotator should be queried to benefit the learning model. However, our approach is free of these restrictions.

3.2 General rules of hints

Motivated by instructional hints in the educational psychology (Koedinger and Aleven 2007), to make the hints useful and reduce interface designers’ workloads, we offer three general rules here:

  1. (A)

    The hints should be easily accessible to interface designers;

  2. (B)

    The hints should be discriminative and concise for workers; and

  3. (C)

    The hints should be irrelevant to the number of annotated samples in each task.

We adopt the three rules in designing our hints in experiments. We take three practical datasets in our experimental setup (Sect. 5.1) to justify that these requirements are reasonable in the real world. First, for Sydney Bridge, as an interface designer, we easily acquire the content of hints from Wikipedia, which includes discriminative and concise phrases, such as “concrete pylons” and “around Sydney Opera House”. Second, for Stanford Dogs, we build a lookup table as hints, which includes the characteristics of four breeds of dogs, such as prick ears for Norwich Terrier. It means that, the hints in this dataset should be irrelevant to the number of annotated samples, but relevant to the number of classes. Third, for Speech Clips, the tool is freely available online to roughly recognize each speech clip and save the concise keywords (\(\le 4\)) as the hints.

3.3 Needs of hybrid-stage setting

It is also noted that, in designing the pre-processing mechanism, the high-quality worker detection is very important for collecting a sufficient number of high-quality labels. If the tasks can be assigned to each worker by his/her work quality, the annotation quality will be increased accordingly. Also, if we can detect the high-quality workers and give more weights on his/her annotations, we can acquire the better label aggregation. Here, we show that it may not be achieved by a single-stage setting with hints (i.e., only Fig. 1b and no Fig. 1a). Later, we also empirically demonstrate this point in Sect. 5.2.3.

Specifically, by our Assumption 2, if we want to collect more correct labels, it is more naturally to directly assign visible hints for every single question. This removes the necessity to have a hybird-stage setting as we design here. However, high-quality workers are always preferred by crowdsourcing platforms, thus they should be identified and more paid. Such a fundamental goal may not be achieved by a simple single-stage setting with visible hints. The reason is explained as follows. Under the single-stage setting, both high-quality and low-quality workers can easily read the visible hints to answer questions. Thus, we cannot make a difference between them. However, under the hybrid-stage setting, high-quality workers may not read the hints frequently. Namely, the less number of times workers enter the hint stage, the higher quality they are estimated to be. Thus, we can track the high-quality workers by our setting.

Note that, only this setting may encounter a problem: if the hints are freely available in the hint stage, by Assumption 1(B), even high-quality workers may abuse free hints for higher accuracy and rewards. This issue causes failure in the detection of high-quality workers by the hybrid-stage setting. Therefore, under the hybrid-stage setting, we hope to develop a payment mechanism (Sect. 4), which incentivizes workers to use the hints properly. Specifically, this mechanism penalizes workers who use the hints. Then, high-quality workers will answer most of the questions directly for higher rewards. As a result, this mechanism helps our setting to detect the high-quality workers effectively.

4 Hint-guided payment mechanism

In Sect. 4.1, we first give two important definitions which help us to design a payment function. Then, the designed payment function is given in Sect. 4.2. Furthermore, we prove that our incentive-compatible payment mechanism is also unique under the hybrid-stage setting. Finally, in Sect. 4.3, we clarify that more restrictive designing goals cannot be realized here.

4.1 Design principles

Incentive compatibility (Definition 1) and mild no-free-lunch axiom (Definition 2) are important to design a payment mechanism for pre-processed approaches, which are also popularly used by previous works (Shah and Zhou 2015, 2016).

Definition 1

(Incentive compatibility) A payment mechanism f is incentive-compatible only if the following two conditions are satisfied: (i) f gives an incentive to a worker to choose all answers by his/her belief; (ii) The expected payment, from the worker’s belief, is strictly maximized in both the main stage and the hint stage.

Definition 1, which is adapted from the standard game theoretical assumption (Nisan et al. 2007), describes incentive compatibility. Basically, it means that f should encourage a worker to select the option he/she believes most likely to be correct.

Definition 2

(Mild no-free-lunch axiom) If all answers attempted by a worker in “gold standard” questions are either wrong or based on hints, then the payment for the worker should be zero, unless all answers attempted by the worker are correct. More formally, \(f(\mathbf {a}) = 0\), \(\forall \mathbf {a} \in \{ \mathbb {D}_{-}, \mathbb {H}_{+}, \mathbb {H}_{-}\}^G\backslash \{\mathbb {H}_{+}\}^G\).

Definition 2 is a variant of the no-free-lunch axioms for our hybrid-stage setting. It requires that f should not pay a worker who has bad performance on “gold standard” questions. This helps to reject spammers and keep high-quality workers, since answers to these questions are known to the platform and spammers are likely to give wrong answers while high-quality workers are not.

Our aim is to design the payment mechanism f, which is defined in Eq. (1), simultaneously satisfies the above definitions.

4.2 Proposed payment mechanism

In order to design a payment mechanism, we first consider the easiest case, i.e., for a single question, how the worker should get paid under our hybrid-stage setting. This helps us to find specific rules under Definition 1 for our setting under Assumption 1, such rules are given in Proposition 1 below, and its proof is in Appendix A.1.

Proposition 1

Let \(f: \{\mathbb {D}_{+}, \mathbb {D}_{-}, \mathbb {H}_{+}, \mathbb {H}_{-}\} \rightarrow [0, \mu _{max}]\), \(d_+ = f\left( \mathbb {D}_{+} \right) \), \(d_- = f\left( \mathbb {D}_{-} \right) \), \(h_+ = f\left( \mathbb {H}_{+} \right) \) and \(h_- = f(\mathbb {H}_{-})\). When \(N = G = 1\), f satisfies Definition 1 if it meets the following pricing constraints:

  1. (A)

    \(d_+> d_-, h_+> h_-, d_+ > h_+\).

  2. (B)

    \(\frac{d_+ - d_-}{1-2\epsilon } \ge \frac{h_+ - h_-}{2\epsilon }\).

  3. (C)

    \(d_+ - d_- \le \frac{2T-1}{1/2 - \epsilon }\left( h_+ - h_- \right) \).

Condition (A) highlights that, for each question, the payment \(h_+\) from an indirect correct answer (after reading hints) should be less than \(d_+\) from a direct correct answer. Condition (B) bridges the per unit income gap \(\frac{d_+ - d_-}{1-2\epsilon }\) in the main stage and \(\frac{h_+ - h_-}{2\epsilon }\) in the hint stage together, and the inequality encourages a worker to directly answer questions that he/she feels confident about in the main-stage. Condition (C) incentivizes his/her to leverage the hints before answering questions that he/she is unsure of. Thus, conditions (A) and (B) encourage workers to directly answer questions without hints if they are confident enough; and when a worker has really low confidence, condition (C) encourages him/her to use hints.

Remark 1

In condition (A), we cannot set \(d_+ = h_+\). If \(d_+ = h_+\), even high-quality workers may abuse hints for higher accuracy and rewards. This issue fails the detection of high-quality workers by the hybrid-stage setting. Therefore, \(d_+ > h_+\) ensures that high-quality workers will answer most of the questions directly for higher rewards. Under the hybrid-stage setting, whether hints are used can be taken as a criterion to detect the high-quality workers. Then, \(d_+ > h_+\) assists our setting to detect the high-quality workers, which has been verified in experiments in Sect. 5.2.3.

From Proposition 1, we can see that f relies on workers’ uncertainty degree \(\epsilon \) in the main stage and their confidence T in the hint stage. When \(\epsilon \) is set to a large value, more workers need hints to make their decision for each question. The disadvantage of large \(\epsilon \) is that the overall payments for workers may be low due to leveraging too many hints. When \(\epsilon \) is set to a small value, fewer workers need hints to make their decision for each question. The disadvantage of small \(\epsilon \) is that the quality of crowdsourced labels may be poor since more workers avoid hints for higher payments. Thus, we need to find \(\epsilon \) to achieve a good tradeoff such that most workers are balanced, neither too cautious nor too careless.

However, Proposition 1 only makes use of Assumption 1 to find rules for f and does not specify the relationship between \(\epsilon \) and T. Below Proposition 2 helps to connect \(\epsilon \) and T, and shows a lower-bound of \(\epsilon \). Its proof is in Appendix A.2.

Proposition 2

Under Assumption 1, f satisfies both Definitions 1 and 2 if \(\epsilon \in [\epsilon _{\min }, 1/2)\) where \(\epsilon _{\min } = T - \sqrt{T^2 - 1/4}\).

Moreover, based on above Proposition, we can derived following Corollary which is based on Assumption 2. Its proof is in “Appendix A.3”.

Corollary 1

Under Assumption 2, \((1/2 - \epsilon _{min}) < (2T - 1)\).

Finally, we show when \(\epsilon = \epsilon _{\min }\), i.e., the boundary condition in Proposition 2 is achieved, a hint-guided payment mechanism f can be designed (Algorithm 1). The function g, which sets how a single question should be paid, is defined at step 1 in Algorithm 1. Note that \(g(\mathbb {H}_{+}) < g(\mathbb {D}_{+}) = 1\) due to Corollary 1, which is also in consistent with condition (a) in Proposition 1. Responses from workers on “gold standard” questions are collected in step 2, and the budget is set in step 3. A multiplicative form of g is adopted in step 4, which is inspired by Shah and Zhou (2015). It incentivizes workers to use hints properly and also helps to make the smallest payment to spammers. The reasons are highlighted in Remark 2.

figure a

Remark 2

The benefits of using the multiplicative form is detailed as follows. For example, a spammer will respond to a question with an arbitrary answer, thus he/she will get the minimum payment once any answers in “gold standard” are wrong. Then, for a normal worker, if he/she tries to get the highest payment, he/she is encouraged to use hints as less as possible. The reason is that the payment for a correct answer after using hints is \(g(\mathbb {H}_{+})\) which is smaller than 1, i.e., \(g(\mathbb {D}_+)\) (Corollary 1). Thus, more hints are used, the maximum payment for a worker will get smaller. Besides, such a multiplicative form also helps us to identify and pay more for high-quality workers, as those workers will naturally user less hints.

The design of Algorithm 1 is further supported by the following Theorem 1. Its proof is in “Appendix A.4”. Thus, our algorithm is the unique one to satisfy both Definitions 1 and 2, and \(\epsilon = \epsilon _{\min }\) is also a must choice here. Note that, in practice, the algorithm makes the minimum payment \(\mu _{\min }\) instead of 0 in Definition 2, if one or more attempted answers in the gold standard are wrong. This operation is without any loss of generality.

Theorem 1

Under Assumptions 1 and 2, f in Algorithm 1 satisfies both Definitions 1 and 2 if and only if \(\epsilon = \epsilon _{\min }\).

4.3 No other compatible mechanism

Definition 1 is a must to design a payment mechanism. However, under our hybird-setting here, there exists another popular “harsh no-free-lunch” axiom (Definition 3), which is adapted from Definition 2 in Shah and Zhou (2016).

Definition 3

(Harsh no-free-lunch axiom) If all answers attempted by the worker in “gold standard” questions are either wrong or based on hints, then the payment for the worker should be zero. More formally, \(f(\mathbf {a}) = 0\), \(\mathbf {a} \in \{\mathbb {D}_{-}, \mathbb {H}_{+}, \mathbb {H}_{-}\}^G\).

Compared to the “mild no-free-lunch” axiom, Definition 3 encourages the worker to answer without hints no matter whether he/she is unsure. Thus, it is stronger than the “mild no-free-lunch” axiom and can be used to replace Definition 2. We wonder whether we can find another payment function which satisfies this more restrictive condition. However, below Theorem 2 shows a contradiction to Definition 3. Its proof is in Appendix A.5.

Theorem 2

Under Assumptions 1 and 2, there is no mechanism that satisfies both Definitions 1 and 3.

Therefore, the “harsh no-free-lunch” axiom is too strong for the existence of any incentive-compatible payment mechanism here. This further illustrates the uniqueness of the proposed payment mechanism.

5 Numerical experiments

We conduct real-world experiments on Amazon Mechanical Turk,Footnote 2 which is the leading platform to collect crowdsourced labels. We compare our hint-guided approach with: (1) Baseline approach : a single-stage setting with an additive payment mechanism (details are in “Appendix B”). (2) Skip-based approach (Shah and Zhou 2015; Ding and Zhou 2017): a skip-stage setting with a skip-based payment mechanism. Note that the skip-based payment mechanism is multiplicative. For the self-corrected approach (Shah and Zhou 2016), it has not been verified on AMT tasks, since there is no criteria how to set references. Therefore, we do not include it in our comparison. Note that additive and multiplicative payment mechanisms are respectively denoted as “\(+\)” and “\(\times \)” for subsequent use in Sect. 5.2.4.

5.1 Experimental setup

All these datasets are collected by us on Amazon MTurk, where hints are easily designed according to the criteria in Sect. 3.2. We conducted three real tasks as follows.

  • Sydney Bridge (binary-choice questions): we collect 30 images of various bridges. Each image contains one bridge. The task is to identify whether the bridge in each image is the Sydney Bridge. The content of hints includes discriminative phrases, such as “concrete pylons” and “around Sydney Opera House”.

  • Stanford Dogs (multiple-choice questions): we collect 100 images of four breeds of dogs. The task is to identify the breed of dogs in each image. We build a lookup table as hints, which includes the characteristics of four breeds of dogs, such as “prick ears” for Norwich Terrier.

  • Speech Clips (subjective questions): we collect 10 speech clips. Each speech clip consists of 1 or 2 short sentences (15 words). The task is to recognize each speech clip and write down the corresponding sentence. We leverage the open toolFootnote 3 to roughly recognize each speech clip and save the key words (\(\le 4\)) as the hints.

We verify the effectiveness of our hint-guided approach from three perspectives (Table 1), and each perspective includes one to two metrics in brackets: requester (“label quantity” and “label quality”), worker (“worker quality detection” and “spammer prevention”) and platform (“money cost”). Except “worker quality detection”, other metrics have been popularly used by previous works (Shah and Zhou 2015, 2016). They are detailed as follows.

  • Label quantity: we evaluate the label quantity by the percentage of the completion of three tasks. In the skip-stage setting, worker yields unlabeled (uncompleted) data by skipping unsure questions. In the single-stage and the hybrid-stage settings, for objective questions, worker yields (few) unlabeled data because he/she forgets or ignores few questions. For subjective questions, worker yields (more) unlabeled data by inputting invalid answers. For example, they write sentences, such as “I do not know” in the answer box.

  • Label quality: we evaluate the label quality from two aspects: (i) the percentage of correct answers and incorrect answers on three tasks; and (ii) the error of aggregated labels (Shah and Zhou 2015). For the \(i\hbox {th}\) question where \(i \in \{1,\ldots , n\}\), if there are \(m_i\) options after majority voting (the tie situation), and the ground-truth label is one of \(m_i\) options, then we consider that \(\frac{1}{m_i}\) of the \(i\hbox {th}\) question is correct. Therefore, the error of aggregated labels is \(1- (\sum _{i=1}^{n} 1/m_i)/n\). Since text answers cannot be majority voted on Speech Clips, we do not report the error of aggregated labels on Speech Clips.

  • Worker quality detection: we evaluate the worker quality detection of the hint-guided approach implicitly, by the error rate (in \(\%\)) of aggregating original and rescaled crowdsourced labels. For example, Sydney Bridge (origin) means the original labels collected by our approach. For Sydney Bridge (rescale), we rank the worker quality from high to low by the usage frequency of the hints in the collection of original labels. Then, we rescale original labels by adaptive weights. Labels from top \(20\%\) (bottom \(20\%\)) workers have been empirically rescaled by 1.8 (0.2). The remaining labels keep unchanged. If the error rate on rescaled dataset decreases, then we speculate that our hint-guided approach indeed detects the worker quality. Namely, the less usage of hints indicates the higher quality of the worker.

  • Spammer prevention and money cost: we evaluate the spammer prevention and the money cost by the average payment to each worker. Note that the payment consists of two parts: fixed payment and reward payment. Reward payment is based on a worker’s responses to G gold standard questions. All payment parameters are in “Appendix C”.

5.2 Experimental results

We demonstrate the effectiveness of our hint-guided approach from the following five aspects. Specifically, Sect. 5.2.1 verifies whether our approach provides a sufficient number of labels. Section 5.2.2 displays whether our approach provides high-quality labels. Section 5.2.3 denotes whether our approach can detect worker quality. Section 5.2.4 indicates whether our approach prevents spammers. Section 5.2.5 demonstrates whether our approach saves money.

5.2.1 Label quantity

Table 2 denotes the percentage of the completion of three tasks. The first two tasks (Sydney Bridge and Stanford Dogs) belong to objective questions, while the last task (Speech Clips) belongs to subjective questions. Objective questions can be answered by the random guess. Therefore, the percentage of the completion for objective questions is much higher than that for subjective questions. In addition, the hint-guided approach has a high percentage of the completion of both objective and subjective questions. Our approach inspires workers to finish the questions, ensuring the quantity of crowdsourced labels.

Table 2 Evaluation of the label quantity. We provide the percentage of the completion on three tasks

5.2.2 Label quality

Figure 3 plots the percentage of correct answers and incorrect answers on three tasks. First, on all tasks, the percentage of correct answers in the hint-guided approach is higher than that in the baseline and skip-based approaches. Second, on Speech Clips, the percentage of incorrect answers is extremely low in the skip-based approach. The reason is that most people skip difficult speech clips, and answer several easy ones. Third, compared with other approaches, our hint-guided approach ensures a sufficient number of high-quality labels.

Fig. 3
figure 3

Evaluation of label quality. Percentage (in %) of correct answers and incorrect answers on three tasks are provided. Note that, we do not plot the percentage of unlabeled questions

Fig. 4
figure 4

Evaluation of the label quality. Results on Speech Clips are not reported, as text answers cannot be majority voted

Figure 4a, b plot the error of aggregated labels on the Sydney Bridge and Standford Dogs tasks. The number of workers (abbreviated as n_workers) is set to \(\{5,6,7,8,9,10\}\), since the error of aggregated labels comes from majority voting among multiple workers (Shah and Zhou 2015), and the number of multiple workers depends on varying situations. For each of combinations between tasks and n_workers, we perform the following actions 200 times repeatedly. In each time, for all questions, we randomly select n_workers workers and perform the majority voting on their responses to yield the aggregated labels. The plotted error of aggregated labels is averaged across 200 results. We observe that the hint-guided approach consistently outperforms the baseline and the skip-based approaches, and the performance gap between the baseline and the hint-guided approaches is extremely obvious on Stanford Dogs.

5.2.3 Worker quality detection

Table 3 denotes the error of aggregating original and rescaled crowdsourced labels. For rescaled crowdsourced labels, labels from estimated high-quality workers are adaptively given more weights, and vice versa. From Table 3, we can see the error of aggregating rescaled labels is lower than the error of aggregating original labels. It demonstrates that our hint-guided approach can detect the high-quality workers effectively. Then, the error decreases significantly on Sydney Bridge, since the size of Sydney Bridge is relatively small (30 questions) compared to Stanford Dogs (100 questions). We believe that, the informative hints for Stanford Dogs may guide the low-quality workers to make more accurate decisions. Then, the performance gap between high-quality and low-quality workers is insignificant. Therefore, the effect of label rescaling is marginal on this dataset.

Table 3 Evaluation of the worker quality detection of the hint-guided approach. Error rate (in %) is provided for aggregating original and rescaled crowdsourced labels

5.2.4 Spammer prevention

The baseline and hint-guided approaches are represented as Single(\(+\)) and Hybrid(\(\times \)), respectively. We provide one extra interaction: the single-stage setting with the “\(\times \)” mechanism (Single(\(\times \))), and all parameters are consistent. Figure 5a explores how our approach prevents spammers. It plots the average payment to each worker under three approaches. We have one observation: the payments of Single(\(\times \)) and Hybrid(\(\times \)) are lower than that of Single(\(+\)), since an answer in G questions is incorrect, and thus the reward of the “\(\times \)” mechanism becomes zero. Since spammers answer each question randomly, the “\(\times \)” mechanism used by our approach makes the smallest payment to them. Thus, our approach prevents spammers.

Fig. 5
figure 5

Evaluation of the spammer prevention. Average payment to each worker on all three tasks are provided. Evaluation of the money cost

5.2.5 Money cost

Figure 5b plots the average payment to each worker under the three approaches. The higher the payment is, the worse the economy of the approach. The payment is calculated as the average of the payments across 200 random selections of G questions. This process mitigates the distortion of results caused by the randomness in the choice of G questions. We can see that, the payments of the skip-based and hint-guided approaches are comparable but less than the payment of the baseline approach, especially in the Stanford Dogs task, since both the skip-based and hint-guided approaches use the multiplicative mechanism but the baseline approach use the additive mechanism. Thus, from the perspective of saving money, we should not employ the baseline approach. Note that, on the Sydney Bridge and Stanford Dogs tasks, although the payment in the skip-based approach is slightly lower than that in the hint-guided approach, the number of high-quality labels from the hint-guided approach is obviously higher than that from the skip-based approach (Fig. 4).

6 Conclusions

To improve the label quality, we proposed a hint-guided approach that encourages workers to use hints when they answer unsure questions. Our approach consists of the hybrid-stage setting and the hint-guided payment mechanism. We proved the incentive compatibility and uniqueness of our mechanism. Besides, our approach can detect the high-quality workers for more accurate result aggregation. Comprehensive experiments conducted on Amazon MTurk revealed the effectiveness of our approach and validated the simple and practical deployment of our approach. These merits are critical for the success of many machine learning applications in practice.

As for future works, first, the hint-guided approach is designed under the worker’s independence. However, it would become more interesting to extend hint-guided approach under the worker’s dependence, where the reward of a worker depends on the answers of the other ones. Second, we hope to extend the hybrid-stage setting from binary choice to multiple choice with the corresponding theoretical results. Third, we consider to provide hints from different levels for all questions. Specifically, we will provide the hints from coarse to fine, which corresponds the different expected payments. Finally, some workers may still be very confused even with hints, we may mix up the unsure option in the hint stage to further improve the label quality further.