Answer validation for generic crowdsourcing tasks with minimal efforts

Hung, Nguyen Quoc Viet; Thang, Duong Chi; Tam, Nguyen Thanh; Weidlich, Matthias; Aberer, Karl; Yin, Hongzhi; Zhou, Xiaofang

doi:10.1007/s00778-017-0484-3

Answer validation for generic crowdsourcing tasks with minimal efforts

Regular Paper
Published: 13 October 2017

Volume 26, pages 855–880, (2017)
Cite this article

The VLDB Journal Aims and scope Submit manuscript

Nguyen Quoc Viet Hung¹,
Duong Chi Thang²,
Nguyen Thanh Tam²,
Matthias Weidlich³,
Karl Aberer²,
Hongzhi Yin⁴ &
…
Xiaofang Zhou⁴

855 Accesses
14 Citations
Explore all metrics

Abstract

Crowdsourcing has been established as an essential means to scale human computation in diverse Web applications, reaching from data integration to information retrieval. Yet, crowd workers have wide-ranging levels of expertise. Large worker populations are heterogeneous and comprise a significant amount of faulty workers. As a consequence, quality insurance for crowd answers is commonly seen as the Achilles heel of crowdsourcing. Although various techniques for quality control have been proposed in recent years, a post-processing phase in which crowd answers are validated is still required. Such validation, however, is typically conducted by experts, whose availability is limited and whose work incurs comparatively high costs. This work aims at guiding an expert in the validation of crowd answers. We present a probabilistic model that helps to identify the most beneficial validation questions in terms of both improvement in result correctness and detection of faulty workers. By seeking expert feedback on the most problematic cases, we are able to obtain a set of high-quality answers, even if the expert does not validate the complete answer set. Our approach is applicable for a broad range of crowdsourcing tasks, including classification and counting. Our comprehensive evaluation using both real-world and synthetic datasets demonstrates that our techniques save up to 60% of expert efforts compared to baseline methods when striving for perfect result correctness. In absolute terms, for most cases, we achieve close to perfect correctness after expert input has been sought for only 15% of the crowdsourcing tasks.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Crowdordering

Incremental Quality Inference in Crowdsourcing

Defeating Tyranny of the Masses in Crowdsourcing: Accounting for Low-Skilled and Adversarial Workers

References

Amsterdamer, Y., Grossman, Y., Milo, T., Senellart, P.: Crowd mining. In: SIGMOD, pp. 241–252 (2013)
Arasu, A., Götz, M., Kaushik, R.: On active learning of record matching packages. In: SIGMOD, pp. 783–794 (2010)
Callison-Burch, C.: Fast, cheap, and creative: evaluating translation quality using Amazon’s Mechanical Turk. In: EMNLP, pp. 286–295 (2009)
Cao, C.C., She, J., Tong, Y., Chen, L.: Whom to ask?: jury selection for decision making tasks on micro-blog services. In: VLDB, pp. 1495–1506 (2012)
CrowdFlower: http://www.crowdflower.com/ (2016)
Davtyan, M., Eickhoff, C., Hofmann, T.: Exploiting document content for efficient aggregation of crowdsourcing votes. In: CIKM, pp. 783–790 (2015)
Dawid, A.P., Skene, A.M.: Maximum likelihood estimation of observer error-rates using the EM algorithm. J. R. Stat. Soc. 1, 20–28 (1979)
Google Scholar
Dekel, O., Shamir, O.: Vox populi: collecting high-quality labels from a crowd. In: COLT (2009)
Demartini, G., Difallah, D.E., Cudré-Mauroux, P.: Zencrowd: leveraging probabilistic reasoning and crowdsourcing techniques for large-scale entity linking. In: WWW, pp. 469–478 (2012)
Difallah, D.E., Demartini, G., Cudré-Mauroux, P.: Mechanical cheat: spamming schemes and adversarial techniques on crowdsourcing platforms. In: CrowdSearch, pp. 26–30 (2012)
Dong, X.L., Berti-Equille, L., Hu, Y., Srivastava, D.: Solomon: Seeking the truth via copying detection. In: VLDB, pp. 1617–1620 (2010)
Dong, X.L., Berti-Equille, L., Srivastava, D.: Truth discovery and copying detection in a dynamic world. In: VLDB, pp. 562–573 (2009)
Dong, X.L., Naumann, F.: Data fusion: resolving data conflicts for integration. In: VLDB, pp. 1654–1655 (2009)
Eckart, C., Young, G.: The approximation of one matrix by another of lower rank. Psychometrika pp. 211–218 (1936)
Galland, A., Abiteboul, S., Marian, A., Senellart, P.: Corroborating information from disagreeing views. In: WSDM, pp. 131–140 (2010)
Garcin, F., Faltings, B., Jurca, R., Joswig, N.: Rating aggregation in collaborative filtering systems. In: Proceedings of the Third ACM Conference on Recommender Systems, pp. 349–352 (2009)
Gokhale, C., Das, S., Doan, A., Naughton, J.F., Rampalli, N., Shavlik, J., Zhu, X.: Corleone: Hands-off crowdsourcing for entity matching. In: SIGMOD, pp. 601–612 (2014)
Goldberg, D.E.: Genetic Algorithms in Search, Optimization and Machine Learning. Addison-Wesley Longman, Boston (1989)
MATH Google Scholar
Gomes, R.G., Welinder, P., Krause, A., Perona, P.: Crowdclustering. In: NIPS, pp. 558–566 (2011)
Hu, Q., He, Q., Huang, H., Chiew, K., Liu, Z.: Learning from crowds under experts supervision. In: PAKDD, pp. 200–211 (2014)
Hung, N.Q.V., Tam, N.T., Miklós, Z., Aberer, K.: On leveraging crowdsourcing techniques for schema matching networks. In: DASFAA, pp. 139–154 (2013)
Hung, N.Q.V., Tam, N.T., Tran, L.N., Aberer, K.: An evaluation of aggregation techniques in crowdsourcing. In: WISE, pp. 1–15 (2013)
Ipeirotis, P.G., Provost, F., Wang, J.: Quality management on Amazon Mechanical Turk. In: HCOMP, pp. 64–67 (2010)
Jeffery, S.R., Franklin, M.J., Halevy, A.Y.: Pay-as-you-go user feedback for dataspace systems. In: SIGMOD, pp. 847–860 (2008)
Joglekar, M., Garcia-Molina, H., Parameswaran, A.: Comprehensive and reliable crowd assessment algorithms. In: ICDE, pp. 195–206 (2015)
Jung, H.J., Lease, M.: Improving quality of crowdsourced labels via probabilistic matrix factorization. In: HCOMP, pp. 101–106 (2012)
Kajino, H., Tsuboi, Y., Sato, I., Kashima, H.: Learning from crowds and experts. In: HCOMP, pp. 107–113 (2012)
Karger, D.R., Oh, S., Shah, D.: Iterative learning for reliable crowdsourcing systems. In: NIPS, pp. 1953–1961 (2011)
Karger, D.R., Oh, S., Shah, D.: Budget-optimal task allocation for reliable crowdsourcing systems. Oper. Res. 62, 1–24 (2014)
Article MATH Google Scholar
Karypis, G., Kumar, V.: Metis-unstructured graph partitioning and sparse matrix ordering system, version 2.0. Technical Report, University of Minnesota (1995)
Kazai, G., Kamps, J., Milic-Frayling, N.: Worker types and personality traits in crowdsourcing relevance labels. In: CIKM, pp. 1941–1944 (2011)
Kittur, A., Chi, E.H., Suh, B.: Crowdsourcing user studies with Mechanical Turk. In: CHI, pp. 453–456 (2008)
Kschischang, F.R., Frey, B.J., Loeliger, H.A.: Factor graphs and the sum-product algorithm. In: TIT pp. 498–519 (1998)
Kulkarni, A., Can, M., Hartmann, B.: Collaboratively crowdsourcing workflows with turkomatic. In: CSCW, pp. 1003–1012 (2012)
Kumar, A., Lease, M.: Modeling annotator accuracies for supervised learning. In: CSDM, pp. 19–22 (2011)
Lam, S.K., Riedl, J.: Shilling recommender systems for fun and profit. In: WWW, pp. 393–402 (2004)
Laws, F., Schätze, H.: Stopping criteria for active learning of named entity recognition. In: ICCL, pp. 465–472 (2008)
Lee, K., Caverlee, J., Webb, S.: The social honeypot project: protecting online communities from spammers. In: WWW, pp. 1139–1140 (2010)
Marcus, A., Parameswaran, A., et al.: Crowdsourced data management industry and academic perspectives. Found Trends Databases 6, 1–161 (2015)
Article Google Scholar
Mozafari, B., Sarkar, P., Franklin, M., Jordan, M., Madden, S.: Scaling up crowd-sourcing to very large datasets: a case for active learning. In: VLDB, pp. 125–136 (2014)
Nguyen, Q.V.H., Do, S.T., Nguyen, T.T., Aberer, K.: Tag-based paper retrieval: minimizing user effort with diversity awareness. In: DASFAA, pp. 510–528 (2015)
Nguyen, Q.V.H., Duong, C.T., Nguyen, T.T., Weidlich, M., Aberer, K., Yin, H., Zhou, X.: Argument discovery via crowdsourcing. VLDB J 26, 511–535 (2017)
Article Google Scholar
Nguyen, Q.V.H., Duong, C.T., Weidlich, M., Aberer, K.: Minimizing efforts in validating crowd answers. In: SIGMOD (2015)
Nguyen, Q.V.H., Huynh, H.V., Nguyen, T.T., Weidlich, M., Yin, H., Zhou, X.: Computing crowd consensus with partial agreement. In: TKDE pp. 1–14 (2017)
Nguyen, Q.V.H., Nguyen, T.T., Miklós, Z., Aberer, K., Gal, A., Weidlich, M.: Pay-as-you-go reconciliation in schema matching networks. In: ICDE, pp. 220–231 (2014)
Nguyen, Q.V.H., Nguyen Thanh, T., Lam, N.T., Do, S.T., Aberer, K.: A benchmark for aggregation techniques in crowdsourcing. In: SIGIR, pp. 1079–1080 (2013)
Nguyen, T.T., Duong, C.T., Weidlich, M., Yin, H., Nguyen, Q.V.H.: Retaining data from streams of social platforms with minimal regret. In: IJCAI (2017)
Nguyen, T.T., Nguyen, Q.V.H., Weidlich, M., Aberer, K.: Result selection and summarization for web table search. In: ICDE, pp. 231–242 (2015)
Nushi, B., Singla, A., Gruenheid, A., Zamanian, E., Krause, A., Kossmann, D.: Crowd access path optimization: diversity matters. In: AAAI (2015)
O’Mahony, M., Hurley, N., Kushmerick, N., Silvestre, G.: Collaborative recommendation: a robustness analysis. TOIT 4, 344–377 (2004)
Article Google Scholar
Pasternack, J., Roth, D.: Latent credibility analysis. In: WWW, pp. 1009–1020 (2013)
Prelec, D., Seung, H.S., McCoy, J.: A solution to the single-question crowd wisdom problem. Nature 541, 532–535 (2017)
Article Google Scholar
Quinn, A.J., Bederson, B.B.: Human computation: a survey and taxonomy of a growing field. In: CHI, pp. 1403–1412 (2011)
Quoc Viet Hung, N., Chi Thang, D., Weidlich, M., Aberer, K.: Erica: expert guidance in validating crowd answers. In: SIGIR, pp. 1037–1038 (2015)
Raykar, V.C., Yu, S.: Ranking annotators for crowdsourced labeling tasks. In: NIPS, pp. 1809–1817 (2011)
Raykar, V.C., Yu, S.: Eliminating spammers and ranking annotators for crowdsourced labeling tasks. J. Mach. Learn. Res. 13, 491–518 (2012)
MathSciNet MATH Google Scholar
Reason, J.: Human Error. Cambridge University Press, Cambridge (1990)
Book Google Scholar
Refaeilzadeh, P., Tang, L., Liu, H.: Cross-validation. In: Encyclopedia of database systems, pp. 532–538. Springer (2009)
Ross, J., Irani, L., Silberman, M., Zaldivar, A., Tomlinson, B.: Who are the crowdworkers?: Shifting demographics in Mechanical Turk. In: CHI, pp. 2863–2872 (2010)
Rubens, N., Kaplan, D., Sugiyama, M.: Active learning in recommender systems. In: Recommender Systems Handbook, pp. 735–767. Springer (2011)
Russell, S.J., Norvig, P.: Artificial Intelligence: A Modern Approach. Pearson Education, London (2003)
MATH Google Scholar
Sarma, A.D., Jain, A., Nandi, A., Parameswaran, A., Widom, J.: Surpassing humans and computers with JELLYBEAN: crowd-vision-hybrid counting algorithms. In: HCOMP (2015)
Shannon, C.E.: A mathematical theory of communication. SIGMOBILE 5, 3–55 (2001)
Google Scholar
Sheng, V.S., Provost, F.: Get another label? Improving data quality and data mining using multiple, noisy labelers. In: SIGKDD pp. 614–622 (2008)
Snow, R., O’Connor, B., Jurafsky, D., Ng, A.Y.: Cheap and fast—but is it good?: Evaluating non-expert annotations for natural language tasks. In: EMNLP, pp. 254–263 (2008)
Sun, C., Rampalli, N., Yang, F., Doan, A.: Chimera: large-scale classification using machine learning, rules, and crowdsourcing. In: VLDB, pp. 1529–1540 (2014)
Surowiecki, J.: The wisdom of crowds: why the many are smarter than the few and how collective wisdom shapes business. Econ. ESN 296, 63–65 (2004)
Google Scholar
TRAVAIL: Global Wage Report 2012–13. International Labour Organization (ILO) (2012)
Turk, A.M.: http://www.mturk.com/ (2016)
Vuurens, J., de Vries, A., Eickhoff, C.: How much spam can you take? An analysis of crowdsourcing results to increase accuracy. In: CIR, pp. 48–55 (2011)
Wang, D., Kaplan, L., Le, H., Abdelzaher, T.: On truth discovery in social sensing: a maximum likelihood estimation approach. In: IPSN, pp. 233–244 (2012)
Welinder, P., Perona, P.: Online crowdsourcing: rating annotators and obtaining cost-effective labels. In: CVPRW, pp. 25–32 (2010)
Wick, M., McCallum, A., Miklau, G.: Scalable probabilistic databases with factor graphs and mcmc. In: VLDB, pp. 794–804 (2010)
Yakout, M., Elmagarmid, A.K., Neville, J., Ouzzani, M., Ilyas, I.F.: Guided data repair. In: VLDB, pp. 279–289 (2011)
Yan, T., Kumar, V., Ganesan, D.: Crowdsearch: exploiting crowds for accurate real-time image search on mobile phones. In: MobiSys, pp. 77–90 (2010)
Zaidan, O.F., Callison-Burch, C.: Crowdsourcing translation: professional quality from non-professionals. In: ACL, pp. 1220–1229 (2011)
Zhang, C., Ré, C.: Towards high-throughput Gibbs sampling at scale: a study across storage managers. In: SIGMOD, pp. 397–408 (2013)
Zhang, C.J., Chen, L., Jagadish, H.V., Cao, C.C.: Reducing uncertainty of schema matching via crowdsourcing. In: VLDB, pp. 757–768 (2013)
Zhao, B., Rubinstein, B.I., Gemmell, J., Han, J.: A Bayesian approach to discovering truth from conflicting sources for data integration. In: VLDB, pp. 550–561 (2012)

Download references

Author information

Authors and Affiliations

Griffith University, Gold Coast, Australia
Nguyen Quoc Viet Hung
École Polytechnique Fédérale de Lausanne, Lausanne, Switzerland
Duong Chi Thang, Nguyen Thanh Tam & Karl Aberer
Humboldt-Universität zu Berlin, Berlin, Germany
Matthias Weidlich
The University of Queensland, Brisbane, Australia
Hongzhi Yin & Xiaofang Zhou

Authors

Nguyen Quoc Viet Hung
View author publications
You can also search for this author in PubMed Google Scholar
Duong Chi Thang
View author publications
You can also search for this author in PubMed Google Scholar
Nguyen Thanh Tam
View author publications
You can also search for this author in PubMed Google Scholar
Matthias Weidlich
View author publications
You can also search for this author in PubMed Google Scholar
Karl Aberer
View author publications
You can also search for this author in PubMed Google Scholar
Hongzhi Yin
View author publications
You can also search for this author in PubMed Google Scholar
Xiaofang Zhou
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Nguyen Quoc Viet Hung.

Appendices

A Further details on datasets

1.1 Real-world data and task design

We have used real-world datasets from different domains, namely people (ppl),object (obj), product reviews (prod), argument (arg), and bluebird (bb). Opting for a generic crowdsourcing setting, our task design uses the default multiple-choice question template from AMT [23, 32]. Further complex, yet out of scope, task designs aiming for human factors and exploiting domain-specific knowledge can be found in [34]. In the ppl dataset, workers have to count the number of people in an real-life image. The crowdsourcing tasks of the obj dataset comprise counting the number of people in digital-art picture. However, the questions of the obj dataset are more difficult than the questions of the ppl dataset as the people in digital-art picture are harder to recognize. In the prod dataset, workers are asked to annotate whether a review expresses positive, neutral or negative meaning. The tasks for the arg dataset require the crowd workers to extract claim and evidence related to a topic from articles from the web. In the bb dataset, workers have to identify one of two types of birds in an image. The similarity function—input of our model—are simply computed by uniformly normalizing the labels into natural number space. The ground truth/expert validation is provided by experts in the field.

1.2 Synthetic data

We used several generated datasets. Since this data should exhibit similar characteristics as real-world data, we considered several parameters for the data generation, in particular: (i) n—the number of objects, (ii) k—the number of workers, (iii) m—the number of labels, (iv) r—the reliability of normal workers, reflecting the probability of their answers being correct, (v) $\sigma $—the percentage of spammers in the worker population and (vi) sim—the similarity between labels, simulated as a uniform distribution in [0, 1]. For the synthetic dataset, we also simulated the ground truth (the correct labels) for the objects. However, it is not known by the simulated workers and only used to simulate the answer validations.

An important part of our synthetic data is the crowd simulation. We follow a practical guideline [22] to simulate the different worker characteristics of the crowd. Specially, we distribute the worker population into $\alpha \%$ reliable workers, $\beta \%$ sloppy workers and $\gamma \%$ spammers. According to a study on crowd population at real-world crowdsourcing services [31], we assign the default values of these parameters as follows: $\alpha = 43$, $\beta =32$ and $\gamma = 25$. In the experiments, the distribution of the worker types is the same as discussed unless stated otherwise.

B Evaluations of expert guidance (cont’d)

In the following experiments, we analyze the effects of the guiding strategy with different crowdsourcing setup, including the number of labels, the number of workers, worker reliability, question difficulty, and the presence of spammers. Since these experiments (except the experiment on question difficulty) require changing the workers’ characteristics (which is not known for the real-world datasets), they are conducted using synthetic data.

We compare the results obtained with our guiding approach (hybrid) to a baseline guiding method that selects the object with the highest uncertainty to seek feedback (baseline):

$$\begin{aligned} \mathrm{select}(O) = \mathop {{{\mathrm{arg\,max}}}}\limits _{o \in O} H(o) \end{aligned}$$

Our hybrid approach is different from the baseline as it further considers the consequences of validation in addition to the mutually reinforcing relations between the reliability of workers and assignment correctness.

1.1 Effect of the number of workers

The idea behind crowdsourcing is that individual crowd answers complement each other. Thus, the aggregation of answers should be closer to the truth as more workers participate [67]. To evaluate the effect of the number of workers on the performance of our approach, we rely on a synthetic dataset containing 50 objects. We vary the number of workers k from 20 to 40 that assign one of three labels to the objects. Figure 15 illustrates an important finding that our approach leads to better results for any number of workers. Taking a fixed amount of expert input, precision increases if more workers are employed. The reason is the widely quoted “wisdom of the crowd” [67], which eventually leads to better precision. Another finding is that the precision improvement with the same amount of expert input is higher if we have more workers (most right plot in Fig. 15). This is expected since, by having more workers, we acquire more answers for the same question, which results in better estimates of assignment probabilities and worker reliabilities. Our approach, thus, has a higher chance to select the objects that lead to a large gain in correctness.

In sum, the two findings suggest that increasing the number of workers is beneficial not only for computing assignment probabilities, but also for guiding answer validation. For the remaining experiments, we fix the number of workers to be the smallest tested value ($k=20$), which is the most challenging scenario.

1.2 Effect of worker reliability

We further explored the effects of the worker reliability r on the effectiveness of our approach. As above, we used a dataset of 20 workers assigning one out of three labels to 50 objects. We then varied the reliability of the non-spammer workers from 0.65 to 0.75.

Figure 16 illustrates a significant improvement in precision using our approach (hybrid) compared to the baseline method. For instance, if the average worker reliability is 0.7, to achieve a precision of 0.95, our approach requires expert input for 20% of the objects, whereas the baseline method requires input for 50% of the objects. In other words, the amount of efforts the baseline method requires is 2.5 times that of our approach. Also, with the same amount of feedback, precision is increased if the average reliability of the workers is higher (most right plot in Fig. 16). This is because an answer set provided by reliable workers requires less validation than an answer set coming from unreliable workers.

1.3 Effect of spammers

In this experiment, we studied the robustness of our guiding approach to spammers using the same dataset as in the previous experiment (20 workers, three labels, 50 objects). We varied the percentage of spammers $\sigma $ in the worker population from 15 to $35\%$ to analyze the effect of these spammers.

Independent of the percentage of spammers, our approach (hybrid) outperforms the baseline method, see Fig. 17. The largest difference between the two approaches is observed when the percentage of spammers is 15%. In that case, to achieve a precision of 0.95, our approach needs 20% of expert input, while the baseline method requires 50%. Regarding the precision improvement (right most plot in Fig. 17), the results are relatively similar across different percentages of spammers. For instance, using 50% of expert input, we are able to increase the precision of the deterministic assignment by 80%, independent of the percentage of spammers. Hence, our approach is indeed robust to the presence of spammers.

1.4 Effects of question difficulty

Beside worker reliability, another factor that can affect the performance of our method is the question difficulty. For hard questions, even reliable workers may give incorrect answers. As a result, there is a need to analyze the effects of question difficulty on the performance of our approach. We compared our approach with the baseline approach using two datasets: ppl and obj, where the questions in the obj dataset is harder than the other. The experimental results are shown in Fig. 18, where the x-axis depicts the expert efforts while the y-axis illustrates the precision of the deterministic assignment.

We observe that our approach is able to outperform the baseline approach for both datasets, meaning that the approach is robust against question difficulty. For instance, for the ppl dataset with easy question, our approach needs only 20% of expert effort to achieve a precision of 0.95 while the baseline approach needs over 60% of expert efforts. Also, the performance of our approach when the questions are easy is better than in the setup with hard questions. This is expected and can be explained as follows. In the dataset with easy questions, most of the workers are able to give the correct answers, which makes the uncertainty in the dataset low. As a result, with the same amount of feedbacks, we can improve the precision higher than when the questions are hard.

C Cost trade-offs (cont’d)

We complement the experiments reported in Sect. 7.9 by studying the effects of question difficulty, spammers, and worker reliability when comparing the EV approach with the WO approach.

1.1 Effects of question difficulty

In this experiment, we compare our EV approach with the WO approach with respect to the difficulty of the questions. We remove the answers from the answer matrix randomly such that 13 answers remain per question ($\phi _0 = 13$). Then, to simulate the addition of answers for the WO approach, we add the answers back to the questions. We fix the expert-crowd cost ratio to $\theta =25$ and average the results over 100 experiment runs.

The experimental results are shown in Fig. 19 where the X-axis depicts the normalized cost and the Y-axis measures the precision improvement of the deterministic assignment. The precision improvement of the EV approach is always higher than that of the WO approach, indicating that our EV approach is robust against the effects of question difficulty.

1.2 Effects of spammers

In this experiment, we analyze the effects of spammers by varying the percentage of spammers in the dataset from 15 to 35%. The experiment is conducted on the synthetic dataset with $\phi _0 = 13$, $\theta =25$.

The results illustrated in Fig. 20 show the benefits of using our approach with different percentages of spammers. The EV approach is able to achieve high precision improvement with a small amount of cost. For instance, when $\sigma =35\%$, to improve the precision by 80%, a cost of 30 is required for the EV approach while the WO approach needs twice the amount. Also, the more spammers are part of the population, the better becomes the performance of the EV approach regarding the WO approach. For example, the difference in cost to achieve 80% precision improvement is about 15 when the percentage of spammers is 15%, but this increases three times to 30 as the percentage of spammers increases to 35%. Again, the reason is that as the percentage of spammer increases, the WO suffers from adding more answers as they are more likely to come from unreliable workers.

1.3 Effects of worker reliability

Worker reliability can affect the quality of crowd answers, thus also affects the cost model. If the worker reliability is high, the expert can spend less effort to give feedbacks as most of the answers are already correct. On the other hand, when the worker reliability is low, more feedbacks from the expert are required to achieve the same amount of precision. In this experiment, we analyze the effects of worker reliability on the cost of validating the crowd answers by varying the reliability of the normal workers from 0.6 to 0.7. Similar to the above experiment, we fix the following parameter: $\phi _0 = 13$, $\theta =25$ and the workers population is simulated as discussed in Sect. 7.1.

The obtained results are illustrated in Fig. 21, which highlights the relation between the cost normalized over each question and the precision of the deterministic assignment. Interestingly, when the reliability of the workers is 0.6, the precision of the deterministic assignment using the WO approach converges to 0 as we add more answers. The reason is that as we decrease the worker reliability, the average worker reliability becomes less than 0.5, which makes the precision converge to 0. This shows that adding more answers to the answer set may not improve but reduce the quality due to unreliable workers. When the reliability of the workers is 0.65, the precision of the deterministic assignment using the WO approach improves very slowly as the average reliability of the whole population is about 0.5. On the other hand, when the reliability of the workers is 0.7, the precision of the WO approach converges to 1. Yet, it requires higher cost to reach the same amount of precision as the EV approach. In summary, this experiment shows that our approach is robust against the reliability of the workers.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Hung, N.Q.V., Thang, D.C., Tam, N.T. et al. Answer validation for generic crowdsourcing tasks with minimal efforts. The VLDB Journal 26, 855–880 (2017). https://doi.org/10.1007/s00778-017-0484-3

Download citation

Received: 31 March 2017
Revised: 20 July 2017
Accepted: 23 September 2017
Published: 13 October 2017
Issue Date: December 2017
DOI: https://doi.org/10.1007/s00778-017-0484-3

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Answer validation for generic crowdsourcing tasks with minimal efforts

Abstract

Access this article