Abstract
This paper proposes a crowdsourcing quality control method with workerprivacy preservation. Crowdsourcing allows us to outsource tasks to a number of workers. The results of tasks obtained in crowdsourcing are often lowquality due to the difference in the degree of skill. Therefore, we need quality control methods to estimate reliable results from lowquality results. In this paper, we point out privacy problems of workers in crowdsourcing. Personal information of workers can be inferred from the results provided by each worker. To formulate and to address the privacy problems, we define a workerprivate quality control problem, a variation of the quality control problem that preserves privacy of workers. We propose a workerprivate latent class protocol where a requester can estimate the true results with worker privacy preserved. The key ideas are decentralization of computation and introduction of secure computation. We theoretically guarantee the security of the proposed protocol and experimentally examine the computational efficiency and accuracy.
This is a preview of subscription content, access via your institution.
References
Agrawal R, Srikant R (2000) Privacypreserving data mining. In: Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, pp 439–450
Bernstein M, Chi EH, Chilton L, Hartmann B, Kittur A, Miller RC (2011) Crowdsourcing and human computation: systems, studies and platforms. In: Proceedings of CHI 2011 Workshop on Crowdsourcing and Human Computation, pp 53–56
Burkhart M, Strasser M, Many D, Dimitropoulos X (2010) SEPIA: privacypreserving aggregation of multidomain network events and statistics. In: Proceedings of the 19th USENIX Conference on Security, pp 223–240
Damgård I, Jurik M (2001) A Generalisation, a simplification and some applications of Paillier’s probabilistic publickey system. In: Proceedings of the 4th International Workshop on Practice and Theory in Public Key Cryptography: Public Key Cryptography, pp 119–136
Dawid AP, Skene AM (1979) Maximum likelihood estimation of observer errorrates using the EM algorithm. J R Stat Soc Ser C 28(1):20–28
Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc Ser B 39(1):1–38
Ertekin S, Hirsh H, Rudin C (2012) Learning to predict the wisdom of crowds. In: Proceedings of Collective Intelligence 2012
Lease M (2011) On quality control and machine learning in crowdsourcing. In: Proceedings of the Third Human Computation Workshop, pp 97–102
Lin X, Clifton C, Zhu M (2005) Privacypreserving clustering with distributed EM mixture modeling. Knowl Inf Syst 8(1):68–81
Lindell Y, Pinkas B (2000) Privacy preserving data mining. In: Advances in CryptologyCRYPTO ’00, pp 36–54
Nabar SU, Kenthapadi K, Mishra N, Motwani R (2008) A survey of query auditing techniques for data privacy. In: PrivacyPreserving Data Mining: Models and Algorithms, pp 415–431
Raykar VC, Yu S, Zhao LH, Florin C, Bogoni L, Moy L (2010) Learning from crowds. J Mach Learn Res 11:1297–1322
Shamir A (1979) How to share a secret. Commun ACM 22(11):612–613. doi:10.1145/359168.359176
Sheng VS, Provost F, Ipeirotis PG (2008) Get another label? Improving data quality and data mining using multiple, noisy labelers. In: Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp 614–622
Sweeney L (2002) kanonymity: a model for protecting privacy. Int J Uncertain Fuzziness Knowl Syst 10(5):557–570. doi:10.1142/S0218488502001648
Varshney LR (2012) Privacy and reliability in crowdsourcing service delivery. In: Proceedings of the 2012 Annual SRII Global Conference, pp 55–60
Welinder P, Branson S, Belongie S, Perona P (2010) The multidimensional wisdom of crowds. Adv Neural Inf Process Syst 23:2424–2432
Whitehill J, Ruvolo P, Wu T, Bergsma J, Movellan J (2009) Whose vote should count more: optimal integration of labels from labelers of unknown expertise. Adv Neural Inf Process Syst 22:2035–2043
Yang B, Sato I, Nakagawa H (2012) Privacypreserving EM algorithm for clustering on social network. In: Advances in Knowledge Discovery and Data Mining 16th PacificAsia Conference, PAKDD 2012, pp 542–553
Acknowledgments
H. Kajino and H. Kashima were supported by the FIRST program.
Author information
Authors and Affiliations
Corresponding author
Additional information
Responsible editors: Toon Calders, Floriana Esposito, Eyke Hüllermeier, and Rosa Meo.
Appendix 1: Extensions to multiclass and realvalued labels
Appendix 1: Extensions to multiclass and realvalued labels
We introduce the detailed update rules of modified LC methods to deal with multiclass and realvalued labels, and then we explain how to extend the inference algorithms to preserve worker privacy.
Appendix 1.1: Multiclass labels
The LC method was originally proposed for multiclass labels by Dawid and Skene (1979). Let us assume a task to give a \(K\)class label (\(K\ge 2\)). For each \(i\in {\mathcal I}\) and \(j\in {\mathcal {J}}\), a crowd label \(y_{i,j}\in \{0,\dots ,K1\}(=:{\mathcal K})\) is generated by the multinomial distribution
where \(\sum _{k\in {\mathcal K}} \pi _{jkl} = 1\) holds for all \(l\in {\mathcal K}\), and we denote \(\Pi _{j} = \{\pi _{jkl} \mid k,l\in {\mathcal K}\}\). Also, for each \(i\in {\mathcal I}\), the true label \(y_i\in {\mathcal K}\) is generated by
where \(\sum _{l\in {\mathcal K}} p_l = 1\) holds. The model parameters \(\Pi =\bigcup _{j\in {\mathcal {J}}}\Pi _{j}\) and \(\{p_l \mid l\in {\mathcal K}\}\) and the posterior probabilities of the true labels \(\mu _{il} = \Pr [y_i = l \mid {\mathcal Y}, \Pi ]\) are estimated using the following EM algorithm.

Estep:
for each \(i\in {\mathcal I}\), update \(\{\mu _{il} \mid l\in {\mathcal K}\}\) as
$$\begin{aligned} \mu _{il}&= \dfrac{p_l \rho _{il}}{\sum _{l^{\prime }\in {\mathcal K}} p_{l^{\prime }}\rho _{il^{\prime }}},\\ \mathrm{where\ } \log \rho _{il}&= \sum _{j\in {\mathcal {J}}_{i}} \sum _{k\in {\mathcal K}} {\mathbf I}(y_{i,j}=k) \log \pi _{jkl}. \end{aligned}$$ 
Mstep:
for each \(j\in {\mathcal {J}}\), update \(\Pi _j\) as
$$\begin{aligned} \pi _{jkl} = \dfrac{\sum _{i\in {\mathcal I}_j} \mu _{il} {\mathbf I}(y_{i,j} = k)}{\sum _{i\in {\mathcal I}_j} \mu _{il}}, \end{aligned}$$and for each \(l\in {\mathcal K}\), update \(p_l\) as
$$\begin{aligned} p_l = \dfrac{1}{{\mathcal I}}\sum _{i\in {\mathcal I}}\mu _{il}. \end{aligned}$$
This algorithm can be extended to preserve worker privacy. In the Estep, the parties calculate \(\{\log \rho _{il} \mid i\in {\mathcal I}, l\in {\mathcal K}\}\) using our secure sum protocol, and the requester calculates and broadcasts \(\{\mu _{il}\mid i\in {\mathcal I}, l\in {\mathcal K}\}\). In the Mstep, each worker \(j\) calculates \(\{\pi _{jkl} \mid k,l\in {\mathcal K}\}\), and the requester calculates \(\{p_l \mid {l\in {\mathcal K}}\}\).
Appendix 1.2: Realvalued labels
The LC method was modified to deal with realvalued labels by Raykar et al. (2010). Let us assume a task to give a realvalued label. For each \(i\in {\mathcal I}\) and \(j\in {\mathcal {J}}\), a crowd label \(y_{i,j}\in \mathbb {R}\) is generated by the normal distribution
where \(\tau _j (> 0)\) is the precision parameter of the normal distribution, which is interpreted as the ability of worker \(j\), and \(\gamma \) works as regularization. Let us denote \(1/\lambda _j := 1/\tau _j + 1/\gamma \). Assuming that the crowd labels were generated by this model, the true labels and the precision parameters are estimated by the following EMlike algorithm.

Estep: for each \(i\in {\mathcal I}\), update the true label \(y_i\) as
$$\begin{aligned} y_i = \dfrac{\sum _{j\in {\mathcal {J}}_i} \lambda _j y_{i,j}}{\sum _{j\in {\mathcal {J}}_i} \lambda _j}. \end{aligned}$$ 
Mstep: for each \(j\in {\mathcal {J}}\), update \(\lambda _j\) by solving
$$\begin{aligned} \dfrac{1}{\lambda _j} = \dfrac{1}{{\mathcal I}_j}\sum _{i\in {\mathcal I}_j} (y_{i,j}  y_i)^2. \end{aligned}$$
This algorithm can also be extended to preserve worker privacy. In the Estep, the parties calculate \(\left\{ \sum _{j\in {\mathcal {J}}_i} \lambda _j y_{i,j}, \sum _{j\in {\mathcal {J}}_i} \lambda _j \mid {i\in {\mathcal I}}\right\} \) using our secure sum protocol, and the requester calculates and broadcasts \(\{y_i \mid {i\in {\mathcal I}}\}\). In the Mstep, each worker \(j\) calculates \(\lambda _j\).
Rights and permissions
About this article
Cite this article
Kajino, H., Arai, H. & Kashima, H. Preserving worker privacy in crowdsourcing. Data Min Knowl Disc 28, 1314–1335 (2014). https://doi.org/10.1007/s1061801403523
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s1061801403523
Keywords
 Crowdsourcing
 Quality control
 Privacypreserving data mining
 EM algorithm