Label distribution similarity-based noise correction for crowdsourcing

Ren, Lijuan; Jiang, Liangxiao; Zhang, Wenjun; Li, Chaoqun

doi:10.1007/s11704-023-2751-3

Label distribution similarity-based noise correction for crowdsourcing

Research Article
Published: 23 December 2023

Volume 18, article number 185323, (2024)
Cite this article

Frontiers of Computer Science Aims and scope Submit manuscript

Lijuan Ren¹,
Liangxiao Jiang¹,
Wenjun Zhang¹ &
…
Chaoqun Li^2,3

43 Accesses
1 Citation
Explore all metrics

Abstract

In crowdsourcing scenarios, we can obtain each instance’s multiple noisy labels from different crowd workers and then infer its integrated label via label aggregation. In spite of the effectiveness of label aggregation methods, there still remains a certain level of noise in the integrated labels. Thus, some noise correction methods have been proposed to reduce the impact of noise in recent years. However, to the best of our knowledge, existing methods rarely consider an instance’s information from both its features and multiple noisy labels simultaneously when identifying a noise instance. In this study, we argue that the more distinguishable an instance’s features but the noisier its multiple noisy labels, the more likely it is a noise instance. Based on this premise, we propose a label distribution similarity-based noise correction (LDSNC) method. To measure whether an instance’s features are distinguishable, we obtain each instance’s predicted label distribution by building multiple classifiers using instances’ features and their integrated labels. To measure whether an instance’s multiple noisy labels are noisy, we obtain each instance’s multiple noisy label distribution using its multiple noisy labels. Then, we use the Kullback-Leibler (KL) divergence to calculate the similarity between the predicted label distribution and multiple noisy label distribution and define the instance with the lower similarity as a noise instance. The extensive experimental results on 34 simulated and four real-world crowdsourced datasets validate the effectiveness of our method.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Multi-label feature selection via spectral clustering-based label enhancement and manifold distribution consistency

Article 09 May 2024

Multi-center federated learning: clients clustering for better personalization

Article Open access 09 June 2022

An efficient clustering algorithm based on searching popularity peaks

Article 21 May 2024

References

Jiang L, Zhang L, Li C, Wu J. A correlation-based feature weighting filter for naive bayes. IEEE Transactions on Knowledge and Data Engineering, 2019, 31(2): 201–213
Article Google Scholar
Hu Y, Jiang L, Li C. Instance difficulty-based noise correction for crowdsourcing. Expert Systems with Applications, 2023, 212: 118794
Article Google Scholar
Sheng V S, Provost F J, Ipeirotis P G. Get another label? Improving data quality and data mining using multiple, noisy labelers. In: Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2008, 614–622
Snow R, O’Connor B, Jurafsky D, Ng A Y. Cheap and fast - but is it good?: evaluating non-expert annotations for natural language tasks. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing. 2008, 254–263
Zhang J. Knowledge learning with crowdsourcing: a brief review and systematic perspective. IEEE/CAA Journal of Automatica Sinica, 2022, 9(5): 749–762
Article MathSciNet Google Scholar
Karger D R, Oh S, Shah D. Iterative learning for reliable crowdsourcing systems. In: Proceedings of the 24th International Conference on Neural Information Processing Systems. 2011, 1953–1961
Zhang J, Sheng V S, Wu J, Wu X. Multi-class ground truth inference in crowdsourcing with clustering. IEEE Transactions on Knowledge and Data Engineering, 2016, 28(4): 1080–1085
Article Google Scholar
Yin L, Han J, Zhang W, Yu Y. Aggregating crowd wisdoms with label-aware autoencoders. In: Proceedings of the 26th International Joint Conference on Artificial Intelligence. 2017, 1325–1331
Jiang L, Zhang H, Tao F, Li C. Learning from crowds with multiple noisy label distribution propagation. IEEE Transactions on Neural Networks and Learning Systems, 2022, 33(11): 6558–6568
Article Google Scholar
Chen Z, Jiang L, Li C. Label augmented and weighted majority voting for crowdsourcing. Information Sciences, 2022, 606: 397–409
Article Google Scholar
Dawid A P, Skene A M. Maximum likelihood estimation of observer error-rates using the EM algorithm. Journal of the Royal Statistical Society Series C: Applied Statistics, 1979, 28(1): 20–28
Google Scholar
Raykar V C, Yu S, Zhao L H, Valadez G H, Florin C, Bogoni L, Moy L. Learning from crowds. The Journal of Machine Learning Research, 2010, 11: 1297–1322
MathSciNet Google Scholar
Demartini G, Difallah D E, Mauroux P C. ZenCrowd: leveraging probabilistic reasoning and crowdsourcing techniques for large-scale entity linking. In: Proceedings of the 21st International Conference on World Wide Web. 2012, 469–478
Zhang Y, Chen X, Zhou D, Jordan M I. Spectral methods meet EM: A provably optimal algorithm for crowdsourcing. In: Proceedings of the 27th International Conference on Neural Information Processing Systems. 2014, 1260–1268
Li Y, Rubinstein B I P, Cohn T. Exploiting worker correlation for label aggregation in crowdsourcing. In: Proceedings of the 36th International Conference on Machine Learning. 2019, 3886–3895
Zhang J, Wu X. Multi-label truth inference for crowdsourcing using mixture models. IEEE Transactions on Knowledge and Data Engineering, 2021, 33(5): 2083–2095
Google Scholar
Nicholson B, Sheng V S, Zhang J. Label noise correction and application in crowdsourcing. Expert Systems with Applications, 2016, 66: 149–162
Article Google Scholar
Zhang J, Sheng V S, Li T, Wu X. Improving crowdsourced label quality using noise correction. IEEE Transactions on Neural Networks and Learning Systems, 2018, 29(5): 1675–1688
Article MathSciNet Google Scholar
Xu W, Jiang L, Li C. Improving data and model quality in crowdsourcing using cross-entropy-based noise correction. Information Sciences, 2021, 546: 803–814
Article Google Scholar
Chen Z, Jiang L, Li C. Label distribution-based noise correction for multiclass crowdsourcing. International Journal of Intelligent Systems, 2022, 37(9): 5752–5767
Article Google Scholar
Li H, Jiang L, Xue S. Neighborhood weighted voting-based noise correction for crowdsourcing. ACM Transactions on Knowledge Discovery from Data, 2023, 17(7): 96
Article Google Scholar
Li J, Socher R, Hoi S C H. DivideMix: Learning with noisy labels as semi-supervised learning. In: Proceedings of the 8th International Conference on Learning Representations. 2020
Liu S, Niles-Weed J, Razavian N, Fernandez-Granda C. Early-learning regularization prevents memorization of noisy labels. In: Proceedings of the 34th International Conference on Neural Information Processing Systems. 2020
Jiang L, Li C. Two improved attribute weighting schemes for value difference metric. Knowledge and Information Systems, 2019, 60(2): 949–970
Article Google Scholar
Deng J, Wang Y, Guo J, Deng Y, Gao J, Park Y. A similarity measure based on kullback-leibler divergence for collaborative filtering in sparse data. Journal of Information Science, 2019, 45(5): 656–675
Article Google Scholar
Zhang J, Sheng V S, Nicholson B, Wu X. CEKA: a tool for mining the wisdom of crowds. The Journal of Machine Learning Research, 2015, 16(1): 2853–2858
MathSciNet Google Scholar
Quinlan J R. C4.5: Programs for Machine Learning. San Francisco: Morgan Kaufmann Publishers Inc., 1993
Google Scholar
Langley P, Iba W, Thompson K. An analysis of Bayesian classifiers. In: Proceedings of the 10th National Conference on Artificial intelligence. 1992, 223–228
Keerthi S, Shevade S, Bhattacharyya C, Murthy K. Improvements to Platt’s SMO algorithm for SVM classifier design. Neural Computation, 2001, 13(3): 637–649
Article Google Scholar
Witten I H, Frank E, Hall M A. Data Mining: Practical Machine Learning Tools and Techniques. 3rd ed. Burlington: Morgan Kaufmann, 2011
Google Scholar
Gamberger D, Lavrac N, Groselj C. Experiments with noise filtering in a medical domain. In: Proceedings of the 16th International Conference on Machine Learning. 1999, 143–151
Alcalá-Fdez J, Fernández A, Luengo J, Derrac J, García S, Sanchez L, Herrera F. KEEL data-mining software tool: Data set repository, integration of algorithms and experimental analysis framework. Journal of Multiple-Valued Logic and Soft Computing, 2011, 17(2–3): 255–287
Google Scholar
Demšar J. Statistical comparisons of classifiers over multiple data sets. The Journal of Machine Learning Research, 2006, 7: 1–30
MathSciNet Google Scholar
Zhang J, Sheng V S, Wu J. Crowdsourced label aggregation using bilayer collaborative clustering. IEEE Transactions on Neural Networks and Learning Systems, 2019, 30(10): 3172–3185
Article Google Scholar
Rodrigues F, Lourenço M, Ribeiro B, Pereira F C. Learning supervised topic models for classification and regression from crowds. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(12): 2409–2422
Article Google Scholar
Zhang J, Wu X, Sheng V S. Learning from crowdsourced labeled data: a survey. Artificial Intelligence Review, 2016, 46(4): 543–576
Article Google Scholar
Rodrigues F, Pereira F C, Ribeiro B. Gaussian process classification and active learning with multiple annotators. In: Proceedings of the 31st International Conference on International Conference on Machine Learning. 2014, II-433–II-441

Download references

Acknowledgements

The work was partially supported by the National Natural Science Foundation of China (Grant No. 62276241) and Foundation of Key Laboratory of Artificial Intelligence, Ministry of Education, China (AI2022004).

Author information

Authors and Affiliations

School of Computer Science, China University of Geosciences, Wuhan, 430074, China
Lijuan Ren, Liangxiao Jiang & Wenjun Zhang
Key Laboratory of Artificial Intelligence, Ministry of Education, Shanghai, 200240, China
Chaoqun Li
School of Mathematics and Physics, China University of Geosciences, Wuhan, 430074, China
Chaoqun Li

Authors

Lijuan Ren
View author publications
You can also search for this author in PubMed Google Scholar
Liangxiao Jiang
View author publications
You can also search for this author in PubMed Google Scholar
Wenjun Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Chaoqun Li
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Liangxiao Jiang.

Additional information

Lijuan Ren is currently a MSc student at the School of Computer Science, China University of Geosciences, China. Her research interests mainly include machine learning and data mining (MLDM).

Liangxiao Jiang is currently a professor at the School of Computer Science, China University of Geosciences, China. His research interests mainly include machine learning and data mining (MLDM). In MLDM domains, he has already published more than 90 papers.

Wenjun Zhang is currently a PhD student at the School of Computer Science, China University of Geosciences, China. His main research interests include machine learning and data mining (MLDM). In MLDM domains, he has published two scientific articles in Science China Information Sciences and Journal of Computer Research and Development.

Chaoqun Li is currently an associate professor at the School of Mathematics and Physics, China University of Geosciences, China. Her research interests mainly include machine learning and data mining (MLDM). In MLDM domains, she has already published more than 50 papers.

Electronic supplementary material