Skip to main content
Log in

Crowd labeling latent Dirichlet allocation

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

Large, unlabeled datasets are abundant nowadays, but getting labels for those datasets can be expensive and time-consuming. Crowd labeling is a crowdsourcing approach for gathering such labels from workers whose suggestions are not always accurate. While a variety of algorithms exist for this purpose, we present crowd labeling latent Dirichlet allocation (CL-LDA), a generalization of latent Dirichlet allocation that can solve a more general set of crowd labeling problems. We show that it performs as well as other methods and at times better on a variety of simulated and actual datasets while treating each label as compositional rather than indicating a discrete class. In addition, prior knowledge of workers’ abilities can be incorporated into the model through a structured Bayesian framework. We then apply CL-LDA to the EEG independent component labeling dataset, using its generalizations to further explore the utility of the algorithm. We discuss prospects for creating classifiers from the generated labels.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

References

  1. Agarwal D, Chen B-C (2010) fLDA: matrix factorization through latent Dirichlet allocation. In: Proceedings of the third ACM international conference on web search and data mining. ACM, pp 91–100

  2. Blei DM, Ng AY, Jordan MI (2003) Latent Dirichlet allocation. J Mach Learn Res 3:993–1022

    MATH  Google Scholar 

  3. Buckley C, Lease M, Smucker MD (2010) Overview of the TREC 2010 Relevance Feedback Track (Notebook). In: The nineteenth text retrieval conference (TREC) notebook

  4. Canini KR, Shi L, Griffiths TL (2009) Online inference of topics with latent Dirichlet allocation. In: International conference on artificial intelligence and statistics (AISTATS), vol 9, pp 65–72

  5. Dawid AP, Skene AM (1979) Maximum likelihood estimation of observer error-rates using the EM algorithm. Appl Stat 28:20–28

    Article  Google Scholar 

  6. Della Penna N, Reid MD (2012) Crowd and prejudice: an impossibility theorem for crowd labelling without a gold standard. arXiv preprint arXiv:1204.3511

  7. Demartini G, Difallah DE, Cudré-Mauroux P (2012) Zencrowd: leveraging probabilistic reasoning and crowd sourcing techniques for large-scale entity linking. In: Proceedings of the 21st international conference on World Wide Web. ACM, pp 469–478

  8. Griffiths TL, Steyvers M (2004) Finding scientific topics. Proc Natl Acad Sci 101:5228–5235

    Article  Google Scholar 

  9. Orr D (2013) 50,000 Lessons on How to Read: a Relation Extraction Corpus. https://research.googleblog.com/2013/04/50000-lessons-on-how-to-read-relation.html. Accessed 01 May 2016

  10. Hoffman M, Bach FR, Blei DM (2010) Online learning for latent Dirichlet allocation. In: Advances in neural information processing systems, pp 856–864

  11. Ipeirotis PG, Provost F, Wang J (2010) Quality management on Amazon mechanical turk. In: Proceedings of the ACM SIGKDD workshop on human computation. ACM, pp 64–67

  12. Kim HC, Ghahramani Z (2012) Bayesian classifier combination. In: International conference on artificial intelligence and statistics (AISTATS), pp 619–627

  13. Krestel R, Fankhauser P, Nejdl W (2009) Latent Dirichlet allocation for tag recommendation. In: Proceedings of the third ACM conference on recommender systems. ACM, pp 61–68

  14. Lienou M, Maître H, Datcu M (2010) Semantic annotation of satellite images using latent Dirichlet allocation. IEEE Geosci Remote Sens Lett 7(1):28–32. doi:10.1109/LGRS.2009.2023536

    Article  Google Scholar 

  15. Makeig S, Bell AJ, Jung TP, Sejnowski TJ, et al (1996) Independent component analysis of electroencephalographic data. In: Advances in neural information processing systems, pp 145–151

  16. Minka T (2000) Estimating a Dirichlet distribution. Tech. rep

  17. Moreno PG, Teh YW, Perez-Cruz F, Artés-Rodríguez A (2014) Bayesian nonparametric crowdsourcing. arXiv preprint arXiv:1407.5017

  18. Mozafari B, Sarkar P, Franklin MJ, Jordan MI, Madden S (2012) Active learning for crowd-sourced databases. arXiv preprint arXiv:1209.3686

  19. Muhammadi J, Rabiee HR, Hosseini A (2013) Crowd labeling: a survey. arXiv preprint arXiv:1301.2774

  20. Sato I, Kashima H, Nakagawa H (2014) Latent confusion analysis by normalized gamma construction. In: Proceedings of the 31st international conference on machine learning (ICML-14), pp 1116–1124

  21. Sheshadri A (2014) A collaborative approach to IR evaluation. Master’s thesis, The University of Texas at Austin

  22. Sheshadri A, Lease M (2013) SQUARE: a benchmark for research on computing crowd consensus. In: First AAAI conference on human computation and crowdsourcing

  23. Snow R, O’Connor B, Jurafsky D, Ng AY (2008) Cheap and fast—but is it good? Evaluating non-expert annotations for natural language tasks. In: Proceedings of the conference on empirical methods in natural language processing. Association for Computational Linguistics, pp 254–263

  24. Tang W, Lease M (2011) Semi-supervised consensus labeling for crowdsourcing. In: ACM SIGIR workshop on crowdsourcing for information retrieval (CIR), pp 36–41

  25. Wallach HM (2008) Structured topic models for language. PhD thesis, University of Cambridge

  26. Wallach HM, Mimno DM, McCallum A (2009) Rethinking LDA: why priors matter. In: Bengio Y, Schuurmans D, Lafferty JD, Williams CKI, Culotta A (eds) Advances in neural information processing systems 22, Curran Associates Inc., pp 1973–1981.

  27. Wang X, Grimson E (2008) Spatial latent Dirichlet allocation. In: Platt JC, Koller D, Singer Y, Roweis ST (eds) Advances in neural information processing systems 20, Curran Associates Inc., pp 1577–1584.

  28. Wang Y, Bai H, Stanton M, Chen WY, Chang EY (2009) PLDA: parallel latent Dirichlet allocation for large-scale applications. In: Goldberg AV, Zhou Y (eds) Algorithmic aspects in information and management. Springer, Berlin, Heidelberg, pp 301–314. doi:10.1007/978-3-642-02158-9_26

  29. Welinder P, Branson S, Belongie S, Perona P (2010) The multidimensional wisdom of crowds. In: Proceedings of the 23rd International Conference on Neural Information Processing Systems, Curran Associates Inc., USA, pp 2424–2432.

  30. Whitehill J, fan Wu T, Bergsma J, Movellan JR, Ruvolo PL (2009) Whose vote should count more: optimal integration of labels from labelers of unknown expertise. In: Bengio Y, Schuurmans D, Lafferty JD, Williams CKI, Culotta A (eds) Advances in neural information processing systems, vol 22. Curran Associates Inc., Rostrevor, pp 2035–2043

    Google Scholar 

  31. Wilson AT, Chew PA (2010) Term weighting schemes for latent Dirichlet allocation. In: Human language technologies: the 2010 annual conference of the North American Chapter of the Association for Computational Linguistics. Association for Computational Linguistics, pp 465–473

  32. Yan , Xu N, Qi Y (2009) Parallel inference for latent Dirichlet allocation on graphics processing units. In: Bengio Y, Schuurmans D, Lafferty JD, Williams CKI, Culotta A (eds) Advances in neural information processing systems 22, Curran Associates Inc., pp 2134–2142.

Download references

Acknowledgements

The authors would like to thank Paul D. Pion, DVM, DipACVIM (Cardiology), for his input on which EEG components contain signals originating from heart activity. This work was supported in part by a gift by the Swartz Foundation (Old Field, NY) and by the National Science Foundation Graduate Research Fellowship under Grant No. DGE-1144086.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Luca Pion-Tonachini.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Pion-Tonachini, L., Makeig, S. & Kreutz-Delgado, K. Crowd labeling latent Dirichlet allocation. Knowl Inf Syst 53, 749–765 (2017). https://doi.org/10.1007/s10115-017-1053-1

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-017-1053-1

Keywords

Navigation