Knowledge and Information Systems

, Volume 53, Issue 3, pp 749–765 | Cite as

Crowd labeling latent Dirichlet allocation

  • Luca Pion-Tonachini
  • Scott Makeig
  • Ken Kreutz-Delgado
Regular Paper


Large, unlabeled datasets are abundant nowadays, but getting labels for those datasets can be expensive and time-consuming. Crowd labeling is a crowdsourcing approach for gathering such labels from workers whose suggestions are not always accurate. While a variety of algorithms exist for this purpose, we present crowd labeling latent Dirichlet allocation (CL-LDA), a generalization of latent Dirichlet allocation that can solve a more general set of crowd labeling problems. We show that it performs as well as other methods and at times better on a variety of simulated and actual datasets while treating each label as compositional rather than indicating a discrete class. In addition, prior knowledge of workers’ abilities can be incorporated into the model through a structured Bayesian framework. We then apply CL-LDA to the EEG independent component labeling dataset, using its generalizations to further explore the utility of the algorithm. We discuss prospects for creating classifiers from the generated labels.


Crowd labeling Generative model Bayesian Latent Dirichlet allocation EEG 



The authors would like to thank Paul D. Pion, DVM, DipACVIM (Cardiology), for his input on which EEG components contain signals originating from heart activity. This work was supported in part by a gift by the Swartz Foundation (Old Field, NY) and by the National Science Foundation Graduate Research Fellowship under Grant No. DGE-1144086.


  1. 1.
    Agarwal D, Chen B-C (2010) fLDA: matrix factorization through latent Dirichlet allocation. In: Proceedings of the third ACM international conference on web search and data mining. ACM, pp 91–100Google Scholar
  2. 2.
    Blei DM, Ng AY, Jordan MI (2003) Latent Dirichlet allocation. J Mach Learn Res 3:993–1022MATHGoogle Scholar
  3. 3.
    Buckley C, Lease M, Smucker MD (2010) Overview of the TREC 2010 Relevance Feedback Track (Notebook). In: The nineteenth text retrieval conference (TREC) notebookGoogle Scholar
  4. 4.
    Canini KR, Shi L, Griffiths TL (2009) Online inference of topics with latent Dirichlet allocation. In: International conference on artificial intelligence and statistics (AISTATS), vol 9, pp 65–72Google Scholar
  5. 5.
    Dawid AP, Skene AM (1979) Maximum likelihood estimation of observer error-rates using the EM algorithm. Appl Stat 28:20–28CrossRefGoogle Scholar
  6. 6.
    Della Penna N, Reid MD (2012) Crowd and prejudice: an impossibility theorem for crowd labelling without a gold standard. arXiv preprint arXiv:1204.3511
  7. 7.
    Demartini G, Difallah DE, Cudré-Mauroux P (2012) Zencrowd: leveraging probabilistic reasoning and crowd sourcing techniques for large-scale entity linking. In: Proceedings of the 21st international conference on World Wide Web. ACM, pp 469–478Google Scholar
  8. 8.
    Griffiths TL, Steyvers M (2004) Finding scientific topics. Proc Natl Acad Sci 101:5228–5235CrossRefGoogle Scholar
  9. 9.
    Orr D (2013) 50,000 Lessons on How to Read: a Relation Extraction Corpus. Accessed 01 May 2016
  10. 10.
    Hoffman M, Bach FR, Blei DM (2010) Online learning for latent Dirichlet allocation. In: Advances in neural information processing systems, pp 856–864Google Scholar
  11. 11.
    Ipeirotis PG, Provost F, Wang J (2010) Quality management on Amazon mechanical turk. In: Proceedings of the ACM SIGKDD workshop on human computation. ACM, pp 64–67Google Scholar
  12. 12.
    Kim HC, Ghahramani Z (2012) Bayesian classifier combination. In: International conference on artificial intelligence and statistics (AISTATS), pp 619–627Google Scholar
  13. 13.
    Krestel R, Fankhauser P, Nejdl W (2009) Latent Dirichlet allocation for tag recommendation. In: Proceedings of the third ACM conference on recommender systems. ACM, pp 61–68Google Scholar
  14. 14.
    Lienou M, Maître H, Datcu M (2010) Semantic annotation of satellite images using latent Dirichlet allocation. IEEE Geosci Remote Sens Lett 7(1):28–32. doi: 10.1109/LGRS.2009.2023536 CrossRefGoogle Scholar
  15. 15.
    Makeig S, Bell AJ, Jung TP, Sejnowski TJ, et al (1996) Independent component analysis of electroencephalographic data. In: Advances in neural information processing systems, pp 145–151Google Scholar
  16. 16.
    Minka T (2000) Estimating a Dirichlet distribution. Tech. repGoogle Scholar
  17. 17.
    Moreno PG, Teh YW, Perez-Cruz F, Artés-Rodríguez A (2014) Bayesian nonparametric crowdsourcing. arXiv preprint arXiv:1407.5017
  18. 18.
    Mozafari B, Sarkar P, Franklin MJ, Jordan MI, Madden S (2012) Active learning for crowd-sourced databases. arXiv preprint arXiv:1209.3686
  19. 19.
    Muhammadi J, Rabiee HR, Hosseini A (2013) Crowd labeling: a survey. arXiv preprint arXiv:1301.2774
  20. 20.
    Sato I, Kashima H, Nakagawa H (2014) Latent confusion analysis by normalized gamma construction. In: Proceedings of the 31st international conference on machine learning (ICML-14), pp 1116–1124Google Scholar
  21. 21.
    Sheshadri A (2014) A collaborative approach to IR evaluation. Master’s thesis, The University of Texas at AustinGoogle Scholar
  22. 22.
    Sheshadri A, Lease M (2013) SQUARE: a benchmark for research on computing crowd consensus. In: First AAAI conference on human computation and crowdsourcingGoogle Scholar
  23. 23.
    Snow R, O’Connor B, Jurafsky D, Ng AY (2008) Cheap and fast—but is it good? Evaluating non-expert annotations for natural language tasks. In: Proceedings of the conference on empirical methods in natural language processing. Association for Computational Linguistics, pp 254–263Google Scholar
  24. 24.
    Tang W, Lease M (2011) Semi-supervised consensus labeling for crowdsourcing. In: ACM SIGIR workshop on crowdsourcing for information retrieval (CIR), pp 36–41Google Scholar
  25. 25.
    Wallach HM (2008) Structured topic models for language. PhD thesis, University of CambridgeGoogle Scholar
  26. 26.
    Wallach HM, Mimno DM, McCallum A (2009) Rethinking LDA: why priors matter. In: Bengio Y, Schuurmans D, Lafferty JD, Williams CKI, Culotta A (eds) Advances in neural information processing systems 22, Curran Associates Inc., pp 1973–1981.Google Scholar
  27. 27.
    Wang X, Grimson E (2008) Spatial latent Dirichlet allocation. In: Platt JC, Koller D, Singer Y, Roweis ST (eds) Advances in neural information processing systems 20, Curran Associates Inc., pp 1577–1584.Google Scholar
  28. 28.
    Wang Y, Bai H, Stanton M, Chen WY, Chang EY (2009) PLDA: parallel latent Dirichlet allocation for large-scale applications. In: Goldberg AV, Zhou Y (eds) Algorithmic aspects in information and management. Springer, Berlin, Heidelberg, pp 301–314. doi: 10.1007/978-3-642-02158-9_26
  29. 29.
    Welinder P, Branson S, Belongie S, Perona P (2010) The multidimensional wisdom of crowds. In: Proceedings of the 23rd International Conference on Neural Information Processing Systems, Curran Associates Inc., USA, pp 2424–2432.Google Scholar
  30. 30.
    Whitehill J, fan Wu T, Bergsma J, Movellan JR, Ruvolo PL (2009) Whose vote should count more: optimal integration of labels from labelers of unknown expertise. In: Bengio Y, Schuurmans D, Lafferty JD, Williams CKI, Culotta A (eds) Advances in neural information processing systems, vol 22. Curran Associates Inc., Rostrevor, pp 2035–2043Google Scholar
  31. 31.
    Wilson AT, Chew PA (2010) Term weighting schemes for latent Dirichlet allocation. In: Human language technologies: the 2010 annual conference of the North American Chapter of the Association for Computational Linguistics. Association for Computational Linguistics, pp 465–473Google Scholar
  32. 32.
    Yan , Xu N, Qi Y (2009) Parallel inference for latent Dirichlet allocation on graphics processing units. In: Bengio Y, Schuurmans D, Lafferty JD, Williams CKI, Culotta A (eds) Advances in neural information processing systems 22, Curran Associates Inc., pp 2134–2142.Google Scholar

Copyright information

© Springer-Verlag London 2017

Authors and Affiliations

  1. 1.Department of Electrical and Computer EngineeringUniversity of California at San DiegoSan DiegoUSA
  2. 2.Swartz Center for Computational NeuroscienceUniversity of California at San DiegoSan DiegoUSA
  3. 3.Calit2/QI Pattern Recognition LaboratoryUniversity of California at San DiegoSan DiegoUSA

Personalised recommendations