Skip to main content
Log in

Latent Dirichlet allocation (LDA) and topic modeling: models, applications, a survey

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Topic modeling is one of the most powerful techniques in text mining for data mining, latent data discovery, and finding relationships among data and text documents. Researchers have published many articles in the field of topic modeling and applied in various fields such as software engineering, political science, medical and linguistic science, etc. There are various methods for topic modelling; Latent Dirichlet Allocation (LDA) is one of the most popular in this field. Researchers have proposed various models based on the LDA in topic modeling. According to previous work, this paper will be very useful and valuable for introducing LDA approaches in topic modeling. In this paper, we investigated highly scholarly articles (between 2003 to 2016) related to topic modeling based on LDA to discover the research development, current trends and intellectual structure of topic modeling. In addition, we summarize challenges and introduce famous tools and datasets in topic modeling based on LDA.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

References

  1. Ahmed A et al (2012) Scalable inference in latent variable models. In: Proceedings of the fifth ACM international conference on web search and data mining. ACM

  2. Alam MH, Ryu W-J, Lee S (2016) Joint multi-grain topic sentiment: modeling semantic aspects for online reviews. Inf Sci 339:206–223

    Article  Google Scholar 

  3. Alashri S et al (2016) An analysis of sentiments on facebook during the 2016 US presidential election. In: IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM), 2016. IEEE

  4. AlSumait L, Barbara D, Domeniconi C (2008) On-line lda: adaptive topic models for mining text streams with applications to topic detection and tracking. In: Eighth IEEE International Conference on Data Mining, 2008. ICDM’08. IEEE

  5. Asgari E, Chappelier J-C (2013) Linguistic Resources and Topic Models for the Analysis of Persian Poems in CLfL@ NAACL-HLT

  6. Asuncion HU, Asuncion AU, Taylor RN (2010) Software traceability with topic modeling. In: Proceedings of the 32nd ACM/IEEE international conference on software engineering, vol 1. ACM

  7. Bagheri A, Saraee M, De Jong F (2014) ADM-LDA: an aspect detection model based on topic modelling using the structure of review sentences. J Inf Sci 40 (5):621–636

    Article  Google Scholar 

  8. Balasubramanyan R et al (2012) Modeling polarizing topics: When do different political communities respond differently to the same news? in ICWSM

  9. Bauer S et al (2012) Talking places: Modelling and analysing linguistic content in foursquare. In: Privacy, security, risk and trust (PASSAT), 2012 international conference on and 2012 international confernece on social computing (SocialCom). IEEE

  10. Bhattacharya P et al (2014) Inferring user interests in the twitter social network. In: Proceedings of the 8th ACM conference on recommender systems. ACM

  11. Bisgin H et al (2014) A phenome-guided drug repositioning through a latent variable model. BMC Bioinforma 15(1):267

    Article  Google Scholar 

  12. Blei DM, Ng AY, Jordan MI (2003) Latent dirichlet allocation. J Mach Learn Res 3(Jan):993–1022

    MATH  Google Scholar 

  13. Blei DM, Jordan MI (2003) Modeling annotated data. In: Proceedings of the 26th annual international ACM SIGIR conference on research and development in informaion retrieval. ACM

  14. Blei DM, Lafferty JD (2006) Dynamic topic models. In: Proceedings of the 23rd international conference on machine learning. ACM

  15. Chaney AJ-B, Blei DM (2012) Visualizing Topic Models in ICWSM

  16. Chang J, Blei DM (2009) Relational topic models for document networks in international conference on artificial intelligence and statistics

  17. Chang J (2011) lda: collapsed Gibbs sampling methods for topic models. R

  18. Chen B et al (2010) What is an opinion about? Exploring political standpoints using opinion scoring model. In: AAAI

  19. Chen T-H et al (2012) Explaining software defects using topic models. In: 2012 9th IEEE working conference on mining software repositories (MSR), IEEE

  20. Chen L et al (2013) WT-LDA: user tagging augmented LDA for web service clustering. In: International conference on service-oriented computing. Springer

  21. Chen S-H et al (2015) Latent dirichlet allocation based blog analysis for criminal intention detection system. In: 2015 International Carnahan Conference on Security Technology (ICCST). IEEE

  22. Chen T-H, Thomas SW, Hassan AE (2016) A survey on the use of topic models when mining software repositories. Empir Softw Eng 21(5):1843–1919

    Article  Google Scholar 

  23. Cheng VC et al (2014) Probabilistic aspect mining model for drug reviews. IEEE Trans Knowl Data Eng 26(8):2002–2013

    Article  Google Scholar 

  24. Cheng X et al (2014) Btm: topic modeling over short texts. IEEE Transactions on Knowledge and Data Engineering 26(1):2928–2941

  25. Cheng Z, Shen J (2016) On effective location-aware music recommendation. ACM Transactions on Information Systems (TOIS) 34(2):13

    Article  MathSciNet  Google Scholar 

  26. Chien J-T, Chueh C-H (2011) Dirichlet class language models for speech recognition. IEEE Transactions on Audio Speech, and Language Processing 19 (3):482–495

    Article  Google Scholar 

  27. Chong W, Blei D, Li F-F (2009) Simultaneous image classification and annotation. In: IEEE conference on computer vision and pattern recognition, 2009. CVPR 2009. IEEE

  28. Choo J et al (2013) Utopian: User-driven topic modeling based on interactive nonnegative matrix factorization. IEEE transactions on visualization and computer graphics 19(12):1992–2001

    Article  Google Scholar 

  29. Chuang J, Manning CD, Heer J (2012) Termite: Visualization techniques for assessing textual topic models. In: Proceedings of the international working conference on advanced visual interfaces. ACM

  30. Cohen R, Ruths D (2013) Classifying political orientation on twitter: it’s not easy!. In: ICWSM

  31. Cohen R et al (2014) Redundancy-aware topic modeling for patient record notes. PloS one 9(2):e87555

    Article  Google Scholar 

  32. Cong Y et al (2012) Cross-modal information retrieval-a case study on Chinese wikipedia. In: International conference on advanced data mining and applications. Springer, Berlin

  33. Cordeiro M (2012) Twitter event detection: combining wavelet analysis and topic inference summarization in doctoral symposium on informatics engineering

  34. Cristani M et al (2008) Geo-located image analysis using latent representations. in Computer Vision and Pattern Recognition, 2008. CVPR, vol 2008. IEEE, IEEE Conference on

  35. Daud A et al (2010) Knowledge discovery through directed probabilistic topic models: a survey. Frontiers of Computer Science in China 4(2):280–301

    Article  Google Scholar 

  36. Debortoli S et al (2016) Text mining for information systems researchers: an annotated topic modeling tutorial. CAIS 39:7

    Article  Google Scholar 

  37. Diao Q et al (2012) Finding bursty topics from microblogs. In: Proceedings of the 50th annual meeting of the association for computational linguistics: long papers-volume 1. Association for Computational Linguistics

  38. Eidelman V, Boyd-Graber J, Resnik P (2012) Topic models for dynamic translation model adaptation. In: Proceedings of the 50th annual meeting of the association for computational linguistics: short papers-volume 2. Association for computational linguistics

  39. Eisenstein J et al (2010) A latent variable model for geographic lexical variation. In: Proceedings of the 2010 conference on empirical methods in natural language processings. Association for computational linguistics

  40. Everingham M et al (2008) The pascal visual object classes challenge 2007 (voc 2007) results (2007)

  41. Everingham M et al (2010) The pascal visual object classes (voc) challenge. Int J Comput Vis 88(2):303–338

    Article  Google Scholar 

  42. Fang Y et al (2012) Mining contrastive opinions on political texts using cross-perspective topic model. In: Proceedings of the fifth ACM international conference on web search and data mining. ACM

  43. Fu X et al (2015) Dynamic non-parametric joint sentiment topic mixture model. Knowl-Based Syst 82:102–114

    Article  Google Scholar 

  44. Fu X et al (2016) Dynamic online HDP model for discovering evolutionary topics from Chinese social texts. Neurocomputing 171:412–424

    Article  Google Scholar 

  45. Gerber MS (2014) Predicting crime using Twitter and kernel density estimation. Decis Support Syst 61:115–125

    Article  Google Scholar 

  46. Gethers M, Poshyvanyk D (2010) Using relational topic models to capture coupling among classes in object-oriented software systems. In: 2010 IEEE international conference on software maintenance (ICSM). IEEE

  47. Giri R et al (2014) User behavior modeling in a cellular network using latent dirichlet allocation. In: International Conference on Intelligent Data Engineering and Automated Learning. Springer, Berlin

  48. Godin F et al (2013) Using topic models for twitter hashtag recommendation. In: Proceedings of the 22nd international conference on world wide web. ACM

  49. Greene D, Cross JP (2015) Unveiling the political agenda of the european parliament plenary: a topical analysis. In: Proceedings of the ACM web science conference. ACM

  50. Gretarsson B et al (2012) Topicnets: Visual analysis of large text corpora with topic modeling. ACM Transactions on Intelligent Systems and Technology (TIST) 3 (2):23

    Google Scholar 

  51. Griffiths TL, Steyvers M (2004) Finding scientific topics. Proc Natl Acad Sci 101(suppl 1):5228–5235

    Article  Google Scholar 

  52. Guo J et al (2009) Named entity recognition in query. In: Proceedings of the 32nd international ACM SIGIR conference on research and development in information retrieval. ACM

  53. Heintz I et al (2013) Automatic extraction of linguistic metaphor with lda topic modeling Inproceedings of the First Workshop on Metaphor in NLP

  54. Henderson K, Eliassi-Rad T (2009) Applying latent dirichlet allocation to group discovery in large graphs. In: 2009 Proceedings of the ACM symposium on applied computing. ACM

  55. Hong L, Dan O, Davison BD (2011) Predicting popular messages in twitter. In: Proceedings of the 20th international conference companion on world wide web. ACM

  56. Hong L, Frias-Martinez E, Frias-Martinez V (2016) Topic models to infer socio-economic maps in AAAI

  57. Hu Y et al (2012) ET-LDA: joint topic modeling for aligning events and their twitter feedback. In: AAAI

  58. Hu P et al (2014) Latent topic model for audio retrieval. Pattern Recogn 47 (3):1138–1143

    Article  Google Scholar 

  59. Hou L et al (2015) Newsminer: Multifaceted news analysis for event search. Knowl-Based Syst 76:17–29

    Article  Google Scholar 

  60. Huang Z, Lu X, Duan H (2013) Latent treatment pattern discovery for clinical processes. Journal of medical systems 37(2):9915

    Article  Google Scholar 

  61. Jagarlamudi J, Daume H III (2010) Extracting multilingual topics from unaligned comparable corpora. In: ECIR. Springer

  62. Jiang Z et al (2012) Using link topic model to analyze traditional chinese medicine clinical symptom-herb regularities. In: 2012 IEEE 14th international conference on e-health networking, applications and services (Healthcom). IEEE

  63. Jiang D et al (2015) SG-WSTD: a framework for scalable geographic web search topic discovery. Knowl-Based Syst 84:18–33

    Article  Google Scholar 

  64. Jo Y, Oh AH (2011) Aspect and sentiment unification model for online review analysis. In: Proceedings of the fourth ACM international conference on web search and data mining. ACM

  65. Kim Y, Shim K (2014) TWILITE: a recommendation system for twitter using a probabilistic model based on latent Dirichlet allocation. Inf Syst 42:59–77

    Article  Google Scholar 

  66. Kim M et al (2017) Topiclens: efficient multi-level visual topic exploration of large-scale document collections. IEEE Trans Vis Comput Graph 23(1):151–160

    Article  Google Scholar 

  67. Lacoste-Julien S, Sha F, Jordan MI (2009) DiscLDA: discriminative learning for dimensionality reduction and classification. In: Advances in neural information processing systems

  68. Lange D, Naumann F (2011) Frequency-aware similarity measures: why Arnold Schwarzenegger is always a duplicate. In: Proceedings of the 20th ACM international conference on Information and knowledge management. ACM

  69. Larkey LS, Connell ME (2001) Arabic information retrieval at UMass in TREC-10 in TREC

  70. Lee S et al (2016) LARGen: automatic signature generation for Malwares using latent Dirichlet allocation IEEE Transactions on Dependable and Secure Computing

  71. Levy KE, Franklin M (2014) Driving regulation: using topic models to examine political contention in the US trucking industry. Soc Sci Comput Rev 32(2):182–194

    Article  Google Scholar 

  72. Lewis DD (1997) Reuters-21578 text categorization collection

  73. Lewis DD et al (2004) Rcv1: a new benchmark collection for text categorization research. J Mach Learn Res 5(Apr):361–397

    Google Scholar 

  74. Li W, McCallum A (2006) Pachinko allocation: DAG-structured mixture models of topic correlations. In: Proceedings of the 23rd international conference on machine learning. ACM

  75. Li F, Huang M, Zhu X (2010) Sentiment Analysis with Global Topics and Local Dependency in AAAI

  76. Li R (2012) Towards social user profiling: unified and discriminative influence model for inferring home locations. In: Proceedings of the 18th ACM SIGKDD international conference on knowledge discovery and data mining. ACM

  77. Li J, Cardie C, Li S (2013) TopicSpam: a topic-model based approach for spam detection in ACL (2)

  78. Li Z et al (2013) Enhancing news organization for convenient retrieval and browsing. ACM Transactions on Multimedia Computing. Communications, and Applications (TOMM) 10(1):1

    Google Scholar 

  79. Li C et al (2015) The author-topic-community model for author interest profiling and community discovery. Knowl Inf Syst 44(2):359–383

    Article  Google Scholar 

  80. Li X, Ouyang J, Zhou X (2015) Supervised topic models for multi-label classification. Neurocomputing 149:811–819

    Article  Google Scholar 

  81. Li Y et al (2016) Design and implementation of Weibo sentiment analysis based on LDA and dependency parsing. China Communications 13(11):91–105

    Article  Google Scholar 

  82. Li C et al (2016) Hierarchical Bayesian nonparametric models for knowledge discovery from electronic medical records. Knowl-Based Syst 99:168–182

    Article  Google Scholar 

  83. Li Z et al (2016) Multimedia news summarization in search. ACM Transactions on Intelligent Systems and Technology (TIST) 7(3):33

    Google Scholar 

  84. Li Z, Tang J (2017) Weakly supervised deep matrix factorization for social image understanding. IEEE Trans Image Process 26(1):276–288

    Article  MathSciNet  MATH  Google Scholar 

  85. Li Z, Tang J, Mei T (2018) Deep collaborative embedding for social image understanding. IEEE transactions on pattern analysis and machine intelligence

  86. Lienou M, Maitre H, Datcu M (2010) Semantic annotation of satellite images using latent Dirichlet allocation. IEEE Geosci Remote Sens Lett 7(1):28–32

    Article  Google Scholar 

  87. Lin CX et al (2010) PET: a statistical model for popular events tracking in social communities. In: Proceedings of the 16th ACM SIGKDD international conference on knowledge discovery and data mining. ACM

  88. Lin J et al, Addressing cold-start in app recommendation: latent user models constructed from twitter followers (2013). In: Proceedings of the 36th international ACM SIGIR conference on research and development in information retrieval. ACM

  89. Linstead E et al (2007) Mining concepts from code with probabilistic topic models. ACM, Inproceedings of the twenty-second IEEE/ACM international conference on automated software engineering

  90. Linstead E, Lopes C, Baldi P (2008) An application of latent Dirichlet allocation to analyzing software evolution. In: 7th international conference on machine learning and applications, 2008. ICMLA’08. IEEE

  91. Liu B et al (2010) Identifying functional miRNA-mRNA regulatory modules with correspondence latent dirichlet allocation. Bioinformatics 26(24):3105–3111

    Article  Google Scholar 

  92. Liu Z et al (2011) Plda+: Parallel latent dirichlet allocation with data placement and pipeline processing. ACM Transactions on Intelligent Systems and Technology (TIST) 2(3):26

    Google Scholar 

  93. Liu B, Zhang L (2012) A survey of opinion mining and sentiment analysis. In: Mining text data. Springer, pp 415–463

  94. Liu Y, Wang J, Jiang Y (2016) PT-LDA: a latent variable model to predict personality traits of social network users. Neurocomputing 210:155–163

    Article  Google Scholar 

  95. Liu Y et al (2016). In: AAAI, Fortune teller: predicting Your Career Path

  96. Lu H-M, Lee C-H (2015) The topic-over-time mixed membership model (TOT-MMM): a twitter hashtag recommendation model that accommodates for temporal clustering effects. IEEE Intell Sys 30(1):18–25

  97. Lu H-M, Wei C-P, Hsiao F-Y (2016) Modeling healthcare data using multiple-channel latent Dirichlet allocation. J Biomed Inform 60:210–223

    Article  Google Scholar 

  98. Lukins SK, Kraft NA, Etzkorn LH (2008) Source code retrieval for bug localization using latent dirichlet allocation. In: 15th working conference on reverse engineering, 2008. WCRE’08. IEEE

  99. Lukins SK, Kraft NA, Etzkorn LH (2010) Bug localization using latent dirichlet allocation. Inf Softw Technol 52(9):972–990

    Article  Google Scholar 

  100. Lui M, Lau JH, Baldwin T (2014) Automatic detection and language identification of multilingual documents. Transactions of the Association for Computational Linguistics 2:27–40

    Article  Google Scholar 

  101. Madan A et al (2011) Pervasive sensing to model political opinions in face-to-face networks. In: International conference on pervasive computing. Springer

  102. Manandhar S, Yuret D (2013) Second joint conference on lexical and computational semantics (* sem), volume 2: Proceedings of the seventh international workshop on semantic evaluation (semeval 2013). In: 2nd joint conference on lexical and computational semantics (* SEM), volume 2: proceedings of the 7th international workshop on semantic evaluation (SemEval 2013)

  103. Mao X-L et al, SSHLDA: a semi-supervised hierarchical topic model (2012). In: Proceedings of the 2012 joint conference on empirical methods in natural language processing and computational natural language learning. Association for computational linguistics

  104. McCallum AK (2002), A machine learning for language toolkit, Mallet

  105. McCallum A, Corrada-Emmanuel A, Wang X (2005) Topic and role discovery in social networks. In: Proceedings of the 19th International Joint Conference on Artificial Intelligence, pp 786–791

  106. McFarland DA et al (2013) Differentiating language usage through topic models. Poetics 41(6):607–625

    Article  MathSciNet  Google Scholar 

  107. McInerney J, Blei DM (2014) Discovering newsworthy tweets with a geographical topic model in NewsKDD: Data Science for News Publishing workshop Workshop in conjunction with KDD2014 the 20th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

  108. Miao J, Huang JX, Zhao J (2016) TopPRF: a probabilistic framework for integrating topic space into pseudo relevance feedback. ACM Transactions on Information Systems (TOIS) 34(4):22

    Article  Google Scholar 

  109. Millar JR, Peterson GL, Mendenhall MJ (2009) Document clustering and visualization with latent Dirichlet allocation and self-organizing maps in FLAIRS Conference

  110. Minka T, Lafferty J (2002) Expectation-propagation for the generative aspect model. In: Proceedings of the eighteenth conference on uncertainty in artificial intelligence. Morgan Kaufmann Publishers Inc

  111. Murdock J, Allen C (2015) Visualization Techniques for Topic Model Checking. In: AAAI

  112. Nakano T, Yoshii K, Goto M (2014) Vocal timbre analysis using latent Dirichlet allocation and cross-gender vocal timbre similarity. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2014. IEEE

  113. Nguyen DQ et al (2015) Improving topic models with latent feature word representations. Transactions of the Association for Computational Linguistics 3:299–313

    Article  Google Scholar 

  114. Panichella A et al (2013) How to effectively use topic models for software engineering tasks? an approach based on genetic algorithms. In: Proceedings of the 2013 international conference on software engineering. IEEE Press

  115. Paul M, Girju R (2010) A two-dimensional topic-aspect model for discovering multi-faceted topics. Urbana 51(61801):36

    Google Scholar 

  116. Paul MJ, Dredze M (2011) You are what you tweet: analyzing twitter for public health. Icwsm 20:265–272

    Google Scholar 

  117. Paul M, Factorial M. Dredze. (2012) LDA: Sparse multi-dimensional text models in advances in neural information processing systems

  118. Phan X-H, Nguyen C-T (2006) Jgibblda: a java implementation of latent dirichlet allocation (lda) using gibbs sampling for parameter estimation and inference

  119. Philbin J, Sivic J, Zisserman A (2011) Geometric latent dirichlet allocation on a matching graph for large-scale image datasets. Int J Comput Vis 95(2):138–153

    Article  MathSciNet  MATH  Google Scholar 

  120. Preotiuc-Pietro D et al (2017) Beyond binary labels: political ideology prediction of twitter users Inproceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

  121. Prier KW et al (2011) Identifying health-related topics on twitter. in International Conference on Social Computing. Springer, Behavioral-Cultural Modeling, and Prediction

    Google Scholar 

  122. Qian S et al (2016) Multi-modal event topic model for social event analysis. IEEE Trans Multimedia 18(2):233–246

    Article  Google Scholar 

  123. Qin Z, Cong Y, Wan T (2016) Topic modeling of Chinese language beyond a bag-of-words. Computer Speech and Language 40:60–78

    Article  Google Scholar 

  124. Ramage D et al (2009) Labeled LDA: a supervised topic model for credit attribution in multi-labeled corpora. In: Proceedings of the 2009 conference on empirical methods in natural language processing: volume 1-volume 1. Association for computational linguistics

  125. Ramage D, Manning CD, Dumais S (2011) Partially labeled topic models for interpretable text mining. In: Proceedings of the 17th ACM SIGKDD international conference on knowledge discovery and data mining. ACM

  126. Ramage D, Rosen E (2011) Stanford topic modeling toolbox

  127. Rao Y (2016) Contextual sentiment topic model for adaptive social emotion classification. IEEE Intell Syst 31(1):41–47

    Article  Google Scholar 

  128. Rao Y et al (2014) Building emotional dictionary for sentiment analysis of online news. World Wide Web 17(4):723–742

    Article  Google Scholar 

  129. Rehurek R, Sojka P (2011) Gensim-statistical semantics in python

  130. Ren Y, Wang R, Ji D (2016) A topic-enhanced word embedding for Twitter sentiment classification. Inf Sci 369:188–198

    Article  Google Scholar 

  131. Rennie J (2017) The 20 Newsgroups data set. http

  132. Roberts K et al (2012) EmpaTweet: annotating and detecting emotions on twitter. In: LREC

  133. Rosen-Zvi M et al (2004) The author-topic model for authors and documents. In: Proceedings of the 20th conference on uncertainty in artificial intelligence. AUAI Press

  134. Sandhaus E (2008) The New York times annotated corpus. Linguistic Data Consortium, Philadelphia

    Google Scholar 

  135. Savage T et al (2010) Topic XP: exploring topics in source code using latent Dirichlet allocation. In: 2010 IEEE International Conference on software maintenance (ICSM). IEEE

  136. Sharma V et al (2015) Analyzing Newspaper Crime Reports for Identification of Safe Transit Paths in HLT-NAACL

  137. Shi B et al (2016) Detecting common discussion topics across culture from news reader comments in ACL (1)

  138. Siersdorfer S et al (2014) Analyzing and mining comments and comment ratings on the social web. ACM Trans Web (TWEB) 8(3):17

    Google Scholar 

  139. Sizov S (2010) Geofolk latent spatial semantics in web 2.0 social media. In: Proceedings of the third ACM international conference on web search and data mining. ACM

  140. Song M, Kim MC, Jeong YK (2014) Analyzing the political landscape of 2012 korean presidential election in twitter. IEEE Intell Syst 29(2):18–26

    Article  Google Scholar 

  141. Srijith P et al (2017) Sub-story detection in Twitter with hierarchical Dirichlet processes. Inf Process Manag 53(4):989–1003

    Article  Google Scholar 

  142. Steyvers M, Griffiths T (2007) Probabilistic topic models. Handbook of latent semantic analysis 427(7):424–440

    Google Scholar 

  143. Steyvers M, Griffiths T (2011) Matlab topic modeling toolbox 1.4. http://psiexp.ss.uci.edu/research/programs_data/toolbox.htm

  144. Sun X et al (2016) Exploring topic models in software engineering data analysis: a survey. In: 2016 17th IEEE/ACIS international conference on software engineering, artificial intelligence, networking and parallel/distributed computing (SNPD). IEEE

  145. Sun S, Luo C, Chen J (2017) A review of natural language processing techniques for opinion mining systems. Information Fusion 36:10–25

    Article  Google Scholar 

  146. Tan S et al (2014) Interpreting the public sentiment variations on twitter. IEEE transactions on knowledge and data engineering 26(5):1158–1170

    Article  Google Scholar 

  147. Tang H et al (2013) A multiscale latent Dirichlet allocation model for object-oriented clustering of VHR panchromatic satellite images. IEEE Trans Geosci Remote Sens 51(3):1680–1692

    Article  Google Scholar 

  148. Thomas SW (2011) Mining software repositories using topic models. In: Proceedings of the 33rd international conference on software engineering. ACM

  149. Thomas SW et al (2011) Modeling the evolution of topics in source code histories. In: Proceedings of the 8th working conference on mining software repositories. ACM

  150. Tian K, Revelle M, Poshyvanyk D (2009) Using latent dirichlet allocation for automatic categorization of software. In: 6th IEEE International working conference on mining software repositories, 2009. MSR’09. IEEE

  151. Titov I, McDonald R (2008) Modeling online reviews with multi-grain topic models. In: Proceedings of the 17th international conference on world wide web. ACM

  152. Vaduva C, Gavat I, Datcu M (2013) Latent Dirichlet allocation for spatial analysis of satellite images. IEEE Trans Geosci Remote Sens 51(5):2770–2786

    Article  Google Scholar 

  153. Vulic I, De Smet W, Moens M-F (2011) Identifying word translations from comparable corpora using latent topic models. In: Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies: short papers-volume 2. Association for computational linguistics

  154. Wallach HM, Mimno DM, McCallum A (2009) Rethinking LDA: why priors matter. In: Advances in neural information processing systems

  155. Wang X, McCallum A (2006) Topics over time: a non-Markov continuous-time model of topical trends. In: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM

  156. Wang C, Blei DM (2009) Decoupling sparsity and smoothness in the discrete hierarchical dirichlet process. In: Advances in neural information processing systems

  157. Wang Y, Mori G (2011) Max-margin latent Dirichlet allocation for image classification and annotation. In: BMVC

  158. Wang H et al (2011) Finding complex biological relationships in recent PubMed articles using Bio-LDA. PloS one 6(3):e17243

    Article  Google Scholar 

  159. Wang C, Blei DM (2011) Collaborative topic modeling for recommending scientific articles. In: Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM

  160. Wang X, Gerber MS, Brown DE (2012) Automatic Crime Prediction Using Events Extracted from Twitter Posts. SBP 12:231–238

    Google Scholar 

  161. Wang Y-C, Burke M, Kraut RE (2013) Gender, topic, and audience response: an analysis of user-generated content on facebook. In: Proceedings of the SIGCHI conference on human factors in computing systems. ACM

  162. Wang J et al (2014) Image tag refinement by regularized latent Dirichlet allocation. Comput Vis Image Underst 124:61–70

    Article  Google Scholar 

  163. Wang T et al (2014) Product aspect extraction supervised with online domain knowledge. Knowl-Based Syst 71:86–100

    Article  Google Scholar 

  164. Wang S et al (2014) Cross media topic analytics based on synergetic content and user behavior modeling. In: IEEE International Conference on Multimedia and Expo (ICME), 2014. IEEE

  165. Wang Y et al (2016) Catching fire via” Likes”: inferring topic preferences of trump followers on twitter. In: ICWSM

  166. Weng J et al (2010) Twitterrank: finding topic-sensitive influential twitterers. In: Proceedings of the third ACM international conference on Web search and data mining. ACM

  167. Weng J, Lee B-S (2011) Event detection in twitter. ICWSM 11:401–408

    Google Scholar 

  168. Wick M, Ross M, Learned-Miller E (2007) Context-sensitive error correction: using topic models to improve OCR. In: 9th international conference on document analysis and recognition, 2007. ICDAR 2007. IEEE

  169. Wilson AT, Chew PA (2010) Term weighting schemes for latent dirichlet allocation. In: Human language technologies: the 2010 annual conference of the north american chapter of the association for computational linguistics. Association for Computational Linguistics

  170. Wu Y et al (2012) Ranking gene-drug relationships in biomedical literature using latent dirichlet allocation. In: Pacific symposium on biocomputing. NIH Public Access

  171. Wu H et al (2012) Locally discriminative topic modeling. Pattern Recogn 45(1):617–625

    Article  MATH  Google Scholar 

  172. Xianghua F et al (2013) Multi-aspect sentiment analysis for Chinese online social reviews based on topic modeling and HowNet lexicon. Knowl-Based Syst 37:186–195

    Article  Google Scholar 

  173. Xiao C et al (2017) Adverse drug reaction prediction with symbolic latent dirichlet allocation in AAAI

  174. Xie P, Yang D, Xing EP (2015) Incorporating word correlation knowledge into topic modeling in HLT-NAACL

  175. Xie W et al (2016) Topicsketch: real-time bursty topic detection from twitter. IEEE Trans Knowl Data Eng 28(8):2216–2229

    Article  Google Scholar 

  176. Xu Z et al (2017) Crowdsourcing based social media data analysis of urban emergency events. Multimedia Tools and Applications 76(9):11567–11584

    Article  Google Scholar 

  177. Yan X et al (2013) A biterm topic model for short texts. In: Proceedings of the 22nd international conference on world wide web. ACM

  178. Yang M-C, Rim H-C (2014) Identifying interesting Twitter contents using topical analysis. Expert Syst Appl 41(9):4330–4336

    Article  Google Scholar 

  179. Yang M, Kiang M (2015) Extracting Consumer Health Expressions of Drug Safety from Web Forum. In: 2015 48th Hawaii international conference on system sciences (HICSS). IEEE

  180. Yang X et al (2017) Characterizing malicious Android apps by mining topic-specific data flow signatures Information and Software Technology

  181. Yano T, Cohen WW, Smith NA (2009) Predicting response to political blog posts with topic models. In: Proceedings of human language technologies: the 2009 annual conference of the north american chapter of the association for computational linguistics. Association for computational linguistics

  182. Yano T, Smith NA (2010) What’s worthy of comment? content and comment volume in political blogs in ICWSM

  183. Yeh J-F, Tan Y-S, Lee C-H (2016) Topic detection and tracking for conversational content by using conceptual dynamic latent Dirichlet allocation. Neurocomputing 216:310–318

    Article  Google Scholar 

  184. Yin Z et al (2011) Geographical topic discovery and comparison. In: Proceedings of the 20th international conference on world wide web. ACM

  185. Yin H et al (2014) A temporal context-aware model for user behavior modeling in social media systems. In: Proceedings of the ACM SIGMOD international conference on Management of data, 2014. ACM

  186. Yoshii K, Goto M (2012) A nonparametric Bayesian multipitch analyzer based on infinite latent harmonic allocation. IEEE Transactions on Audio. Speech, and Language Processing 20(3):717–730

    Article  Google Scholar 

  187. Yu K et al (2014) Mining hidden knowledge for drug safety assessment: topic modeling of LiverTox as a case study. BMC Bioinforma 15(17):S6

    Google Scholar 

  188. Yu R, He X, Liu Y (2015) Glad: group anomaly detection in social media analysis. ACM Transactions on Knowledge Discovery from Data (TKDD) 10(2):18

    Article  Google Scholar 

  189. Yu X, Yang J, Xie Z-Q (2015) A semantic overlapping community detection algorithm based on field sampling. Expert Syst Appl 42(1):366–375

    Article  Google Scholar 

  190. Yuan B et al (2014). In: International conference on web information systems engineering. Springer, Berlin

  191. Yuan J et al (2015) Lightlda: big topic models on modest computer clusters. In: Proceedings of the 24th international conference on world wide web. International world wide web conferences steering committee

  192. Zhai Z, Liu B, Xu H, Jia P (2011) Constrained LDA for grouping product features in opinion mining. In: Pacific-Asia conference on knowledge discovery and data mining. Springer, Berlin, pp 448–459

  193. Zhang H et al (2007) Probabilistic community discovery using hierarchical latent gaussian mixture model. In: AAAI

  194. Zhang X-P et al (2011) Topic model for chinese medicine diagnosis and prescription regularities analysis: case on diabetes. Chinese Journal Of Integrative Medicine 17 (4):307–313

    Article  Google Scholar 

  195. Zhang J et al (2013) Social Influence Locality for Modeling Retweeting Behaviors in IJCAI

  196. Zhang L, Sun X, Zhuge H (2015) Topic discovery of clusters from documents with geographical location. Concurrency and Computation: Practice and Experience 27(15):4015–4038

    Article  Google Scholar 

  197. Zhang Y et al (2017) iDoctor: personalized and professionalized medical recommendations based on hybrid matrix factorization. Futur Gener Comput Syst 66:30–35

    Article  Google Scholar 

  198. Zhao WX et al (2011) Comparing twitter and traditional media using topic models. In: European conference on information retrieval. Springer

  199. Zhao F et al (2016) A personalized hashtag recommendation approach using LDA-based topic model in microblog environment. Futur Gener Comput Syst 65:196–206

    Article  Google Scholar 

  200. Zhai K et al (2012) Mr. LDA: a flexible large scale topic modeling package using variational inference in mapreduce. In: Proceedings of the 21st international conference on world wide web. ACM

  201. Zheng X et al (2014) Incorporating appraisal expression patterns into topic modeling for aspect and sentiment word identification. Knowl-Based Syst 61:29–47

    Article  Google Scholar 

  202. Zeng J, Liu Z-Q, Cao X-Q (2016) Fast online EM for big topic modeling. IEEE Trans Knowl Data Eng 28(3):675–688

    Article  Google Scholar 

  203. Zhu J, Ahmed A, Xing EP (2009) MedLDA: maximum margin supervised topic models for regression and classification. In: Proceedings of the 26th annual international conference on machine learning. ACM

  204. Zirn C, Stuckenschmidt H (2014) Multidimensional topic analysis in political texts. Data and Knowledge Engineering 90:38–53

    Article  Google Scholar 

  205. Zoghbi S, Vulic I, Moens M-F (2016) Latent Dirichlet allocation for linking user-generated content and e-commerce data. Inf Sci 367:573–599

    Article  Google Scholar 

Download references

Acknowledgements

This article has been awarded by the National Natural Science Foundation of China (61170035, 61272420, 81674099, 61502233), the Fundamental Research Fund for the Central Universities (30916011328, 30918015103), and Nanjing Science and Technology Development Plan Project (201805036).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yongli Wang.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Jelodar, H., Wang, Y., Yuan, C. et al. Latent Dirichlet allocation (LDA) and topic modeling: models, applications, a survey. Multimed Tools Appl 78, 15169–15211 (2019). https://doi.org/10.1007/s11042-018-6894-4

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-018-6894-4

Keywords

Navigation