Abstract
The main aim of this article is to present the results of different experiments focused on the problem of model fitting process in topic modeling and its accuracy when applied to long texts. At the same time, in fact, the digital era has made available both enormous quantities of textual data and technological advances that have facilitated the development of techniques to automate the data coding and analysis processes. In the ambit of topic modeling, different procedures were born in order to analyze larger and larger collections of texts, namely corpora, but this has posed, and continues to pose, a series of methodological questions that urgently need to be resolved. Therefore, through a series of different experiments, this article is based on the following consideration: taking into account Latent Dirichlet Allocation (LDA), a generative probabilistic model (Blei et al. in J Mach Learn Res 3:993–1022, 2003; Blei and Lafferty in: Srivastava, Sahami (eds) Text mining: classification, clustering, and applications, Chapman & Hall/CRC Press, Cambridge, 2009; Griffiths and Steyvers in Proc Natl Acad Sci USA (PNAS), 101(Supplement 1):5228–5235, 2004), the problem of fitting model is crucial because the LDA algorithm demands that the number of topics is specified a priori. Needles to say, the number of topics to detect in a corpus is a parameter which affect the analysis results. Since there is a lack of experiments applied to long texts, our article tries to shed new light on the complex relationship between texts’ length and the optimal number of topics. In the conclusions, we present a clear-cut power-law relation between the optimal number of topics and the analyzed sample size, and we formulate it in a form of a mathematical model.
Similar content being viewed by others
References
Arun, R., Suresh, V., Veni Madhavan, C.E., Narasimha Murthy, M.N.: On finding the natural number of topics with latent Dirichlet allocation some observations. In: Zaki, M.J., Yu, J.X., Ravindran, B., Pudi, V. (eds.) Advances in Knowledge Discovery and Data Mining. Lecture Notes in Computer Science, pp. 391–402. Springer, Berlin (2010)
Blei, D.M, Lafferty, J.D.: Dynamic topic models. In: Proceedings of the 23rd International Conference on Machine Learning, pp. 113–120 (2006)
Blei, D., Lafferty, J.: A correlated topic model of Science. Ann. Appl. Stat. 1(1):17–35 (2007)
Blei, D.M., Lafferty, J.D.: Topic Models. In: Srivastava, A., Sahami, M. (eds.) Text Mining: Classification, Clustering, and Applications, pp. 71–93. Chapman & Hall/CRC Press, Cambridge (2009)
Blei, D.M., Ng, A., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)
Cao, J., Xia, T., Li, J., Zhang, Y., Tang, S.: A density-based method for adaptive LDA model selection. Neurocomputing 72(7–9), 1775–1781 (2009)
Deveaud, R., SanJuan, É., Bellot, P.: Accurate and effective latent concept modeling for ad hoc information retrieval. Document numérique 17(1), 61–84 (2014)
Feinerer, I., Hornik, K., Meyer, D.: Text mining infrastructure in R. J. Stat. Softw. 25(5), 1–54 (2008)
Giordan, G., Saint-Blancat, C., Sbalchiero, S.: Exploring the history of american sociology through topic modeling. In: Tuzzi, A. (ed.) Tracing the Life-Course of Ideas in the Humanities and Social Sciences, pp. 45–64. Springer, Berlin (2018)
Griffiths, T., Steyvers, M.: Finding scientific topics. Proceedings of the National Academy of Sciences of the United States of America (PNAS) 101(Supplement 1), 5228–5235 (2004)
Grün, B., Hornik, K.: Topicmodels: an R package for fitting topic models. J. Stat. Softw. 40(13), 1–30 (2011)
Hall, D., Jurafsky, D., Manning, C.D.: Studying the history of ideas using topic models. In: EMNLP ‘08 Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 363–371 (2008)
Hong, L., Davison, B.D.: Empirical study of topic modeling in Twitter. In: Proceedings of the SIGKDD Workshop on SMA, pp. 80–88 (2010)
Jockers, M.L., Mimno, D.: Significant themes in 19th-century literature. Poetics 41(6), 750–769 (2013)
Kodinariya, T.M., Makwana, P.R.: Review on determining number of cluster in k-means clustering. International Journal of Advance Research in Computer Science and Management Studies 1(6), 90–95 (2013)
Köhler, R., Galle, M.: Dynamic aspects of text characteristics. In: Hrebícek, L., Altmann, G. (eds.) Quantitative Text Analysis, pp. 46–53. Wissenschaftlicher, Trier (1993)
Lebart, L., Salem, A., Berry, L.: Exploring textual data. Kluwer Academic Publishers, Dordrecht (1998)
Li, W., McCallum, A.: Pachinko allocation: DAG-structured mixture models of topic correlations. In: Proceedings of the 23rd International Conference on Machine Learning, pp. 577–584 (2006)
Maier, D., Waldherr, A., Miltner, P., Wiedemann, G., Niekler, A., Keinert, A., Pfetsch, B., Heyer, G., Reber, U., Häussler, T., Schmid-Petri, H., Adam, S.: Applying LDA topic modeling in communication research: toward a valid and reliable methodology. Commun. Methods Meas. 12(2–3), 93–118 (2018)
Michel, J.-B., Shen, Y.K., Aiden, A.P., Veres, A., Gray, M.K., Pickett, J.P., Hoiberg, D., Clancy, D., Norvig, P., Orwant, J., Pinker, S., Nowak, M.A., Aiden, E.L.: Quantitative analysis of culture using millions of digitized books. Science 331(6014), 176–182 (2011)
Popescu, I., Macutek, J., Altmann, G.: Aspects of Word Frequencies. Studies in Quantitative Linguistics. RAM Verlag, Ludenscheid (2009)
Puschmann, C., Scheffler, T.: Topic modeling for media and communication research: a short primer. HIIG Discussion Paper Series No. 2016-05. Available at SSRN: https://doi.org/10.2139/ssrn.2836478 (2016)
R Development Core Team: R: a language and environment for statistical computing [software]. R foundation for statistical computing. Retrieved from http://www.r-project.org. Accessed Jan 2020 (2016)
Rosen-Zvi, M., Griffiths, T., Steyvers, M., Smyth, P.: The author-topic model for authors and documents. In: Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence, pp. 487–494 (2004)
Savoy, J.: Authorship attribution based on a probabilistic topic model. Inf. Process. Manag. 49, 341–354 (2013)
Sbalchiero, S.: Finding topics: a statistical model and a quali-quantitative method. In: Tuzzi, A. (ed.) Tracing the Life-Course of Ideas in the Humanities and Social Sciences, pp. 189–210. Springer, Berlin (2018)
Sbalchiero, S., Tuzzi, A.: What’s old and new? Discovering Topics in the American Journal of Sociology. In: Iezzi, D.F., Celdardo, L., Misuraca, M. (eds.) Proceedings of 14th International Conference on Statistical Analysis of Textual Data, pp. 724–732. UniversItalia Editore, Rome (2018)
Tong, Z., Zhang, H.: A text mining research based on LDA topic modelling. In: Jordery School of Computer Science, pp. 201–210 (2016)
Acknowledgements
The study was conducted at the intersection of two research projects: M.E. was founded by the Polish National Science Center (SONATA-BIS 2017/26/E/HS2/01019), whereas S.S. was supported by the COST Action "Distant Reading for European Literary History" (CA16204). We are grateful to prof. Arjuna Tuzzi (University of Padova, Italy) for the inspiring discussions and her valuable suggestions.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Sbalchiero, S., Eder, M. Topic modeling, long texts and the best number of topics. Some Problems and solutions. Qual Quant 54, 1095–1108 (2020). https://doi.org/10.1007/s11135-020-00976-w
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11135-020-00976-w