Skip to main content
Log in

Topic modeling, long texts and the best number of topics. Some Problems and solutions

  • Published:
Quality & Quantity Aims and scope Submit manuscript

Abstract

The main aim of this article is to present the results of different experiments focused on the problem of model fitting process in topic modeling and its accuracy when applied to long texts. At the same time, in fact, the digital era has made available both enormous quantities of textual data and technological advances that have facilitated the development of techniques to automate the data coding and analysis processes. In the ambit of topic modeling, different procedures were born in order to analyze larger and larger collections of texts, namely corpora, but this has posed, and continues to pose, a series of methodological questions that urgently need to be resolved. Therefore, through a series of different experiments, this article is based on the following consideration: taking into account Latent Dirichlet Allocation (LDA), a generative probabilistic model (Blei et al. in J Mach Learn Res 3:993–1022, 2003; Blei and Lafferty in: Srivastava, Sahami (eds) Text mining: classification, clustering, and applications, Chapman & Hall/CRC Press, Cambridge, 2009; Griffiths and Steyvers in Proc Natl Acad Sci USA (PNAS), 101(Supplement 1):5228–5235, 2004), the problem of fitting model is crucial because the LDA algorithm demands that the number of topics is specified a priori. Needles to say, the number of topics to detect in a corpus is a parameter which affect the analysis results. Since there is a lack of experiments applied to long texts, our article tries to shed new light on the complex relationship between texts’ length and the optimal number of topics. In the conclusions, we present a clear-cut power-law relation between the optimal number of topics and the analyzed sample size, and we formulate it in a form of a mathematical model.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

Notes

  1. https://github.com/computationalstylistics/100_english_novels.

References

  • Arun, R., Suresh, V., Veni Madhavan, C.E., Narasimha Murthy, M.N.: On finding the natural number of topics with latent Dirichlet allocation some observations. In: Zaki, M.J., Yu, J.X., Ravindran, B., Pudi, V. (eds.) Advances in Knowledge Discovery and Data Mining. Lecture Notes in Computer Science, pp. 391–402. Springer, Berlin (2010)

    Chapter  Google Scholar 

  • Blei, D.M, Lafferty, J.D.: Dynamic topic models. In: Proceedings of the 23rd International Conference on Machine Learning, pp. 113–120 (2006)

  • Blei, D., Lafferty, J.: A correlated topic model of Science. Ann. Appl. Stat. 1(1):17–35 (2007)

    Article  Google Scholar 

  • Blei, D.M., Lafferty, J.D.: Topic Models. In: Srivastava, A., Sahami, M. (eds.) Text Mining: Classification, Clustering, and Applications, pp. 71–93. Chapman & Hall/CRC Press, Cambridge (2009)

    Google Scholar 

  • Blei, D.M., Ng, A., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)

    Google Scholar 

  • Cao, J., Xia, T., Li, J., Zhang, Y., Tang, S.: A density-based method for adaptive LDA model selection. Neurocomputing 72(7–9), 1775–1781 (2009)

    Article  Google Scholar 

  • Deveaud, R., SanJuan, É., Bellot, P.: Accurate and effective latent concept modeling for ad hoc information retrieval. Document numérique 17(1), 61–84 (2014)

    Article  Google Scholar 

  • Feinerer, I., Hornik, K., Meyer, D.: Text mining infrastructure in R. J. Stat. Softw. 25(5), 1–54 (2008)

    Article  Google Scholar 

  • Giordan, G., Saint-Blancat, C., Sbalchiero, S.: Exploring the history of american sociology through topic modeling. In: Tuzzi, A. (ed.) Tracing the Life-Course of Ideas in the Humanities and Social Sciences, pp. 45–64. Springer, Berlin (2018)

    Google Scholar 

  • Griffiths, T., Steyvers, M.: Finding scientific topics. Proceedings of the National Academy of Sciences of the United States of America (PNAS) 101(Supplement 1), 5228–5235 (2004)

    Article  Google Scholar 

  • Grün, B., Hornik, K.: Topicmodels: an R package for fitting topic models. J. Stat. Softw. 40(13), 1–30 (2011)

    Article  Google Scholar 

  • Hall, D., Jurafsky, D., Manning, C.D.: Studying the history of ideas using topic models. In: EMNLP ‘08 Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 363–371 (2008)

  • Hong, L., Davison, B.D.: Empirical study of topic modeling in Twitter. In: Proceedings of the SIGKDD Workshop on SMA, pp. 80–88 (2010)

  • Jockers, M.L., Mimno, D.: Significant themes in 19th-century literature. Poetics 41(6), 750–769 (2013)

    Article  Google Scholar 

  • Kodinariya, T.M., Makwana, P.R.: Review on determining number of cluster in k-means clustering. International Journal of Advance Research in Computer Science and Management Studies 1(6), 90–95 (2013)

    Google Scholar 

  • Köhler, R., Galle, M.: Dynamic aspects of text characteristics. In: Hrebícek, L., Altmann, G. (eds.) Quantitative Text Analysis, pp. 46–53. Wissenschaftlicher, Trier (1993)

    Google Scholar 

  • Lebart, L., Salem, A., Berry, L.: Exploring textual data. Kluwer Academic Publishers, Dordrecht (1998)

    Book  Google Scholar 

  • Li, W., McCallum, A.: Pachinko allocation: DAG-structured mixture models of topic correlations. In: Proceedings of the 23rd International Conference on Machine Learning, pp. 577–584 (2006)

  • Maier, D., Waldherr, A., Miltner, P., Wiedemann, G., Niekler, A., Keinert, A., Pfetsch, B., Heyer, G., Reber, U., Häussler, T., Schmid-Petri, H., Adam, S.: Applying LDA topic modeling in communication research: toward a valid and reliable methodology. Commun. Methods Meas. 12(2–3), 93–118 (2018)

    Article  Google Scholar 

  • Michel, J.-B., Shen, Y.K., Aiden, A.P., Veres, A., Gray, M.K., Pickett, J.P., Hoiberg, D., Clancy, D., Norvig, P., Orwant, J., Pinker, S., Nowak, M.A., Aiden, E.L.: Quantitative analysis of culture using millions of digitized books. Science 331(6014), 176–182 (2011)

    Article  Google Scholar 

  • Popescu, I., Macutek, J., Altmann, G.: Aspects of Word Frequencies. Studies in Quantitative Linguistics. RAM Verlag, Ludenscheid (2009)

    Google Scholar 

  • Puschmann, C., Scheffler, T.: Topic modeling for media and communication research: a short primer. HIIG Discussion Paper Series No. 2016-05. Available at SSRN: https://doi.org/10.2139/ssrn.2836478 (2016)

  • R Development Core Team: R: a language and environment for statistical computing [software]. R foundation for statistical computing. Retrieved from http://www.r-project.org. Accessed Jan 2020 (2016)

  • Rosen-Zvi, M., Griffiths, T., Steyvers, M., Smyth, P.: The author-topic model for authors and documents. In: Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence, pp. 487–494 (2004)

  • Savoy, J.: Authorship attribution based on a probabilistic topic model. Inf. Process. Manag. 49, 341–354 (2013)

    Article  Google Scholar 

  • Sbalchiero, S.: Finding topics: a statistical model and a quali-quantitative method. In: Tuzzi, A. (ed.) Tracing the Life-Course of Ideas in the Humanities and Social Sciences, pp. 189–210. Springer, Berlin (2018)

    Google Scholar 

  • Sbalchiero, S., Tuzzi, A.: What’s old and new? Discovering Topics in the American Journal of Sociology. In: Iezzi, D.F., Celdardo, L., Misuraca, M. (eds.) Proceedings of 14th International Conference on Statistical Analysis of Textual Data, pp. 724–732. UniversItalia Editore, Rome (2018)

    Google Scholar 

  • Tong, Z., Zhang, H.: A text mining research based on LDA topic modelling. In: Jordery School of Computer Science, pp. 201–210 (2016)

Download references

Acknowledgements

The study was conducted at the intersection of two research projects: M.E. was founded by the Polish National Science Center (SONATA-BIS 2017/26/E/HS2/01019), whereas S.S. was supported by the COST Action "Distant Reading for European Literary History" (CA16204). We are grateful to prof. Arjuna Tuzzi (University of Padova, Italy) for the inspiring discussions and her valuable suggestions.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Stefano Sbalchiero.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Sbalchiero, S., Eder, M. Topic modeling, long texts and the best number of topics. Some Problems and solutions. Qual Quant 54, 1095–1108 (2020). https://doi.org/10.1007/s11135-020-00976-w

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11135-020-00976-w

Keywords

Navigation