Topic modeling, long texts and the best number of topics. Some Problems and solutions

Sbalchiero, Stefano; Eder, Maciej

doi:10.1007/s11135-020-00976-w

Topic modeling, long texts and the best number of topics. Some Problems and solutions

Published: 17 February 2020

Volume 54, pages 1095–1108, (2020)
Cite this article

Quality & Quantity Aims and scope Submit manuscript

Stefano Sbalchiero¹ &
Maciej Eder²

5078 Accesses
53 Citations
1 Altmetric
Explore all metrics

Abstract

The main aim of this article is to present the results of different experiments focused on the problem of model fitting process in topic modeling and its accuracy when applied to long texts. At the same time, in fact, the digital era has made available both enormous quantities of textual data and technological advances that have facilitated the development of techniques to automate the data coding and analysis processes. In the ambit of topic modeling, different procedures were born in order to analyze larger and larger collections of texts, namely corpora, but this has posed, and continues to pose, a series of methodological questions that urgently need to be resolved. Therefore, through a series of different experiments, this article is based on the following consideration: taking into account Latent Dirichlet Allocation (LDA), a generative probabilistic model (Blei et al. in J Mach Learn Res 3:993–1022, 2003; Blei and Lafferty in: Srivastava, Sahami (eds) Text mining: classification, clustering, and applications, Chapman & Hall/CRC Press, Cambridge, 2009; Griffiths and Steyvers in Proc Natl Acad Sci USA (PNAS), 101(Supplement 1):5228–5235, 2004), the problem of fitting model is crucial because the LDA algorithm demands that the number of topics is specified a priori. Needles to say, the number of topics to detect in a corpus is a parameter which affect the analysis results. Since there is a lack of experiments applied to long texts, our article tries to shed new light on the complex relationship between texts’ length and the optimal number of topics. In the conclusions, we present a clear-cut power-law relation between the optimal number of topics and the analyzed sample size, and we formulate it in a form of a mathematical model.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Multi-objective Topic Modeling

Tutorial on Probabilistic Topic Modeling: Additive Regularization for Stochastic Matrix Factorization

Topic Modeling Approaches—A Comparative Analysis

Notes

https://github.com/computationalstylistics/100_english_novels.

References

Arun, R., Suresh, V., Veni Madhavan, C.E., Narasimha Murthy, M.N.: On finding the natural number of topics with latent Dirichlet allocation some observations. In: Zaki, M.J., Yu, J.X., Ravindran, B., Pudi, V. (eds.) Advances in Knowledge Discovery and Data Mining. Lecture Notes in Computer Science, pp. 391–402. Springer, Berlin (2010)
Chapter Google Scholar
Blei, D.M, Lafferty, J.D.: Dynamic topic models. In: Proceedings of the 23rd International Conference on Machine Learning, pp. 113–120 (2006)
Blei, D., Lafferty, J.: A correlated topic model of Science. Ann. Appl. Stat. 1(1):17–35 (2007)
Article Google Scholar
Blei, D.M., Lafferty, J.D.: Topic Models. In: Srivastava, A., Sahami, M. (eds.) Text Mining: Classification, Clustering, and Applications, pp. 71–93. Chapman & Hall/CRC Press, Cambridge (2009)
Google Scholar
Blei, D.M., Ng, A., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)
Google Scholar
Cao, J., Xia, T., Li, J., Zhang, Y., Tang, S.: A density-based method for adaptive LDA model selection. Neurocomputing 72(7–9), 1775–1781 (2009)
Article Google Scholar
Deveaud, R., SanJuan, É., Bellot, P.: Accurate and effective latent concept modeling for ad hoc information retrieval. Document numérique 17(1), 61–84 (2014)
Article Google Scholar
Feinerer, I., Hornik, K., Meyer, D.: Text mining infrastructure in R. J. Stat. Softw. 25(5), 1–54 (2008)
Article Google Scholar
Giordan, G., Saint-Blancat, C., Sbalchiero, S.: Exploring the history of american sociology through topic modeling. In: Tuzzi, A. (ed.) Tracing the Life-Course of Ideas in the Humanities and Social Sciences, pp. 45–64. Springer, Berlin (2018)
Google Scholar
Griffiths, T., Steyvers, M.: Finding scientific topics. Proceedings of the National Academy of Sciences of the United States of America (PNAS) 101(Supplement 1), 5228–5235 (2004)
Article Google Scholar
Grün, B., Hornik, K.: Topicmodels: an R package for fitting topic models. J. Stat. Softw. 40(13), 1–30 (2011)
Article Google Scholar
Hall, D., Jurafsky, D., Manning, C.D.: Studying the history of ideas using topic models. In: EMNLP ‘08 Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 363–371 (2008)
Hong, L., Davison, B.D.: Empirical study of topic modeling in Twitter. In: Proceedings of the SIGKDD Workshop on SMA, pp. 80–88 (2010)
Jockers, M.L., Mimno, D.: Significant themes in 19th-century literature. Poetics 41(6), 750–769 (2013)
Article Google Scholar
Kodinariya, T.M., Makwana, P.R.: Review on determining number of cluster in k-means clustering. International Journal of Advance Research in Computer Science and Management Studies 1(6), 90–95 (2013)
Google Scholar
Köhler, R., Galle, M.: Dynamic aspects of text characteristics. In: Hrebícek, L., Altmann, G. (eds.) Quantitative Text Analysis, pp. 46–53. Wissenschaftlicher, Trier (1993)
Google Scholar
Lebart, L., Salem, A., Berry, L.: Exploring textual data. Kluwer Academic Publishers, Dordrecht (1998)
Book Google Scholar
Li, W., McCallum, A.: Pachinko allocation: DAG-structured mixture models of topic correlations. In: Proceedings of the 23rd International Conference on Machine Learning, pp. 577–584 (2006)
Maier, D., Waldherr, A., Miltner, P., Wiedemann, G., Niekler, A., Keinert, A., Pfetsch, B., Heyer, G., Reber, U., Häussler, T., Schmid-Petri, H., Adam, S.: Applying LDA topic modeling in communication research: toward a valid and reliable methodology. Commun. Methods Meas. 12(2–3), 93–118 (2018)
Article Google Scholar
Michel, J.-B., Shen, Y.K., Aiden, A.P., Veres, A., Gray, M.K., Pickett, J.P., Hoiberg, D., Clancy, D., Norvig, P., Orwant, J., Pinker, S., Nowak, M.A., Aiden, E.L.: Quantitative analysis of culture using millions of digitized books. Science 331(6014), 176–182 (2011)
Article Google Scholar
Popescu, I., Macutek, J., Altmann, G.: Aspects of Word Frequencies. Studies in Quantitative Linguistics. RAM Verlag, Ludenscheid (2009)
Google Scholar
Puschmann, C., Scheffler, T.: Topic modeling for media and communication research: a short primer. HIIG Discussion Paper Series No. 2016-05. Available at SSRN: https://doi.org/10.2139/ssrn.2836478 (2016)
R Development Core Team: R: a language and environment for statistical computing [software]. R foundation for statistical computing. Retrieved from http://www.r-project.org. Accessed Jan 2020 (2016)
Rosen-Zvi, M., Griffiths, T., Steyvers, M., Smyth, P.: The author-topic model for authors and documents. In: Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence, pp. 487–494 (2004)
Savoy, J.: Authorship attribution based on a probabilistic topic model. Inf. Process. Manag. 49, 341–354 (2013)
Article Google Scholar
Sbalchiero, S.: Finding topics: a statistical model and a quali-quantitative method. In: Tuzzi, A. (ed.) Tracing the Life-Course of Ideas in the Humanities and Social Sciences, pp. 189–210. Springer, Berlin (2018)
Google Scholar
Sbalchiero, S., Tuzzi, A.: What’s old and new? Discovering Topics in the American Journal of Sociology. In: Iezzi, D.F., Celdardo, L., Misuraca, M. (eds.) Proceedings of 14th International Conference on Statistical Analysis of Textual Data, pp. 724–732. UniversItalia Editore, Rome (2018)
Google Scholar
Tong, Z., Zhang, H.: A text mining research based on LDA topic modelling. In: Jordery School of Computer Science, pp. 201–210 (2016)

Download references

Acknowledgements

The study was conducted at the intersection of two research projects: M.E. was founded by the Polish National Science Center (SONATA-BIS 2017/26/E/HS2/01019), whereas S.S. was supported by the COST Action "Distant Reading for European Literary History" (CA16204). We are grateful to prof. Arjuna Tuzzi (University of Padova, Italy) for the inspiring discussions and her valuable suggestions.

Author information

Authors and Affiliations

Department of Philosophy, Sociology, Education and Applied Psychology (FISPPA) - Section of Sociology, University of Padova, Padova, Italy
Stefano Sbalchiero
Polish Academy of Sciences and Pedagogical University of Kraków, Kraków, Poland
Maciej Eder

Authors

Stefano Sbalchiero
View author publications
You can also search for this author in PubMed Google Scholar
Maciej Eder
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Stefano Sbalchiero.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Sbalchiero, S., Eder, M. Topic modeling, long texts and the best number of topics. Some Problems and solutions. Qual Quant 54, 1095–1108 (2020). https://doi.org/10.1007/s11135-020-00976-w

Download citation

Published: 17 February 2020
Issue Date: August 2020
DOI: https://doi.org/10.1007/s11135-020-00976-w

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Topic modeling, long texts and the best number of topics. Some Problems and solutions

Abstract

Access this article

Similar content being viewed by others

Multi-objective Topic Modeling

Tutorial on Probabilistic Topic Modeling: Additive Regularization for Stochastic Matrix Factorization

Topic Modeling Approaches—A Comparative Analysis

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Topic modeling, long texts and the best number of topics. Some Problems and solutions

Abstract

Access this article

Similar content being viewed by others

Multi-objective Topic Modeling

Tutorial on Probabilistic Topic Modeling: Additive Regularization for Stochastic Matrix Factorization

Topic Modeling Approaches—A Comparative Analysis

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation