Skip to main content

Clustering scientific documents with topic modeling

Abstract

Topic modeling is a type of statistical model for discovering the latent “topics” that occur in a collection of documents through machine learning. Currently, latent Dirichlet allocation (LDA) is a popular and common modeling approach. In this paper, we investigate methods, including LDA and its extensions, for separating a set of scientific publications into several clusters. To evaluate the results, we generate a collection of documents that contain academic papers from several different fields and see whether papers in the same field will be clustered together. We explore potential scientometric applications of such text analysis capabilities.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Notes

  1. 1.

    K-means was run in R from a inverse document frequency weighted (TfIdf) matrix with Euclidean distance clustering the data to the known number of clusters (7).

  2. 2.

    Where we take the topical Boolean search results as true; however, we recognize that this is not absolute—knowledgeable authors or readers might classify the articles differently.

  3. 3.

    Overall the sum of the seven topics F score with LDA was 0.007 smaller. Although this can be seen as small, the larger challenge came from the qualitative analysis of topics. The term clumped approach resulted in terms having a low frequency in any given topic and the overall results being more homogenous. This made assigning topics challenging.

  4. 4.

    Words removed were: “results”, “paper”, ‘elsevier”, “rights”, “reserved”, “aim”, “aimed”, “aims”, “analyse”, “analysis”, “approach”, “approaches”, “data”, “describe”, “describes”, “discusses”, “discussion”, “dissemination”, “study”, “studies”, “suggests”, “theory”, “view”, “2010”, “2009”, “2008”, “2007”, “2006”, “2005”, “2004”, “2003”, “2002”, “2001”, “2000”, “1999”, “1998”, “1997”, “1996”, “1995”, “1994”, “1993”, “1992”, “1991”, “90”, “article”, “based”.

References

  1. Apache Software Foundation, Drost, I., Dunning, T., Eastman, J., Gospodnetic, O., Ingersoll, G., Mannix, J., Owen, S., & Wettin, K. (2010). Apache mahout. http://mloss.org/software/view/144/.

  2. Blei, D., & Lafferty, J. (2006a). Correlated topic models. In Y. Weiss, B. Schölkopf, & J. Platt (Eds.), Advances in neural information processing systems 18 (pp. 147–154). Cambridge: MIT Press.

    Google Scholar 

  3. Blei, D., Griffiths, T. L., Jordan, M. I., Tenenbaum, J. B., et al. (2004). Hierarchical topic models and the nested chinese restaurant process. In S. Thrun, L. Saul, & B. Schölkopf (Eds.), Advances in neural information processing systems 16. Cambridge: MIT Press.

    Google Scholar 

  4. Blei, D.M., & Lafferty, J.D. (2006b). Dynamic topic models. In Proceedings of the 23rd International Conference on Machine Learning (p. 113120).

  5. Blei, D. M., & Lafferty, J. D. (2007). A correlated topic model of science. The Annals of Applied Statistics, 1(1), 17–35.

    Article  MATH  MathSciNet  Google Scholar 

  6. Blei, D. M., & Lafferty, J. D. (2009). Text mining: Classification, clustering, and applications (10th ed., pp. 71–94). London: Taylor and Francis. chap Topic Models.

    Google Scholar 

  7. Blei, D. M., Ng, A. Y., Jordon, M. I., et al. (2003). Latent dirichlet allocation. The Journal of Machine Learning Research, 3, 993–1022.

    MATH  Google Scholar 

  8. Borgman, C., & Furner, J. (2002). Scholarly communication and bibliometrics. Annual Review of Information Science and Technology, 36, 3–72.

    Google Scholar 

  9. Daim, T., Rueda, G., Martin, H., & Gerdsri, P. (2006). Forecasting emerging technologies: Use of bibliometrics and patent analysis. Technological Forecasting & Social Change, 73(8), 981–1012.

    Article  Google Scholar 

  10. Feldman, R., & Sanger, J. (2006). The text mining handbook: Advanced approaches in analyzing unstructured data. Cambridge: Cambridge University Press.

    Book  Google Scholar 

  11. Ferrara, A., & Salini, S. (2012). Ten challenges in modeling bibliographic data for bibliometric analysis. Scientometrics 121.

  12. Glenisson, P., Glnzel, W., & Persson, O. (2005). Combining full-text analysis and bibliometric indicators. A pilot study. Scientometrics, 63(1), 163–180.

    Article  Google Scholar 

  13. Griffiths, T. L., & Steyvers, M. (2004). Finding scientific topics. Proceedings of the National Academy of Sciences of the United States of America, 101, 5228–5235.

    Google Scholar 

  14. Grün, B., & Hornik, K. (2011). Topicmodels: An R package for fitting topic models. Journal of Statistical Software, 40(13), 130.

    Google Scholar 

  15. Hofmann, T. (1999). Probabilistic latent semantic indexing. In Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval (pp. 50–57).

  16. McCallum, A. (2002). Mallet: A machine learning for language toolkit. http://mallet.cs.umass.edu.

  17. Nallapati, R., Cohen, W., & Lafferty, J. (2007). Parallelized variational EM for latent dirichlet allocation: An experimental evaluation of speed and scalability. In 7th IEEE International Conference on Data Mining Workshops, 2007. ICDM Workshops 2007 (pp. 349–354).

  18. Newman, N.C., Porter, A.L., Newman, D., Trumbach, C.C., & Bolan, S.D. (2012). Comparing methods to extract technical content for technological intelligence. In Technology Management for Emerging Technologies (PICMET), 2012 Proceedings of PICMET’12 (p. 12791285).

  19. Ni, C., Sugimoto, C., & Cronin, B. (2012). Visualizing and comparing four facets of scholarly communication: producers, artifacts, concepts, and gatekeepers. Scientometrics pp. 1–13.

  20. Smola, A., & Narayanamurthy, S. (2010). An architecture for parallel topic models. Proceedings of the VLDB Endowment, 3(1–2), 703710.

    Google Scholar 

  21. Steyvers, M., Smyth, P., Rosen-Zvi, M., & Griffiths, T. (2004). Probabilistic author-topic models for information discovery. In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 306–315).

  22. Suominen, A. (2013). Analysis of technological progression by quantitative measures: A comparison of two technologies. Technology Analysis & Strategic Management, 25(6), 687–706.

  23. Teh, Y. W., Jordan, M. I., Beal, M. J., & Blei, D. M. (2006). Hierarchical dirichlet processes. Journal of the American Statistical Association, 101(476), 1566–1581.

    Article  MATH  MathSciNet  Google Scholar 

  24. Wallach, H. (2006). Topic modeling: Beyond bag-of-words. In In Proceedings of the 23rd International Conference on Machine Learning (p. 977984). Pittsburgh, Pennsylvania, U.S.

  25. Wang, Y., Bai, H., Stanton, M., Chen, W.Y., & Chang, E.Y. (2009). Plda: Parallel latent dirichlet allocation for large-scale applications. In Algorithmic Aspects in Information and Management (p. 301314). Springer.

  26. Wei, X., & Croft, W.B. (2006). Lda-based document models for ad-hoc retrieval. In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (p. 178185).

  27. Yan, E., Ding, Y., & Jacob, E.K. (2012). Overlaying communities and topics: An analysis on publication networks. Scientometrics pp. 1–15.

  28. Zhai, K., Boyd-Graber, J., Asadi, N., & Alkhouja, M.L. (2012). Mr. LDA: A flexible large scale topic modeling package using variational inference in MapReduce. In Proceedings of the 21st international conference on World Wide Web (p. 879888).

  29. Zhang, Y., Porter, A. L., Hu, Z., Guo, Y., & Newman, N. C. (2014). “Term clumping” for technical intelligence: A case study on dye-sensitized solar cells. Technological Forecasting and Social Change. doi:10.1016/j.techfore.2013.12.019.

Download references

Acknowledgments

We acknowledge support from the US National Science Foundation (NSF - Award #1064146). The findings and observations are those of the authors and do not necessarily reflect the views of NSF. Arho Suominen also acknowledges the support from the Finnish Funding Agency for Innovation (Project: "Co-evolution of knowledge creation systems and innovation pipelines (CEK)").

Author information

Affiliations

Authors

Corresponding author

Correspondence to Arho Suominen.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Yau, CK., Porter, A., Newman, N. et al. Clustering scientific documents with topic modeling. Scientometrics 100, 767–786 (2014). https://doi.org/10.1007/s11192-014-1321-8

Download citation

Keywords

  • Topic modeling
  • Text analysis
  • Atent dirichlet allocation