Abstract
Topic modeling is a type of statistical model for discovering the latent “topics” that occur in a collection of documents through machine learning. Currently, latent Dirichlet allocation (LDA) is a popular and common modeling approach. In this paper, we investigate methods, including LDA and its extensions, for separating a set of scientific publications into several clusters. To evaluate the results, we generate a collection of documents that contain academic papers from several different fields and see whether papers in the same field will be clustered together. We explore potential scientometric applications of such text analysis capabilities.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Notes
K-means was run in R from a inverse document frequency weighted (TfIdf) matrix with Euclidean distance clustering the data to the known number of clusters (7).
Where we take the topical Boolean search results as true; however, we recognize that this is not absolute—knowledgeable authors or readers might classify the articles differently.
Overall the sum of the seven topics F score with LDA was 0.007 smaller. Although this can be seen as small, the larger challenge came from the qualitative analysis of topics. The term clumped approach resulted in terms having a low frequency in any given topic and the overall results being more homogenous. This made assigning topics challenging.
Words removed were: “results”, “paper”, ‘elsevier”, “rights”, “reserved”, “aim”, “aimed”, “aims”, “analyse”, “analysis”, “approach”, “approaches”, “data”, “describe”, “describes”, “discusses”, “discussion”, “dissemination”, “study”, “studies”, “suggests”, “theory”, “view”, “2010”, “2009”, “2008”, “2007”, “2006”, “2005”, “2004”, “2003”, “2002”, “2001”, “2000”, “1999”, “1998”, “1997”, “1996”, “1995”, “1994”, “1993”, “1992”, “1991”, “90”, “article”, “based”.
References
Apache Software Foundation, Drost, I., Dunning, T., Eastman, J., Gospodnetic, O., Ingersoll, G., Mannix, J., Owen, S., & Wettin, K. (2010). Apache mahout. http://mloss.org/software/view/144/.
Blei, D., & Lafferty, J. (2006a). Correlated topic models. In Y. Weiss, B. Schölkopf, & J. Platt (Eds.), Advances in neural information processing systems 18 (pp. 147–154). Cambridge: MIT Press.
Blei, D., Griffiths, T. L., Jordan, M. I., Tenenbaum, J. B., et al. (2004). Hierarchical topic models and the nested chinese restaurant process. In S. Thrun, L. Saul, & B. Schölkopf (Eds.), Advances in neural information processing systems 16. Cambridge: MIT Press.
Blei, D.M., & Lafferty, J.D. (2006b). Dynamic topic models. In Proceedings of the 23rd International Conference on Machine Learning (p. 113120).
Blei, D. M., & Lafferty, J. D. (2007). A correlated topic model of science. The Annals of Applied Statistics, 1(1), 17–35.
Blei, D. M., & Lafferty, J. D. (2009). Text mining: Classification, clustering, and applications (10th ed., pp. 71–94). London: Taylor and Francis. chap Topic Models.
Blei, D. M., Ng, A. Y., Jordon, M. I., et al. (2003). Latent dirichlet allocation. The Journal of Machine Learning Research, 3, 993–1022.
Borgman, C., & Furner, J. (2002). Scholarly communication and bibliometrics. Annual Review of Information Science and Technology, 36, 3–72.
Daim, T., Rueda, G., Martin, H., & Gerdsri, P. (2006). Forecasting emerging technologies: Use of bibliometrics and patent analysis. Technological Forecasting & Social Change, 73(8), 981–1012.
Feldman, R., & Sanger, J. (2006). The text mining handbook: Advanced approaches in analyzing unstructured data. Cambridge: Cambridge University Press.
Ferrara, A., & Salini, S. (2012). Ten challenges in modeling bibliographic data for bibliometric analysis. Scientometrics 121.
Glenisson, P., Glnzel, W., & Persson, O. (2005). Combining full-text analysis and bibliometric indicators. A pilot study. Scientometrics, 63(1), 163–180.
Griffiths, T. L., & Steyvers, M. (2004). Finding scientific topics. Proceedings of the National Academy of Sciences of the United States of America, 101, 5228–5235.
Grün, B., & Hornik, K. (2011). Topicmodels: An R package for fitting topic models. Journal of Statistical Software, 40(13), 130.
Hofmann, T. (1999). Probabilistic latent semantic indexing. In Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval (pp. 50–57).
McCallum, A. (2002). Mallet: A machine learning for language toolkit. http://mallet.cs.umass.edu.
Nallapati, R., Cohen, W., & Lafferty, J. (2007). Parallelized variational EM for latent dirichlet allocation: An experimental evaluation of speed and scalability. In 7th IEEE International Conference on Data Mining Workshops, 2007. ICDM Workshops 2007 (pp. 349–354).
Newman, N.C., Porter, A.L., Newman, D., Trumbach, C.C., & Bolan, S.D. (2012). Comparing methods to extract technical content for technological intelligence. In Technology Management for Emerging Technologies (PICMET), 2012 Proceedings of PICMET’12 (p. 12791285).
Ni, C., Sugimoto, C., & Cronin, B. (2012). Visualizing and comparing four facets of scholarly communication: producers, artifacts, concepts, and gatekeepers. Scientometrics pp. 1–13.
Smola, A., & Narayanamurthy, S. (2010). An architecture for parallel topic models. Proceedings of the VLDB Endowment, 3(1–2), 703710.
Steyvers, M., Smyth, P., Rosen-Zvi, M., & Griffiths, T. (2004). Probabilistic author-topic models for information discovery. In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 306–315).
Suominen, A. (2013). Analysis of technological progression by quantitative measures: A comparison of two technologies. Technology Analysis & Strategic Management, 25(6), 687–706.
Teh, Y. W., Jordan, M. I., Beal, M. J., & Blei, D. M. (2006). Hierarchical dirichlet processes. Journal of the American Statistical Association, 101(476), 1566–1581.
Wallach, H. (2006). Topic modeling: Beyond bag-of-words. In In Proceedings of the 23rd International Conference on Machine Learning (p. 977984). Pittsburgh, Pennsylvania, U.S.
Wang, Y., Bai, H., Stanton, M., Chen, W.Y., & Chang, E.Y. (2009). Plda: Parallel latent dirichlet allocation for large-scale applications. In Algorithmic Aspects in Information and Management (p. 301314). Springer.
Wei, X., & Croft, W.B. (2006). Lda-based document models for ad-hoc retrieval. In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (p. 178185).
Yan, E., Ding, Y., & Jacob, E.K. (2012). Overlaying communities and topics: An analysis on publication networks. Scientometrics pp. 1–15.
Zhai, K., Boyd-Graber, J., Asadi, N., & Alkhouja, M.L. (2012). Mr. LDA: A flexible large scale topic modeling package using variational inference in MapReduce. In Proceedings of the 21st international conference on World Wide Web (p. 879888).
Zhang, Y., Porter, A. L., Hu, Z., Guo, Y., & Newman, N. C. (2014). “Term clumping” for technical intelligence: A case study on dye-sensitized solar cells. Technological Forecasting and Social Change. doi:10.1016/j.techfore.2013.12.019.
Acknowledgments
We acknowledge support from the US National Science Foundation (NSF - Award #1064146). The findings and observations are those of the authors and do not necessarily reflect the views of NSF. Arho Suominen also acknowledges the support from the Finnish Funding Agency for Innovation (Project: "Co-evolution of knowledge creation systems and innovation pipelines (CEK)").
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Yau, CK., Porter, A., Newman, N. et al. Clustering scientific documents with topic modeling. Scientometrics 100, 767–786 (2014). https://doi.org/10.1007/s11192-014-1321-8
Received:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11192-014-1321-8