Abstract
Mining the content of scientific publications is increasingly used to investigate the practice of science and the evolution of research domains. Topic models, among which LDA (statistical bag-of-words approach) and Top2Vec (embeddings approach), have notably been shown to provide rich insights into the thematic content of disciplinary fields, their structure and evolution through time. However, improving topic modeling methods remains a major concern. Here we propose an alternative topic-modeling approach based on neural clustering and feature maximization with F1-measure (in short: CFMf). We compare the performance of this approach to LDA and Top2Vec by applying the methods to a reference corpus of full-text philosophy of science articles (N = 16,917). The results reveal significant improvements in terms of coherence measures, independently of the number of topics. Qualitative comparisons show an overall consistency in terms of topical coverage across all three methods, yet with differences: in particular, CFMf appears affected by the presence of a large class while Top2Vec generates some sets of top-words highly difficult to interpret. We discuss these results and highlight upcoming research work.
Similar content being viewed by others
Notes
In what follows, we will use the term “cluster” indifferently across the three approaches to topic-modeling we propose to compare. This usage corresponds well to the approaches mobilized by CFMf and Top2Vec, since both approaches start by crisp-clustering the documents into specific clusters, and then extract representative terms that are taken to represent a common topic shared by the documents of the same cluster. The LDA model however does not crisp cluster documents into specific clusters, but considers that documents are probability distributions over topics. When using the term “cluster” or “document cluster” in the context of the LDA model, we simply mean the set of documents that share the same specific topic as dominant topic (i.e., topic with the highest probability in these documents).
Abbreviations
- CFMf:
-
Neural clustering and feature maximization with F1-measure
- CFMc:
-
Neural clustering and feature maximization with contrast
- DTM:
-
Document-term matrix
- F-Max:
-
Feature maximization
- GNG:
-
Growing neural gas
- HDBSCAN:
-
Hierarchical density-based spatial clustering of applications with noise
- LDA:
-
Latent Dirichlet allocation
- MCMC:
-
Markov chain Monte Carlo
- NPMI:
-
Normalized pointwise mutual information
- t-SNE:
-
T-distributed stochastic neighbor embedding
- UMAP:
-
Uniform manifold approximation and projection
- USE:
-
Universal sentence encoder
References
Angelov, D. (2020). Top2Vec: distributed representations of topics. arXiv Preprint. arXiv:2008.09470
Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. Journal of Machine Learning Research, 3(1), 993–1022.
Börner, K., Silva, F. N., & Milojević, S. (2021). Visualizing big science projects. Nature Reviews Physics, 3(11), Article 11. https://doi.org/10.1038/s42254-021-00374-7
Boyd-Graber, J. L., Hu, Y., & Mimno, D. (2017). Applications of topic models (Vol. 11). Now Publishers Incorporated.
Campello, R. J., Moulavi, D., & Sander, J. (2013). Density-based clustering based on hierarchical density estimates. In J. Pei, V. S. Tseng, L. Cao, H. Motoda, & G. Xu (Eds.), Advances in knowledge discovery and data mining (Vol. 7819, pp. 160–172). Berlin: Springer. https://doi.org/10.1007/978-3-642-37456-2_14
Dugué, N., Lamirel, J.-C., & Chen, Y. (2021). Evaluating clustering quality using features salience: A promising approach. Neural Computing and Applications. https://doi.org/10.1007/s00521-021-05942-7
Fritzke, B. (1994). A growing neural gas network learns topologies. In Advances in neural information processing systems (Vol. 7). MIT.
Griffiths, T. L., & Steyvers, M. (2004). Finding scientific topics. Proceedings of the National Academy of Sciences of the United States of America, 101(Suppl 1), 5228–5235. https://doi.org/10.1073/pnas.0307752101
Lamirel, J.-C., Chen, Y., Cuxac, P., Al Shehabi, S., Dugué, N., & Liu, Z. (2020). An overview of the history of Science of Science in China based on the use of bibliographic and citation data: A new method of analysis based on clustering with feature maximization and contrast graphs. Scientometrics, 125(3), 2971–2999. https://doi.org/10.1007/s11192-020-03503-8
Lamirel, J.-C., Cuxac, P., Chivukula, A. S., & Hajlaoui, K. (2015). Optimizing text classification through efficient feature selection based on quality metric. Journal of Intelligent Information Systems, 45(3), 379–396. https://doi.org/10.1007/s10844-014-0317-4
Lamirel, J.-C., Lareau, F., & Malaterre, C. (2023, 5/7). The CFMf topic-modeling method based on neural clustering with feature maximization: Comparison with LDA. In Proceedings of ISSI 2023. The 19th conference of the international society for scientometrics and informetrics, Bloomington, IN.
Lamirel, J.-C., Mall, R., Cuxac, P., & Safi, G. (2011). Variations to incremental growing neural gas algorithm based on label maximization. In The 2011 International joint conference on neural networks (pp. 956–965). https://doi.org/10.1109/IJCNN.2011.6033326
Malaterre, C., & Lareau, F. (2022). The early days of contemporary philosophy of science: Novel insights from machine translation and topic-modeling of non-parallel multilingual corpora. Synthese, 200(3), 242. https://doi.org/10.1007/s11229-022-03722-x
Marcus, Mitchell P., Mary Ann Marcinkiewicz, & Beatrice Santorini. (1993). Building a large annotated corpus of english: The Penn Treebank. Computational Linguistics, 19(2), 313–30. https://doi.org/10.21236/ADA273556
McInnes, L., Healy, J., & Melville, J. (2020). UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. arXiv Preprint. arXiv:1802.03426
Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv Preprint. arXiv:1301.3781
Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems (Vol. 26). https://proceedings.neurips.cc/paper/2013/hash/9aa42b31882ec039965f3c4923ce901b-Abstract.html
Newman, David, Jey Han Lau, Karl Grieser, and Timothy Baldwin. 2010. “Automatic Evaluation of Topic Coherence.” In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, 100–108.
Prouteau, T., Connes, V., Dugué, N., Perez, A., Lamirel, J.-C., Camelin, N., & Meignier, S. (2021). SINr: fast computing of sparse interpretable node representations is not a sin! In P. H. Abreu, P. P. Rodrigues, A. Fernández, & J. Gama (Eds.), Advances in intelligent data analysis XIX (Vol. 12695, pp. 325–337). Cham: Springer. https://doi.org/10.1007/978-3-030-74251-5_26
Röder, M., Both, A., & Hinneburg, A. (2015). Exploring the space of topic coherence measures. In Proceedings of the 8th ACM international conference on web search and data mining—WSDM ’15 (pp. 399–408). https://doi.org/10.1145/2684822.2685324
Schmid, Helmut. (1994). Probabilistic part-of-speech tagging using decision trees. In Proceedings of International Conference on New Methods in Language Processing, 44–49. Manchester: Association for Computational Linguistics
Talley, E. M., Newman, D., Mimno, D., Herr, B. W., Wallach, H. M., Burns, G. A. P. C., Leenders, A. G. M., & McCallum, A. (2011). Database of NIH grants using machine-learned categories and graphical clustering. Nature Methods, 8(6), 443–444. https://doi.org/10.1038/nmeth.1619
van der Maaten, L., & Hinton, G. (2008). Visualizing data using t-SNE. Journal of Machine Learning Research, 9(11), 2579–2605.
Acknowledgements
F.L. acknowledges funding from the Fonds de recherche du Québec Société et culture (FRQSC-276470) and the Canada Research Chair in Philosophy of the Life Sciences at UQAM. C.M. acknowledges funding from Canada Social Sciences and Humanities Research Council (Grant 430-2018-00899) and Canada Research Chairs (CRC-950-230795). The authors thank the audience of ISSI 2023 for most helpful comments on an earlier version of this paper published in the conference proceedings as: Lamirel et al. (2023).
Author information
Authors and Affiliations
Contributions
Conceptualization: JCL, FL, CM; data curation: FL; formal analysis and investigation: JCL, FL, CM; funding acquisition: JCL, CM; Investigation: JCL, FL, CM; Methodology: JCL, FL, CM; project administration: JCL, CM; resources: JCL, FL, CM; software: JCL, FL; supervision: JCL, FL, CM; validation: JCL, FL, CM; Visualization: FL, CM; writing—original draft preparation: JCL, FL, CM; writing—review and editing: JCL, FL, CM. All authors approved the final submitted manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declared that they have no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Below is the link to the electronic supplementary material.
11192_2024_5017_MOESM1_ESM.xlsx
Supplementary Table ST1 Contains the citation reference of the top-10 documents per topic for each model (LDA, CFMf, and Top2Vec) calculated at K = 25 (XLSX 48 KB)
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Lamirel, JC., Lareau, F. & Malaterre, C. CFMf topic-model: comparison with LDA and Top2Vec. Scientometrics (2024). https://doi.org/10.1007/s11192-024-05017-z
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s11192-024-05017-z