CFMf topic-model: comparison with LDA and Top2Vec

Lamirel, Jean-Charles; Lareau, Francis; Malaterre, Christophe

doi:10.1007/s11192-024-05017-z

CFMf topic-model: comparison with LDA and Top2Vec

Published: 29 April 2024

(2024)
Cite this article

Scientometrics Aims and scope Submit manuscript

86 Accesses
Explore all metrics

Abstract

Mining the content of scientific publications is increasingly used to investigate the practice of science and the evolution of research domains. Topic models, among which LDA (statistical bag-of-words approach) and Top2Vec (embeddings approach), have notably been shown to provide rich insights into the thematic content of disciplinary fields, their structure and evolution through time. However, improving topic modeling methods remains a major concern. Here we propose an alternative topic-modeling approach based on neural clustering and feature maximization with F1-measure (in short: CFMf). We compare the performance of this approach to LDA and Top2Vec by applying the methods to a reference corpus of full-text philosophy of science articles (N = 16,917). The results reveal significant improvements in terms of coherence measures, independently of the number of topics. Qualitative comparisons show an overall consistency in terms of topical coverage across all three methods, yet with differences: in particular, CFMf appears affected by the presence of a large class while Top2Vec generates some sets of top-words highly difficult to interpret. We discuss these results and highlight upcoming research work.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Concepts in Topics. Using Word Embeddings to Leverage the Outcomes of Topic Modeling for the Exploration of Digitized Archival Collections

Identifying Topics of Scientific Articles with BERT-Based Approaches and Topic Modeling

Analyzing the field of bioinformatics with the multi-faceted topic modeling technique

Article Open access 31 May 2017

Notes

https://github.com/cognitivefactory/features-maximization-metric.
https://pypi.org/project/lda/.
https://pypi.org/project/top2vec/.
https://pypi.org/project/gensim/.
In what follows, we will use the term “cluster” indifferently across the three approaches to topic-modeling we propose to compare. This usage corresponds well to the approaches mobilized by CFMf and Top2Vec, since both approaches start by crisp-clustering the documents into specific clusters, and then extract representative terms that are taken to represent a common topic shared by the documents of the same cluster. The LDA model however does not crisp cluster documents into specific clusters, but considers that documents are probability distributions over topics. When using the term “cluster” or “document cluster” in the context of the LDA model, we simply mean the set of documents that share the same specific topic as dominant topic (i.e., topic with the highest probability in these documents).

Abbreviations

CFMf:: Neural clustering and feature maximization with F1-measure
CFMc:: Neural clustering and feature maximization with contrast
DTM:: Document-term matrix
F-Max:: Feature maximization
GNG:: Growing neural gas
HDBSCAN:: Hierarchical density-based spatial clustering of applications with noise
LDA:: Latent Dirichlet allocation
MCMC:: Markov chain Monte Carlo
NPMI:: Normalized pointwise mutual information
t-SNE:: T-distributed stochastic neighbor embedding
UMAP:: Uniform manifold approximation and projection
USE:: Universal sentence encoder

References

Angelov, D. (2020). Top2Vec: distributed representations of topics. arXiv Preprint. arXiv:2008.09470
Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. Journal of Machine Learning Research, 3(1), 993–1022.
Google Scholar
Börner, K., Silva, F. N., & Milojević, S. (2021). Visualizing big science projects. Nature Reviews Physics, 3(11), Article 11. https://doi.org/10.1038/s42254-021-00374-7
Boyd-Graber, J. L., Hu, Y., & Mimno, D. (2017). Applications of topic models (Vol. 11). Now Publishers Incorporated.
Campello, R. J., Moulavi, D., & Sander, J. (2013). Density-based clustering based on hierarchical density estimates. In J. Pei, V. S. Tseng, L. Cao, H. Motoda, & G. Xu (Eds.), Advances in knowledge discovery and data mining (Vol. 7819, pp. 160–172). Berlin: Springer. https://doi.org/10.1007/978-3-642-37456-2_14
Chapter Google Scholar
Dugué, N., Lamirel, J.-C., & Chen, Y. (2021). Evaluating clustering quality using features salience: A promising approach. Neural Computing and Applications. https://doi.org/10.1007/s00521-021-05942-7
Article Google Scholar
Fritzke, B. (1994). A growing neural gas network learns topologies. In Advances in neural information processing systems (Vol. 7). MIT.
Griffiths, T. L., & Steyvers, M. (2004). Finding scientific topics. Proceedings of the National Academy of Sciences of the United States of America, 101(Suppl 1), 5228–5235. https://doi.org/10.1073/pnas.0307752101
Article Google Scholar
Lamirel, J.-C., Chen, Y., Cuxac, P., Al Shehabi, S., Dugué, N., & Liu, Z. (2020). An overview of the history of Science of Science in China based on the use of bibliographic and citation data: A new method of analysis based on clustering with feature maximization and contrast graphs. Scientometrics, 125(3), 2971–2999. https://doi.org/10.1007/s11192-020-03503-8
Article Google Scholar
Lamirel, J.-C., Cuxac, P., Chivukula, A. S., & Hajlaoui, K. (2015). Optimizing text classification through efficient feature selection based on quality metric. Journal of Intelligent Information Systems, 45(3), 379–396. https://doi.org/10.1007/s10844-014-0317-4
Article Google Scholar
Lamirel, J.-C., Lareau, F., & Malaterre, C. (2023, 5/7). The CFMf topic-modeling method based on neural clustering with feature maximization: Comparison with LDA. In Proceedings of ISSI 2023. The 19th conference of the international society for scientometrics and informetrics, Bloomington, IN.
Lamirel, J.-C., Mall, R., Cuxac, P., & Safi, G. (2011). Variations to incremental growing neural gas algorithm based on label maximization. In The 2011 International joint conference on neural networks (pp. 956–965). https://doi.org/10.1109/IJCNN.2011.6033326
Malaterre, C., & Lareau, F. (2022). The early days of contemporary philosophy of science: Novel insights from machine translation and topic-modeling of non-parallel multilingual corpora. Synthese, 200(3), 242. https://doi.org/10.1007/s11229-022-03722-x
Article MathSciNet Google Scholar
Marcus, Mitchell P., Mary Ann Marcinkiewicz, & Beatrice Santorini. (1993). Building a large annotated corpus of english: The Penn Treebank. Computational Linguistics, 19(2), 313–30. https://doi.org/10.21236/ADA273556
McInnes, L., Healy, J., & Melville, J. (2020). UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. arXiv Preprint. arXiv:1802.03426
Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv Preprint. arXiv:1301.3781
Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems (Vol. 26). https://proceedings.neurips.cc/paper/2013/hash/9aa42b31882ec039965f3c4923ce901b-Abstract.html
Newman, David, Jey Han Lau, Karl Grieser, and Timothy Baldwin. 2010. “Automatic Evaluation of Topic Coherence.” In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, 100–108.
Prouteau, T., Connes, V., Dugué, N., Perez, A., Lamirel, J.-C., Camelin, N., & Meignier, S. (2021). SINr: fast computing of sparse interpretable node representations is not a sin! In P. H. Abreu, P. P. Rodrigues, A. Fernández, & J. Gama (Eds.), Advances in intelligent data analysis XIX (Vol. 12695, pp. 325–337). Cham: Springer. https://doi.org/10.1007/978-3-030-74251-5_26
Chapter Google Scholar
Röder, M., Both, A., & Hinneburg, A. (2015). Exploring the space of topic coherence measures. In Proceedings of the 8th ACM international conference on web search and data mining—WSDM ’15 (pp. 399–408). https://doi.org/10.1145/2684822.2685324
Schmid, Helmut. (1994). Probabilistic part-of-speech tagging using decision trees. In Proceedings of International Conference on New Methods in Language Processing, 44–49. Manchester: Association for Computational Linguistics
Talley, E. M., Newman, D., Mimno, D., Herr, B. W., Wallach, H. M., Burns, G. A. P. C., Leenders, A. G. M., & McCallum, A. (2011). Database of NIH grants using machine-learned categories and graphical clustering. Nature Methods, 8(6), 443–444. https://doi.org/10.1038/nmeth.1619
Article Google Scholar
van der Maaten, L., & Hinton, G. (2008). Visualizing data using t-SNE. Journal of Machine Learning Research, 9(11), 2579–2605.
Google Scholar

Download references

Acknowledgements

F.L. acknowledges funding from the Fonds de recherche du Québec Société et culture (FRQSC-276470) and the Canada Research Chair in Philosophy of the Life Sciences at UQAM. C.M. acknowledges funding from Canada Social Sciences and Humanities Research Council (Grant 430-2018-00899) and Canada Research Chairs (CRC-950-230795). The authors thank the audience of ISSI 2023 for most helpful comments on an earlier version of this paper published in the conference proceedings as: Lamirel et al. (2023).

Author information

Authors and Affiliations

SYNALP-LORIA, Université de Strasbourg, 615 Rue du Jardin-Botanique, 54506, Vandœuvre-Lès-Nancy, France
Jean-Charles Lamirel
Computer Science Department, Université du Québec à Montréal, 201 av. Président-Kennedy, Montréal, QC, H2X 3Y7, Canada
Francis Lareau
Department of Philosophy, Centre interuniversitaire de recherche sur la science et la technologie, Université du Québec à Montréal, 455 bd. René-Lévesque Est, Montréal, QC, H3C 3P8, Canada
Christophe Malaterre

Authors

Jean-Charles Lamirel
View author publications
You can also search for this author in PubMed Google Scholar
Francis Lareau
View author publications
You can also search for this author in PubMed Google Scholar
Christophe Malaterre
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Conceptualization: JCL, FL, CM; data curation: FL; formal analysis and investigation: JCL, FL, CM; funding acquisition: JCL, CM; Investigation: JCL, FL, CM; Methodology: JCL, FL, CM; project administration: JCL, CM; resources: JCL, FL, CM; software: JCL, FL; supervision: JCL, FL, CM; validation: JCL, FL, CM; Visualization: FL, CM; writing—original draft preparation: JCL, FL, CM; writing—review and editing: JCL, FL, CM. All authors approved the final submitted manuscript.

Corresponding author

Correspondence to Jean-Charles Lamirel.

Ethics declarations

Competing interests

The authors declared that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

11192_2024_5017_MOESM1_ESM.xlsx

Supplementary Table ST1 Contains the citation reference of the top-10 documents per topic for each model (LDA, CFMf, and Top2Vec) calculated at K = 25 (XLSX 48 KB)

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Lamirel, JC., Lareau, F. & Malaterre, C. CFMf topic-model: comparison with LDA and Top2Vec. Scientometrics (2024). https://doi.org/10.1007/s11192-024-05017-z

Download citation

Received: 20 November 2023
Accepted: 02 April 2024
Published: 29 April 2024
DOI: https://doi.org/10.1007/s11192-024-05017-z

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

CFMf topic-model: comparison with LDA and Top2Vec

Abstract

Access this article

Similar content being viewed by others

Concepts in Topics. Using Word Embeddings to Leverage the Outcomes of Topic Modeling for the Exploration of Digitized Archival Collections

Identifying Topics of Scientific Articles with BERT-Based Approaches and Topic Modeling

Analyzing the field of bioinformatics with the multi-faceted topic modeling technique

Notes

Abbreviations

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher's Note

Supplementary Information

11192_2024_5017_MOESM1_ESM.xlsx

Rights and permissions

About this article

Cite this article

Keywords

Navigation

CFMf topic-model: comparison with LDA and Top2Vec

Abstract

Access this article

Similar content being viewed by others

Concepts in Topics. Using Word Embeddings to Leverage the Outcomes of Topic Modeling for the Exploration of Digitized Archival Collections

Identifying Topics of Scientific Articles with BERT-Based Approaches and Topic Modeling

Analyzing the field of bioinformatics with the multi-faceted topic modeling technique

Notes

Abbreviations

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher's Note

Supplementary Information

11192_2024_5017_MOESM1_ESM.xlsx

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation