Skip to main content
Log in

CFMf topic-model: comparison with LDA and Top2Vec

  • Published:
Scientometrics Aims and scope Submit manuscript

Abstract

Mining the content of scientific publications is increasingly used to investigate the practice of science and the evolution of research domains. Topic models, among which LDA (statistical bag-of-words approach) and Top2Vec (embeddings approach), have notably been shown to provide rich insights into the thematic content of disciplinary fields, their structure and evolution through time. However, improving topic modeling methods remains a major concern. Here we propose an alternative topic-modeling approach based on neural clustering and feature maximization with F1-measure (in short: CFMf). We compare the performance of this approach to LDA and Top2Vec by applying the methods to a reference corpus of full-text philosophy of science articles (N = 16,917). The results reveal significant improvements in terms of coherence measures, independently of the number of topics. Qualitative comparisons show an overall consistency in terms of topical coverage across all three methods, yet with differences: in particular, CFMf appears affected by the presence of a large class while Top2Vec generates some sets of top-words highly difficult to interpret. We discuss these results and highlight upcoming research work.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

Notes

  1. https://github.com/cognitivefactory/features-maximization-metric.

  2. https://pypi.org/project/lda/.

  3. https://pypi.org/project/top2vec/.

  4. https://pypi.org/project/gensim/.

  5. In what follows, we will use the term “cluster” indifferently across the three approaches to topic-modeling we propose to compare. This usage corresponds well to the approaches mobilized by CFMf and Top2Vec, since both approaches start by crisp-clustering the documents into specific clusters, and then extract representative terms that are taken to represent a common topic shared by the documents of the same cluster. The LDA model however does not crisp cluster documents into specific clusters, but considers that documents are probability distributions over topics. When using the term “cluster” or “document cluster” in the context of the LDA model, we simply mean the set of documents that share the same specific topic as dominant topic (i.e., topic with the highest probability in these documents).

Abbreviations

CFMf:

Neural clustering and feature maximization with F1-measure

CFMc:

Neural clustering and feature maximization with contrast

DTM:

Document-term matrix

F-Max:

Feature maximization

GNG:

Growing neural gas

HDBSCAN:

Hierarchical density-based spatial clustering of applications with noise

LDA:

Latent Dirichlet allocation

MCMC:

Markov chain Monte Carlo

NPMI:

Normalized pointwise mutual information

t-SNE:

T-distributed stochastic neighbor embedding

UMAP:

Uniform manifold approximation and projection

USE:

Universal sentence encoder

References

  • Angelov, D. (2020). Top2Vec: distributed representations of topics. arXiv Preprint. arXiv:2008.09470

  • Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. Journal of Machine Learning Research, 3(1), 993–1022.

    Google Scholar 

  • Börner, K., Silva, F. N., & Milojević, S. (2021). Visualizing big science projects. Nature Reviews Physics, 3(11), Article 11. https://doi.org/10.1038/s42254-021-00374-7

  • Boyd-Graber, J. L., Hu, Y., & Mimno, D. (2017). Applications of topic models (Vol. 11). Now Publishers Incorporated.

  • Campello, R. J., Moulavi, D., & Sander, J. (2013). Density-based clustering based on hierarchical density estimates. In J. Pei, V. S. Tseng, L. Cao, H. Motoda, & G. Xu (Eds.), Advances in knowledge discovery and data mining (Vol. 7819, pp. 160–172). Berlin: Springer. https://doi.org/10.1007/978-3-642-37456-2_14

    Chapter  Google Scholar 

  • Dugué, N., Lamirel, J.-C., & Chen, Y. (2021). Evaluating clustering quality using features salience: A promising approach. Neural Computing and Applications. https://doi.org/10.1007/s00521-021-05942-7

    Article  Google Scholar 

  • Fritzke, B. (1994). A growing neural gas network learns topologies. In Advances in neural information processing systems (Vol. 7). MIT.

  • Griffiths, T. L., & Steyvers, M. (2004). Finding scientific topics. Proceedings of the National Academy of Sciences of the United States of America, 101(Suppl 1), 5228–5235. https://doi.org/10.1073/pnas.0307752101

    Article  Google Scholar 

  • Lamirel, J.-C., Chen, Y., Cuxac, P., Al Shehabi, S., Dugué, N., & Liu, Z. (2020). An overview of the history of Science of Science in China based on the use of bibliographic and citation data: A new method of analysis based on clustering with feature maximization and contrast graphs. Scientometrics, 125(3), 2971–2999. https://doi.org/10.1007/s11192-020-03503-8

    Article  Google Scholar 

  • Lamirel, J.-C., Cuxac, P., Chivukula, A. S., & Hajlaoui, K. (2015). Optimizing text classification through efficient feature selection based on quality metric. Journal of Intelligent Information Systems, 45(3), 379–396. https://doi.org/10.1007/s10844-014-0317-4

    Article  Google Scholar 

  • Lamirel, J.-C., Lareau, F., & Malaterre, C. (2023, 5/7). The CFMf topic-modeling method based on neural clustering with feature maximization: Comparison with LDA. In Proceedings of ISSI 2023. The 19th conference of the international society for scientometrics and informetrics, Bloomington, IN.

  • Lamirel, J.-C., Mall, R., Cuxac, P., & Safi, G. (2011). Variations to incremental growing neural gas algorithm based on label maximization. In The 2011 International joint conference on neural networks (pp. 956–965). https://doi.org/10.1109/IJCNN.2011.6033326

  • Malaterre, C., & Lareau, F. (2022). The early days of contemporary philosophy of science: Novel insights from machine translation and topic-modeling of non-parallel multilingual corpora. Synthese, 200(3), 242. https://doi.org/10.1007/s11229-022-03722-x

    Article  MathSciNet  Google Scholar 

  • Marcus, Mitchell P., Mary Ann Marcinkiewicz, & Beatrice Santorini. (1993). Building a large annotated corpus of english: The Penn Treebank. Computational Linguistics, 19(2), 313–30. https://doi.org/10.21236/ADA273556

  • McInnes, L., Healy, J., & Melville, J. (2020). UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. arXiv Preprint. arXiv:1802.03426

  • Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv Preprint. arXiv:1301.3781

  • Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems (Vol. 26). https://proceedings.neurips.cc/paper/2013/hash/9aa42b31882ec039965f3c4923ce901b-Abstract.html

  • Newman, David, Jey Han Lau, Karl Grieser, and Timothy Baldwin. 2010. “Automatic Evaluation of Topic Coherence.” In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, 100–108.

  • Prouteau, T., Connes, V., Dugué, N., Perez, A., Lamirel, J.-C., Camelin, N., & Meignier, S. (2021). SINr: fast computing of sparse interpretable node representations is not a sin! In P. H. Abreu, P. P. Rodrigues, A. Fernández, & J. Gama (Eds.), Advances in intelligent data analysis XIX (Vol. 12695, pp. 325–337). Cham: Springer. https://doi.org/10.1007/978-3-030-74251-5_26

    Chapter  Google Scholar 

  • Röder, M., Both, A., & Hinneburg, A. (2015). Exploring the space of topic coherence measures. In Proceedings of the 8th ACM international conference on web search and data mining—WSDM ’15 (pp. 399–408). https://doi.org/10.1145/2684822.2685324

  • Schmid, Helmut. (1994). Probabilistic part-of-speech tagging using decision trees. In Proceedings of International Conference on New Methods in Language Processing, 44–49. Manchester: Association for Computational Linguistics

  • Talley, E. M., Newman, D., Mimno, D., Herr, B. W., Wallach, H. M., Burns, G. A. P. C., Leenders, A. G. M., & McCallum, A. (2011). Database of NIH grants using machine-learned categories and graphical clustering. Nature Methods, 8(6), 443–444. https://doi.org/10.1038/nmeth.1619

    Article  Google Scholar 

  • van der Maaten, L., & Hinton, G. (2008). Visualizing data using t-SNE. Journal of Machine Learning Research, 9(11), 2579–2605.

    Google Scholar 

Download references

Acknowledgements

F.L. acknowledges funding from the Fonds de recherche du Québec Société et culture (FRQSC-276470) and the Canada Research Chair in Philosophy of the Life Sciences at UQAM. C.M. acknowledges funding from Canada Social Sciences and Humanities Research Council (Grant 430-2018-00899) and Canada Research Chairs (CRC-950-230795). The authors thank the audience of ISSI 2023 for most helpful comments on an earlier version of this paper published in the conference proceedings as: Lamirel et al. (2023).

Author information

Authors and Affiliations

Authors

Contributions

Conceptualization: JCL, FL, CM; data curation: FL; formal analysis and investigation: JCL, FL, CM; funding acquisition: JCL, CM; Investigation: JCL, FL, CM; Methodology: JCL, FL, CM; project administration: JCL, CM; resources: JCL, FL, CM; software: JCL, FL; supervision: JCL, FL, CM; validation: JCL, FL, CM; Visualization: FL, CM; writing—original draft preparation: JCL, FL, CM; writing—review and editing: JCL, FL, CM. All authors approved the final submitted manuscript.

Corresponding author

Correspondence to Jean-Charles Lamirel.

Ethics declarations

Competing interests

The authors declared that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

11192_2024_5017_MOESM1_ESM.xlsx

Supplementary Table ST1 Contains the citation reference of the top-10 documents per topic for each model (LDA, CFMf, and Top2Vec) calculated at K = 25 (XLSX 48 KB)

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Lamirel, JC., Lareau, F. & Malaterre, C. CFMf topic-model: comparison with LDA and Top2Vec. Scientometrics (2024). https://doi.org/10.1007/s11192-024-05017-z

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s11192-024-05017-z

Keywords

Navigation