Skip to main content
Log in

The early days of contemporary philosophy of science: novel insights from machine translation and topic-modeling of non-parallel multilingual corpora

  • Original Research
  • Published:
Synthese Aims and scope Submit manuscript

Abstract

Topic model is a well proven tool to investigate the semantic content of textual corpora. Yet corpora sometimes include texts in several languages, making it impossible to apply language-specific computational approaches over their entire content. This is the problem we encountered when setting to analyze a philosophy of science corpus spanning over eight decades and including original articles in Dutch, German and French, on top of a large majority of articles in English. To circumvent this multilingual problem, we use machine-translation tools to bulk translate non-English documents into English. Though largely imperfect, especially syntactically, these translations nevertheless provide correctly translated terms and preserve the semantic proximity of documents with respect to one another. To assess the quality of this translation step, we develop a “semantic topology preservation test” that relies on estimating the extent to which document-to-document distances have been preserved during translation. We then conduct an LDA topic-model analysis over the entire corpus of translated and English original texts, and compare it to a topic-model done over the English original texts only. We thereby identify the specific contribution of the translated texts. These studies reveal a more complete picture of main topics that can found in the philosophy of science literature, especially during the early days of the discipline when numerous articles were published in languages other than English.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

Supplementary information

A technical appendix with code and data, including data for graphs is available on https://zenodo.org/record/6484582 (https://doi.org/10.5281/zenodo.6484582). The topic model can be explored on https://philscitopics.uqam.ca/

Notes

  1. For instance, see (Pence & Ramsey, 2018) for an overview in the context of the philosophy of science.

  2. Such multilingual topic-modeling approaches have been used, among others, to infer semantic similarities between terms of different languages, notably with a view to improving machine translation but also to classify documents in different languages, to investigate sentiment analyses or to retrieve information within documents written in different languages.

  3. The topic model can be explored on https://philscitopics.uqam.ca/.

  4. The articles were downloaded from JSTOR and the publishers Internet platforms (Elsevier, Oxford University Press, Springer, Taylor and Francis and University of Chicago Press) between May and June 2018. Philosophy of science is of course published in many other venues, the entirety of which we cannot hope to cover. These include other journals, be they more general philosophy journals (e.g. Mind), more specialized philosophy of science journals (e.g. Studia Logica, Hyle, Biology and Philosophy), or even science journals (e.g. Bioscience). Philosophy of science is also published in many non-English languages (e.g. Principia, Epistemologia, Philosophia Scientiae, Theoria) and in numerous books and edited volumes. By selecting 8 of the major general philosophy of science journals in English language, including some of the earliest ones, our objective is to provide a representative perspective on the thematic content of philosophy of science and its evolution over the past 8 decades.

  5. To the extent feasible, we only retained articles with research content. We thereby excluded book-reviews, editorials, errata, and very short texts such as discussion notes (less than 4 000 characters). Some very minor differences in article numbers compared to (Malaterre et al., 2020) are due to language detection methods (15,901 English articles vs 15,899 previously).

  6. https://pypi.org/project/langid/ (Lui & Baldwin, 2012); https://pypi.org/project/langdetect/ (Shuyo, 2010). We found out that differences in language prediction between methods could be triggered by articles that were written with significant sections in different languages (for instance, an article in German with large portions of cited text in French and in English). This was notably the case for 5 articles written in several languages and that were removed for steps 2 and 3 of the methodology (but kept for all of stage B): 2 articles that were initially classified as English (from BJPS and JGPS), 1 as French (Erkenntnis), 1 as Dutch (Synthese) and 1 as German (Erkenntnis).

  7. Google Translate requirements; Google API (translate_v2 from google.cloud) accessed on 30 march 2020 at https://cloud.google.com

  8. The Mantel, RV, and Procrustes approaches aim at assessing the similarity between two matrices. The Mantel approach, originally introduced in epidemiology by Mantel (1967) and now widely used in ecology (Legendre & Legendre, 2012), consists in calculating the correlation between two distance matrices. Given D and B two N x N distance matrices, given \(\overline{d }\) and \(\overline{b }\) the means of their respective off-diagonal elements, their Mantel coefficient is: \({r}_{M}= \left[\sum_{i=1}^{N-1}\sum_{j=i+1}^{N}\left({d}_{i,j}-\overline{d }\right)\left({b}_{i,j}-\overline{b }\right)\right]/\sqrt{\left[\sum_{i=1}^{N-1}\sum_{j=i+1}^{N}{\left({d}_{i,j}-\overline{d }\right)}^{2}\right]\left[\sum_{i=1}^{N-1}\sum_{j=i+1}^{N}{\left({b}_{i,j}-\overline{b }\right)}^{2}\right]}.\) The RV coefficient arises as a generalization of Pearson’s correlation (Escoufier, 1973). It is computed as the ratio of the covariance to the square-rooted product of the variances: \(RV=\text{trace }\left\{{\mathbf{S}}^{\text{T}}{\mathbf{T}}\right\}/\sqrt{\text{trace }\left\{{\mathbf{S}}^{\text{T}}{\mathbf{S}}\right\} \times \text{trace }\left\{{\mathbf{T}}^{\text{T}}{\mathbf{T}}\right\}}\) where S is a positive semi-definite matrix (i.e., S is such that there exists a matrix X such that S = XXT). Finally, the Procrustes approach consists in applying a Procrustean superimposition that scales and rotates matrices so as to maximize their fit (Mardia et al., 1979, pp. 416–419). The sum of the squared residuals between configurations in their optimal superimposition can then be used as a metric of similarity (Peres-Neto & Jackson, 2001). For each coefficient, statistical tests and comparisons to the null hypothesis were performed by means of random permutations on rows and columns. Packages: https://www.rdocumentation.org/packages/vegan/versions/2.4-2/topics/mantel, https://www.rdocumentation.org/packages/FactoMineR/versions/2.4/topics/coeffRV, https://www.rdocumentation.org/packages/vegan/versions/2.4-2/topics/procrustes

  9. https://pypi.org/project/lda/

  10. LDA is a generative statistical model that finds out optimal probability distributions of words in topics and of topics in documents provided a number K of topic is chosen beforehand. The choice of K = 25 in the previous topic modeling was done by generating a set of models with different values of K and comparing them based on expert judgment (see (Malaterre et al., 2020, Sect. 2)). LDA-generated topics are sets of words with their probabilities. Once sorted by decreasing order of probability, these terms are usually informative of the semantic content of each topic. Retrieving texts in which topics are the most probable helps in formulating a meaningful interpretation of each topic, in particular when its most likely terms may appear ambiguous. Sometimes, depending on the objectives of the study, it can be helpful to increase the number of topics so as to capture contextual variations of terms of interest (for instance, an earlier topic-model of the single journal Philosophy of Science resulted in identifying 5 topics on explanation out of 126 meaningful topics; see (Malaterre et al., 2019, p. 225)).

  11. Hellinger distance from Gensim package (https://pypi.org/project/gensim/). We thank an anonymous reviewer for this suggestion.

  12. Regex = "\\? + \\p{L}".

  13. Regex = "\\? + [^\\p{L}\\?]".

  14. The number of question marks inside words in English texts (3.3) appears to be due to the presence of words in foreign languages (e.g., citations in German or Greek), algebraic expressions, OCR errors and encoding errors, and predominantly in Synthese, Erkenntnis and the GJPS.

  15. On the more general role of logic in philosophy, see (Bonino et al., 2020) for a quantitative corpus-based approach. (Noichl, 2019) provides an all-encompassing view of field of philosophy, also based on quantitative approaches.

  16. Of course, due to different syntactic word order rules in different languages, one should not expect the exact word order to be preserved in translations. Our point here concerns the approximate collocation of words in n-term windows as investigated for instance by co-occurrence analyses.

  17. Note that this is also the case when a term in the original language is translated by different terms in the target language depending on context. In other words, the semantic topology preservation test assesses the contextual consistency of translations.

  18. As suggested by a reviewer, we could conceive of a defective translator which would replace each properly translated term by its following term in a dictionary. This would result in an inadequate translation that would pass the semantic topology preservation test. Hence the importance of manual checks.

  19. We thank an anonymous reviewer for highlighting this point.

References

  • Aggarwal, C. C., Hinneburg, A., & Keim, D. A. (2001). On the surprising behavior of distance metrics in high dimensional space. In Database Theory — ICDT 2001, edited by Jan Van den Bussche and Victor Vianu, (Vol. 1973, pp. 420–34). Lecture Notes in Computer Science. Berlin: Springer. https://doi.org/10.1007/3-540-44503-X_27.

  • Anellis, I. H. (2005). Smith, Henry Bradford (1882–1938). In J. R. Shook & R. T. Hull (Eds.), The dictionary of modern American philosophers. Thoemmes Continuum.

    Google Scholar 

  • Bastian, M., Heymann, S., & Jacomy, M. (2009). Gephi: An open source software for exploring and manipulating networks. In International AAAI conference on weblogs and social media.

  • Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent Dirichlet allocation. The Journal of Machine Learning Research, 3(March), 993–1022.

    Google Scholar 

  • Bloembergen, S. (1969). Dr. Willem Marius Kruseman 1902–1969. Methodology and Science: Interdisciplinary Journal for the Empirical Study of the Foundations of Science and Their Methodology March. Retrieved from https://achterderug.nl/pageauteurs_libel.php?id=Kruseman&id2=W.M.

  • Bonino, G., Maffezioli, P., & Tripodi, P. (2020). Logic in analytic philosophy: A quantitative analysis. Synthese. https://doi.org/10.1007/s11229-020-02770-5

    Article  Google Scholar 

  • Boyd-Graber, J., & Blei, D. (2012). Multilingual topic models for unaligned text. https://arxiv.org/abs/1205.2657

  • De Smet, W., & Moens, M. F. (2009). Cross-language linking of news stories on the web using interlingual topic modelling. In Proceedings of the 2nd ACM workshop on Social web search and mining (pp. 57–64).

  • de Vries, E., Schoonvelde, M., & Schumacher, G. (2018). No longer lost in translation: Evidence that Google translate works for comparative bag-of-words text applications. Political Analysis, 26(4), 417–430. https://doi.org/10.1017/pan.2018.26

    Article  Google Scholar 

  • Dewulf, F.(2020). The institutional stabilization of philosophy of science and its withdrawal from social concerns after the second world war. British Journal for the History of Philosophy (pp. 1–19). https://doi.org/10.1080/09608788.2020.1848794.

  • Dombrowski, D. (2020). Charles Hartshorne. In The Stanford Encyclopedia of Philosophy, edited by Edward N. Zalta, Winter 2020. Metaphysics Research Lab, Stanford University. https://plato.stanford.edu/archives/win2020/entries/hartshorne/.

  • Escoufier, Y. (1973). Le Traitement Des Variables Vectorielles. Biometrics, 29(4), 751–760. https://doi.org/10.2307/2529140

    Article  Google Scholar 

  • François, D., Wertz, V., & Verleysen, M. (2005). Non-Euclidean metrics for similarity search in noisy datasets. In ESANN (Vol. 2005, pp. 339–344).

  • Giere, R. N. (1996). From Wissenschaftliche Philosophie to Philosophy of Science. In Origins of Logical Empiricism, edited by Ronald N. Giere and Alan W. Richardson, (pp. 335–354). Minnesota Studies in the Philosophy of Science, v. 16. Minneapolis: University of Minnesota Press.

  • Giere, R. N., & Richardson, A. W. (Eds.). (1996). Origins of logical empiricism. Minnesota studies in the philosophy of science (Vol. 16). University of Minnesota Press.

    Google Scholar 

  • Griffiths, T. L., & Steyvers, M. (2004). Finding scientific topics. Proceedings of the National Academy of Sciences, 101(suppl 1), 5228–5235. https://doi.org/10.1073/pnas.0307752101

    Article  Google Scholar 

  • Grimmer, J., & Stewart, B. M. (2013). Text as data: The promise and pitfalls of automatic content analysis methods for political texts. Political Analysis, 21(3), 267–297.

    Article  Google Scholar 

  • Guillemain, H. (2013). La Méthode Coué: Histoire d’une pratique de guérison au XXe siècle. Média Diffusion.

  • Hardcastle, G. L., & Richardson, A. W. (Eds.). (2003). Logical empiricism in North America. Minnesota studies in the philosophy of science (Vol. 18). University of Minnesota Press.

    Google Scholar 

  • Howard, D. (2003). Two left turns make a right: On the curious political career of North American philosophy of science at midcentury. In Logical Empiricism in North America, edited by Gary L. Hardcastle and Alan W. Richardson, (pp. 25–93). Minnesota Studies in the Philosophy of Science, v. 18. Minneapolis: University of Minnesota Press.

  • Hu, Y., Boyd-Graber, J., Satinoff, B., & Smith, A. (2014). Interactive topic modeling. Machine Learning, 95(3), 423–469. https://doi.org/10.1007/s10994-013-5413-0

    Article  Google Scholar 

  • Hu, Y., Zhai, K., Eidelman, V., & Boyd-Graber, J. (2014b). Polylingual tree-based topic models for translation domain adaptation. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Vol. 1: Long Papers) (pp. 1166–1176).

  • Jagarlamudi, J., & Daumé, H. (2010). Extracting multilingual topics from unaligned comparable corpora. In European Conference on Information Retrieval (pp. 444–456). Springer.

  • Johnson, M. (2016). De Stijl (1917–1932). Routledge encyclopedia of modernism (1st ed.). London: Routledge.

    Google Scholar 

  • Kukich, K. (1992). Techniques for automatically correcting words in text. ACM Computing Surveys (CSUR), 24(4), 377–439.

    Article  Google Scholar 

  • Landauer, T. K., Foltz, P. W., & Laham, D. (1998). An introduction to latent semantic analysis. Discourse Processes, 25(2–3), 259–284.

    Article  Google Scholar 

  • Legendre, P., & Legendre, L. (2012). Numerical ecology. Elsevier.

    Google Scholar 

  • Lucas, C., Nielsen, R. A., Roberts, M. E., Stewart, B. M., Storer, A., & Tingley, D. (2015). Computer-assisted text analysis for comparative politics. Political Analysis, 23(2), 254–277.

    Article  Google Scholar 

  • Lui, M., & Baldwin, T. (2012). Langid. Py: An off-the-Shelf Language Identification Tool. In Proceedings of the ACL 2012 System Demonstrations, (pp. 25–30).

  • Malaterre, C., Chartier, J.-F., & Pulizzotto, D. (2019). What is this thing called philosophy of science? A computational topic-modeling perspective, 1934–2015. HOPOS: the Journal of the International Society for the History of Philosophy of Science, 9(2), 215–249. https://doi.org/10.1086/704372

    Article  Google Scholar 

  • Malaterre, C., Lareau, F., Pulizzotto, D., & St-Onge, J. (2020). Eight journals over eight decades: A computational topic-modeling approach to contemporary philosophy of science. Synthese. https://doi.org/10.1007/s11229-020-02915-6

    Article  Google Scholar 

  • Mantel, N. (1967). The detection of disease clustering and a generalized regression approach. Cancer Research, 27(2), 209–220.

    Google Scholar 

  • Marcus, M. P., Marcinkiewicz, M. A., & Santorin, B. (1993). Building a large annotated corpus of English: The penn treebank. Computational Linguistics, 19(2), 313–330. https://doi.org/10.21236/ADA273556

    Article  Google Scholar 

  • Mardia, K. V., Kent, J. T., & Bibby, J. M. (1979). Multivariate analysis. Academic Press.

    Google Scholar 

  • Mauri, M., Elli, T., Caviglia, G., Uboldi, G., & Azzi, M. (2017, September). RAWGraphs: a visualisation platform to create open outputs. In Proceedings of the 12th biannual conference on Italian SIGCHI chapter (pp. 1–5). Cagliari, Italy: ACM Press. https://doi.org/10.1145/3125571.3125585.

  • Mimno, D., Wallach, H., Naradowsky, J., Smith, D. A., & McCallum, A. (2009, August). Polylingual topic models. In Proceedings of the 2009 conference on empirical methods in natural language processing (pp. 880–889). Volume 2 - EMNLP ’09, 2:880. Singapore: Association for Computational Linguistics. https://doi.org/10.3115/1699571.1699627.

  • Noichl, M. (2019). Modeling the structure of recent philosophy. Synthese. https://doi.org/10.1007/s11229-019-02390-8

    Article  Google Scholar 

  • Pence, C. H., & Ramsey, G. (2018). How to do digital philosophy of science. Philosophy of Science, 85(5), 930–941. https://doi.org/10.1086/699697

    Article  Google Scholar 

  • Peres-Neto, P. R., & Jackson, D. A. (2001). How well do multivariate data sets match? The advantages of a procrustean superimposition approach over the mantel test. Oecologia, 129(2), 169–178.

    Article  Google Scholar 

  • Pruss, D., Fujinuma, Y., Daughton, A. R., Paul, M. J., Arnot, B., Szafir, D. A., & Boyd-Graber, J. (2019). Zika discourse in the americas: A multilingual topic analysis of Twitter. PLoS ONE, 14(5), e0216922. https://doi.org/10.1371/journal.pone.0216922

    Article  Google Scholar 

  • Reber, U. (2019). Overcoming language barriers: Assessing the potential of machine translation and topic modeling for the comparative analysis of multilingual text corpora. Communication Methods and Measures, 13(2), 102–125.

    Article  Google Scholar 

  • Reisch, G. A. (2005). How the cold war transformed philosophy of science: To the icy slopes of logic. Cambridge University Press.

    Book  Google Scholar 

  • Richardson, A., & Uebel, T. (2007). The Cambridge companion to logical empiricism. Cambridge University Press.

    Book  Google Scholar 

  • Ruder, S., Vulić, I., & Søgaard, A. (2019). A survey of cross-lingual word embedding models. Journal of Artificial Intelligence Research, 65, 569–631.

    Article  Google Scholar 

  • Schmid, H. (1994). Probabilistic part-of-speech tagging using decision trees. In Proceedings of International Conference on New Methods in Language Processing, (pp. 44–49). Manchester.

  • Shuyo, N. (2010). Language Detection Library for Java. http://code.google.com/p/language-detection/.

  • Stegeman, J. H. (1992). Gerrit Mannoury: A bibliography. Tilburg University Press.

    Google Scholar 

  • Vaesen, K., & Katzav, J. (2019). The national science foundation and philosophy of science’s withdrawal from social concerns. Studies in History and Philosophy of Science Part A, 78(December), 73–82. https://doi.org/10.1016/j.shpsa.2019.01.001

    Article  Google Scholar 

  • van Berkel, K. (2001). Schoenmaekers, Mathieu Hubertus Josephus. In Biografisch Woordenboek van Nederland V (pp. 462-464). Instituut voor Nederlandse Geschiedenis. Retrieved from http://resources.huygens.knaw.nl/bwn1880-2000/lemmata/bwn5/schoenma.

  • Volk, M., Furrer, L., & Sennrich, R. (2011). Strategies for reducing and correcting OCR errors. In Language Technology for Cultural Heritage, (pp. 3–22). Springer.

  • Windsor, L. C., Cupit, J. G., & Windsor, A. J. (2019). Automated content analysis across six languages. PLoS ONE, 14(11), e0224425. https://doi.org/10.1371/journal.pone.0224425

    Article  Google Scholar 

  • Woleński, J. (2020). Lvov-Warsaw School. In The Stanford Encyclopedia of Philosophy, edited by Edward N. Zalta, Summer 2020. Metaphysics Research Lab, Stanford University. Retrieved from https://plato.stanford.edu/archives/sum2020/entries/lvov-warsaw/.

  • Yuan, M., Van Durme, B., & Ying, J. L. (2018). Multilingual anchoring: Interactive topic modeling and alignment across languages. In 32nd Conference on Neural Information Processing Systems (NeurIPS 2018), 11. Montréal.

  • Zhang, D., Mei, Q., & Zhai, C. (2010). Cross-lingual latent topic extraction. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, (pp. 1128–1137).

  • Zhao, B., & Xing, E. (2007). HM-BiTAM: Bilingual topic exploration, word alignment, and translation. Advances in Neural Information Processing Systems, 20, 1689–1696.

    Google Scholar 

Download references

Acknowledgements

The authors are grateful to JSTOR, Elsevier, Oxford University Press, Springer, Taylor and Francis, and University of Chicago Press for providing access to journal articles for text-mining purposes. Special thanks are due to Martin Léonard for developing the topic-model web-browser, to Pedro Peres-Neto for providing guidance with matrix similarity measures, to Sari Lemable and Frédérick Deschênes respectively for Dutch and German translation checks, to Rens Strijbos for insights about W. M. Kruseman, and to Charles Pence and Luca Rivelli for their invitation to submit to this special issue. The authors also thank the audiences of a 2020 TEC seminar at UQAM, of the DS2-2021 conference and of the 2021 CSHPS congress for comments on an earlier version of the manuscript. They also thank the reviewers at Synthese for their very valuable comments. C.M. acknowledges funding from Canada Foundation for Innovation (Grant 34555) and Canada Research Chairs (CRC-950-230795). F.L. acknowledges funding from the Fonds de recherche du Québec – Société et culture (FRQSC-276470).

Author information

Authors and Affiliations

Authors

Contributions

CM and FL jointly conceived the study. CM analyzed the results, wrote and revised the manuscript. FL prepared the corpus, wrote the code and revised the manuscript.

Corresponding author

Correspondence to Christophe Malaterre.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Malaterre, C., Lareau, F. The early days of contemporary philosophy of science: novel insights from machine translation and topic-modeling of non-parallel multilingual corpora. Synthese 200, 242 (2022). https://doi.org/10.1007/s11229-022-03722-x

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s11229-022-03722-x

Keywords

Navigation