1 Introduction

For the last several centuries, academic journals and their concomitant processes (e.g., editorial arbitration, peer review) have been the dominant means by which scholarship and research is appraised, validated, and disseminated (Csiszar, 2018; Porter, 1992; Smith, 2006). Indeed, many scientific discoveries are reported, critiqued, and shared within the pages of these publications. Einstein’s “Zur Elektrodynamik Bewegter Körper” (“On the Electrodynamics of Moving Bodies"), for example, which was foundational in his subsequent development and refinement of the general theory of relativity, was published in 1905 in the influential German journal Annalen der Physik (Einstein, 1905). This work was later confirmed by Eddington and colleagues (Dyson et al., 1920) whose account and results of the 1919 solar eclipse were published in the Philosophical Transactions of the Royal Society. In the century since, publication in academic journals has dramatically increased. Today, the number of articles published per year is estimated to be around 3 million (see Johnson et al., 2018).

Though scholarly discourse has increasingly expanded outside the confines of journals (e.g., through use of preprint servers, personal and academic blogs, podcasts, etc.; see Kupferschmidt, 2020; Mollett et al., 2017; Quintana & Heathers, 2021), they have retained their dominant role within the scientific process – often being characterized as the “gatekeepers” of scientific and scholarly knowledge (e.g., Caputo, 2019; Demeter, 2020; Siler et al., 2015). However, there is substantial variability across journals (even within a single field) in terms of a journal’s published output (e.g., number of volumes, issues, and articles published per year, the closed/open access status of articles), the mechanisms and processes by which articles are evaluated (e.g., open/closed publication of reviews, blinding/unblinding of reviewers and authors, data sharing policies), and the quality and content of the work published (e.g., presence and rates of selective reporting of results, data fabrication). What’s more, individual journals are malleable – their internal processes and external output may change dramatically over time.

Because of this variability and the broader technological and institutional changes occurring across scholarly publishing (see generally Tennant et al., 2017), scientific and philosophical evaluation must advance beyond the traditional institutions and norms used to assess journal quality. A framework is needed to capture how and why some journals review and publish what might be called genuine contributions to the corpus of knowledge more frequently than others – the latter of which varyingly publish (and thus perhaps tacitly endorse) false or ambiguous claims (Camerer et al., 2018; Dunleavy, 2021; Gambrill, 2018; Open Science Collaboration, 2015). This framework will need to account for both momentary states (i.e., how a journal functions at a given timepoint) and wider trends, which extend across years and decades. Lastly, any such framework must enable direct comparison of a set of journals – since journals often compete with one another at various levels of analysis.

For instance, journals may publish peer-reviewed manuscripts that are based on suboptimal research practices (see generally Chambers, 2017; Dunleavy, 2020), fraud (e.g., Harvey, 2020; Stroebe et al., 2012), and/or plagiarism (e.g., Baždarić et al., 2012); among a host of other issues (e.g., Ioannidis, 2005).Footnote 1 By doing so, these journals may hinder the development of a robust body of knowledge (Dunleavy, 2021), by introducing false, misleading, unverified, or otherwise unjustified claims into the scholarly literature (Akça & Akbulut, 2021; Akerlof & Michaillat, 2019; Ioannidis, 2016; Smaldino & McElreath, 2016). The offshoot of these differences, however, is that journals can, in principle, be appraised and contrasted in accordance to how well they function in their role as gatekeepers.Footnote 2

2 “Good” and “bad” journals

A variety of labels, designations, and metrics exist to help gauge a journal’s performance in gatekeeping or otherwise adjudicate between so-called “good” and “bad” journals.Footnote 3 While this is not the place for a comprehensive overview (see Tobin, 2004), a couple of brief examples will help illustrate this effort. Perhaps most notably is the commonly used journal “impact factor” (Garfield, 1972). The impact factor (IF) gives a rough approximation of how often articles from a particular journal are cited — by dividing the total number of citations across some period of time (e.g., 1-year, 5-years, etc.) by the number of citable articlesFootnote 4 in some finite set. Conventional thinking suggests that the higher the impact factor (relative to a respective field) the greater the quality and prestige of a journal — the lower the impact factor, the lower the quality, prestige, and (perhaps) rigor (see Sternberg, 2018). Journals with the highest impact factor, in their respective field, are accordingly viewed as containing the strongest scholarly contributions. Though the IF has entrenched itself into the heart of contemporary scholarly publishing, and seemingly provides some utility as a metric, (Garfield, 1972, 2006; Hoeffel, 1998), its value as a measure of journal quality is questionable.

Select criticisms include, but are not limited to: Incorrectly imputing value to individual articles or authors based on journal-level metrics (i.e., journal-impact factor; JIF) (Ioannidis & Thombs, 2019), the equivocation of merit or scientific impact with IF (Dunleavy, 2022; Eyre-Walker & Stoletzki, 2013), the failure to control for self-citation practices by authors and journals (Larivière & Sugimoto, 2019), skewness in the distribution of citations among articles (Nature, 2005), the manipulation, negotiation, and general opacity of IF calculations (Brembs et al., 2013; Rossner et al., 2007), and English-language bias.Footnote 5 One critic (Brembs, 2018), posits that methodological rigor decreases as IF increases – a point that, if true and generalizable, would seriously undermine use of the IF, as a proxy for quality, altogether.

In a similar vein, scholars have attempted to differentiate between quality, trustworthy, or “reputable” journals on the one hand, and untrustworthy or so-called “predatory” ones, through the creation of various whitelists and blacklists (Beall, 2010, 2014; Bisaccio, 2018; Grudniewicz et al., 2019; Laine & Winker, 2017; Teixeira da Silva et al., 2021). The underlying aim here is largely the same – journals found on blacklists are often viewed as containing weak, fraudulent, or otherwise flawed scholarship (i.e., they are “bad journals” and to be avoided). For example, blacklists (e.g., Beall’s List and Cabell’s; Bisaccio, 2018; Chen, 2019) have been used to identify journals that purportedly fail to meet professional standardsFootnote 6 and/or are alleged to engage in exploitative or otherwise dishonest behaviors. Whitelists, such as the PubMed journal list, the Directory of Open Access Journals (DOAJ), and the Council on Publication Ethics (COPE), among others, serve as a means for communicating that a set of journals has been vetted or deemed “trustworthy” by some set of criteria or individual/group assessment (i.e., they are “good journals”). Surely these tools have helped authors detect and avoid some bad faith actors in the publishing world. However, several limitations hamper these types of lists, including:

  • 1) the inability of scholars to agree upon a precise and objective definition of – or criterion for – the term “predatory” (e.g., Aromataris & Stern, 2020; Cobey et al., 2018; Grudniewicz et al., 2019; Teixeira da Silva et al., 2019),

  • 2) the heterogeneity and somewhat arbitrariness of characteristics subsumed under the “predatory” label (e.g., Shamseer & Moher, 2017; Shamseer et al., 2017; Tennant et al., 2019), and

  • 3) the problem of “false positive” and “false negatives” cases (i.e., mistakenly labeling a non-predatory journal as “predatory” and vice versa; Teixeira da Silva & Tsigaris, 2018; Tsigaris & Teixeira da Silva, 2021)

These issues may ultimately lead scholars to rethink the use of these labels and ensuing lists altogether (see discussion by Anderson, 2015; Kratochvíl et al., 2020; Shamseer & Moher, 2017).Footnote 7 Neylon (2017) has gone so far as to argue that blacklists, specifically, are “technically impossible to make work” (p. 2).Footnote 8,Footnote 9

What is most interesting about these efforts to appraise journals (and their respective difficulties) is that they resemble and overlap with many of the core problems described by philosophers of science throughout the nineteenth and twentieth centuries – where the quest for a suitable demarcation criterion to distinguish between science and pseudoscience, the nature of scientific progress, and the sociology of science, were at the forefront of discussion and debate.Footnote 10 For instance, the attempt to distinguish between predatory and non-predatory (or “good” and “bad”) journals (Siler, 2020) resembles Popper’s (1959/1968) classical formulation of the demarcation problem (i.e., how to distinguish science from pseudoscience). Contemporary questions about the role of journals and publishers in the development and growth of scholastic knowledge (see Dunleavy, 2021; Lock, 1985, especially Chapter 6) mirror investigations into the concept of, and theories about, scientific progress (e.g., Lakatos & Musgrave, 1970; Laudan, 1977). Finally, the self-governance of scholarly communities (including the reform of journal publication practices and standards; e.g., Dunleavy & Hendricks, 2020; Hardwicke et al., 2020) have links to broader inquiries into the workings, policing, and self-correction of scientific communities (e.g., Kuhn, 1962; Merton, 1973; Zuckerman & Merton, 1971). Because of these parallels, it may be fruitful to draw upon the philosophy of science to illuminate aspects of these current debates in journalology (i.e., the study of publication practices) and specific attempts to appraise journal quality.

In this article, I present a novel lens by which we can evaluate scholarly journals – one that helps capture its dynamic nature (i.e., that journal policies, functions, and published output are not fixed, but diverse, malleable, and ever-changing).Footnote 11 Specifically, I draw on the work of Imre Lakatos and his methodology of scientific research programmes (MSRP; Lakatos, 1968b, 1970, 1978b). I argue that his classifications of “progressive” and “degenerative” research programmes – and their general features – can be analogized and repurposed for the evaluation of scholarly journals. In doing so, I argue that this alternative framework resolves some of the flaws of current approaches discussed above and offers a more considered evaluation of journal quality – one that helps account for the historical evolution of journal-level publication practices and consequent contributions to the growth (or stunting) of scholarly knowledge.Footnote 12

I begin by introducing Lakatos’s MSRP and its associated terminology. Next, I discuss how it can be repurposed for the task of evaluating scholarly journals. I note here that Lakatos’ philosophy need not be the “final word”, or even a correct account (see Larvor, 2006, especially p. 715; and Cohen et al., 1976) of the philosophy and sociology of science, to be a fruitful tool in this respect.Footnote 13 After all, even flawed philosophies can still be useful.

Having introduced the MSRP, I then attempt to operationalize its components – relying on two metrics: the “mistake index” (MI; see Margalida & Colomer, 2015) and the “scite index” (SI; see scite, 2019b) – to help illustrate empirical and theoretical features of the progressive and degenerative classifications. These metrics provide preliminary criteria for assessing whether a journal’s articles – as a function of its internal editorial policies and practices (e.g., peer review) have caused it to enter a progressive, degenerative, or stagnant phase. Finally, I discuss how such an approach could complement or supplant contemporary (if flawed) evaluative schemes, such as the impact factor and the development of predatory lists, described above, and outline an agenda for future scholarship on this topic.

3 Lakatos and the methodology of scientific research programmes

Imre Lakatos (1922–1974) was a Hungarian-born philosopher of science. Shortly after World War II, he immigrated to England, where he completed his PhD in philosophy (1961 – King’s College Cambridge). In 1960, he started teaching at the London School of Economics. As a philosopher Lakatos had a number of interests, but primarily centered his focus on the philosophies of mathematics and science (see Lakatos, 1976, 1978a, 1978b). Lakatos would go on to publish on topics such as the problem of demarcation, theory-change, and induction, as well as the growth of scientific knowledge, until his sudden death in 1974.Footnote 14

Within the philosophy of science, he is perhaps most noted for his development and articulation of the methodology of scientific research programmes – a framework which he viewed as an advance from the work of his contemporaries Thomas Kuhn and Karl Popper (Lakatos, 2012, p. 23).Footnote 15 Kuhn (1962, 1970), to his end, attempted to capture the process of scientific change and the workings of scientific communities with his concept of the “paradigm” and the historically-driven descriptions of “normal” and “revolutionary” science. Popper (1959/1968, 1963), on the other hand, attempted to justify a rational, deductive approach to scientific discovery, via his method of “conjectures and refutations” – one that embraced a Humean skepticism towards induction and challenged the perceived difficulties inherent in the contemporary tradition of the logical positivists. Though Lakatos was critical of both of these attempts, he was sympathetic to each as well. Consequently, features of both Kuhn and Popper can be found in his work and general philosophical outlook.Footnote 16

The MSRP can be viewed as an attempt to bring aspects of the rationality (or “logic”), practice, and history of science into a comprehensive framework. As will be described below, it touched not only on matters of theory-testing and evidential support (e.g., Popperian falsification and corroboration), but also the methodological and pragmatic decisions made by scientists (and scientific communities) about how and when to pursue or withdraw from lines of inquiry (e.g., Kuhn’s “normal” and “revolutionary” science and puzzle-solutions) – as well as the historical track record (successes and failures) of theories and their conceptual rivals.

Under Lakatos’ MSRP (1968b, 1970), a research programme has four key components, the: 1) unchanging or irrefutable “hard core”, 2) a “protective belt”, 3) the “negative heuristic”, and 4) the “positive heuristic”. Ideally, research programmes will all have these four components, though in practice, some components have been more clearly articulated than others. The hard core comprises the essential elements of a programme. It is what we might call the “lead idea” (Larvor, 2006) or “central principles” (Hacking, 1983). In Newton’s theory of gravitation, the hard core contains his three laws of motion and his one law of gravitation. Together, these four laws serve as the foundation for which empirical observations are predicted and explained. Accordingly, they are not to be questioned or modified.Footnote 17 The protective belt, in contrast, brings a sort of stability to the programme. It consists of the set of auxiliary hypotheses that further articulate and support the underlying theory (i.e., the hard core), and most importantly, shield it from potential falsifiers or other empirical anomalies. In other words, these hypotheses serve an instrumental purpose (i.e., to “protect” the hard core). They are disposable, eventually becoming modified or replaced by new auxiliary hypotheses. Again, using Newton’s programme as our example, the protective belt would consist of his theories of atmospheric refraction and geometrical optics (Lakatos & Zahar, 1975; Lakatos, 1971b) – theories which can be tweaked or supplanted in the face of inconsistencies between predictions made from Newton’s four laws and subsequent observations. Finally, are the negative and positive heuristics. Heuristics are the implicit and explicit methodological rules and beliefs which constrict and guide the behavior of scientists working within the programme. The negative heuristic tells the scientist what not to do (e.g., “…not to tinker with the hard core…”, Chalmers, 1999, p. 133), while the positive heuristic tells the scientist where to focus their attention – which provides what Hacking (1983) describes as a “…ranking of [scientific] problems” (p. 117) to work on. By doing so, the positive heuristic prevents the scientist from being distracted by empirical anomalies or otherwise unimportant or distracting issues. Together, the two heuristics help facilitate research within the programme.

Having sketched out some basic features of a Lakatosian research programme, we can now turn to how they are classified and appraised. Lakatos calls programmes which are performing well “progressive” and those that are failing or otherwise languishing as “degenerative”.Footnote 18 A programme is considered progressive when it makes novel predictions and has (at least some of) these predictions confirmed/corroborated. These predictions should be consistent with the programme’s positive heuristic. The more a programme achieves these tasks, compared to its rivals, the more progressive it is deemed.Footnote 19 In contrast, a degenerative programme fails to make novel, confirmed predictions – or does so in spite of its own flawed internal logic. Rather, a degenerating programmes development (e.g., modification of auxiliary hypotheses) occurs not due to its own internal success, but “in response to external criticism” (Larvor, 2006, p. 713) or in reaction to the success(es) of its competitors. For Lakatos, Newton and Einstein’s theories were paradigmatic cases of progressive programmes, with socioeconomic Marxism and Freudian psychoanalysis epitomizing degenerative ones (Lakatos, 2012). More contemporary examples of progressive programmes might include investigations into the effects of childhood adversity and social deprivation on the development of later psychopathology (e.g., McLaughlin & Hatzenbuehler, 2009; Varese et al., 2012; Wickham et al., 2014), the HIV-AIDS hypothesis (Barré-Sinoussi et al., 1983; Gallo, 1991; Gallo & Montagnier, 2003), and more recently the CRISPR gene editing programme within molecular biology (Hsu et al., 2014) on the one hand, and research investigating genetic or monoamine-deficit causes of depression (e.g., Border et al., 2019; Cai et al., 2020; Healy, 2015; Lacasse & Leo, 2005), the drug-induced or “chemical” hypothesis of AIDS (see generally Ellison & Duesberg, 1994; Duesberg, 1992, 1996a, 1996b; Duesberg et al., 2003; Ellison et al., 1995), and perhaps, more broadly, collections of programmes – such as those found in some areas of nutritional science nutritional science (e.g., Ioannidis, 2013; Taubes, 2007) on the other.

A final point on Lakatos’ MSRP is worth stressing here. In Lakatos’ philosophy, progressive and degenerative appraisals are contextual, contrastive, and impermanent – features which are distinct yet intertwined. A programme is progressive or degenerative with respect to how it fairs relative to its rivals (e.g., does Programme A out predict/perform Programme B?) at a given point in time (or across some finite timeframe). Moreover, these designations are not fixed – that is, a progressive programme can begin to fail in time, while a degenerating one may still yet recover and flourish. In Lakatos’s (1971a) own words:

“One must realise that one’s opponent [rival programme], even if lagging badly behind, may still stage a comeback. No advantage for one side can ever be regarded as absolutely conclusive. There is never anything inevitable about the triumph of a programme. Also, there is never anything inevitable about its defeat…The scores of the rival sides, however, must be recorded and publicly displayed at all times.” (p. 101)

Accordingly, the MSRP’s strengths reveal its purported limitations. That is, it is often argued that Lakatos’ MSRP permits a contextual and historically-driven assessment of scientific research (appraisal), but does not allow for definitive conclusions or recommendations (e.g., what lines of research to pursue or abandon) to be made.Footnote 20 We will return to these issues in more detail later. For now, we will attempt to motivate the case for applying the MSRP to scholarly journals.

4 Progressive and degenerative journals

Though Lakatos’s MSRP (1968b, 1970) largely dealt with the empirical sciences like physics, chemistry, and astronomy – fields with highly mature theoretical content (e.g., Newton’s theory of gravitation; Bohr’s model of the atom, Copernicus’ heliocentric model of the universe; Lakatos & Zahar, 1975; Musgrave, 1976b; Zahar, 1973a, 1973b) – there is no clear rationale for precluding its use in describing other areas of inquiry. Lakatos himself (1971a) makes this point clear, when he states that, “[T]he methodology of research programmes may be applied not only to norm-impregnated historical knowledge but to any normative knowledge, including even ethics and aesthetics.” (p. 132, emphasis added). This point is affirmed by Kadvany (2001), who argues that, “Lakatos intends the methodology of scientific research programmes as a general theory of criticism…” (p. 216). Such non-empirical, quasi-empirical, and extra-scientific applications of the MSRP are exemplified by Lakatos himself, in his framing of Rudolf Carnap’s system of inductive logic as a degenerating research programme (Lakatos, 1968a; see also Groves, 2016), his reappraisal of the methodology of “proofs and refutations” as a progressive programme within mathematics (Lakatos, 1970, p. 180, footnote 2), and in his discussion of Popper’s philosophical account of the growth of scientific knowledge (Lakatos, 1970, p. 180, footnote 2; see also Lakatos, 1978b, Chapter 3 and Feyerabend, 1974). His contemporaries demonstrate the utility and flexibility of the MSRP as well. For example, D’Amour (1976) in his discussion of act utilitarianism within the field of ethics (see particularly pp. 89–93; and Alfano, 2013), Koertge in her discussion of the philosophical school known as logical empiricism (1972), and by Popper (1974) in his (initial) characterization of evolutionary theory and Darwinism as metaphysical research programmes (pp. 118–121 and 133–143).Footnote 21

Even Feyerabend, Lakatos’s close friend and intellectual antagonist (Motterlini, 1999), floats the possibility of theologicalFootnote 22 research programmes (Feyerabend, 1975, p. 15), if we are by necessity to consider programmes within the context of their contemporary rivals, scientific and otherwise. These examples, along with later applications of the MSRP in fields far removed from philosophy and the hard sciences, such as economics (Latsis, 1976), developmental psychology (Phillips, 1987; Phillips & Nicolayev, 1978), intelligence research (Urbach, 1974a, 1974b), and international relations (Elman & Elman, 2003), demonstrate the flexibility of the MSRP and give precedent for its application (and potential utility) within the broader areas of journalology, meta-research, and meta-science discussed here.Footnote 23 Since MSRP attempts to capture the rationality of scientific practice and the growth of knowledge, it could conceivably be employed in the context of scholarly journals, where the practice of scholarly publication – vis-à-vis journals, provides a (ostensibly) rational means for assessing new ideas and evidential claims and validating and disseminating new knowledge.

4.1 Journals as research programmes

Journals are composed of many different parts and processes. There is the obvious published output, namely, journal articles – which may, depending on the journal, include empirical studies, conceptual pieces, editorials, commentaries and opinions pieces, letters to the editors, book reviews, and so on. Journals are also comprised of stakeholders, including but not limited to the editor(s)-in-chief, editorial board, reviewers, the authors who submit to and publish in the journal, and (perhaps also to some extent) the journal’s broader readership. Still further, are the standards and policies by which the journal performs its activities. This may include it’s stated “aims and scope”, instructions to authors and reviewers (i.e., the basic standards submitted manuscripts must conform to and standards by which reviewers are to adhere in evaluating those manuscripts), the internal/external processes by which manuscripts are solicited, evaluated, published and rejected (e.g., the number of referees utilized, anonymization of authors/reviewers, criteria for determining acceptability, etc.), and the output of those processes (e.g., reviewers reports), among other things. These components are brought together within the journal’s parameters and are materialized.Footnote 24

Some of the features of journals can be readily redeployed within the four parts of the Lakatosian framework. Recall, these are the 1) hard core, 2) protective belt, 3) negative heuristic, and 4) positive heuristic. The negative and positive heuristics are readily identifiable. In scientific research programmes these are the rules and beliefs that guide the behavior of scientists and drive experimentation. The analogous features in this context would be the tacit and explicit rules, norms, and policies that guide the journal’s stakeholders (namely, but not exclusively, the editorial board and peer reviewers). Examples abound. The Public Library of Science (PLOS) family of journals, for instance, requires “authors to make all data necessary to replicate their study’s findings publicly available without restriction at the time of publication” (PLOS One, 2019), barring certain ethical and legal exceptions. Another example comes from the field of public health. Kenneth Rothman (1998), former Editor-in-Chief of the prominent journal Epidemiology, notably discouraged authors from using p-values. While acknowledging that they are sometimes productively used, he noted that, “…we prefer that p-values be omitted altogether, provided that point and interval estimates, or some equivalent, are available.” (p. 334; see Trafimow, 2014).Footnote 25 More generally, each journal’s “Aims and Scope” description helps guide editors and reviewers in determining the “fit” between a manuscripts content and the journal’s overarching mission or purpose. Together, these heuristics and rules guide a journal’s stakeholders towards manuscripts that it finds of value; and away from ones it disvalues,Footnote 26 while constraining or nudging authors to submit content which coheres with the form, standards, and objectives of the journal.Footnote 27 Note here, that this kind of valuing might ideally be said to refer to manuscripts that are additive to our knowledge base, but this is not necessarily so (see generally Else, 1978, pp. 269–270). Journals can, after all, value “flashy” papers and results or other instrumental and non-epistemic characteristics over, or in addition to, seeking scholarship which promotes substantive research and the search for truth (see generally Brembs, 2018; Nosek et al., 2012; Serra-Garcia & Gneezy, 2021; Smith, 2006).

Having identified how heuristics come into play within journals, we can now focus on the other two components of the MSRP, the hard core and protective belt. This effort is admittedly less clear, but we can sketch out some possibilities. In traditional scientific research programmes, the hard core consists of the central principles or ideas. The protective belt consists of a (relatively) disposable set of auxiliary hypotheses. The latter support the former and serve to ward off empirical anomalies and other threats. What then, is the central, fixed, “core” of a journal? Certainly, no single or finite (yet everchanging) number of articles could be said to make up the core of a journal. Articles are malleable. They can be corrected, retracted, and critiqued. Nor is the core represented by the stakeholders and staff of a journal, for they too change with the passing of time. I argue that the hard core could, perhaps, be best represented by something slightly more abstract – what might be understood as the overarching aim (or “ethos”) of the journal.Footnote 28

A few real-world examples will help illustrate what I mean by aim. The New England Journal of Medicine (henceforth NEJM; n.d.), states that “Our mission is to publish the best research and information at the intersection of biomedical science and clinical practice and to present this information in understandable, clinically useful formats that inform health care practice and improve patient outcomes”. Psychological Science (henceforth, PS; n.d.), one of the foremost psychology journals, describes itself as publishing, “cutting-edge research articles and short reports, spanning the entire spectrum of the science of psychology. This journal is the source for the latest findings in cognitive, social, developmental, and health psychology, as well as behavioral neuroscience and biopsychology”. Social Work Research (henceforth SWR; n.d.) purports to publish, “…exemplary research to advance the development of knowledge and inform social work practice”. Each journal is ostensibly a venue in which to publish rigorous and/or novel research, which helps to advance its discipline’s knowledge base and/or improves clinical practice. Regardless of whether these journals are successful in these efforts, we can see that something lies at the center of each journal – something that would fundamentally change the journals themselves if altered – regardless of its physical content (i.e., stakeholders, published output).Footnote 29 Put differently, at its core, a journal consists of an overarching scholarly ethos (i.e., the journal’s primary aim), whose internal sorting processes (editorial decision-making and peer review) are guided by a set of positive and negative heuristics. This results in submitted manuscripts being validated and assimilated into the journal (i.e., “published”) or rejected and discarded. When this underlying ethos changes (e.g., in response to external pressures or internal dysfunctions), so too has something fundamental at the journal’s core.

A journal’s published articles (as objects) could be said to comprise the protective belt. Articles published in the NEJM, for example, support (in varying ways) the journal’s hard core.Footnote 30 Articles that successfully sustain this core help foster a progressive programme. Insomuch as a journal’s core is concerned with scientific/epistemic progress, this may, among other things, mean that the hypotheses posited within empirical articles are subject to severe tests (e.g., Mayo, 2018), which are subsequently confirmed by the results of the performed analyses, and are later corroborated and/or replicated in future studies. Ones that do not,Footnote 31 contribute to the degeneration of a programme. Degeneration, then, can be characterizedFootnote 32 by increasing rates of corrections, retractions,Footnote 33 and outside critique (e.g., refutation). While these criticisms are very often directed at the article(s) and its authors,Footnote 34 critical attention may inevitably shift towards the journal itself. As these criticisms add up, the journal may fall into disrepute (though this is not strictly true; see Fanelli, 2013).

Having described the MSRP and explicated its components in the context of scholarly journals, we can now attempt to describe its potential application. To do so, I will draw on two formalized metrics, the “mistake index” and the “scite index”. I have decided to emphasize these two metrics for illustrative purposes and because I feel they help capture (in their own limited ways) empirical and theoretical contributions of articles and more broadly the journals in which they are published. These metrics provide preliminary criteria for assessing whether a journal’s articles have caused it to enter a progressive, degenerative, or stagnant phase. However, in practice, the appraisal of research programmes will very likely rely on a number of metrics and assessments (epistemic and non-epistemic; empirical and metaphysical).Footnote 35

4.2 The mistake index (MI)

The mistake index (MI) is a relatively novel tool, put forth by Margalida and Colomer (2015) as a measure for reviewer/editor effectiveness. In its simplest form, the MI is calculated by taking the annual number of items published in a journal (or by a publisher) and dividing by the total number of corrections (i.e., erratum/corrigendum). This provides a standardized index that can be compared from year-to-year and/or across journals. The higher the MI, the better a journal’s performance (generally speaking). To use a toy example, a journal that published 100 items in a given year, alongside ten corrections [i.e., 100/10], would have an MI of 10. Of course, as with other tools such as the IF, a journal’s MI can be calculated for any finite period of time (e.g., 5-year period, lifetime), and need not solely be done for a single year. Margalida and Colomer (2016) have since extended and refined this proposal, delineating between indices calculated according to total items published (described above), which they call the Mistake Index Total (MIT) and that calculated on the basis of total papers (i.e., articles) published – the Mistake Index Paper (MIP). In the following examples, all references to the MI will refer to this latter calculation.

The MI is used here for two purposes. First, it serves as a proxy of reviewer/editor effectiveness.Footnote 36 I say proxy because, under ideal circumstances, this would simply reflect how well a journal catches mistakes, errors, omissions, etc. with higher rates indicating increasingly suboptimal functioning of review processes. However, since journals presently lack incentive to correct published mistakes, more rigorous journals may have higher rates of mistakes relative to their (less self-critical) peers. Nevertheless, the MI still serves as a surrogate for quality. This is because a high MI can be indicative of nothing but low quality and performance.Footnote 37 Second, the MI captures something about a journal’s empirical and theoretical contributions. A journal whose articles often contain errors (especially severe ones) are not supporting or protecting its hard core (e.g., to publish the best research on ‘x’), nor reliably generating confirming evidence. Consequently, a journal who has a high MI, relative to its peers (particularly when examining a set of journals with rigorous self-assessment and post-publication peer review) – or an increasing MI, relative to itself, over time – can be characterized as being stagnant or (in extreme cases) degenerative.

4.3 The scite index (SI)

scite is a deep learning-based platform developed by researchers and academics from across the sciences. scite, among other things, extracts citations from published articles and categorizes them based on how they are discussed within the text of the paper (i.e., supporting, contrasting [i.e., disputing], or mentioning – the latter being a neutral descriptor). This allows readers (and authors) insight into how supported a particular article or claim is, within the scholarly corpus. This approach has since been quantified. The scite index (SI) provides a standardized measure by which one can evaluate the quality of scholarly publishing (e.g., at the journal-level). To calculate the SI for a given journal, one would take the number of supporting citations, across a given time period, and divide by the number of supporting and contrasting citations. For example, as of the year 2019, a given journal’s articles (say, journal “X”), has had 95 supporting citations and only 5 contrasting citations. This would mean journal X has a SI of 0.95 [95/95 + 5]. As with the MIP above, a SI can be calculated for a given journal for any time interval (e.g., a 1-year period, 5-year period, lifetime). This permits both internal evaluation of a journal’s quality across time and for its comparison with peers (Nicholson, 2021; Nicholson et al., 2021; scite, 2019a).

The SI is used here for two purposes. First, it serves as a proxy for the corroboration of findings (i.e., results) and claims (e.g., hypotheses/theory) made within articles. Again, I say proxy because supportive citations may reflect well-confirmed findings, but this is not necessarily so. A paper, author, or journal with a high SI score may merely reflect groupthink among a network or field of scholars or a strong, yet misplaced (or premature), scientific consensus. Nevertheless, the SI serves as an important indicator of journal quality. Findings that are not (or never) supported in subsequent literature will (likely) not be accepted within the broader scholarly community. Second, the SI captures something about a journal’s empirical and theoretical contributions – that is, ostensibly, well-confirmed and robust findings will (inevitably) have higher SI scores than those which do not – whereas those that are disconfirmed, are inconclusive, or controversial will seemingly have middling or low scores. Consequently, a journal who has a high SI, relative to its peers – or an increasing SI, relative to itself, over time – may be characterized as being progressive. Lower or decreasing SI scores may reflect a stagnating or degenerating journal.

Ideally, the MI and SI could be used in together, in a complementary fashion – or paired with other proxies for functioning, such as comments posted after publication (i.e., post-publication peer review; see Bordignon, 2020). This would help gauge the reliability of published output as well as its epistemic and social value of specifical journals to the scientific community.

Table 1 (below) is based on the metrics described above and provides preliminary standards for appraising journal functioning. A journal that has a low rates of severe MI and high SI can tentatively be labeled as progressive. Such a journal, under ideal conditions, is validating and disseminating research and scholarship which is both reliable and highly confirmed. A journal which has a high MI (particularly moderate and severe ones), but still produces work that corresponds with a high SI should be viewed more skeptically (Stagnating). Published output from this type of journal may be less reliable (though again, here, context is key – as a high MI could reflect greater editorial attention – for instance, when rates of MI reflect mild or moderate errors). A journal which has low SI and high MI may be classified as degenerating (degenerative1). Such a journal is producing work that is unconfirmed or disconfirmed and is of questionable reliability. Even if the high MI reflects greater editorial functioning (e.g., by correcting or preventing severe errors), the journal is of questionable utility if its contributions to the scholarly corpus lack epistemic value. Lastly, we have journals which have low MI and low SI (degenerative2). A journal with these features is ostensibly reliable, but produces unconfirmed or disconfirmed scholarship – leading one to ask what value the journal has. In absence of any evidence for self-correction, it warrants questioning whether its editorial standards and practices are unable or unwilling to solicit, appraise, and produce work which generates novel and confirmable findings.

Table 1 Preliminary taxonomy for appraising scholarly journals

5 Caveats and limitations

There are several limitations with the proposal discussed here which must be noted. Many of these problems haunt Lakatos’ original formulation (see Feyerabend, 1970, 1980). Foremost, the MSRP does not offer hard and fast rules for pursuit at the level of the individual. Though the progressive and degenerative designations have utility in appraisal and assessment of theories and journals, they cannot provide definitive guidance (to the individual researcher, scholar, scientist) about which to support, endorse, avoid, or dismiss in the future. Rather, the MSRP – applied within the context of scholarly journals – is useful for overall appraisal and this assessment can then be used to inform pursuit within scholarly communities. Put differently, for the individual, it is primarily a descriptive and “backward-looking” (Hacking, 1983) framework – offering a pragmatic set of tools for future consideration, rather than a purely prescriptive one. After all, there are valid reasons that an individual may provide for backing (i.e., publishing within, reviewing for, contributing to) a seemingly degenerating journal or for avoiding a progressive one (Feyerabend, 1970, 1980).Footnote 38 Reflecting on replies from his colleagues, Lakatos (1971b) notes:

"The arguments my critics produce have made me realise that I fail to stress sufficiently forcefully one crucial message of my paper [1971a]. This message is that my … ‘methodological rules’ explain the rationale of the acceptance of Einstein’s theory over Newton’s, but they neither command nor advise the scientist to work in the Einsteinian and not in the Newtonian research programme." (p. 174, emphasis added)

Despite these faults, degenerating programmes have instrumental value by applying direct or indirect pressure(s) to competing research programmes and by offering guidance for scholarly communities (as well as government agencies, funders, etc.). That latter can readily employ these designations as a rational guide about what to justifiably endorse or withhold resources from (Lakatos, 1970, p. 157, footnote 1; see also elaboration by Musgrave, 1976a, 1978).

The inability of the MSRP to providing absolute certainty (i.e., that past performance is a fallible indicator of future success) is readily addressed. As Motterlini (1995) notes:

“[T]o the question of whether we are able to supply any practical indication on the basis of appraisals which are concerned with past performances, we can now give a positive answer. … Naturally, it is not the lot of humans to know the future – and the question whether past performance is a reliable guide to the future may be addressed to all methods of appraisal –, but this does not mean that any research programme is as "promising" as another”. (pp. 6-7).

In other words, the fallibility of our inferential tools (e.g., p-values, Bayes factors in statistical inference), at the individual level, do not preclude their valid uses at the broader level of the scientific community.

Other limitations include the difficulty in fully operationalizing the “progressiveness” and “degeneration” labels (concepts that surely exist on a spectrum), the actualization and implementation of the MSRP into journalology and meta-research (i.e., whether the MSRP can be fully refashionedFootnote 39 for this purpose), and the ever-growing set of alternative methods of journal appraisal (e.g., Teixeira da Silva et al., 2021).Footnote 40 These have been highlighted here and will be addressed in future discussions of the proposal.

What is gained by this exercise is an ability to capture journal change over time (see generally Lock, 1989) – something that is largely missingFootnote 41 from current, static designations, such as whitelists and blacklistsFootnote 42 and that is understood too narrowly when applied solely within individual metrics (e.g., JIF). After all, what is a “high impact”, “good”, or “legitimate” journal today, may not be tomorrow – and one that is “low impact”, “bad”, or “predatory” can ultimately change the way it performs, re-establishing (or achieving for the first time) reputability and legitimacy. This is particularly true given the shock of the COVID-19 pandemic to the publication landscape (Dunleavy et al., 2020) which has left consumers of scholarship drowning in a sea of (mis)information (Correia & Segundo, 2020; Dunleavy & Hendricks, 2020) and brings questions about the quality of published work in even leading journals (Alexander et al., 2020; Jung et al., 2021; Zdravkovic et al., 2020). Paraphrasing Barseghyan and Shaw (2017), we might say that the MSRP of journals benefits us, over current alternatives, by providing criteria and logic for “keeping score” (p. 9) of which journals are performing best (in a given moment or timeframe), thus informing our overall assessment/appraisal and aiding in comparing multiple journals in a given field or area of inquiry.

Lastly, there is a broader set of issues that impact the development and functioning of academic journals, which deserve mention. Individual journals and publishers are vulnerable to a variety of internal and external pressures that have thus far only been alluded to. The commercialization of scholarly publishing (Larivière et al., 2015; Tennant, 2018) and research (Resnik, 2007), a skewed incentive structure for researchers (see Ritchie, 2020, Chapter 7), and (in some disciplines) the rampant influence of pharmaceutical companies (Goldacre, 2012), have shaped how science is conducted, and fostered financial and intellectual conflicts of interest among (many) authors and editorial board members (e.g., Abramson & Starfield, 2005; Brody, 2007). This has enabled the manipulation of the scholarly literature (e.g., Angell, 2000; Oreskes & Conway, 2010) through practices such as ghostwriting (McHenry, 2010; Sismondo, 2009). These issues impact not only what scholarship is published, but how it is conducted, funded, and the standards by which it is appraised and critiqued. While these issues continue to warrant greater attention, it can only be noted here that they must be taken into account when assessing the scholarly ecosystem using the MSRP framework. This will be a task required of future work – once the components of the MSRP are fully articulated, debated, and defended.

6 Conclusion

This article has attempted to outline some of the deficiencies of current efforts to evaluate journal quality – specifically highlighting issues with the impact factor and predatory blacklists and whitelists. It was noted that many of these efforts share features found across debates within twentieth century philosophy of science – particularly the demarcation problem. It was then proposed that Lakatos’ MSRP and his classifications of “progressive” and “degenerative” programmes could be repurposed to evaluate journals and scientific publishing more broadly. In a preliminary effort, the MI and SI were described, to help operationalize these classifications. Future theoretical and empirical work will need to be done to further explicate the characteristics of progressive and degenerative journals described here and determine the utility and applicability of these classifications for scientific and scholarly use.Footnote 43