Introduction

The focus of this opinion essay is a powerful means to shape—and often cloud—people’s judgments: quantificationsFootnote 1. Numbers have been turned into, so we argue in allusion to Karl Marx’ famous statement about religion, the new opium of the people, namely when it comes to making judgments about uncertain things. Numbers affect the public as much as those experts who dedicate their lives to produce, process, and scrutinize quantifications: scientists. Statistics-driven thinking is characteristics for the modern society (Desrosières, 1998); “...we are living in a world informed by indicators” (Zenker, 2015, p. 103). Give people numbers, and they will have something to hold on to, to be blinded by, or to argue against. In one way or the other, “[n]umbers rule the world” (Gigerenzer et al., 1989, p. 235):

  1. 1.

    Numbers can dramatically change how we make judgments under uncertainty about multiple aspects of science and society. Particularly p-values (and nowadays increasingly Bayesian statistics, too) have metamorphosed the way psychologists and other social scientists make inferences (for basic readings, see Gigerenzer et al., 1989, and Gigerenzer & Murray, 1987; for a critical discussion of the Bayesian twist, see Gigerenzer & Marewski, 2015). When it comes to building theories of cognition, “…methods of statistical inference have [even] turned into metaphors of mind” (Gigerenzer, 1991, p. 254). Bibliometric statistics transform judgments about scientists, their work and institutions, when it comes to ‘assessments’ of productivity, value, or quality: quantitative science evaluation.

  2. 2.

    Decision making in science is concerned with uncertainty (Pfeffer et al., 1976; Salancik & Pfeffer, 1978). The routine uses of quantifications to target uncertainty aid governance, fuel bureaucratization, and cement social conventions (e.g., h-indices to evaluate scientists and p-values to evaluate research findings). Bibliometric indicators seemingly help establishing ‘objective’ facts. Indicators are used to “...produce relationships among the things or people that they measure by sorting them according to a shared metric” (Espeland, 2015, p. 59). Those, in turn, serve to justify decisions (e.g., about funding scientific work or hiring senior scientists), and if need to be, also put decision makers (e.g., scientists, administrators, or politicians) in a position to defend themselves (e.g., against being accused to have made arbitrary, biased, nepotistic, or otherwise flawed decisions). The problem with indicators is, however, that they measure but “do not explain” (Porter, 2015, p. 34).  In order for those numbers to become meaningful, it is necessary to fill them with life – as it can be done when scientific experts interpret the numbers.

  3. 3.

    The advance of numbers as substitute for judgment comes with rather old ideals, including those of ‘rational, analytic decision making’, dating back, at least, to the Enlightenment. The past century has seen a twist of those ideals, with psychological research (e.g., Kahnemann et al., 1982) documenting how human judgment deviates from alleged gold standards for rationality, this way trying to establish intuition’s flawed nature (Hoffrage & Marewski, 2015). Views that human judgment cannot be trusted certainly do not help when it comes to stopping the twentieth century’s Zeitgeist of using seemingly objective (= judgment-free) statistics in research (e.g., p-values) or the more recent digital wave of ‘objectifiers’ in science evaluation (e.g., h-indices).

  4. 4.

    In a way it is telling that the notion of objectivity in science itself is an object of study and critical reflection (e.g., Douglas, 2004, 2009; Gelman & Hennig, 2017; Gigerenzer, 1993). Yet, there is more: John Q Public—including ourselves—are prepared to trust and use quantitative data to understand and manage all kinds of objects and phenomena, respectively—from our finances to life-expectancy and “human needs” (e.g., Glasman, 2020, p. 2), with the title of Porter’s (1995) classic, “Trust in numbers: The pursuit of objectivity in science and public life”, beautifully reflecting more general trends that engulf academic activities, including their evaluation.

  5. 5.

    Cohen (1994) titled a paper “The earth is round (p < .05)”. While the “[m]indless” (Gigerenzer, 2004a, p. 587) abuse of p-values and seemingly judgment-free indicators such as the h-index is nowadays prevalent in virtually all branches of academia, the decision sciences, statistics, and their history inspire us to both question this state of affairs and to point to antidotes against the harmful side effects of the increasing quantification of science evaluation.Footnote 2 In our view, the mindlessness can be overcome if science evaluations are actually made and understood as good human judgments under uncertainty where not everything is known or knowable, and where surprises can disrupt routines and other seemingly givens (see e.g., Hafenbrädl et al., 2016). This view suggests that mistakes are inevitable and need management, or that different statistical judgment tools ought to be chosen mindfully, in an informed way, as a function of the context at hand (see e.g., Gigerenzer, 1993, 2018). We believe that science evaluation under uncertainty may be aided if those using numbers (e.g., citation counts) in research evaluation have expertise in bibliometrics and statistics and are ideally active in the to-be-evaluated area of research. Such expertise can aid both: (i) to understand when good human judgment ought to be trusted even when numbers speak against that judgment, and (ii) to realize why good human judgment and intuition is what matters, and that even when there is no number attached to it. Put differently, common sense should rule numbers, not vice versa.

In what follows, we will first sketch out historical contexts and societal trends that come with the increasing quantification of science and society. Second, we will turn to those developments’ latest outgrowth: the exaggerated and uninformed use of bibliometric statistics for research evaluation purposes. Third, we explore how the mindless use of bibliometric numbers can be overcome. We close by calling for bringing common sense back into scientific judgment exercises.

Before we begin, let us add a commentary. One of us once co-edited a special issue on human intuition. The issue’s introduction (Hoffrage & Marewski, 2015) tried to capture the elusive nature of human intuition with contrasting poles, including Enlightenment thinking and the “culture of objectivity” (p. 148) as well as poetry by polymath Johann Wolfgang von Goethe and a painting by Romanticist artist Caspar David Friedrich. Yet, while pictures, poetry, and other artwork (e.g., stories, films, songs) may trigger intuitions about intuition, until about that time, the author had spent little time on reflecting that there maybe things numbers and algorithms cannot capture; to the contrary in our respective fields, we both repeatedly argued for approaching judgment quantitatively (e.g. through algorithmic models).Footnote 3 So while we caricaturize the quest for quantification in this opinion essay, following the proverbial expression “Let any one … who is without sin be the first to throw a stone …” (The Bible, John, 8:7) we need to stone ourselves; and as suggested by the essay’s title, we are, perhaps, stoned already.

Significant numbers

Quantifications aid governance

The quest for quantifications is not new. Numbers, written on papyrus, coins or milestones aid governing societies and their activities—ranging from trade to war—since thousands of years. The Roman Empire offers examples (see e.g., Vindolanda Tablets Online, 2018); and so does, for intance, Prussia later (e.g., Hacking, 1990). Quantifications became part of the Deoxyribonucleic Acid (DNA) of states (e.g., the German welfare state), enabling to compute contributions to pension funds and insurances; they served to protocol commercial and demographic activities as well as military assets. Indeed, the word statistics likely originates in states’ quest for data, with for instance, the “English Political Arithmetic” (Desrosières, 1998, p. 23) and the German equivalent, “Statistik” (Desrosières, 1998, p. 16) being traceable to the 1600s (for more historical discussion, e.g., Daston, 1995; Krüger et al., 1987; Porter, 1995).Footnote 4

Big data is one of the latest reflections of the old proverbial insight ‘knowledge is power’—an insight that exists in various forms (e.g., The Bible, Book of Proverbs, 24:5; Hobbes, 1651, Part I-X) and languages (e.g., in German: ‘Wissen ist Macht’),Footnote 5 but that may gain yet other meanings with the massive digitalization; in the future possibly implying E-governance, digital democracy or the dictatorship of numbers (e.g., Helbing et al., 2017; Marewski & Hoffrage, 2021; O’Neil, 2016). A development one could subsume under a lemma commonly (mis)attributed to Galileo Galilei (e.g., by Hoffrage & Marewski, 2015, p. 149): “measure what is measurable, and make measurable what is not soFootnote 6; or, reformulated in terms of science evaluation, evaluate what is evaluable and make evaluable what is not.

Databases featuring numbers of publications, grants, and other ‘output’ can be used—and if need be—instrumentalized for academic governance purposes. Much like outside of science, even the mere ability—expertise—to quantify can be a source of power or claims thereof. And that not only for the science evaluator, but also for researchers themselves. For instance, in the twentieth century, psychologists like Edward Thorndike spearheaded the quantification of their field; and that quantification came with a side-effect. As Danzinger (1990) points out, “Quantitative data … could be transformed into…power for those who controlled their production and interpreted their meaning to the nonexpert public” (p. 147), equating “[t]he keepers of that [quantitative] knowledge … [with] a new kind of priesthood, which was to replace the traditional philosopher or theologian” (p. 147)—a “religion of numbers” (p. 144). Conflicts such as between quantitively and qualitatively working social scientists are probably not alien to some readers of this essay either.

Sir Ronald Fisher (1990a), put it bluntly in his “Statistical methods for research workers”—a bible, published originally in 1925: “Statistical methods are essential to social studies, and it is principally by the aid of such methods that these studies may be raised to the rank of sciences” (p. 2). Numbers create science.

Quantifications offer seemingly universal and automatic means to ends

Using analysis and reason to understand (and rule) the world are old ideals: traces of them can be found in the Enlightenment, an epoch that has brought forward thinkers such as Immanuel Kant and Pierre-Simon de Laplace. Gottfried Wilhelm Leibniz (1677/1951), an Enlightenment pioneer, pointed out that “…most disputes arise from the lack of clarity in things, that is, from the failure to reduce them to numbers” (p. 24). Arguing that “…there is nothing which is not subsumable under number” (p. 17), he proposed to develop a “universal characteristic” (p. 17, in original fully capitalized) that “…will reduce all questions to numbers…” (p. 24). Modern counterparts of such ideals seem to be universalism and automatism (or some variant thereof; see e.g., Gigerenzer, 1987; Gigerenzer & Marewski, 2015; see also Gelman & Hennig, 2017). With automatic ‘neutral’ measurements and quantitative evaluation procedures are meant that—independent from the people (e.g., scientists, evaluators, judges) using them—should yield ‘unbiased, objective judgments’, say for better decision making. One can think of such automatic processes as being “mechanical” (Gigerenzer, 1993, p. 331) input–output relations. Universal is a complement to mechanical automatism and refers to the pretension of corresponding judgment procedures to be serviceable for all problems. Omnipresent in universal and automatic procedures are numbers. Numbers can be conveniently used across contexts (one can enumerate anything). They seem to lend objectivity (e.g., to observations, inferences, and decisions) that is independent of people (it does not matter who counts the number of citations; everybody should arrive at the same number; see also Porter, 1993).

In scientific research, a prominent example of universalism and automatism is the usage of ‘null hypothesis significance testing’ (NHST) for all statistical inferences (e.g., Gigerenzer, 2004a).Footnote 7 Statistical inferences are judgments under uncertainty. In making those judgments, social science researchers unreflectingly report p < 0.05, as if the p-value would not depend on their own intentions (see Kruschke, 2010, for dependencies of the p-value), or as if that number were equally informative for all problems. As Meehl (1978) pointed out almost half a century ago in an article titled “Theoretical risks and tabular asterisks: Sir Karl, Sir Ronald, and the slow progress of soft psychology”: “In the typical Psychological Bulletin article … we see a table showing with asterisks … whether this or that experimenter found a difference in the expected direction at the .05 (one asterisk), .01 (two asterisks!), or .001 (three asterisks!!) levels of significance” (p. 822).

P-value fetish is not limited to old-school psychology. For instance, Gigerenzer and Marewski (2015) report that an estimated 99 p-values were computed, on average, per article in the Academy of Management Journal in 2012. Why would somebody compute 99 p-values? Habits and other factors may play a role in the sustained usage of significance testing (e.g., Oakes, 1990). But such routine number crunching can also be linked to “…the satisfying illusion of objectivity: [Historically,] [t]he statistical logic of analyzing data seemed to eliminate the subjectivity of eyeballing and wishful distortion” (Gigerenzer, 1993, p. 331; see also Ziliak & McCloskey, 2012). A small number (p < 0.05) translates into a giant leap towards objectivity in scientific (e.g., what to conclude from data) and editorial decision-making (e.g., what papers to publish). In Danziger’s (1990) words: “By the end of the 1930s … ‘Statistical significance’ had become a widely accepted token of factual status for the products of psychological research” (p. 152).

But not only scientists use automatic, universal statistics to make seemingly objective judgments; science itself is increasingly submitted to context-blind, number-driven inference—“Surrogate science”, as Gigerenzer and Marewski (2015, p. 436) put it, spreads fast and widely. Here, an important example of universalism is the routine reliance on h-indices and journal impact factors (JIFs) by bibliometric laypersons for measuring and making judgments about the ‘productivity’ or ‘quality’ of scientists, institutions, and scientific outlets, independent of context. Context can be the discipline, the research paradigm within a discipline, or the unit investigated (be it a teaching- or research-oriented professor or institution). Automatism in research evaluation takes the disguise of legal procedures that come, for instance, with faculty evaluation exercises. JIFs and h-indices should ‘objectively’ tell, independently of who conducts the evaluation, whether a scientist is worth hiring or meriting a promotion.

How numbers seem to replace (bad) human judgement and intuitions

In 2011, a paper in the Journal of Personality and Social Psychology, “… reports 9 experiments, involving more than 1,000 participants, that test for retroactive influence by ‘time-reversing’ well-established psychological effects so that the individual’s responses are obtained before the putatively causal stimulus events occur” (Bem, 2011, p. 407).

Quantifications communicate reassuring objectivity, and aid to establish ‘facts’—seemingly regardless of whether it comes to the supernatural, the divine, death, the uncertain future or other more mundane things, and even more so when the numbers come embedded into procedures (e.g., Gigerenzer et al., 1989). Indeed, nowadays, scientists, politicians, doctors, and businesspersons evoke quantifications rather than hunches or gut feel to motivate and justify their thoughts about scientific ideas, spending policies, diseases, or mergers. h-indices and JIFs serve to infer scholars’ future performance, motivating tenure decisions. We even use the absence of large numbers to make judgments, such when pointing out that few are known to have died from smoking (a few decades ago), or from living close to nuclear power plants (still today).Footnote 8 Quantifications, cast into statistics and algorithms, seem to allow us to control the uncertainties of the future and to (cold-bloodedly) justify present-day decisions. In short, calculation seems to have, for the better or the worse, replaced mere hunches, intuitions, feelings and personal judgment—in science and beyond (Hoffrage & Marewski, 2015).

This was not always so, and not in so many contexts—and one does not have to go to times of shamanic rituals for cases in point. For instance,”…the rise of statistics in therapeutics was part of the process of objectivization through which science entered medicine … and diagnosis became increasingly independent of the patient’s own judgment” (Gigerenzer et al., 1989, p. 47). Before that, medicine was based throughout centuries on individual judgment rather than on averages and other (e.g., epidemiological) statistics (for more discussion, see e.g., Gigerenzer, 2002b; Porter, 1993).

Also the link between calculation and intelligence has changed—including the meaning of intellegence itself, which differs between the eighteenth and twentieth centuries (Daston, 1994). As Daston (1994) points out,

“The history of calculation in the Enlightenment is a chapter in the cultural history of intelligence. Calculation had not yet become mechanical, the paradigmatic example of processes that were mental but not intelligent. Indeed, eighteenth-century philosophers conceived of intelligence and even moral sentiment to be in their essence forms of calculation” (p. 185).

Daston (1994) adds that

“When Pierre-Simon Laplace described probability theory as ‘good sense re-duced to a calculus,’ he intended to disparage neither good sense nor probability theory thereby… Yet by the turn of the nineteenth century, calculation was shifting its field of associations, drifting from the neighborhood of intelligence to that of something very like its opposite and sliding from the company of savants and philosophes into that of unskilled laborers” (p. 186).

Actually, what is the meaning of ‘intellengence’ nowadys? And who is intelligent? Seemingly ‘objective’ assessments of aptitude have become almost synonymous with a number: IQ. Yet, intelligence and alike notions, measured by IQ and other indices, are inventions of the twentieth century. They served, for instance, U.S. military recruitment, immigration control, and horrifying policies, with low (e.g., IQ) scores offering arguments (i) for limiting access to education or even (ii) for sterilization (see e.g., Gigerenzer et al., 1989; Severson, 2011; Young, 1922). This example sadly illustrates how numbers can not only displace individual, idiosyncratic judgment but also be turned into easy and universal criteria to single out individuals.

Seemingly irrational behaviors and subjective, biased cognitions, ranging from feelings to intuitions, pose prominent targets for investigation when it comes to documenting, correcting, and even exploiting them. Starting in the 1970’s, Kahneman and Tversky’s heuristics-and-biases research program (e.g., Kahnemann et al., 1982; Tversky & Kahneman, 1974) brought irrational, error-prone, and faulty judgment and decision making into thousands of journal pages, and onto practitioners’ agendas (Hoffrage & Marewski, 2015; Lopes, 1991). Many of those judgment biases were defined based on numbers; including by experiments showing how people’s judgments violated Bayes theorem—a seemingly universal, single benchmark for “…rationality for all contents and contexts” (Gigerenzer & Murray, 1987, p. 179). In other research programs, statistics from behavioral studies fueled similar conclusions, namely that irrational citizens need outside help and steering.

The libertarian paternalist movement’s emphasis on nudging (ignorant) people (Thaler & Sunstein, 2009) is one example (see Gigerenzer, 2015, for a critique). Also, caricatures of homo economicus—an egoistic being by nature, who in the absence of punishment and control will inevitably maximize her/his own interests, measured by utilities—fit the widespread view that people’s subjective judgments cannot be trusted. Of course, utilities can be expressed numerically such that, in principle, everything (e.g., costs and benefits of crime, marriage, fertility, or discrimination) can be modelled with them (see Becker, 1976). The most recent outgrowth of the mistrust in good human judgement could be the view that artificial intelligence—systems based on codes and numbers—will soon outperform human intelligence or the wishful belief that machines—algorithms operating on numbers—will always be more ‘objective’ judges than humans (e.g., avoiding biases in hiring decisions). A potential backside of the medal: perhaps “Weapons of Math Destruction”—to borrow a beautifully frightening term from O’Neil (2016, p. 3).

Science evaluation with quantifications

How numbers fuel the quest for objective, unbiased, and justifiable judgment

“The ideal of the classical natural sciences was to consider knowledge as independent of the scientist and his measuring instruments”, so Gigerenzer (1987, p. 16) points out. His piece (1987) on the “Fight against Subjectivity” (p. 11) illustrates how the use of statistics became institutionalized in a social science: psychology. Experimenters invoked them to make their claims independent (i) from themselves as well as (ii) from the human subjects they studied. The fight against subjectivity is neither unique to psychology, nor have quests for objectivity ended in the social sciences. Across disciplines, scientists use quantifications to make their claims about their findings and the value of their work appear independently from themselves and from context (see also Gelman & Hennig, 2017). Academia, so we argue, is in the process of undergoing a significant transformation since many years: as much as quantifications have contributed to transform other aspects of science and society, they shape science evaluation (Wilsdon et al., 2015). Indicators are used to strive for unbiased, fair, or legitimate judgments.

What is being evaluated varies, ranging from individual journal articles to different ‘producers of science’, including scientists or competing departments and universities. For instance, one may use the number of citations a paper enlists on Google Scholar to find out whether that paper is worth reading and citing. Likewise, when it comes to justifying promotions of assistant professors or to allocating limited amounts of funds to competing departments, what frequently counts is the number of papers published in first quartile journals of the Journal Citation Reports (Clarivate) or in the top 5 journals in economics (American Economic Review, Econometrica, Journal of Political Economy, Quarterly Journal of Economics, and Review of Economic Studies) (Heckman & Moktan, 2020).

Similar statistics serve science policy, with ‘tax payer’s investments’ in academic institutions, personnel, and research seemingly calling for ‘objective’ indicators of success as justification. Research evaluation is characterized by heterogeneous practices (Hug, 2022). A prominent example (with changing practice over the last years) is the Research Excellence Framework (REF) of the United Kingdom. REF is the United Kingdom's “…system for assessing the excellence of research in UK higher education providers...” REF 2029 (2024). It informs the public about the quality of British science (e.g., for 2014, “[t]he overall quality of submissions was judged … 30% world-leading … 46% internationally excellent…”, REF, 2014).. The objectives of the framework are to:

  • “provide accountability for public investment in research and produce evidence of the benefits of this investment

  • provide benchmarking information and establish reputational yardsticks, for use in the higher education sector and for public information.

  • inform the selective allocation of funding for research” REF 2029 (2024).

One wonders to what extent in the United Kingdom and other countries (e.g., Australia)those developments feed and are fed by businesspersons.Footnote 9 Companies offer a stream of user-friendly number producers, ranging from search engines (e.g., Google Scholar) and network applications (e.g., ResearchGate) to bibliometric products (e.g., InCites provided by Clarivate, and SciVal from Elsevier). Lawyers and journalists may have their share, with the public outcries Corruption! Nepotism! or the latent threat of court trials (e.g., from job candidates) incentivizing academic institutions to implement (e.g., hiring, resource allocation) procedures that are not, first and foremost, sensible, but that are defendable. The rationale of the number-based defenses would be that quantifications seem harder to argue with than subjective judgments and intuitions.

The quantification of science evaluation has numerous consequences (see also e.g., Weingart, 2005). Scientists can get fixated on producing a certain number of publications (e.g., per year) rather than with simply doing research for the sake of the research itself (see also e.g., Gigerenzer & Marewski, 2015; Smaldino & McElreath, 2016). Scholars may worry more about the statistics computed (e.g., the p-value), promising both (a) ‘publisheability’ (e.g., in ‘high-impact’ journals) and (b) seeming ‘objectivity’ in conclusions, than about the meaningfulness of the analysis and research question, the precision of the underlying theory and its generalizability, or the quality of the data collected to test the theory. Likewise, when research is evaluated, the focus may be, once more, on seemingly ‘objective’ h-indices and JIFs, but not on the actual content and contribution of scientific work—and much the same holds true when researchers themselves (e.g., job applicants) are under evaluation.

Even when it comes to ‘measuring’ career prospects, there may be parallels (to citation-based numbers such as h-indices and JIFs) —at least historically in disciplines such as psychology. As Rosnow and Rosenthal (1989) point out,

“It may not be an exaggeration to say that for many PhD students, for whom the .05 alpha has acquired almost an ontological mystique, it can mean joy, a doctoral degree, and a tenure-track position at a major university if their dissertation p is less than .05. However, if the p is greater than .05, it can mean ruin, despair, and their advisor's suddenly thinking of a new control condition that should be run” (p. 1277).

By that logic, a few decades ago, the number of p-values < 0.05 a young scholar ‘found’ could have served as an early indicator of professional success, similar to how one can look, nowadays, at the number of high-impact journals she/he publishes in or her/his h-index.

Each indicator suffers from different problems. For example, there is probably no universal way of citing across fields and reasons for citing differ (Tahamtan & Bornmann, 2018). So do normalization procedures. Yet, tools and evaluation guidelines, centered on single ‘key performance’ indicators (e.g., JIFs, h-indices, or other ‘significant numbers’) can obliterate such diversity in repertoires of methods and indices. A parallel are publication manuals and textbooks on statistics advocating uniform p-value crunching that brushes across conflicting assumptions of different statistical frameworks, including diverse meanings of level of significance (from Fisher and Neyman and Pearson; see e.g., Gigerenzer, 1993). Arguably, a single ‘key performance’ indicator, mandated to be used in all circumstances, poses little affordance—to borrow a notion from Gibson (1979/1986)—for judgment. Hence, it also does not pose an affordance for practicing judgment and acquiring expertise, such as when to rely on what statistic.Footnote 10

However, single ‘key performance’ indicators afford being looking up quickly in open literature databases, with online fora and other digital media (e.g., the press) conveniently allowing for pressure and control. This ranges from public investigation (potentially by everybody with an internet connection) to harsh punishment, with the fear of digital pillorying and ‘shit storms’ potentially further inducing defensive logics of the kind ‘better publish too much than too little’. Or would you, as a dean or public funder, like to see ‘your institution’ exposed, on the internet, as consistently producing fewer papers than average, being low in the rankings, hosting scientist X, publicly accused of being guilty of Y, or being in-compliant with guidelines A, B, and C? In the aftermath of all this mess, what matters is

  • producing more than an arbitrarily defined number of papers per year,

  • having an h-index of a certain magnitude (e.g., 15 or whatever a supposed excellent score is, see Hirsch, 2005), or

  • publishing in a journal in the first quartile of the Journal Citation Reports (= ‘Q1-journal’).

How numbers in quantitative science evaluation parallel those in social science research

But also on other dimensions, there seem to be interrelated parallels between the quantification of research evaluation and social science research more generally (Gigerenzer, 2018; Gigerenzer & Marewski, 2015):

  1. 1.

    ’Significant numbers’ simplify life: whether it is 99 p-values, spit out by off-the-shelf statistical software or citation counts on Google Scholar, nowadays indicators are easy to obtain. Also, seemingly everybody feels capable of using them: indeed, even a school child who might not understand the contents of a paper, or know how to recognize quality work, can count and grasp larger-smaller relationships, which is what indicators are all about (e.g., 80 > 20 citations; p < 0.05, see e.g., Gigerenzer, 1993, p. 331, on the common practice of hypothesis testing in social science (i.e., psychology): “...a fairly mechanical schema that could be taught, learned, and copied by almost anyone”).

  2. 2.

    Indicators speed up ‘production’ in global academic factories: in a research world where quantifications matter, time saved by relying on a number (e.g., a p-value or JIF) rather than on more cumbersome activities (e.g., alternative data analyses or reading papers from a job applicant) aids to play the game (e.g., producing more papers or evaluating articles fast; see also Bornmann & Marewski, 2019). As Gigerenzer (1993) points out with respect to the common practice of mechanical hypothesis testing, “[i]t made the administration of the social science research that exploded since World War 2 easier, and it facilitated editors’ decisions” (p. 331). The p-value offers “...a simple, ‘objective’ criterion for...” (Gigerenzer & Marewski, 2015, p. 429) judging findings—NHST as fast automatic judgment procedure, applicable to all articles, independent of context and people. Something analogous seems to be happening with h-indices and JIFs when they are (mis)used in fast and automatic ways, brushing across the idiosyncracies of academic life. Indeed, as noted by Gigerenzer (2018) “…null ritual [NHST] can be seen as an instance of a broader movement toward replacing judgment about the quality of research with quantitative surrogates” (p. 214).

  3. 3.

    ’Significant numbers’ are not always understood: in the last century, as textbook writers intermingled Fisher’s and Neyman and Pearson’s competing statistical frameworks, social scientists started to use with NHST a “hybrid theory” (Gigerenzer, 2018, p. 212). They made p < 0.05 a magic potion (i.e., a drug), widely consumed, but prone to misattributions about what that potion can actually do and what not (e.g., “Probability of replication = 1 – p”, Gigerenzer, 2018, p. 204; see also Oakes, 1990). The Annual Review of Statistics and Its Application—a journal that aims to “…inform[] statisticians, and users of statistics…”—once advertised on its principal website to have “…debuted in the 2016 Release of the Journal Citation Report[s] (JCR) with an Impact Factor of 3.045” (Annual Review of Statistics and Its Application, 2020). Do marketing professionals working for scientific journals, science administrators, and other practitioners fully grasp what information h-indices, JIFs, and other indices can convey, and especially what not?

  4. 4.

    Research evaluations and statistical analyses serving research itself (e.g., hypothesis tests) are both often carried out by non-experts, that is, by non-bibliometricians (e.g., administrators) and researchers (e.g., applied psychologists) who are not statisticians by training. Lack of expertise may drive the quest for automatic, mechanical procedures (see also e.g., Oakes, 1990), seemingly obliterating the need for personal judgment: non-experts’ (e.g., statistical) intuitions cannot be trusted, but even a non-expert can follow simple procedures such as computing a p-value or telling which of two h-indices and JIFs are larger. As Gelman and Hennig (2017) point out, “…statistics is sometimes said to be the science of defaults: most applications of statistics are performed by non-statisticians who adapt existing general methods to their particular problems. … It is then seen as desirable that any required data analytic decisions … are performed in an objective manner…” (p. 971). We believe that the same holds true for quantitative research evaluation.

  5. 5.

    Whether it is the JIF, h-index, or p-value, in both research evaluation and applied statistics, ‘significant numbers’ are at the core of critique and controversies (e.g., Benjamin et al., 2018; Callaway, 2016), making their massive consumption a surprising affair. One may speculate to what extent misconceptions such as false believes about the information conveyed by an indicator could contribute to sustain its use (see e.g., Gigerenzer, 2004a, for that conjecture with respect to NHST).

To summarize, most importantly, ‘significant numbers’ seem to aid to base propositions on seemingly ‘objective’ grounds (see also e.g., Porter, 1993). In science evaluation, that job is done by h-indices and JIFs enabling to ‘objectively’ judge the research performance of scientists and institutions or the ‘quality of journals’—across fields and other elements of context in an automatic way.

How quantifications fuel the bureaucratization of science

The quantification of science evaluation has not ended with mere efforts towards making judgments look unbiased, objective, and justifiable. The bureaucratization of science is another (equally worrisome) outcome. Modern science is frequently called post-academic. According to Ziman (2000), bureaucratization is an appropriate term which describes most of the processes in post-academic science:

“It is being hobbled by regulations about laboratory safety and informed consent, engulfed in a sea of project proposals, financial returns and interim reports, monitored by fraud and misconduct, packaged and repackaged into performance indicators, restructured and downsized by management consultants, and generally treated as if it were just another self-seeking professional group” (p. 79).

For example, principal investigators in ERC Grants by the European Research Council (ERC) “…should … be able to provide evidence for the calculation of their time involvement…”—time sheets enter academia (European Research Council, 2012, p. 33).

Words such as management, performance, regulation, accountability, and compliance had previously no place in scientific life. The vocabulary was not developed within science but was transferred by the bureaucratized society (Dahler-Larsen, 2012; Ziman, 2000). The bureaucratic virus spreads through the scientific publication process itself: nowadays, many journals require hosts of (e.g., web) forms to be signed (or ticked), ranging from conflict-of-interest statements to ethical regulations, assurances that the data to be published is new and original, to copyright transfer agreements. Certain publication guidelines read like instructions one would otherwise find in tax forms in public administration, and the length of those guidelines is sometimes as overwhelming as the endless legal small print of privacy and licensing policies on internet websites. Data protection regulations add to the growing mess.

Does bureaucracy suffocate science? It has not come that far. But arguably, research in post-academic science is characterized by less freedom. Projects are framed by proposals, employment, and supervision of project staff (PhD students, post-doctoral researchers). Explorative studies—which may lead to scientific revolutions—can raise questions that are new and not rooted in the field-specific literature; such studies can, moreover, come with unconventional approaches, and lead to unforeseeable expenditures of time (Holton et al., 1996; Steinle, 2008). There is the risk that these elements are negatively assessed in grant funding and research evaluation processes, because they do not fit into ‘efficient’ project management schemes. The five most important characteristics of post-academic science evaluation can be summarized as follows (Moed & Halevi, 2015):

(1) Performance-based institutional funding In many European countries, the number of enrolled students is decreasingly, and performance criteria are increasingly relevant for research funds for universities. Today, some institutions favor the exclusive use of bibliometrics or peer review to determine research performance; others favor mixed approaches by combining peer review and bibliometrics (Thomas et al., 2020). The performance criteria are used for accountability purposes (Thonon et al., 2015). According to Moed and Halevi (2015), “[i]n the current … [climate] where budgets are strained and funding is difficult to secure, ongoing, diverse and wholesome assessment is of immense importance for the progression of scientific and research programs and institutions” (p. 1988).

(2) International university rankings Universities are confronted with the results of international rankings (Espeland, et al., 2016). Although heavily criticized (Hazelkorn, 2011), politicians are influenced by ranking numbers in their strategies for funding national science systems. There are even universities incentivizing behavior to influence their positions in rankings, for instance, by institutionally binding  highly cited researchers from universities in other countries (Bornmann & Bauer, 2015). Clarivate publishes annually a list of highly cited researchers https://clarivate.com/hcr/who have authored the most papers in their disciplines belonging to the 1% most frequently cited papers. This list is constitutive of one of the best-known international rankings, the Academic Ranking of World Universities (ARWU), also known as Shanghai Ranking. The Financial Times and The Economist undertake rankings of business schools and their programs (e.g., Master, MBA, Executive MBA).

(3) Internal research assessment systems More and more universities and funding agencies install research information systems to collect relevant data on research input (e.g., number of researchers) and output (e.g., publications; Biesenbender & Hornbostel, 2016). These numbers are used to monitor performance and efficiency continually. Problems emerge if those monitoring systems change researchers’ goals in an unintended way—for instance, leading them to frame a finding or theory as ‘novel’, rather than closely tying it to previous work, or to dissect ideas into short journal papers to increase the output (Gigerenzer & Marewski, 2015; Weingart, 2005): the 2013 Nobel Prize laureate in physics, Peter W. Higgs—who recently passed away—remarked to  The Guardian, “Today I wouldn’t get an academic job … I don’t think I would be regarded as productive enough”; Higgs noted that he came to be  “an embarrassment to the department when they did research assessment exercises”—when requested “Please give a list of your recent publications …I would send back a statement: 'None.'” (Aitkenhead, 2013). Or, in turning to psychology, speaking with Mischel (2008):

“When the science community is working well, it doesn’t re-label, or at least it tries not to reward re-labeling. After the structure of DNA was discovered, nobody renamed and recycled it as QNA (or if they did, it was not published in Science or Nature). But in at least some areas of psychological science, excellent and honorable researchers with the best intentions inadvertently create a QNA or two, sometimes perhaps even a QNA movement.“

(4) Performance-based salaries In certain countries salaries are sometimes connected to publishing X articles in reputable journals (e.g., Science or Nature) (Reich, 2013). In business schools, academic faculty may see  a reduced teaching load, promotions, or other benefits linked to publishing in journals on the Financial Times List—the same newspaper that publishes the MBA and other rankings important to the schools’ prestige and, ultimately, to profitability of their programs. Such practices widely open the doors to scientific misconduct (Bornmann, 2013). One is not really be surprised to read that papers are bought from online brokers or that scientists pay for authorships (Hvistendahl, 2013).

(5) The use of metrics to target ambiguity in peer review processes Peer review is the main quality assurance process in science (Hug, 2022). The meaning of research quality differs between research fields, the context of evaluation, and the policy context (Langfeldt et al., 2020). Reviewers use many different criteria for making judgements in different contexts and integrate the criteria into judgements using complex rules (Hug, 2024; Hug & Aeschbach, 2020). Bibliometrics is one of these criteria frequently used in the context of peer review processes (Cruz-Castro & Sanz-Menendez, 2021; Langfeldt et al., 2021). One reason for the use of metrics is the ambiguity of the peer review process: research is evaluated against some criteria and some level of achievement. As the study by Langfeldt et al. (2021) shows, especially reviewers with high scores on metrics “...find metrics to be a good proxy for the future success of projects and candidates, and rely on metrics in their evaluation procedures despite the concerns in scientific communities on the use and misuse of publication metrics” (p. 112). The issue of choice under ‘ambiguity’ is not only specific for research evaluation processes, but is characteristic for areas of policy (Dahler-Larsen, 2018; Manski, 2011, 2013).

Why quantifications alone are not sufficient when it comes to research evaluation

According to Wilsdon et al. (2015), today three broad approaches are mostly used to assess research in post-academic science: the metrics-based model, which relies on quantitative measures (e.g., counts of publications, prices, or funded projects), peer review (e.g., of journal or grant submissions), and the combination of both approaches. Quality in science, so the rationale of the peer review process, can only be established if research designs and results are assessed by peers (from the same or related fields). In the past decades, the quest for comparative and continuous evaluations on a higher aggregation level (e.g., institutions or countries) has fueled preferences for the metrics-based model. Those preferences are also triggered by the overload of the peer review system: the demand for the participation in peer review processes exceeds the supply.

In the metrics-based model of research evaluation, bibliometrics has a prominent position (Schatz, 2014; Vinkler, 2010). According to Wildgaard et al. (2014) “[a] researcher’s reputational status or ‘symbolic capital’ is to a large extent derived from his or her ‘publication performance’” (p. 126). Bibliometric information is available in large databases (e.g., Web of Science, Clarivate, or Scopus, Elsevier) and can be used in many disciplines and on different aggregation levels (e.g., single papers, researchers, research groups, institutions, or countries). Whereas the number of publications is used as an indicator for output, the number of citations is relied upon as proxy for quality.

However, the metrics-based model has several pitfalls (Hicks et al., 2015). Five of those problems stem from the ways in which quantifications are used (Bornmann, 2017).

(1) Skew in bibliometric data Bibliometric data tend to be right-skewed, with there being only a few highly cited publications and many publications with only a few or zero citations (Seglen, 1992). There is a tendency for citations to concentrate on a relatively small stratum of publications. Citations are over-dispersed count data (Ajiferuke & Famoye, 2015). Hence, simple arithmetic means—as they are built into mean citation rates or JIFs—should be avoided as measure of central tendency (Glänzel & Moed, 2013).

(2) Variability in bibliometric data In line with the ideals of universalism and automatism, the results of bibliometric studies are typically published as if they were independent of context or otherwise invariant (Waltman & van Eck, 2016). The results of bibliometric studies on the same unit can vary between different samples (e.g., from different publication periods or literature databases).

(3) Time- and field-dependencies of bibliometric data Many bibliometric studies are based on bare citation counts, although these numbers cannot be used for cross-field and cross-time comparisons (of researchers or universities). Different publication and citation cultures lead to different average citation rates in the fields—independently of the quality of the publications.

(4) Language effect in bibliometric data In bibliometric databases, English publications dominate. Since English is the most frequently used language in science communication, the prevalence of English publications comes as no real surprise. However, the prevalence can influence research evaluation in practice. For example, there is a language effect in citation-based measurements of university rankings, which discriminates, particularly, against German and French institutions (van Raan et al., 2011). Publications not written in English receive—as a rule—fewer citations than publications in English.

(5) Missing and/or incomplete databases in certain disciplines Bibliometric analyses can be poorly applied in certain disciplines (e.g., social sciences, humanities, or computer science). The most important reason is that the literature from these disciplines is insufficiently covered in the major citation databases which focus on international journal publications (Marx & Bornmann, 2015). In other words, “[b]ibliometric assessment of research performance is based on one central [but possibly false] assumption: scientists, who have to say something important, do publish their findings vigorously in the open, international journal literature” (van Raan, 2008, p. 463).

Three additional problems with bibliometric indicators concern their purpose, how they are used, and what information the numbers can convey. Those problems are of a more general nature.

(1) Poorly understood indicators As Cohen’s (1990) points out with respect to hypothesis testing in psychology, “Mesmerized by a single all-purpose, mechanized, ‘objective’ ritual in which we convert numbers into other numbers and get a yes–no answer, we have come to neglect close scrutiny of where the numbers came from” (p. 1310). Also many of those relying on bibliometric indicators do not seem to know where the numbers come from and/or what they really mean. Today, the JIF is a widely used indicator to infer ‘the impact’ of single publications by a researcher. However, the indicator was originally developed to decide on the importance of holding journals in libraries. Paralleling how the historical roots and purposes of Fisher’s and von Neyman and Pearson’s respective statistical frameworks and their bitter controversy are buried in the current “hybrid” (Gigerenzer, 2018, p. 202) practice of seemingly universal and automatic NHST, the JIF was applied, with little conceptual fundament, to new judgment tasks. These tasks are inferences about the quality or relevance of scientific output, made mindlessly across contexts and people. Similarly problematic, the h-index combines the output and citation impact of a researcher in a single number. However, with h papers having at least h citations each, the formula for combining both metrics is arbitrarily chosen: h2 citations or 2h citations could have been used as well (see Waltman & van Eck, 2012); just as p < 0.06 or < 0.03 could become a convention instead of 0.05 and 0.01. Rosnow and Rosenthal (1989) put it like this: “...God loves the 0.06 nearly as much as the 0.05” (p. 1277), to which Cohen (1990) added “...amen!” (p. 1311).

(2) Amateurs playing experts A physician once remarked, exasperated, to one of us that, nowadays, fueled by digital media, certain of his patients pretend to be expert doctors—albeit without caring to know what they don’t know. Until the end of the twentieth century, bibliometric analyses were frequently conducted by expert bibliometricians who knew about the typical problems with bibliometric studies, alongside with possible solutions. Since then,  “Desktop Scientometrics” (Katz & Hicks, 1997, p. 142) has become more and more popular. Here, research managers, administrators, and scientists from fields other than bibliometrics use “…bibliometric data in a quick, unreliable manner…” (Moed & Halevi, 2015, p. 1989). Digitalized bibliometric applications are available (e.g., InCites or SciVal), which provide ready-to-use bibliometric results, foregoing available expertise and scrutiny from professional bibliometricians (Leydesdorff et al., 2016). As Retzer and Jurasinski (2009) point out—rightly—“…a review of a scientist’s performance based on citation analysis should always be accompanied by a critical evaluation of the analysis itself” (p. 394). Bibliometric applications, like SciVal or InCites, can deliver bibliometric results, but they cannot replace expert judgment—much like off-the-shelf statistical software can only deliver p-values and other statistics automatically, but not judgment.

(3) Impact is not equal to quality Gigerenzer (e.g., 2018) and others (e.g., Cohen, 1994; Oakes, 1990) have touched academic researcher’s sore spots when it comes wishful but false believes about statistical hypothesis tests. In research evaluation, wishful thinking takes place when impact is equated with quality. A citation-based indicator might capture (some) aspects of quality but is not able to accurately measure quality. Indicators are “…largely built on sand” (Macilwain, 2013, p. 255), in the view of some. According to Martin and Irvine (1983) citations reflect scientific impact as just one aspect of quality: correctness and importance are others. Applicability and real-world relevance are further aspects scarcely reflected in citations. Moreover, groundbreaking findings—those, indeed, leading, to scientific revolutions—are not necessarily highly cited ones. For example, Marx and Bornmann (2010, 2013) bibliometrically analyzed publications that were decisive in revolutionizing our thinking. They analyzed those (1) that replaced the static view with Big Bang theory in cosmology, or (2) that dispensed with the prevailing fixist point of view (fixism) in favor of a dynamic view of the Earth where the continents move through the Earth’s crust. As those bibliometric analyses show, several publications that propelled the transition from one theory to another are lowly cited.

What is more, in all areas of science, important publications might be recognized as such only many years after publication—these articles have been named as “sleeping beauties” (van Raan, 2004, p. 467), only that—unlike in fairy tales—nobody starts to kiss them. The Shockley-Queisser paper (Shockley & Queisser, 1961)—describing the limited efficiency of solar cells based on absorption and reemission processes—is one such sleeping beauty (Marx, 2014). Within the first 40 years after it appeared, this groundbreaking paper was hardly cited. Even worse, sometimes papers that are highly cited perpetuate factual mistakes, misconceptions, or misunderstandings contained in them. For instance, a highly cited paper by Preacher and Hayes (2008) recommends using a certain statistical procedure to test mediation. This procedure is widely used in the disciplines of psychology and management; however, the procedure produces biased statistical estimates: it ignores a key assumption made by the estimator (i.e., that the mediator is not endogenous; see Antonakis et al., 2010; Kline, 2015).

Science evaluation from a statistical point of view: universal and automatic classifiers do not exist

Together with his friend and colleague, Allen Newell, the later Nobel laureate Herbert Simon wrote a visionary paper in 1958 (Simon & Newell, 1958). In that paper, the two (see also Simon, 1947/1997; 1973) introduced a distinction between two types of problems decision-makers face. Their distinction was grounded in what was, at the time, becoming an emerging technology for research—a technology that has become, in addition, an indispensable tool for many quantification processes as well as a metaphor of the workings of the human mind itself (Gigerenzer, 2002a, Chapter 2): the computer. Specifically, Simon and Newell (1958) distinguished between ill-structured and well-structured problems. According to them:

“A problem is well structured to the extent that it satisfies the following criteria:

  1. 1.

    It can be described in terms of numerical variables, scalar and vector quantities.

  2. 2.

    The goals to be attained can be specified in terms of a well-defined objective function—for example, the maximization of profit or the minimization of cost.

  3. 3.

    There exist computational routines (algorithms) that permit the solution to be found and stated in actual numerical terms…

In short, well-structured problems are those that can be formulated explicitly and quantitatively, and that can then be solved by known and feasible computational techniques” (Simon & Newell, 1958, pp. 4–5).

Can the problem of recognizing quality science be conceived of as being well-structured? The difficulty of recognizing quality science is that—statistically speaking—judgments about the quality of research (e.g., people or institutions) represent classifications (Bornmann & Marewski, 2019).Footnote 11 For instance, the various indicators (e.g., JIFs or h-indices) alluded to above can be thought of as numerical predictor variables to be used in the classification of research output. Yet, regardless of whether it comes to science evaluation or other classification tasks (e.g., medical diagnosis or credit scoring), no classifier will always yield totally accurate results. Instead, false positives (giving ‘poor research’ laudatory evaluations) and false negatives (giving ‘quality research’ disapproving evaluations) will occur (Bornmann & Daniel, 2010). That is, mistakes are inevitable. In hypothesis testing, such mistakes are also known as type I and type II errors.

The fact that mistakes can occur, however, does not necessarily mean that a problem is not well-structured. Rather, to construct classifiers and assess what level of accuracy they can attain, in areas other than science evaluation, one tests their performance empirically out of sample (e.g., in cross-validations), with the testing of different classifiers against each other permitting to identify what might be the best one, given a learning and test sample, and a precise performance criterion. Performance can be measured in terms of classification accuracy, or in terms of the costs and benefits that come with making correct (i.e., correct-positive, correct-negative) and incorrect (i.e., false-positive, false-negative) classifications. Granted, differences between calibration data and validation data may introduce some weak, or—in case of predictions out-of-population rather than out-of-sample—stronger uncertaintyFootnote 12 (see also Marewski & Hoffrage, 2021, 2024). Yet as long as one stays within the realm of clearly defined populations, the problem of identifying a classifier that produces the most accurate and/or best cost–benefit ratio seems reasonably well-structured.

In research evaluation taking that approach is hardly feasible—conceptually, research evaluation does not present itself as a well-structured problem:

  1. (1)

    To determine classification (i.e., the judgements’) accuracy, one needs to have at least one meaningful criterion variable which scarcely exists in research evaluation. That is, one does not really know for sure how good research is, given a valid yardstick. Instead, the same indicators that could be used as predictor variables in classifiers, must be interchangeably used as criterion to define, seemingly objectively, what counts as quality. In certain areas, in contrast, it is relatively more straightforward to establish self-standing and meaningful outside criteria. A cancer or an unpaid loan (i.e., a debt) might be present or not. If present and if the classifier predicts that state to be present, one has a correct positive. If not present, and if the classifier predicts the state to be present, one has a false positive, and so on.

  2. (2)

    That said, one may take the view that citations and other numbers are valid criteria. For instance, just as one may be able to use past cancers or debts to classify individuals with respect to how likely it is that they will develop future cancers or debts, respectively, one could use past citations as predictor variable for making best guesses about future citations. One may, furthermore, accept the obvious: namely that the specific set of predictor variables used, how one combines them (i.e., the functional form), and the population at hand will shape classification accuracy. However, when it comes to research evaluation, the criterion one ought to be interested in is not just (likely unmeasurable) classification accuracy, but the costs and benefits associated with correct positives, correct negatives, false negatives, and false positives. What is the prize to pay (e.g., by society) if just one brilliant, groundbreaking piece of work on cancer research is not recognized as such (false negative) or reversely, if million papers with weak theory and not replicable findings are classified, false-positively, as ‘commendable’ (e.g., simply because they received some self-citations)? What is more, counterfactuals may be hard, if impossible to observe (e.g., how would the world look like today, if major discoveries X, Y, and Z had gained attention). In science evaluation and many other classification tasks, the real costs and benefits of classifications are hard to estimate or fully unknowable. And even if one worked with fully hypothetical cost-benefit structures, different stakeholders would place more or less importance on different costs and benefits. For example, different scientists, evaluators, or politicians might have diverging agendas when it comes to evaluating the research of colleagues or institutions. Assumptions about cost–benefit structures will also vary across contexts (e.g., a false negative in cancer research is not the same as one in psychology). Finally, time may matter. For instance, as March (1991) points out, discussing trade-offs between exploration and exploitation in organizational learning:

“The certainty, speed, proximity, and clarity of feedback ties exploitation to its consequences more quickly and more precisely than is the case with exploration.… Basic research has less certain outcomes, longer time horizons, and more diffuse effects than does product development. The search for new ideas, markets, or relations has less certain outcomes, longer time horizons, and more diffuse effects than does further development of existing ones” (p. 73).

In short, even if one tries to treat science evaluation as well-structured problem, it becomes clear that this problem comes with massive uncertainties that go beyond those due to making predictions out of sample or out of population. Before defining more precisely what the term uncertainty actually means, let us now treat science evaluation as ill-structured problem. Said Simon and Newell (1958):

“Problems are ill-structured when they are not well-structured. In some cases, for example, the essential variables are not numerical at all, but symbolic or verbal. An executive who is drafting a sick-leave policy is searching for words, not numbers. Second, there are many important situations in everyday life where the objective function, the goal, is vague and nonquantitative. How, for example, do we evaluate the quality of an educational system or the effectiveness of a public relations department? Third, there are many practical problems – it would be accurate to say ‘most practical problems’ – for which computational algorithms simply are not available” (p. 5).

That problem-description seems, in our view, to be more fitting to most research-evaluation tasks. But what then are tools for tackling ill-structured problems? Simon and Newell (1958) believed that particularly simple problem-solving tools, called heuristics, would permit tackling such problems. They had started to implement heuristics “...similar to those that have been observed in human problem solving activity” (Newell & Simon, 1956, p. 1) in a computer program that modeled scientific discoveries (“proofs for theorems in symbolic logic”; Newell & Simon, 1956, p. 1), the logic theory machine or logic theorist (Newell et al., 1958). As Newell et al. (1958) commented, the “[logic theorist’s] success does not depend on the ‘brute force’ use of a computer's speed, but on the use of heuristic processes like those employed by humans” (p. 156). While the program itself can be considered foundational for modern-day AI and computational cognitive psychology, it also reflects Simon’s research program on bounded rationality (e.g., Simon, 1955a, 1956) and foreshadows a vast body of later neo-Simonian research on heuristics that emerged in Simon’s footsteps (see Gigerenzer, 2002a, Chapter 2; Marewski & Hoffrage, 2024).Footnote 13 Heuristics are models of how ordinary people—who face limits in knowledge available to them, information-processing capacity, and time—make judgments and decisions.

At the close of this essay, let us take a look at how heuristics may represent a key for opening the door for good judgment and decision-making in science evaluation.

What is the remedy?

Before we start another commentary is warranted. In the introduction to this essay, we wrote that quantifications have been turned into the new opium of the people. We did not write that they are opium. Opium is a dangerous drug; quantifications are not drugs, but they can cloud one’s thinking like drugs. Like many drugs, quantifications can not only do harm but, when insightfully and diligently used, they can be extremely beneficial.

So let us be clear: in what follows, we do not advocate getting rid of quantifications in judgement. From experimentation to computer simulation, for countless judgments, quantifications are absolutely indispensable. What we do advocate is that those who invoke bibliometric numbers to assess the ‘quality’ of research, people, or institutions do not pretend, at the same time, to also avoid judgment and the uncertainties such judgments entail, particularly when it comes to ill-structured problems. Likewise, we advocate that those who dare to rely on their judgment are not automatically forced to work with some form of quantitative tool (e.g., statistics serving as ‘facts’) even when doing so does not make much sense.

A toolbox for handling uncertainty may aid good judgment

To take a look at how heuristics may represent a key for opening the door for good judgment in science evaluation, let us return to our initial idea of conceptualizing such evaluation as well-structured problem. As we have pointed out, with each change in the assumed cost–benefit structure, a classifier’s performance can change. As a consequence, there will be no universal and automatic classifier for such problems. In that regard, science evaluation is by no means special: Benjamin et al. (2018), for instance, point out “…that the significance threshold selected for claiming a new discovery should depend on … the relative cost of type I versus type II errors, and other factors that vary by research topic” (p. 8).

This non-universality parallels what Gigerenzer (2004a), in writing about NHST, stresses for inductive inference more generally: “There is no uniformly most powerful test…” (p. 604). And in his “Statistical methods and scientific inference”, first published in 1956, Fisher (1990b) remarked:

“The concept that the scientific worker can regard himself as an inert item in a vast co-operative concern working according to accepted rules, is encouraged by directing attention away from his duty to form correct scientific conclusions, to summarize them and to communicate them to his scientific colleagues, and by stressing his supposed duty mechanically to make a succession of automatic ‘decisions’, deriving spurious authority from the very incomplete mathematics of the Theory of Decision Functions… the Natural Sciences can only be successfully conducted by responsible and independent thinkers applying their minds and their imaginations to the detailed interpretation of verifiable observations. The idea that this responsibility can be delegated to a giant computer programmed with Decision Functions belongs to the phantasy of circles rather remote from scientific research” (pp. 104–105).Footnote 14

But there are more parallels between statistics and research evaluation. To speak with Simon:

“… I have experienced more frustration from the statistical issue [tests of significance] than from almost any other problem I have encountered in my scientific career. (There is a possible competitor – the reaction of economists to suggestions that human beings may be not global optimizers…) To be accurate, the frustration lies not in the statistical issue itself, but in the stubbornness with which psychologists hold to a misapplication of statistical methodology that is periodically and consistently denounced by mathematical statisticians” (Simon, 1979, p. 261; the text in parentheses represents a footnote in the original).

What significance testing is for those psychologists Simon had in mind, may be citation rates for research evaluators: certain administrators concerned with science evaluations, namely those obsessed with some kind of evaluative number crunching procedure. Following the old ideals of universalism and automatism, many treat bibliometric results as if they could be used as ‘classifiers’ that were informative in all situations and their classifications independent of how different people used them.

Importantly, those administrators do not only seem to handle science evaluation and other judgment problems as if there were just one type of universal tool available (e.g., h-indices for all assessments of researchers). Also, the idea that simple, common-sense judgments could outwit detailed, seemingly rational analysis may seem counterintuitive to them. Rational decision-making warrants full information: searching for all ‘facts’ and integrating them, is the best approach to judgment, so the logic goes. Here we may see a reflection of ideals of ‘rational, analytic decision making’; captured by optimization models (e.g., expected utility-maximization) as model of, tool for, and norm for rational-decision making in parts of business, economics, psychology, and even biology (e.g., Becker, 1976; Stephens & Krebs, 1986).

Yet, all ‘fact’-considerations notwithstanding, do we know which technology will be invented tomorrow? Is it knowable whether the scientist with the h-index X will in five, ten, fifteen years from now make the discovery that revolutionizes the field? Science evaluation and many other judgment tasks we face in our lives do not resemble gambles where all information—all possible decisional options, their consequences, and the probability that each consequence occurs—can, in theory, be assessed and used to calculate best bets—‘rational expectations’ of sorts. Instead, real-world judgment is characterized by uncertainty. In neo-Simonian research on heuristics the term uncertainty has come to be employed to refer to such situations where the unexpected can happen and “...where not all alternatives, consequences, and probabilities are known, and/or where the available information is not sufficient to estimate these reliably” (Hafenbrädl et al., 2016, p. 217).Footnote 15

Under uncertainty, so the fast-and-frugal heuristics framework (Gigerenzer, Todd, & the ABC Research Group, 1999) posits, people can make accurate classifications and other judgments, because they can adaptively draw from a toolbox of heuristics as a function of a person’s goals and context. That is, shocking with the ideals of automatism and universalism, under uncertainty the performance of a tool is neither independent from an individual herself, nor is any tool useful universally. What is more, shocking with pretense of optimization, computer-simulation work on heuristics has shown that computational models of heuristics can match or outperform information-greedy statistical tools even in well-structured problems featuring uncertainty due to differences between calibration and validation data (e.g., Czerlinski et al., 1999; Gigerenzer & Brighton, 2009; Katsikopoulos et al., 2020). Many of those information-greedy tools optimize in one way or the other: for instance, in regressions a form of optimization is the computation of beta-weights that minimize the distance between the predictions made and data. A key insight from such work is that knowing when to rely on which tool is at the heart of good decision-making.Footnote 16

We think that this simple insight could be key to improving science evaluation. If one thinks of science evaluation in terms of judgments under uncertainty, and if one furthermore understands that many problems of science evaluation—but possibly not all—belong more to the ill-structured rather than the well-structured spectrum, then the consequence may be that one needs to be prepared to choose from a repertoire of different tools. That toolbox may include different indicator and non-indicator-based ones. In line with Simon and Newell’s (1958) vision, some of those tools may, moreover, be cast as computational models of heuristics, ready to be implemented in computer programs (see e.g., Bornmann et al., 2022; Bornmann, 2020, for ideas). Elsewhere, we have referred to such tools as bibliometric-based heuristics (BBH) and suggested a corresponding research program (Bornmann & Marewski, 2019). Yet, we do believe that those tools are not the only ones. Yet others, may come as qualitative, common-sensical rules of thumb instead (Katsikopoulos et al., 2024; Marewski et al., 2024).

To illustrate our point, imagine a large pile of grayish-colored human skulls, with their dark, empty eye-sockets staring at you, and the black empty mouths motionlessly exhibiting their rotten teeth, all in silence. There is a sign attached to those skulls, with just two sentences written on it: “What you are now, we have been. What we are now, - you will turn into”.Footnote 17 Life-expectancy and death (e.g., morality rates) can be quantified, but numbers may not be able to capture the sentiments and understanding that arises in you, as your (inner) gaze switches back and forth between the words and skulls. It is also not clear if a computer ever will ever be able to do so. However, to you as a human, the bones and the words will mean something (and a human may then, in consequence decide to dare to live her/his life, making use of her/his judgment—or to take opium as fear-alleviating drug instead).

So, what exactly, do we mean by qualitative rules of thumb? Newell had been introduced to heuristics as “...an undergraduate physics major at Stanford...” (Simon, 1996, p. 199), where he had taken courses with the mathematician Pólya. In his book, “How to solve it”, Pólya (1945) conceived of heuristics in terms of qualitative guiding principles and thought of proverbs—such as “Who understands ill, answers ill” (p. 222) or “He thinks not well that thinks not again. Second thoughts are best” (p. 224)—as heuristics for mathematical problem-solving. For example, those two, so Pólya (1945) points out, prescribe that “The very first thing we must do for our problem is to understand it” (p. 222) and “Looking back at the completed solution is an important and instructive phase of the work” (p. 224), respectively.

While those proverbs seem applicable as heuristics that could guide the actions of peer reviewers (e.g., As a first step, make sure you understand the paper you review!, Once done with the review, re-review your own review!) and larger-scale science evaluators (e.g., As a first step, understand the discipline and its challenges!), also other common-sense rules of thumb could help. An example of one such rule is to read and reflect upon the research one evaluates: Only use citation-based indicators alongside expert judgements of papers! A result of that simple rule of thumb would be that those evaluating others would have to be experts in the same areas. Generalists can, at best, only judge the quality of work from the outside by relying on citation and publication numbers or other seemingly universal indicators.

Expertise in a research field and expertise in bibliometrics aid good judgment

The toolbox view on judgment in research evaluation has its equivalent in the toolbox view on statistics: rather than pretending that the statistical toolbox contained just one type procedure for making statistical inferences (e.g., NHST), statistics can be best conceived of in terms of a repertoire of tools (e.g., Fisher’s null hypothesis testing, Bayes’ rule, Neyman and Pearson’s framework) for different situations. Good human judgment is needed to discern when to rely on which tool: as Gigerenzer (2004a) puts it bluntly, “[j]udgment is part of the art of statistics” (p. 604) and “[p]racticing statisticians rely on … their expertise to select a proper tool…” (Gigerenzer & Marewski, 2015, p. 422). For instance, despite all p-value bashing, Neyman-Pearson hypothesis testing can serve model selection and the bias-variance dilemma can help understanding this and other quantitative tools (e.g., Bayesian information criterion, Akaike information criterion) performance (Forster, 2000). But tests of statistical significance may nevertheless not be suitable when it comes to “...evaluating the fit of computer programs to data...” as Simon (1992, p. 159) remarked.

In our view, at least two kinds of expertise can aid good judgment if metrics are used by administrators. First, in line with the simple rule of thumb to only use bibliometric indicators alongside expert judgements of papers, when conducting a bibliometric evaluation of a scientist, journal, or an institution, judgments should be made within the field, and not by people outside of that area of research. Judgments made by a field’s experts are necessary to choose the appropriate database for the bibliometric analysis. In certain areas of research, multi-disciplinary databases such as Scopus do not cover the corresponding literature, why field-specific databases such as Chemical Abstracts (provided by Chemical Abstracts Service) for chemistry and related areas should be selected. Experts are also necessary to interpret the numbers in a bibliometric report and to place them in the institutional and field-specific context. This is what the notion of informed peer review is about: “…the idea [is] that the judicious application of specific bibliometric data and indicators may inform the process of peer review, depending on the exact goal and context of the assessment. Informed peer review is in principle relevant for all types of peer review and at all levels of aggregation” (Wilsdon et al., 2015, p. 64).

Second, expertise in a field ought to be combined with expertise in bibliometrics. Measurements involve the careful selection of dimensions within a property space (Bailey, 1972). It is clear that a non-physician or a non-pilot should not attempt to diagnose patients or fly airplanes even if convenient diagnosis tools are sold on the internet or if flight simulation software is readily available (Himanen et al., 2024). We think it should similarly be unimaginable that staff with insufficient bibliometric expertise (e.g., administrators) is put in positions where those non-experts must assess units or scientists—even if bibliometric platforms seemingly make those tasks as simple as computing p-values with statistical software. Either professional bibliometricians should be involved in research evaluations or people involved in evaluations should be trained in bibliometrics. The trained staff or the bibliometric experts should then be provided with basic information alongside a bibliometric analysis such as data sources, definitions of indicators, and reasons for the selection of a specific indicator set.

Professional bibliometricians do not only have access to a repertoire of different databases and indicators and are used to choosing among their tools, they may also advise the client against a bibliometric report in cases where bibliometrics can be scarcely applied (e.g., in the humanities) or point to other problems alongside with possible solutions. Hicks et al. (2015) formulated 10 principles—the Leiden Manifesto—guiding experts in the field of scientometrics (see also Bornmann & Haunschild, 2016). For example, performance should be measured against the research missions of the institution, group, or researcher (principle no. 2) or the variation by field in publication and citation practices should be considered by using field-normalized citation scores in cross-field evaluations (principle no. 6). Thus, if a report is commanded from bibliometricians, the responsible administrator should try to understand the report by discussing it with the bibliometricians and the expert in the concerned field: Seek understanding, rather than only experts’ rubberstamps.

Expertise in statistics (and their history) can aid good judgment

Bringing good judgment into science evaluation calls for more than just expertise in bibliometrics and in the domain of research. Any individual involved with research evaluation should have basic knowledge of statisticsat a level as is typical in social science research in economics or psychology. It is the comprehension of quantifications (1) that will help administrators to discuss bibliometric reports with the bibliometricians and scientists from the field, and (2) that will aid administrators to see the bibliometric report’s limitations. Similarly, it is statistical knowledge that allows playing the devil’s advocate on quantifications compiled by others or by oneself. Ideally, science evaluators and scientists know how to program computer simulations to scrutinize their own judgments and classifications. Just imagine what would happen if everybody followed the simple rule of thumb to only rely on numbers (e.g., Bayes factors, p-values, bibliometric indicators) and quantitative models (e.g., regression analysis, classification trees) they are able to fully calculate and programrespectively—from scratch themselves, that is, without any help of off-the-shelf software? Likely, there would be all five: more reading, more thinking, more informed discussions about contents, less uninformed abuse of quantifications, and more experts in bibliometric methods and statistics.

In line with that rule of thumb, scientists and science evaluators ought to be familiar with basic principles for statistical reasoning under uncertainty. This might be testing one’s predictions on data differing from the data that served the development of the predictions (that is, out of sample or out of population; Gigerenzer, 2004a; for more explanations, see e.g., Roberts & Pashler, 2000). To give another example, evaluators should be familiar with different techniques of exploring, representing, visualizing, and communicating data and results: Look at the very same numbers presented in various meaningful ways, would be the guiding rule of thumb here. Changing the representation format can aid understanding. An example is the powerful heuristic to convert probabilities into natural frequencies (e.g., Hoffrage et al., 2000), which has been found to aid judgment in different fields (e.g., medicine).

In research on heuristics and statistics, repertoires of different judgmental and statistical tools are not simply invented. Instead, the performance of different tools is investigated in extensive mathematical analyses and/or computer simulations (see e.g., Gigerenzer et al., 1999). By analogy, science evaluators ought to be sufficiently familiar with data analysis techniques to be able to evaluate their own tools for science evaluation. For instance, evaluators ought to understand the parallels between science evaluations and any classification problem, as described above. Correspondingly, they ought to know how to develop and test different classifiers out of sample.

But also non-quantiative expertise on statistics and other forms of quantification matters. Knowing about the historical origins of a specific form of quantification can aid to better understand why we are doing what we are doing today, as well as to ask critical questions about what might be widely-accepted practices. NHST, the JIF, as well intelligence-testing are cases in point we touched upon above. But there is much more. For instance, averages unquestionably rule, together with their standard deviations (errors), much research on humans. Yet, it was, among others, a scholar coming from astronomy—Adolphe Quetlet (1796–1874)—who had paved the way for the ‘average man’ (e.g., Desrosières, 1998; Hacking, 1990). History can teach one to be humble: We do not know what we do not know!

A note on how to aid good judgment in practice

We do not advocate for any of those recommendations to be put in practice in isolation, but rather it is their simultaneous implementation that may aid science evaluation. To illustrate this, imagine the recommendation would be implemented that expert bibliometricians perform evaluations. The result could be that those experts become too powerful in a research evaluation: decision making may then be based too much on technical bibliometric criteria rather than on substantive (including qualitative) considerations made by researchers from the field. In contrast, if those who perform the evaluation (1) are experts in bibliometrics, (2) are experts in the field under study, (3) understand that they are making decisions under uncertainty and that, under uncertainty, simple rules of thumb may help, then there might be room for both quantitative evaluation and substantive (i.e., qualitative and field-specific) considerations. Finally, if all involved have good statistical knowledge then they might additionally be in a better position to know, for instance, when which of many different quantitative indicators is useful, or how to best analyze bibliometric data for a given field. Knowledge is power—in several ways!

Importantly, when conceiving of science evaluation from a statistical point of view, it becomes clear that error management is needed. Designs of institutions, work contracts, and review procedures aid dealing with the mistakes (e.g., false negatives) that any judgement under uncertainty can entail. That does not only entail bolstering the effects of errors, but it also entails investigating the potential sources of errors to avoid them in the future. Failures are opportunities for learning.

The word ‘investigating’ is put in italics, since steady empirical research is required to continuously evaluate the effects of science evaluations. This also includes research on how the evaluations change people’s behavior and science as a social system (see e.g., de Rijcke et al., 2016). For instance, as Levinthal and March (1993) remark with respect to organizations more generally, “...exploitation tends to drive out exploration” (p. 107). As a rule of thumb, exploring new ideas (be they, e.g., novel products in business-contexts or new research paradigms in academia) may come with more uncertain, more distant rewards than continuing to exploit existing ones. That is, it is easy to imagine why indices counting the numbers of publications a person or an institution produces could lead to a “success trap” (Levinthal & March, 1993, p. 106), hindering the exploration of new—and potentially risky—avenues for research, and eventually hampering innovation. However, what qualitative rules of thumb for research evaluation could—perhaps counterintuitively—have exactly the same harmful effects as their quantitative, indicator-based counterparts?

In a sense, it is ironic that those who advocate systematic metrics-based science evaluation and monitoring do not advocate, with the same fervor, a ceaseless systematic empirical evaluation of science evaluation itself. We have recently discussed in detail how the fast-and-frugal heuristics program, as it was originally developed in the decision sciences, might lend itself to systematically investigate metric-based science evaluation, notably bibliometric-based heuristics (Bornmann & Marewski, 2019).

In fact, we believe that many of us—as scientists—may operate by our own qualitative rules of thumb—and possibly transmit them to our students. Examples may come with heuristics for discovery (e.g., Katsikopoulous et al., 2024) and scientific writing (e.g., Marewski et al., 2024: Start your paper with a familiar story; see pp. 297-298), citation (e.g., Read what you cite!), or data analysis (e.g., First plot your data in many different ways!). After all, ‘experts’ may develop strategies to manage tasks—and possibly transmit problem-solving skills—that fall within their domain of expertise. For instance, as Marewski et al. (2024) argue, corresponding rules of thumb abound in business contexts (e.g., Bingham & Eisenhardt, 2011). We see no reason why researchers trying to make up their mind on other researchers’ work should not be in a good position to identify, similarly, own rules of thumb for research evaluation that could then be systematically studied.

What could that mean? “As the wind blows you must set your sail” (Pólya, 1945; p. 223). Broad distinctions between situations—such as that between ill-structured and well-structured problems—can aid shaping-up one’s thinking on what broad classes of tools might work, but they are not sufficient for navigation. Rather, the science of heuristics—and Simon’s notion of bounded rationality—prescribe understanding the precise nature of the situation to which each tool applies (e.g., Simon, 1956, 1990). In the context of research on fast-and-frugal heuristics, typically the term environment is used to refer to situations, and the fit between each heuristic and its environment is systematically studied (e.g., Gigerenzer & Brighton, 2009; Martignon & Hoffrage, 1999).

Analogous work would also have to be carried out for the various ‘heuristics’ proposed above as well as those others may come up with. Interestingly, here research on heuristics may meet research in scientometrics. Notably, the connection between proverbial expressions and scientometrics is not new. For instance, Ruocco et al. (2017) discuss a common principle “success breeds success” (p. 2), albeit not viewing it as a heuristic, but means to characterize bibliometric distributions, as in Schubert and Glänzel’s (1984) work. From the perspective of research on heuristics, this usage of the expression corresponds to a characterization of a specific type of environment in which heuristics (e.g., of scientific authors, reviewers, administrators) may operate: one in which past success leads to future success. This example also illustrates a point that we have not addressed in this article—namely that the boundary between qualitative (e.g., proverbial) and numerical insights is fluid; one complements the other. For instance, Simon (1955b) himself has, early on, studied such environmental patterns quantitatively.

Conclusions

To conclude, we agree that it would be wonderful if universal norms for establishing what one might call ‘rational’ judgments existed! Yet, despite all number crunching, many judgments in science—be it about findings or research institutions—will neither be straightforward, clear, and unequivocal, nor will it be possible to ‘validate’ and ‘objectify’ all judgments by seemingly universal external standards. To speak with Simon and Newell (1958), “The basic fact we have to recognize is that no matter how strongly we wish to treat problems with the tools our science provides us, we can only do so when the situations that confront us lie in the area to which the tools apply” (p. 6). We propose replacing the quest for universal tools with an endorsement of judgment—and the willingness to act as a consequence of that endorsement. This may not be easy, but for sure it will be easier than trying to treat scientific bureaucratization madness as well as obsessions with control and accountability with real opium.

Research evaluation is an activity that can be characterized as judgment—and eventually also decision-making—under uncertainty (and ambiguity). Uncertainty is an attribute when ‘selecting under unknown future conditions’; ambiguity and heterogeneous preferences (biases) regarding evaluation criteria (either individual or emerging from different fields' practices) are other attributes. Experienced and trained evaluators (with expertise in the evaluated field, in bibliometrics, and in statistics) could be less prone to biased judgements in research evaluation. Wisely and professionally used, numbers (indicator scores) may play the role of ‘cognitive clues’ that help experienced and trained evaluators ‘to think fast’ (e.g., under workload). Our essay may be understood as a wake-up call for more sensible use of indicators (bibliometric) in the heterogeneous practices of research evaluation.