1 Introduction

1.1 The Epistemic Exceptionality of Mathematics

There is no dispute about the fact that mathematicians disagree frequently about many mathematical issues, e.g., whether a mathematical result is interesting, which of two proofs of the same theorem is more natural than the other, or what the most salient features of a proof are.

It is a common belief, however, that on matters of mathematical correctness, mathematicians will be able to reach an agreement. This does not entail that mathematicians cannot disagree about the correctness of a proof; instead, the common belief is that with a sufficient amount of effort and discussion of details, they will eventually reach agreement. This view is implicit in Leibniz’s famous calculemus quote:Footnote 1

Quando orientur controversiae, non magis disputatione opus erit inter duos philosophos, quam inter duos computistas. Sufficiet enim calamos in manus sumere sedereque ad abacos, et sibi mutuo (accito si placet amico) dicere: calculemus.

Beyond mathematics, for science in general, the ability to resolve disputes and reach a consensus is seen as a fundamentally desirable hallmark of the scientific method. In popular discussions about science, an inability of experts to agree on whether a statement is correct or not is seen as major failure. This general sentiment underlies debates such as the discussion of the Sokal hoaxFootnote 2 or the public discourse on scientific consensus, e.g., debates about vaccinations or climate change.Footnote 3

We shall refer to the ability to reach consensus in principle as intersubjective stability; whether science is intersubjectively stable is very closely connected to the stronger claim of the objectivity of science (Reiss and Sprenger 2020). Following Hanson (1958), Kuhn (1962, 1977), and Feyerabend (1975), the received view in the philosophy of science community embraces the principle that science is theory-laden and is generally sceptical of strong claims of scientific objectivity. As a consequence, while intersubjective stability of judgments is in practice a very important indicator of scientific quality, the overall expectations of degrees of agreement are modest.

This situation changes drastically if we move from the empirical sciences to mathematics: many mathematicians and philosophers of mathematics alike believe in a very high degree of intersubjective stability in judgments about mathematical correctness.Footnote 4 The standard argument for this belief, sometimes called derivationism, assumes that correct mathematical proofs are warranted by formal derivations corresponding to them;Footnote 5 the latter are entirely surveyable objects whose correctness can be established with certainty. Therefore, a disagreement between two mathematicians about the correctness of a proof can be resolved in principle by calamos in manus sumere sedereque ad abacos.Footnote 6 Derivationism relies on the deductive nature of mathematics; this has been used to claim that mathematics is an epistemic exception among the sciences: its deductive nature gives mathematicians a categorically different epistemic access to judgments about the correctness of mathematical argument.Footnote 7 Hence, the claimed epistemic exceptionality of mathematics has been used to explain the reported intersubjective stability as well as the perceived lack of lasting disagreements and scientific revolutions in mathematics.Footnote 8

1.2 Peer Review

The perceived differences between the sciences in general and mathematics in particular also apply to the discussion about the purpose, nature, and measures of success of peer review.

In the wider field of science, the general view is that the main purpose of peer review is not to serve as a mechanism to check the correctness of claims in a paper: it is not assumed that peer review can guarantee correctness; on the other hand, mere correctness is not enough to make a submission publishable.Footnote 9 Mirroring the contrast concerning intersubjective stability, the discussion of the purpose of peer review in mathematics differs from that in science in general. The London Mathematical Society presents a very different view of peer review from that of the learnèd societies from other disciplines (cf. fn. 9):

Mathematics is distinguished by the fact that the results are not a matter for debate: when an argument is presented, it can be studied by other experts, who will determine whether it is correct and whether it is complete. Although it may take some time for particularly long or difficult arguments, there is no room for disagreement. This gives Peer Review an especially significant role in mathematics. [...] Because of the extreme density of mathematical writing, Editors will usually expect Reviewers to take around two months unless the paper is especially long, difficult or innovative. It is not unusual for assiduous Reviewers to take several times this long. [...] Because it is necessary to invest so much effort in reading a single paper, it is extremely valuable to the community that published papers have been declared correct by experts. [...] The main benefits of Peer Review are that it ensures the correctness and clarity of the content. (House of Commons Science and Technology Committee 2011, p. 100sq)

This highly idealised description of the London Mathematical Society should be contrasted with a very sceptical view among some mathematicians whether mathematical peer review actually provides the mentioned “main benefits”. In an opinion piece published in the Notices of the American Mathematical Society, Nathanson (2008) complains:

Many (I think most) papers in most refereed journals are not refereed. There is a presumptive referee who looks at the paper, reads the introduction and the statement of the results, glances at the proofs, and, if everything seems okay, recommends publication. Some referees check proofs line-by-line, but many do not. When I read a journal article, I often find mistakes. Whether I can fix them is irrelevant. The literature is unreliable.

The discrepancy between the idealised picture of the referee as a correctness checker and the actual practice of mathematical peer review was analysed by Geist et al. (2010). Their data suggests that the mathematical peer review process is not fundamentally different from the peer review in other sciences: a referee is being asked to answer the questions referred to as Littlewood’s precepts: “(1) Is it new? (2) Is it correct? (3) Is it surprising? (Krantz 1997, p. 125)”. Only the second precept relates to correctness checking and there is no consensus in the mathematical community what level of detail of correctness checking is done by the referee, is expected of the referee, or even should be expected of the referee. An identified fundamental error will certainly lead to rejection of the submission, but the judgment on the other two precepts (“Is it new?” and “Is it surprising?”) is much less likely to fall under the scope of epistemic exceptionality of mathematics. As a consequence, the referee’s verdict is an aggregation of a judgment that possibly has epistemically exceptional status and other judgments that do not.

Since a lack of intersubjective stability in the peer review process is considered to be worrying for the scientific quality of the publication decisions, a rich literature on peer review focuses on stability of reviewer judgments, the effect of bias, and reproducibility of decisions.Footnote 10 This worry is very perspicuous in the subtitle of the seminal paper by Rothwell and Martyn (2000): Is agreement between reviewers any greater than would be expected by chance alone?

Rothwell and Martyn performed an agreement analysis using Cohen’s \(\kappa \), a statistical measure of inter-rater agreement, and found that in papers of their subject (clinical neuroscience) “there was little or no agreement between the reviewers (Rothwell and Martyn 2000, p. 1964)”. A similar study with a similar result was later done for journals in information science by Wood et al. (2004).

These studies were taken up by Geist et al. (2010) who did a similar analysis for peer review processes (for conferences) in the deductive sciences (mathematics and theoretical computer science), comparing the values of Cohen’s \(\kappa \) from clinical neuroscience, information science, and the deductive sciences. They found that the agreement values in the deductive sciences are considerably higher than those in the comparison disciplines.

In the past decade, these findings have been repeatedly discussed at workshops and conferences in the wider community usually called Philosophy of Mathematical PracticeFootnote 11 resulting in the following question:

Can we give an argument for or against epistemic exceptionality of mathematics on the basis of empirical data of peer (*) reviewer agreement in the form of values of Cohen’s \(\kappa\)?

In this paper, we answer \((*)\) negatively. We provide a toy model for the calculation of values of \(\kappa \) that shows that under realistic assumptions about the practice of mathematical peer review even perfect epistemic access to the correctness of proofs would not be detectable in the value of \(\kappa \).

In Sect. 2, we shall give a general description of the definition and use of Cohen’s \(\kappa \); in Sect. 3, we define a toy model which we then use in Sect. 4 to answer \((*)\) negatively.

2 Cohen’s \(\kappa \)

2.1 Cohen’s \(\kappa \) as a Proxy for Sensitivity

Cohen’s \(\kappa \) is a statistical measure for inter-rater agreement (Cohen 1960). It compares the degree of agreement of pairs of judgments to what would be expected if the judgments were randomly distributed. Its general formula is

$$\begin{aligned} \kappa = \frac{F-E}{1-E} \end{aligned}$$

where F is the observed frequency of agreement and E is the expected value of the frequency of agreement. A value of \(\kappa = 1\) corresponds to full agreement (\(F = 1\), i.e., every single pair of judgments is in agreement) and a value of \(\kappa = 0\) corresponds to agreement according to the expected value (\(F=E\)).

Cohen’s \(\kappa \) can be used as an informative proxy for the quality of a checking mechanism: suppose there is collection of samples, a property P of samples that we wish to check, and a checking mechanism that tests a sample whose sensitivity, i.e., the probability that a sample with property P tests positive, is not known.Footnote 12

In a situation where the test is the only epistemic access to whether a sample has property P or not, it is difficult to determine the sensitivity of the test. One possible way to do so is to test each sample twice by independent tests and calculate the value of inter-rater agreement. Assuming independence of the tests and a random distribution of property P, the value of Cohen’s \(\kappa \) is a remarkably good proxy for the sensitivity of the test as can be seen in Table 1.Footnote 13 If the probability of a sample having property P is 20% or less, the value of \(\kappa \) is very close to the sensitivity of the test.Footnote 14

Table 1 Values of Cohen’s \(\kappa \) for a test for property P (and no chance of false positives; cf. fn. 14): the rows correspond to the sensitivity of the test, the columns to the probability of a sample having property P

2.2 Interpreting Cohen’s \(\kappa \)

The statistical literature has a rich discussion of the strengths and weaknesses of Cohen’s \(\kappa \): the so called kappa paradoxes (Feinstein and Cicchetti 1990; Cicchetti and Feinstein 1990) describe undesirable behaviour of the function \(\kappa \) in practice (cf. fn. 14). Numerous variants of Cohen’s \(\kappa \) have been proposed to deal with these unwanted features of the function, e.g., Scott’s \(\pi \) (Scott 1955), the G-index by Holley and Guilford (1964), Bangdiwala’s \(\mathrm {B}\) (Bangdiwala 1985), or Gwet’s \(\mathrm {AC1}\) (Gwet 2010). Many of the alternative measures share the basic mathematical structure of Cohen’s \(\kappa \) and the results for our toy model in Sect. 3 would not be expected to differ substantially if we had used a different inter-rater agreement measure.

In spite of the criticism and alternative proposals, Cohen’s \(\kappa \) has developed into a standard measure for inter-rater agreement. However, interpreting the meaning of values of Cohen’s \(\kappa \) is difficult and highly context-sensitive. It is curious that many textbooks and papers refer to the scale provided by Landis and Koch (1977) given in Table 3 even though Landis and Koch (1977, p. 164sq) themselves state very clearly that these descriptors are merely a convention for their paper:

In order to maintain consistent nomenclature when describing the relative strength of agreement associated with kappa statistics, the following labels will be assigned to the corresponding ranges of kappa. Although these divisions are clearly arbitrary, they do provide useful “benchmarks” for the discussion of the specific example.

Table 2 Values of Cohen’s \(\kappa \) for a test for property P with specificity \(s = 0.99\): the rows correspond to the sensitivity of the test, the columns to the probability of a sample having property P
Table 3 Arbitrary descriptors for the strength of agreement by Landis and Koch (1977, p. 165).

In addition to the arbitrary choice of the numerical divisions, the descriptors used (e.g., “substantial agreement” for \(\kappa \) between 0.61 and 0.80) should not be expected to work across application contexts: what counts as “substantial agreement” for a flaw check in a paperclip factory may not meet the standards of a test for infection with a highly contagious deadly disease.

3 A Toy Model

In this section, we describe a toy model for mathematical peer review in which we identify the philosophical claim that mathematics is an epistemic exception as a particular special case. Detecting epistemic exceptionality with the values of Cohen’s \(\kappa \) therefore becomes the question whether knowledge of \(\kappa \) allows us to conclude that we are in that particular special case.

We emphasise that our toy model is by no means intended to be a realistic mathematical model of peer review: we make a number of simplifying assumptions that are not true of mathematical peer review in practice. However, we claim that our simplifications will only make it easier to detect epistemic exceptionality, so that if even under these simplifying assumptions, the value of Cohen’s \(\kappa \) cannot detect the epistemic exceptionality, then this will remain true under more realistic assumptions.

We assume that we have a collection of papers submitted for review (submissions) and a population of peer reviewers (referees). The recommendation of a referee reviewing a submission will depend on whether the referee detects a fundamental flaw and, if not, whether the referee considers the paper suitable for publication in the journal to which it was submitted. We formalise this in three parameters e, m, and p:

Correctness. Submissions can either be correct or incorrect. Here, incorrect means that the paper is fundamentally flawed and must not be published. We ignore all of the subtle situations between these two extremes: that errors in the submission might be fixed as part of the peer review process; that papers could be partially correct and publishable after removing the errors; etc. We write e (for “error probability”) for the probability that a submission is incorrect.

Mastery. This parameter is the ability of the referee to detect flaws of the type discussed in the preceding paragraph. We write m (for “mastery”) for the probability that a referee is able to detect that a paper is fundamentally flawed; in that case, we shall say that the referee is masterful. In biomedical terminology, the mastery is the sensitivity of the flaw testing provided by the referees. As with correctness, we ignore many of the subtleties of the refereeing process: that a referee may spot a flaw in parts of the paper, but the rest of the paper is still correct and valuable; that a referee thinks they found a flaw that later turns out not to be one, etc.Footnote 15

In general, it is reasonable to assume in all disciplines that expert referees will have a good chance of spotting fundamental flaws; the claim that mathematics is epistemically exceptional amounts to the assumption that mathematical referees are in principle always able to detect them, i.e., all referees are masterful or, in terms of the toy model, \(m=1\).

Positivity. Assuming that a paper is correct or that a referee has not identified its fundamental flaw, the referee’s decision to recommend acceptance or rejection is based on many non-mathematical factors, e.g., whether the referee likes the research area, whether the referee thinks that this theorem is good enough for this journal, etc. We simplify all of these factors into a single probability that we call the positivity of the referee. The positivity p is the probability that a referee will recommend acceptance, provided she or he has not detected that it is incorrect.

Based on our parameters e, m, and p, we can now analyse what the probability of the recommendations “accept” and “reject” are:

Case 1. The paper is incorrect and the referee is masterful. This means that the referee detects the incorrectness and recommends rejection. This case happens with probability \(e\cdot m\).

Case 2. The paper is incorrect and the referee is not masterful. This means that the referee does not detect the incorrectness and the recommendation depends on the positivity of the referee. Case 2 results in “accept” with probability \(e\cdot (1-m) \cdot p\) and in “reject” with probability \(e\cdot (1-m) \cdot (1-p)\).

Case 3. The paper is correct. In this case, the mastery of the referee is irrelevant, so the referee will recommend acceptance with probability \((1-e)\cdot p\) and rejection with probability \((1-e)\cdot (1-p)\).

Based on our three cases, we calculate the probabilities a and r of an acceptance or rejection recommendation and the expected value E of agreement, respectively:

$$\begin{aligned} a&= e\cdot (1-m)\cdot p + (1-e)\cdot p = p-emp,\\ r&= 1+emp-p\text{, } \text{ and }\\ E&= a^2+r^2 = 1+2m^2p^2e^2 - 4mp^2e + 2mpe +2p^2-2p. \end{aligned}$$

Assuming we assign two referees to a given submission, we can either have two masterful referees (with probability \(m^2\)), one masterful referee (with probability \(2m(1-m) = 2m-2m^2\)), or no masterful referees (with probability \((1-m)^2 = 1-2m+m^2\)). If a submission is correct, the two referees will agree with probability \(p^2+(1-p)^2 = 1-2p+2p^2\), independent of the mastery of the referees. If a submission is incorrect, masterful referees will always agree, non-masterful referees will agree with probability \(p^2+(1-p)^2 = 1-2p+2p^2\), and a masterful referee will agree with a non-masterful referee with probability \(1-p\). Combining these numbers, we obtain the following formula for the frequency of agreement:

$$\begin{aligned} F:&= (1-e)\big [1-2p+2p^2\big ] + \\&\qquad e\big [ m^2 + 2m(1-m)(1-p) + (1-m)^2(1-2p+2p^2)\big ] \quad \quad (\dag )\\&= 1+2m^2p^2e - 4mp^2e + 2mpe +2p^2-2p. \end{aligned}$$

From (\(\dag \)), we can now calculate the value of Cohen’s \(\kappa \) as

$$\begin{aligned} \kappa = \frac{F-E}{1-E} = \frac{m^2pe-m^2pe^2}{1+2mpe-m^2pe^2-me-p}. \end{aligned}$$

As mentioned, Table 1 shows some values of this function for the special case \(p=1\); in this case, if \(e < 0.5\), the value of Cohen’s \(\kappa \) is a remarkably good proxy for the value of m. Therefore, in this special case, we could check whether all referees are masterful by checking the value of Cohen’s \(\kappa \).

In general, we give the numerical values of \(\kappa \) in Table 4 for the values \(e = 0.1\) (first block; i.e., 10% of all submissions are fundamentally flawed), \(e = 0.2\) (second block; i.e., 20% of all submissions are fundamentally flawed), and \(e = 0.3\) (third block; i.e., 30% of all submissions are fundamentally flawed). An inspection of the tables yields that \(\kappa \) ceases to be a good proxy for m if \(p\ne 1\): the values of \(\kappa \) are below 0.2 unless the positivity is 0.8 or higher or the mastery is 0.7 or higher. Values in the Landis–Koch category of “substantial agreement” (0.61 or higher) require either \(p=1\) or \(p\ge 0.9\) and \(m\ge 0.9\); only values in the column of \(p=1\) reach the Landis–Koch category of “almost perfect agreement”.

Table 4 Three tables of value of Cohen’s \(\kappa \) for \(e=0.1\), \(e=0.2\), and \(e=0.3\) in the toy model: the rows correspond to the value of m and the columns to the value of p

We should like to emphasise that our toy model is not specifically about mathematical peer review or even about peer review processes. It is a general model for any decision making process that aggregates a correctness check (with no false positives) and a subjective judgment into a single binary answer: submissions identified as flawed are to be rejected and all others will receive a verdict on the basis of the subjective judgment. The subjective judgment is represented by our parameter p: if \(p=1\), we revert to the case of a mechanical check of a property P from Sect. 2. Table 4 shows that if a non-trivial amount of subjectivity is injected into the aggregation, the value of Cohen’s \(\kappa \) drops substantially.Footnote 16

4 Interpretation and Discussion

4.1 Masterful, but Resource-Bounded Referees

Before inspecting the numerical details of the values of Cohen’s \(\kappa \), we should like to discuss an interpretation issue with the attempt to argue for epistemic exceptionality from quantitative data of \(\kappa \):

Even in a world of epistemically exceptional referees, they remain human beings and are resource-bounded. As mentioned in the report of the London Mathematical Society (Sect. 1.2), checking the details of a proof requires a great deal of time and energy and not every referee can or is willing to invest that time and energy. It is therefore conceivable that we live in a world of epistemic exceptionality where nevertheless not every referee uses their special epistemic powers. In fact, we know that there are referees who do not check the proofs; in the survey of editors of mathematical journals by Geist et al. (2010), only half of the editors reported that they “think the referee should check all of the proofs in detail” and one of them was realistic enough to comment “to be reasonable, I am happy when I find a referee [who checks some proofs in detail] (Geist et al. 2010, p. 163sq)”.Footnote 17

The methodological issue is that the values of Cohen’s \(\kappa \) are unable to distinguish between a situation where all referees are masterful and can in principle detect errors with absolute certainty, but only 70% of them invest the time and energy to do so, and a situation where mathematics is not epistemically exceptional and referees just have an average mastery level of 0.7. This presents a structural problem with any argument for or against epistemic exceptionality on the basis of values of Cohen’s \(\kappa \).

4.2 Realistic Ranges of the Parameters for Correctness and Positivity

In our argument concerning question \((*)\) (cf. Sect. 4.3), it would be methodologically satisfying to use good estimates for the actual values of the parameters e and p in the practice of mathematical peer review based on solid quantitative data. However, such data does not exist, so we need to give realistic ranges for p and e on the basis of personal experience.

Correctness. Based on many years of experience as an editor and managing editor of mathematical journals, the present author expects that only a small number of submissions is fundamentally flawed in a way that makes the publication of any version of the submission impossible on the grounds of correctness alone. The chosen values \(e=0.1\), \(e=0.2\), and \(e=0.3\) in Table 4 reflect that.

Positivity. If we fix a particular mathematical research journal and have access to collection of editorial decisions, we could estimate the value of p for that journal by counting the number of rejections based on criteria that do not involve correctness, e.g., the scope of the journal, the level of difficulty, novelty, or relevance of the result. We expect that this empirically determined value of p depends significantly on the choice of the journal. While this data could be obtained in principle, as far as we know, no robust quantitative study has been done to date. Personal experience with mathematical research journals suggests that a substantial number of submissions is rejected and, given that fundamental flaws are rare (see above), a value of \(p = 0.5\) or below is a realistic assumption.

4.3 Negative Answer to Question \((*)\)

With the realistic ranges for e and p, we can now make question \((*)\) precise in terms of the toy model: if e and p are in the realistic range, is it possible to determine from the values of Cohen’s \(\kappa \) whether the toy model is epistemically exceptional (i.e., whether \(m=1\))?

The answer is “no”: inspecting the values in Table 4 in the columns for \(p=0.5\) or below, we see that the values are all below 0.2. There is no categorical difference between the values at \(m=1\) from the values at \(m<1\). E.g., in an epistemically exceptional world (\(m=1\)) with \(p=0.45\) and \(e=0.2\), the value of Cohen’s \(\kappa \) is 0.141; this is the same as the value in a world with \(m=0.8\), \(p=0.5\), and \(e= 0.295\). Similarly, in an epistemically exceptional world (\(m=1\)) with \(p=0.3\) and \(e=0.1\), the value of Cohen’s \(\kappa \) is 0.041; this is the same as the value in a world with \(m=0.8\), \(p=0.35\), and \(e=0.13\) or \(m=0.6\), \(p=0.34\), and \(e=0.28\).

4.4 Cohesion

Table 5 Values of Cohen’s \(\kappa \) for \(e=0.2\) and \(p=0.5\) with additional cohesion parameter c, using the formula (\({\ddag }\)): the rows correspond to the value of m and the columns to the value of c

One of the implicit assumptions in our toy model that could be criticised is the fact that referee judgments on correct papers are made randomly according to the positivity p. In the calculation of the value of F in (\(\dag \)), this is reflected in the two factors \(p^2+(1-p)^2 = 1-p^2+2p^2\) corresponding to the probability of agreement of two referees on a paper with no detected flaw. One could argue that as part of a social community of mathematicians sharing similar values, agreement about judgments of novelty or relevance is more likely than expected if it were a pure chance decision. We can model this in our toy model by replacing the two occurrences of \(p^2+(1-p)^2 = 1-p^2+2p^2\) with a new parameter \(c \ge p^2 + (1-p)^2\) for cohesion, i.e.,

$$\begin{aligned} F := (1-e)c + e\big [ m^2 + 2m(1-m)(1-p) + (1-m)^2c\big ].\qquad ({\ddag }) \end{aligned}$$

The results, for \(e = 0.2\) and \(p=0.5\), are given in Table 5: the light grey column is \(c = p^2+(1-p)^2 = 0.5\) and matches the column \(p=0.5\) in the second block of Table 4 (also marked in light grey).

Not surprisingly, the values of Cohen’s \(\kappa \) are increasing quickly as c increases, but we also observe that a high degree of cohesiveness minimises the effect that the parameter m has on the value of \(\kappa \); as a consequence, it becomes even harder to detect epistemic exceptionality.

Cohesion effects like this are expected in general in most academic disciplines; unpublished work of Greiffenhagen (2021) suggests that these effects are stronger in mathematics than in other disciplines.

The numerical values that were obtained by Geist et al. (2010) for conferences in the deductive sciences were considerably higher than the values in clinical neuroscience and information science and also considerably higher than would be predicted by our toy model. One possible explanation for these findings could be that we observe the strong(er) cohesion effect in mathematics at work. Making this precise and giving an argument that this is the case would require a detailed and comparative empirical study of the mathematical peer review process.