Arbitrariness in the peer review process

The purpose of this paper is to analyze the causes and effects of arbitrariness in the peer review process. This paper focuses on two main reasons for the arbitrariness in peer review. The first is that referees are not homogenous and display homophily in their taste and perception of innovative ideas. The second element is that reviewers are different in the time they allocate for peer review. Our model replicates the NIPS experiment of 2014, showing that the ratings of peer review are not robust, and that altering reviewers leads to a dramatic impact on the ranking of the papers. This paper also shows that innovative works are not highly ranked in the existing peer review process, and in consequence are often rejected.


Introduction
The process of peer review is one of the main underlying practice used for the publication of research. The quality of the published articles is influenced by the efficiency and the competence of the peer process. Lately, many studies have emphasized the problems inherent to the process of peer review (for a summary, see Squazzoni et al. 2017). Moreover, Ragone et al. (2013) have shown that there is a low correlation between peer review outcome and the future impact measured by citations. 1 One of the most severe problems of the peer review process is emphasized by the results of the NIPS experiment. The NIPS experiment, which took place in 2014, consisted of altering the reviewers of papers, since a fraction of submissions went through the review

Facts related to peer review
The Peer review process is used in three different channels of science. First, it is used by journals for deciding which papers to publish. Second, governments and NGOs who provide grants chose the projects through peer review process. And third, conference organizers also use peer review process to choose the papers to be presented in conferences. There are some differences between these three channels, but our model is general enough to be appropriate for all of them. In these three channels, the criteria for ranking are quite similar. This is the topic of the next section.

Criteria of peer review and acceptance rates
In each peer review process, the committee publishes the criteria by which the reviewers should judge the papers or projects. The criteria in the process of ranking grants are very similar to these in the process of ranking papers for conferences. More specifically, we have focused on the criteria chosen in computer science conferences, and have chosen subfields as artificial intelligence, cryptology and computer vision.
We have found that funding committees and conference organizers propose many criteria such as 'presentation quality ', 'clarity', 'reproducibility', 'correctness', 'novelty' and 'value added'. 4 The criteria such as 'presentation quality' and 'clarity' are very often chosen.
We have found a total of 12 criteria used for ranking papers. These 12 criteria can be regrouped in three main categories of criteria: (1) soundness, dealing with the presentational and scientific validity aspects; (2) contribution, responsible for the importance of the results; and (3) innovation, showing how novel the results or ideas are. The criteria are presented in Table 1.
These three categories of criteria affect the ranking of papers and projects. The weight given to these criteria is also an important element of the peer review process. In consequence, in the model we develop in the next section, we take into account these three categories of criteria. 5 Related to acceptance rates, they vary for conferences as well as for projects to be financed. For conferences, the acceptance rate for computer science conferences, has a median acceptance rate of 37% (see Malicki et al. 2017). In the case of NIPS (see below), it is of 22%.
For projects, the acceptance rates are small and are between 1 and 20%, with an average of 10%. In the European H2020 calls, the acceptance rate is of 1.8%. 6

The NIPS experiment
In December 2014, in a conference on Neural Information Processing Systems (NIPS) which took place in Montreal, for 10% of the papers, the main committee of the conference split the program committee in two, forming two independent committees. The two committees have received the same papers.
The acceptance rate was pre-defined at 22%. There were 166 papers submitted, which underwent a "duplicate" peer review, and 37 papers had to be accepted.
Of the 37 papers to be accepted by both committees, only 16 papers were accepted by both committees (43%), while they disagreed on 21 (57%), and recall that these results were for a 22% acceptance rate.
There are some main conclusions to be drawn from this ex-post experiment. The conclusions are: (1) there is arbitrariness in the peer review process, since the two committees have chosen a very different set of papers to be presented in their conference (the solution was to accept all papers and to add more sessions in this conference). (2) The arbitrariness was for more than 50% of the papers.
These conclusions necessitate a thorough analysis of the underlying reason for these facts. The following model analyzes the reasons for this arbitrariness, and tries to simulate the results of the NIPS experiment.

Introduction
The purpose of this paper is to underline possible channels for arbitrariness in the peer review process. The paper focuses on two main elements for getting this effect; and these two elements are crucial for obtaining our results. The first one is the concept of homophily implying that reviewers have personal bias. More specifically, we assume that reviewers are different in their taste for innovation, and in consequence, they give grades according to how the projects assessed to them are close to their own taste for innovation. The second element incorporated in this model is that reviewers are different in the time they allocate for peer review. 7 Another element included in this model is the correlation between homophily and time devoted to peer review. Are reviewers with less innovation taste, the ones to devote more or less time to peer review? In this paper we assume that the decisions about innovative tastes and time devoted to reviewing are independent. It could be that some more innovative people will devote more time for referring, but it could also be the opposite.
We now turn to present our model. Our model will allow us to explain the results of the NIPS experiment. Moreover, it shows that good but innovative projects are often rejected.

Criteria for papers and projects valuation
Let us assume that we have k projects from which only h can be funded, or equivalently k papers from which only h can be published. Since our model is pertinent for peer review of papers to be presented in conference, and to peer review for projects to be funded, we will use the terminology of "papers" also for projects in order to make sentences shorter, and not use "projects/papers".
We first describe the criteria by which we define the true value of a paper. As shown in Sect. 2, we can group all these criteria under three main categories-'soundness', 'contribution' and 'innovation'.
More specifically we define as S, criteria linked to soundness as clarity, reproducibility, correctness, and the absence of misconduct. 8 In the criteria contribution, C, we include elements linked to the impact and value added of the paper; while novelty-related criteria are denoted as I. In other words, each paper is defined by three criteria, S, C and I, so that the true value of a project is: Where V is the value of the project i, S represents the scientific soundness of the project, C the scientific contribution, and I is the innovative element of the project. 9 The weight given to these three criteria are not similar in all fields or editing committees. Some prefer to put emphasis on 'soundness', other might want to focus on the (1) One of the reasons why reviewer have different taste related to the time they want to invest in reviewing can be either having a different utility function, or they have different time constraints. 8 As we will discuss later on, adding more criteria is leading to more arbitrariness. So having only one category of 'soundness' and not three different ones as correctness, reproducibility and clarity is good. 9 I is similar to the degree of disruption the paper introduces, as described by Wu et al. (2019).
'contribution' of a paper. We therefore analyze in this model the effects of having different weights on these criteria. In the first case, we check when = = = 1∕3 . Later on, we also check five other sets of weights.

Referees valuation of the projects and papers
The referees have different subsets of projects, usually selected based on their expertise. The referees try to estimate the 'true' values of the projects. We denote U ij the value given by the referee j to the project i. Note that referees are different in their subjective value of time, as well as their degree of homophily to the project. So, U ij is a function of the time spent by the referee analyzing the project. It is also a function of the referee's opinion on how innovative the project is, which is influenced by homophily.
We now present in more details the way referees value project. First, we assume that the referees evaluate S i without error, since committees report that there are no big debates about the 'soundness' of a project.

Contribution criteria, C
About the contribution and value added of a project, there is usually a debate between referees. Indeed the scientific contribution of a paper, C i , is not easily evaluated.
We define T ij as the time that referee j takes to investigate the project i, and assume that if the time invested is higher than the contribution value, i.e., C i ≤ T ij , then the referee can correctly estimate the true value of the project. However, if C i > T ij , then he/she does not appreciate the true value.
In other words, we assume that the more time a referee spends analyzing the project, the closer he gets to the true value C i ; and the greater the difference between C i and T ij , the larger the error in valuation is.
Without loss of generality, we assume that T ij depends on both the reviewer and the project, so that it can be represented as: where T j represents the average time the referee j spends on review and ε ij represents the project-dependent fraction of time. In the following, we set ε ij = 0 for the sake of simplicity.

Innovation criteria, I
About the innovative value of a project and the effect of homophily on the valuation, we assume that some of the referees are more innovative and have a tendency for more innovative ideas, while other referees are more orthodox in their essence and do not like unorthodox projects.
We call I ij the homophilic index of scientist j w.r.t. project i, which is distributed normally on the range [0, Z]. We can compute homophily between the referee and the project as the similarity between the set of traits related to innovations they have. In general, similarly to T ij we can split I ij into two components: where γ ij represents the homophily effect, while I j represents the conformity, i.e., how receptive the referee is to innovative ideas.
When considering the inventive element, I ij , we assume that γ ij = 0, i.e., homophily affects the valuation of referee in the following manner: (1) the more creative (or receptive to nonorthodox ideas) the referee is, the better he estimates the invention element; (2) if the referee is more creative that the project proposed, he makes no error on the value; and (3) the error is an increasing function of the difference between the true value and his creative possibilities.

The total valuation of referees
Taking into account the various elements described above, we get that the valuation given by a referee is: The results of the model This simple model permits us to show some of the implications of the peer review process, based on Eq. (4). The two main results are the following.

Proposition 1
The peer review process leads to arbitrariness: For the same given papers, when the reviewers are different, then we get a different ranking of the papers.
The model reproduces the results of the NIPS experiment.
Proposition 2 Innovative projects are not highly ranked in the existing peer review process, mainly due to the homophilic trait of reviewers.
Instead of presenting formal proofs, we will show the results for more than 200 simulations. We present the simulations in the next section, but, let us start with a simple example which is more intuitive and allows us to understand the various claims of the propositions.

An example
There is one committee and two referees who have to choose 3 papers out of 10 (k = 10 and h = 3), an acceptance rate of 30%. This acceptance rate is consistent with the acceptance rate for computer science conferences, as shown above. 10 We assume that all criteria are equal in their weight (and assume that α, β and γ are equal to 1). We order all the papers in an increasing value such that: The two referees chosen are different in their preferences. The first referee spends much time on each of the papers (his T 1 is 70, so that all papers with C lower than 70 will be reviewed accurately). But his homophilic index related to unorthodox views is low (his I 1 is 40, so that all papers with innovation index higher than 40 will not be judged accurately).
The second referee does not spend much time (T 2 = 40), but her homophilic index is high (I 2 = 120). The consequences of the fact that these two referees are quite dissimilar in the time spent on refereeing, and in their homophilic index is that they will differ in their choice of papers. In Table 2, we present the 10 papers, their 'true' value, and how they were ranked by the two referees. Table 2 permits us to compare the ranking of papers chosen by each referee, as well as their average. First, we see that reviewer 1 will choose the three papers: 7, 8, 10; while reviewer 2 will choose the three papers: 5, 9, 10 (recall that the referees have to pick 3 papers out of 10).

Results of the example
What is striking is that the referees agree only on 1 paper out of 3. Indeed, both reviewers have in common only the paper #10 (1/3!).
In case the committee chooses the papers by average then, the final choice of the papers will include papers 5, 7, 10, while the best 3 papers are: 8, 9, 10. The referees should have chosen papers 8, 9, 10. In fact they have chosen, 5, 7, 10: a mistake of 66%.
As stressed in Proposition 1, there is Arbitrariness in the peer review process. This is exactly what happened in the NIPS experiment. Our example replicates the results of the NIPS experiment. Moreover, the three innovative papers are #8, 9, and 10. Only the paper #10 is chosen. As stated in Proposition 2, peer review leads to a bias against innovative papers and projects.
This example highlights the bias in the peer review system. We turn now to present the 200 simulations performed.

Introduction
We now present the simulations performed. The number of papers is 100 (k = 100). The number of papers accepted, h, is either 5, 10 or 15 (i.e., 5%, 10% or 15% acceptance rate). We present different acceptance rates in order to check whether the acceptance rate has an impact on arbitrariness.
The Committee chooses 30 referees to referee the 100 papers. The committee sends each referee 10 papers, so that each paper will be read by 3 referees. 11 Recall that Eq. (1) represents the true value of the paper: where S is soundness, C contribution of the paper and I, innovativeness of the paper. We first explain the size of the coefficients, and then how C, S, and I are generated.

The coefficients
The coefficients given to the various criteria can be different. These coefficients account for varying emphasis on certain aspects of the paper (innovation, contribution, and soundness). Indeed, different reviewers, heads of project financing and conferences, each put the emphasis on different elements. Some may care more about innovation than contribution, or vice versa. Hence, we test five different sets of coefficients: In Table 3, we present the five sets of coefficients, each labelled respectively coeff. 1-5. Below, we show that despite big differences in the value of the coefficients, the results for the various coefficients are almost the same.

The value of the papers
There are three elements to be generated: the soundness, S; the scientific contribution, C, and the inventive part, I. We have generated these elements in a random way. The elements S and C are generated from a normal distribution on the range (0, 100). The element I is generated from a 1/x distribution, to reflect the fact that there are many "below average" and "average" papers, while there are only few very innovative papers. The graphs of the density of the values of the elements of the 100 projects, along with an explanation on why we chose these distributions, are presented in "Appendix 1".

The referees
The referees are picked from a long list of people. Randomly, we picked 30 referees who have different T j 's (time spent refereeing a paper), and I j 's (level of innovation homophily).
Note from Eq. (4) that the referee will not be capable of accurately measuring a paper with a greater degree of innovation than himself/herself. The same is true for the time spent by a referee in regards to the contribution of the paper.

The selection of papers by the referees
We divide the 30 referees in 10 groups of 3, and for each group we allocate them to 10 papers, so that each paper will be refereed by 3 referees, and each referee will give a grade to 10 papers. We repeat this exercise 10 times, to analyze how different will be the choices of the referees under the 10 various iterations. 12 We simulate these 10 iterations, for the 5 different coefficients of Table 3. So in total, we had 200 various rankings of these 100 papers. We start by presenting the results of the 10 iterations for the coefficients equal to 1/3; which we have coined "coeff. 1".
The results of the 10 iterations for coeff. 1: (1/3, 1/3, 1/3) We present the results of iterations 1 and 2 in Tables 5 and 6, while the other iterations are presented succinctly in Table 7. Table 5 The results of iteration #1 The top 5% papers are represented with *; the top 10% are represented with ", and the top 15% with > In column (2), we present the list of the papers which are the best-the true value is presented in parentheses. The best paper for alpha of 1/3 is 14 with a value of 71, and the second best paper is #65. It should be noted that in this draw, there are not great papers, since #14 has a value of 71 (see column 2). In reality, this might happen. Although, for draws with papers of high quality, the results were similar: there is arbitrariness In columns (3-5), we present the list of the papers chosen by the three referees

Ranking of paper
The list of the best papers (true value) 12 The draw we got was that for instance, in iteration 1, papers 1-10 are read by referees #19; 13; and 10. In iteration 2, papers 1-10 are sent to referees 16; 22 and 29. In fact, Referee #1 will grade papers 21-30 in iteration 1; and papers 61-70 in iteration 2. Table 6 The results of iteration #2 See notes of Table 5 Ranking of paper  Table 7 The Results of the 10 iterations for coeff. 1, and for the three referees average ranking See notes of Table 5 Ranking of paper The "true" rank Iter.  Table 5 presents the top 15 papers chosen by each of the three referees and also the average of the three referees. In column (1), we present the ranking. In column (2), we present the 15 best papers in the draw we got. In parentheses, we present the value of the papers.

Iteration #1
The papers chosen by the three referees for iteration #1 are presented in columns (3) to (5). For iteration 1, we got that only the paper #90 was recognized as a top 5% paper. So, when the committee asks for 100% agreement between the three referees, then there is only one paper accepted, which means a success of 1/5 of recognizing the best papers.
When the committee requires a total agreement on the 10% best papers, then there is an agreement on papers #90, 48 and 85, that is: 3/10. And when we check for the acceptance rate of 15%, then they agree on 6/15 papers (90,48,85,14,52,21).
This means that total agreement is very difficult to get, exactly as in the example presented in the previous section. Therefore, most committees do not ask for total agreement, but rank the papers using the average of the grades given by the referees.
When the committee examines the average of the grades given by the three referees, we get that three papers are recognized as top 5% (90, 14, 21)-a success of 60% (see column 6). When we check for the 10% top papers, the committee chose 8 papers-a success of 80%, and for 15% top, a success of 87%. So this seems to be a positive result.
What happens when the committee sends the papers to different referees? Let us check iteration 2. 13

Iteration #2
For iteration 2, we got that only the paper #14 was recognized as a top 5% paper (see Table 6, columns 3-5). So, when committee asks for 100% agreement, then there is only one paper they can agree on (#14).
What is striking is that the paper chosen in iteration 2 is different than the paper chosen in iteration 1, where paper #90 was the one chosen. In other words, there is arbitrariness.
When the committee asks for the average of the grades given by the three referees, we get that three papers are recognized among the top 5% (21,65,14), a success of 60%. What is striking is that in iteration 1 it was also 60%, but different papers: #90, 14, 21, so that there was a complete agreement on only two papers.
When we check for the 10% top papers, the committee chose 7 papers (while in the first iteration it was 8 papers)-a success of 70%, and for top 15%, a success of 80%.

The 10 iterations
The 10 iterations are summarized in Table 7. We present the average of the grades given by the three referees, which succeeds finding the good papers with a probability of 60%. But, each iteration picks different papers.
To conclude and to summarize our results: there is complete arbitrariness in the peer review process. These simulations and iterations lead us to present the following results: 1. There is not even one paper among the top 5, which is accepted by all the committees in these 10 different iterations. It means that there is no robustness at all in the choice of the papers. 14 This confirms the result in Pier et al. (2018), whose replication study of the NIH peer review process has shown a very low level of agreement among the reviewers in both their written critiques and the ratings. 2. The best paper (#14) is chosen among the top 5% by only 4 committees out of the 10 iterations. 3. Averaging the referees' grades is better than asking for consensus, but does not eliminate arbitrariness. 4. Each iteration (committee) succeeds in picking between 1 and 3 top 5% papers, which means that 2-4 'not-top' papers will be selected as top. A mistake of 40-80%. 15 Table 8 The results of the 10 iterations for coeff. 2, and for the average of 3 referees See notes of Table 5 Ranking of paper The "true rank Iter. 5. Increasing the acceptance rate of papers (moving from top 5% to top 10%, or top 15%) leads to accept more papers. 16 It is clear that while reducing the tightness of selection of papers to conferences is not too costly-(one has to add more sessions at the same time, and admittedly big conferences are not easy to handle); For projects to be financed, increasing the number of projects funded could be almost impossible. 6. We have also checked the results for the 5 coefficients presented in Table 3. In Table 8, we present the results for coeff. 2, (1/4, 3/8, 3/8). As can be seen, the committees pick between 1 and 4 top papers. Again, none of the top papers will be chosen by all. We get similar results for the other coefficients 3-5. 7. The papers and projects with more innovation are the ones with the highest variance among the 10 iterations. In consequence, their probability of being accepted is low. 8. In conclusion, arbitrariness is a robust result of this paper.

Conclusions and policy remarks
Peer review has come under scrutiny in the last few years; and it has become acknowledged that the system is not optimal. This paper has focused on one of the problems: The arbitrariness of projects and papers chosen through peer review.
The problem of arbitrariness has already been raised in the past: The NIPS experiment has raised the alert about the arbitrariness of the peer review process underlining that changing reviewers lead to choosing different projects. Moreover, several previous studies have shown that the reviewers' ratings do not correlate with subsequent citations of the paper. 17 This paper focuses on the reasons for the robustness of arbitrariness, by modeling the phenomenon, and by emphasizing that the heterogeneity of the reviewers is the main reason for the arbitrariness. There are two main types of heterogeneity leading to arbitrariness. The first is homophily in the trait related to innovation, and the second is the time dispensed by reviewers to peer review. We have stressed that heterogeneity in these two elements is sufficient to generate arbitrariness.
We have shown that if we have 10 different committees formed with the same 30 referees, but a different draw of papers sent to them, and for an acceptance rate of 5%, we get that there is agreement on only one paper out of five, so 20% agreement. We have also underlined that changing the weight of the various criteria does not change the results: Arbitrariness is a phenomenon related to peer review.
Can the problem be even more acute than arbitrariness? Unfortunately, yes. The second result emphasized by our paper is that the probability of accepting innovative papers is low. The peer review process leads to conformity, i.e., selection of less controversial projects and papers. This may even influence the type of proposals scholars will propose, since scholars need to find financing for their research as discussed by Martin (1997): "a common informal view is that it is easier to obtain funds for conventional projects. Those who are eager to get funding are not likely to propose radical or unorthodox projects. Since you 17 See Ragone et al. 2013, Bartneck 2017, and Shah et al. 2018. 16 In the case of 10%, the number of not-top papers published will be between 10 and 40%, while for the top 15%, it will be between 13 and 33%. Obviously, when we increase the acceptance rate, we increase the number of "top" papers, and the errors are tautologically reduced. When we accept all papers, the error is then nil! So the decision about the acceptance rate is crucial, for the trade-off between arbitrariness and tightness.
don't know who the referees are going to be, it is best to assume that they are middle-of the road. Therefore, a middle of the road application is safer".
Can we reduce arbitrariness and the bias against innovative projects? There are some alternative models proposed, and see in particular (Kovanis et al. 2017;Birukou et al. 2011;Brezis 2007), which try to reduce the bias against innovative projects, by introducing some randomness in the process of peer review. More recent approaches suggest using a modified lottery to partially eliminate bias (see Avin 2015; Gross and Bergstrom 2019; Roumbanis 2019). Still, the problem persists.
About arbitrariness, our model does not propose a panacea to the problems raised in this paper. Yet, our model can pinpoint worse solutions, as proposing to increase the numbers of criteria. This would increase the variance among the reviewers.
In conclusion, it is not easy to improve the peer review process. But, to conclude on an optimistic note, it could be that Artificial intelligence which is expanding these last years could revolutionize the peer review process. Maybe the revolution is at our gate. 18