Peer review at the center of the research process
Science has developed as a specific human activity, with its own rules and procedures, since the birth of the scientific method. The method itself brought upon unique and abundant rewards, giving science a special place, distinct from other areas of human thinking. This had been epitomized in the 1959 “two cultures” lecture, whose relevance is testified by countless reprints (see Snow 2012), presenting the natural sciences and the humanities as conflicting opposites, evidenced by their misunderstandings and the resulting animosity in the world of academia.Footnote 1
Famous studies on science (the first example that comes to mind is that of Kuhn 1996) have been carried out with the tools of sociology and statistics. In the meantime science, co-evolving with society, has developed and refined a set of procedures, mechanisms and traditions. The most prominent about them is the mechanism of paper selection by evaluation from colleagues and associates—peers, from which the name peer review—, which is so ingrained in everyday research process to make scientists forget its historical, immanent nature (Spier 2002).
Nowadays, simulation techniques, and especially social simulations, have been proposed as a new method to study and understand societal constructs. Time is ripe to apply them to science in general and to peer review in particular. Simulations of peer review are starting to blossom, with several groups working on them, while the whole field of scientific publications is undergoing remarkable changes due, for instance, to the diffusion of non-reviewed shared works,Footnote 2 and to the open access policies being enacted after the 2012 approval of the Finch report from the UK government. These are only symptoms of the widespread changes brought about by the information revolution, which has transformed general access and dissemination of content, and specifically access and dissemination of science. In the future, the paper as we know it might be superseded by new forms of scientific communication (Marcus and Oransky 2011). The measures we use to evaluate scientific works will also keep changing. Consider for instance the impact factor, not yet fully superseded by the h-index, which in turn is likely to be substituted by alternatives as the P
top 10 % index (Bornmann 2013), or complemented by usage metrics. The simplest of the latter, paper downloads, has already been shown to exhibit unique patterns; reading and citing, apparently, are not the same (Bollen et al. 2009).
In the meantime, collective filtering (e.g. reddit) and communal editing (e.g. Wikipedia) have found a way to operate without the need of the authority figures that constitute the backbone of peer review. Academia, on the contrary, still shows a traditional structure for the management of paper acceptance, maintaining a “peer evaluation” mechanism that is the focus of the present work. We note in passing that the peer review mechanism is only one of the many aspects of collective scientific work, which has been little studied as such (see for a recent exception Antonijevic et al. 2012); it is only one of a number of intersected feedback loops, serving the purpose of quality assurance between others.
Peer review, as a generic mechanism for quality assurance, is contextualized by a multiplicity of journals, conferences and workshops with different quality requests, a kind of multilevel selection mechanism where the pressure to improve quality is sustained through the individualization of a container—for now we will consider only journals as representative of this—and the consequent pressure to defend its public image (see Camussone et al. 2010, for a similar analysis).
Journals as aggregates of papers, then, play two other important roles: first, as one of the most important (although indirect) elements in deciding researchers’ career, at least for the current generation of researchers, that are often evaluated on the basis of their publications on selected journals. Second, journals play a role in the economy of science: their economic sustainability—or profit, as shown by the recent campaign on subscription prices
Footnote 3—depends on quality too, but just indirectly; profits would be warranted as well in a self-referential system, based not on quality but on co-option.
At the core of the above intersected feedback loops lies the mechanism of peer evaluation, that is based on custom and tradition, leveraging on shared values and culture. The operation of this mechanism has been the target of much criticism, including accusations of poor reliability, low fairness and lack of predictive validity (Bornmann and Daniel 2005)—even if this calculation often lacks a good term of comparison. Declaring a benchmark poor/low with respect to perfect efficiency is an idealistic perspective, while a meaningful comparison should target other realistic social systems. All the same, we must mention that some studies have demonstrated the unreliability of the journal peer review process, in which the levels of inter-reviewer agreement are often low, and decisions can be based on procedural details. An example is reported by Bornmann and Daniel (2009), who show how late reviews—that is, reviews that come after the editors have made a decision—, would have changed the evaluation result in more than 20 % of the cases. Although a high level of agreement among the reviewers is usually seen as an indicator of the high quality of the process, many scientists see disagreement as a way of evaluating a contribution from a number of different perspectives, thus allowing decision makers to base their decisions on much broader information (Eckberg 1991; Kostoff 1995). Yet, with the current lack of widely accepted models, the scientific community has not hitherto reached a consensus on the value of disagreement (Lamont and Huutoniemi 2011; Grimaldo and Paolucci 2013).
Bias, instead, is agreed to be much more dangerous than disagreement, because of its directional nature. Several sources of bias that can compromise the fairness of the peer review process have been pointed out in the literature (Hojat et al. 2003; Jefferson and Godlee 2003). These have been divided into sources closely related to the research (e.g. reputation of the scientific institution an applicant belongs to), and sources irrelevant to the research (e.g. author’s gender or nationality); in both cases the bias can be positive or negative. Fairness and lack of bias, indeed, is paramount for the acceptance of the peer review process from both the community and the stakeholders, especially with regard to grant evaluation, as testified by empirical surveys.
Finally, the validity itself of judgments in peer review has often been questioned against other measures of evaluation. For instance, evaluations from peer review show little or no predictive validity for the number of citations (Schultz 2010; Ragone et al. 2013). Also, anecdotal evidence of peer review failures abound in the literature and in the infosphere; since mentioning them contributes to a basic cognitive bias—readers are bound to remember the occasional failures more vividly than any aggregated data—we will not delve on them, striking as they might be.Footnote 4 Some of the most hair-rising recent failures (Wicherts 2011) recently raised the attention to the emerging issue of research integrity, that converged in documents such as the Singapore statement.Footnote 5 These documents highlight the hazard of cheating in science, by exploiting the loopholes of peer review. This behavior goes under the name of rational cheating.
Rational cheating is defined in (Callahan 2004) as a powerful force, often augmented by rational incentives for unethical misconduct. Factors like large rewards for winners, perception of diffused cheating, limited or non-existing punishments for breaking the rules put pressure on researchers to work against the integrity and fairness required from themselves. Rational cheating is currently believed to be an important factor, potentially contributing to the failure of peer review and with that, to the tainting of science by cronyism. In this work, we aim to contribute to the understanding of rational cheating in peer review. Before moving on, however, we point out how the peer review mechanism has never been proved—or, for the matter, disproved—to work better than chance. Indeed, at present, little empirical evidence is available to support the use of editorial peer review as a mechanism to ensure quality; but the lack of evidence on efficacy and effectiveness cannot be interpreted as evidence of their absence (Jefferson et al. 2002).
This might sound strange, given the central position of peer review in the scientific research process as it is today. Why this lack of attention for such a crucial component? The answer, besides the habituation component, is probably to be found in the nature of the peer review as a complex system, based on internal feedback, whose rules are determined by tradition in a closed community. In this paper, we aim to contribute to an undergoing collective effort to understand peer review, in order to improve it, preparing the ground for its evolution by reform.
Simulation of peer review
To model the dynamics of science, a broad range of quantitative techniques have been proposed, such as: stochastic and statistical models, system-dynamics approaches, agent-based simulations, game theoretical models, and complex-network models (see Scharnhorst et al. 2012, for a recent overview).
In accordance with the aim and with the tool—that is, the type of modeling being used—a number of conceptualizations of science have been proposed that: explain statistical regularities (Egghe and Rousseau 1990); model the spreading of ideas (Goffman 1966), scientific paradigms (Sterman 1985) or fields (Bruckner et al. 1990); interrelate publishing, referencing, and the emergence of new topics (Gilbert 1997); and study the evolution of co-author and paper-citation networks (Börner 2010). In this paper we focus on a specific aspect of science: the peer review process applied to assess the quality of papers produced by scientists, aimed to the publication of textual documents.
The peer review process can be generally conceptualized as a social judgment process of individuals in a small group (Lamont 2009). Aside from the selection of manuscripts for their publication in journals, the most common contemporary application of peer review in scientific research is for the selection of fellowship and grant applications (Bornmann 2011). In the peer review process, reviewers sought by the selection committee (e.g. the editor in a journal or the chair in a conference) normally provide a written review and an overall publication recommendation.
Simulations and analysis of actual data on the review process are the ingredients that have been proposed to improve our understanding of this complex system (Squazzoni and Takács 2011). In this work, with an agent-based approach, we develop a computational model as an heuristic device to represent, discuss and compare theoretical statements and their consequences. Instead of using the classic “data-model-validation” cycle, we take advantage of the social simulation approach that, following Axelrod (1997), could be conceptualized as “model-data-interpretation.” In this approach, we use the computer to draw conclusions from theoretical statements that, being inserted in a complex system, have consequences that are not immediately predictable.
Agent-based modeling and simulation has been proposed as an alternative to the traditional equation-based approach for computational modeling of social systems, allowing the representation of heterogeneous individuals, and a focus on (algorithmic) process representation as opposed to state representation (Payette 2011). Research agendas and manifestos have appeared defending that agent-based models should become part of the policy-maker’s toolbox, as they enable us to challenge some idealizations and to capture a kind of complexity that is not easily tackled using analytical models (Scharnhorst et al. 2012; Conte and Paolucci 2011; Paolucci et al. 2013).
An already classic example of the agent-based simulation approach to the study of science is presented in (Gilbert 1997), where the author succeeds into finding a small number of simple, local assumptions (the model), with the power to generate computational results (the data), which, at the aggregate level, show some interesting characteristics of the target phenomenon—namely, the specialty structure of science and the power law distribution of citations among authors (the interpretation).
A simulation with simple agents is performed in Thurner and Hanel (2011), where the authors propose an optimizing view of the reviewer for his or her own advantage. To this purpose, they define a submission/review process that can be exploited by a rational cheating (Callahan 2004) strategy in which the cheaters, acting as reviewers, reject papers whose quality would be better than their own. In that model, the score range for review is very limited (accept or reject) and in case of disagreement (not unlikely because they allow only two reviewers per paper), the result is completely random. They find out that a small number of rational cheaters quickly reduces the process to random selection. The same model is expanded by Roebber and Schultz (2011), focusing not on peer review of papers but on funding requests. Only a limited amount of funding is available, and the main focus is to find conditions in which a flooding strategy is ineffective. The quantity of cheaters, differently from this study and from Thurner and Hanel (2011), is not explored as an independent variable. The main result obtained is a strong dependence of results from the mechanism chosen (e.g. number of reviews, unanimity, etc.). In Grimaldo et al. (2012), the authors introduce a larger set of scores and use three reviewers for paper; they analyze the effect of several left-skewed distributions of reviewing skill on the quality of the review process. They also use a disagreement control method for programme committee update in order to improve the quality of papers as resulting from the review process (Grimaldo and Paolucci 2013).
A similar approach has shown that there is a quantitative, model-based way to select among candidate peer-review systems. Allesina (2012) uses agent-based modeling to quantitatively study the effects of different alternatives on metrics such as speed of publication, quality control, reviewers’ effort and authors’ impact. As a proof-of-concept, it contrasts an implementation of the classical peer review system adopted by most journals, in which authors decide the journal for their submissions, with a variation in which editors can reject manuscripts without review and with a radically different system in which journals bid on manuscripts for publication. Then, it shows that even small modifications to the system can have large effects on these metrics, thus highlighting the complexity of the model.
Other researchers have considered reviewer effort as an important factor and have studied its impact on referee reliability. Their results emphasize the importance of homogeneity of the scientific community and equal distribution of the reviewing effort. They have also shown that offering material rewards to referees tends to decrease the quality and efficiency of the reviewing process, since these might undermine moral motives which guide referees’ behavior (Squazzoni et al. 2013).
Agent-based simulation and mechanisms
Agent-based simulation, even if its application to this class of problems (modeling science, and peer review in particular) is starting to spread, is still a controversial approach. As its practice grows, its armor begins to show the chinks. These include a tendency to build ad-hoc models (Conte and Paolucci 2011), the inability to do proper sensitivity analysis for changes in the process as opposed to changes in parameters, the risk of overfitting, and oversimplification. All these issues are amplified by the self-referentiality of the simulation community, and from the difficulty to actually define what is a result in silico.
To frame the problem, we can consider the practice of simulation as a discipline under pressure from two opposing forces. The first is the tendency to simplify to the extreme, which can produce models that are, at best, only incremental in complexity with respect to an equation-based model; at worst, models that could actually be easily converted into equation-based ones, applying game theoretical approaches or mathematical techniques as the master equation of mean field theory (see Helbing 2010, for a review). The second force is the pressure for cognitive modeling, coming from researchers interested in sociology, social psychology, and from system designers interested in cognitively-inspired architectures. They have an interest in devising complex architectures that, while plausible at the micro level as inspired by complete representation of cognitive artefacts and folk psychology, are prone to overfitting and difficult to verify.
Our view is that the first force is destructive for the agent-based field, as it is simply inclined towards its reinclusion in the sphere of equation-based treatment. This is not to be considered harmful, especially as it would cause an expansion of the technical and explanatory power of the equation-based field, but could lead to failure by oversimplification. Moreover, discretization often produces results that are different, for example, from those produced by mean field approximation (see Edwards et al. 2003), even for simple models. The path of simplification is attractive and has long been promoted by researchers in the field with the Keep it simple, stupid! (KISS) slogan (Axelrod 1997, p. 5). We see a twofold problem here. First, a KISS approach contrasts with a descriptive approach, as argued in Edmonds and Moss (2005). But more than losing part of the descriptive approach, a simplification mindset tends to drive out from simulations the part that characterizes them most: the ability to play with mechanisms (Bunge 2004), in the form of processes or algorithms, instead of playing just with distributions and parameters. We are hardly suited to propose a solution of this tension between the simple and the complicated in the field of simulation. We can only suggest that the focus on mechanismic (Bunge 2004) explanation could compensate the attraction towards simple models, and propose—also thanks to tools that support that focus—an example of how different results could be, if the employed mechanisms are different.
A mechanism is originally defined as a process in a concrete (physical, but also social) system. Focusing on the mechanism implies paying attention on how the components of a system interact and evolve. A mechanism in a system is called an essential mechanism when it is peculiar to the functioning of that system, in contrast to both mechanisms that are common to other systems, and mechanisms that do not contribute to system definition, that is, that could be modified or replaced without losing the specificity of the system. As the modelist will immediately recognize, an essential mechanism is the one that should be preserved through the process of abstraction. A mechanism, as Bunge contends, is essential for understanding; it is essential for control.
In the approach that we propose, modeling for agent-based simulation is a matter of finding out, by abstraction and by conjecture, both the variables and the processes that characterize the individual agents. Thus, agent-based simulation is the ideal ground for experimentation with mechanisms, that we are going to perform in the rest of this paper. A warning, however, must be issued on our algorithmic interpretation of mechanisms, something that Bunge would not appreciate: he states that “algorithms are ... not natural and lawful processes ... can only imitate, never replicate, in silico some of what goes on in vivo” (Bunge 2004, p. 205). Nevertheless, in the context of the modeling activity, we maintain that our proposal improves on what could be seen as just an algorithmic variation, if only for the reason that what we propose is indeed algorithmic variations, but in the context of a modeling approach that leans on individual-oriented, plausible micro foundations.
A replication of peer review simulation
The tension between simple models, that lay open to mathematical modeling of some sort, and richer models, that reflect a larger part of the reality they want to describe but do not lend themselves to mathematical formalization, and thus lose in generality, characterizes the simulative approach from its inception. In this work, we wish to add another dimension to this tension. We show how a simple model, whose results we reproduce, is stable enough to be replicated with a completely different approach; however, when we add freedom in the mechanisms, rather than on parameters only, the results may change qualitatively.
Specifically, we compare results from a simple simulation of peer review to results obtained from a belief–desire–intention (BDI)-based one. The BDI model of rational agency has been widely relevant in the context of artificial intelligence, and particularly in the study of multi-agent systems (MASs). This is because the model has strong philosophical assumptions such as: the intentional stance (Dennett 1987), the theory of plans, intentions and practical reasoning (Bratman 1999) and the speech acts theory (Searle 1979). These three notions of intentionality (Lyons 1997) provide suitable tools to describe agents at an appropriate level of abstraction and, at the level of design, they invite to experiment different mechanisms.
The simple-agent simulation we take as our starting point is the one proposed in Thurner and Hanel (2011). After replicating it with our BDI model, we explore some algorithmic variations that are inspired by some of the models examined above in the state of the art. In Fig. 1 we present the sources that will influence our model in the rest of the work in a schematic form.
Replication is considered at the same time a neglected and indispensable research activity for the progress of social simulation (Wilensky and Rand 2007). This is because models and their implementations cannot be trusted in the same way as we trust mathematical laws. On the contrary, trust in models can only be built by showing convergence of different approaches and implementations. Replication is the best solution to obtain accumulative scientific findings (Squazzoni 2012, chap. 4.2).
Here, we pursue a kind of qualitative matching that goes beyond both replication and docking (i.e. the attempt to produce similar results from models developed for different purposes, Wilensky and Rand (2007)). Once qualitative congruence is obtained, a successful replication should demonstrate that the results of an original simulation are not driven by the particularities of the algorithms chosen. Our work shows that the original result is replicable but fragile.
The rest of the paper is organized as follows: the next section presents a new agent-based model of peer review that allows to flexibly exchange the mechanisms performed by the entities involved in this process; "The model in operation" section shows the model in execution and describes the metrics we obtain from it regarding the number and quality of accepted papers; in "Results and comparison" section we present a qualitative replication of the peer review simulation by Thurner and Hanel (2011) and we analyze how the original results appear to be fragile against small changes in mechanisms. "Discussion" section compares the findings obtained in the different scenarios. Finally, "Conclusion" section summarizes the general lessons learned from the proposed redesign and identifies directions for future work.