Introduction

Peer review serves a gatekeeping function for the scientific community. It is the process through which research is selected for funding by subject matter expert review panels and communicated through publication in refereed journal articles. Peer reviewers judge proposals on the basis of multiple criteria. For example, when performing reviews for US federal scientific research organizations such as the National Institutes of Health (NIH 2016a) and the National Science Foundation (NSF 2017), reviewers typically determine an overall score, as well as scores for component criteria that usually include significance, innovation, methodology, investigators and the research environment. In a study of NIH proposals, Eblen et al. (2016) found that reviewers’ component scores for approach and methodology were most strongly related to the overall impact scores, followed by component scores for significance and innovation. In assessing the approach, it has been emphasized that reviewers focus on the “overall strategy, methodology, and analyses” to be “well-reasoned and appropriate to accomplish the specific aims of the project,” that “potential problems, alternative strategies, and benchmarks for success” are presented and that the strategy “establishes feasibility” and manages “particularly risky aspects” (NIH 2016a). Innovation, on the other hand, has been defined as the extent to which “the proposed activities suggest and explore creative, original or potentially transformative concepts” (NSF 2017) and to which the application “challenges and seeks to shift current research or clinical practice paradigms by utilizing novel theoretical concepts, approaches or methodologies, instrumentation, or interventions,” as well as whether the application is “a refinement, improvement, or new application of theoretical concepts, approaches or methodologies, instrumentation, or interventions proposed” (NIH 2016a). These sample definitions of innovation and approach/methodology suggest tension and perhaps competing, somewhat opposing values between these review criteria, particularly with regard to the tolerance of the riskiness of the project, i.e., whether the goals will likely be accomplished based on what has been proposed. This potential competition is reflected in the research community, e.g., between emphasizing innovative research which may have inherent risk of failure but potentially high scientific payoff (e.g., Sorlie et al. 2012) and the needs for replication, reproducibility, and smaller, incremental advances (Ioannidis 2005). Moreover, reviewers associate the overall score with approach more than they do innovation (Eblen et al. 2016), which may also result in the overall scores assigned to proposals correlating poorly with the citation impact of the funded projects, presuming highly innovative work yields greater citation success (Danthi et al. 2014; Gallo et al. 2014; Li and Agha 2015).

In addition to the theoretical tension between the definitions and potentially opposing values of an application’s level of innovation and the quality and riskiness of its approach/methodology, it may be that there is considerable variation in the interpretation of these two criteria by reviewers. Some literature has suggested significant variance in reviewer preference and even biases for different types of research (Lamont 2009; Lee et al. 2013; Travis and Collins 1991). Indeed, there is evidence to suggest a large degree of subjectivity in reviewer decision-making, as inter-reviewer reliability of the same grant has been low (Cole et al. 1981, Cicchetti 1991; Bornmann and Daniel 2008). While many scientists would agree that the most exciting, innovative research should be identified through peer review and subsequently funded, it is not at all clear that there is a consensus on how, in practice, reviewers decide what that innovation should look like. Some have suggested that reviewers may judge innovative research as being less rigorous due to its novelty; therefore, reviewers are less inclined to tolerate risks and evaluate innovative proposals favorably, which is supported by recent criteria score analyses (Luukkonen 2012; Eblen et al. 2016). While reviewers likely use their subject matter expertize to identify innovative projects, some explorations have suggested that even when expertize is controlled, highly novel proposals are routinely penalized in score (Boudreau et al. 2016). Thus, if one of the goals of funding agencies is to “foster fundamental creative discoveries, innovative research strategies, and their applications” (NIH 2013), this may not be achieved, potentially due to reviewer risk aversion, although it is unclear if reviewers perceive themselves as avoiding risk.

There is a paucity of data on how reviewers evaluate risk in peer review (or how they perceive their evaluations) as well as a dearth of general descriptions of the decision-making processes used by reviewers; what data does exist is largely on journal peer review, which is different in scope and purpose than grant peer review. Moreover, while criteria like innovation are standard in the review of most NIH and NSF grants, many applicants report bias against innovative projects, suggesting there is a discrepancy between reviewer and applicant perceptions, yet factors like reviewer and applicant attitudes toward risk have not been explored relative to peer review (Lee et al. 2013). In fact, only recently has there been a widespread call for a more robust and complete evidence base upon which to drive policy regarding the peer review of grant applications (Lauer and Nakamura 2015; Bohannon 2013; Rennie 2016). While there has been a steady increase in empirical literature on grant peer review in recent years, reviewer decision-making has been minimally explored and characterized.

The examination of reviewers’ perceptions of their evaluative practices would be particularly valuable for informing grant review processes, especially compared to applicants’ perceptions of reviewer feedback. The most obvious way to assess these perceptions is through a survey of scientists’ experience with peer review; however, surveys that currently exist in the literature center around the motivations and levels of participation of scientists in journal peer review, and not on their perceptions of their evaluations (Sense About Science 2009; Taylor and Francis 2016; Ware 2016; Ware and Monkman 2008). While some surveys have been conducted by funding agencies, they have very generally examined reviewer/applicant differences in perceptions of review outcomes, there has not yet been a more in-depth exploration of perceptions of review content (NIH 2012; NSF 2014), particularly in terms of the evaluation of specific criteria. These data would be important to shed light on decision-making processes occurring within peer review and possibly provide a basis for enhanced reviewer training to align reviewer decisions with funding agency goals and to provide feedback to applicants that would enable improved future submissions.

In an effort to better understand reviewer motivations and experiences with grant peer review, a comprehensive peer review survey was developed to address these areas. The survey was sent to individuals in the American Institute of Biological Sciences’ (AIBS) database who had either participated on an AIBS peer review panel convened by the institute or served as a PI, investigator, collaborator, or consultant on an application that had been reviewed by AIBS. The goals of the survey were to gather much needed data on several elements related to applicant and reviewer perceptions about peer review in order to inform and refine future peer review practices.

Methodology

Survey

A 60-item questionnaire was developed for the peer review survey. The survey was divided into five sections, three of which are relevant for this manuscript: (1) demographics; (2) investigator attitudes toward grant review; and (3) reviewer attitudes toward grant review. The other sections largely dealt with levels and types of participation and motivations for reviewing, and thus were not analyzed as part of this study.

The questions were associated with dichotomous (yes/no), multiple choice, or rating scale (1–5) answer choices. Respondents were given the option to skip any question. In addition, text boxes were provided at the end of each section to allow respondents to elaborate on their responses. Based on beta testing, it was estimated that the survey would take 15 min to complete. A full copy of the peer review survey is available in “Appendix.”

One of the key areas addressed in this questionnaire was asking former applicants to identify the key areas of focus in the feedback they received from reviewers. Applicants were asked to indicate presence/absence from a list of standard criteria that were derived from those routinely present in NIH and NSF reviews as well as reviews previous reviews AIBS has conducted, including potential impact of research, the quality of the hypothesis, the research methodology, the innovation potential, the quality of the preliminary data, the responsiveness to the funding mechanism, statistical issues, the qualifications of the research team, and the appropriateness of the budget (NIH 2016a; NSF 2017).

Participants and procedures

The survey was administered through Limesurvey, a commercial, web-based survey package, and was disseminated to 16,875 scientists in AIBS’s proprietary database. AIBS has cultivated this database over many years to use as a crucial resource for recruiting for reviews it conducts on biomedical and other life sciences research proposals for a wide variety of federal funding agencies, private research institutes and non-profit research funders. Most of these reviews are ad-hoc, not recurring; a new set of scientists is recruited each peer review round to match the specific expertize needed for the given proposal set. AIBS review topic areas (and thus respondent expertize) vary from basic biological and biophysical explorations of mechanisms of disease and injury to more translational and clinical studies, including but not limited to psychology, neuroscience, microbiology, pharmacology, engineering, physical rehabilitation, social work, and clinical trials.

To assure the anonymity of the respondents, Limesurvey uses two separate databases that prevent identifying responses being linked back to participants. The survey was open for 2 months: the initial invitation to participate in the survey was sent on September 7, 2016, a reminder was sent a month and a half later (October 24/25, 2016), and the survey was closed on November 7, 2016. Once closed, the survey was no longer accessible to invitees or respondents.

The survey responses were exported through Limesurvey and analyzed using basic statistical packages. Descriptive statistics were used to characterize the results; tests for differences in proportions were used to examine differences between applicants’ and reviewers’ perceptions for parallel questions.

As some of these data are dichotomous in nature, to measure correlations between such variables, phi correlation coefficients were calculated as was standard error and significance levels (Yule 1912) of these relationships. Chi-square tests of applicant/reviewer samples were used to compare proportions of reviewer and applicant perceived use of criteria and standard errors for these differences were calculated.

Results

Survey response rate

Initially, 16,875 scientists in AIBS’s proprietary database were invited via email to participate in the survey. Of those 16,875 invitations, 2737 bounced back, leaving a total of 14,138 sent invitations, which comprised 13,781 unique individuals. These individuals had either participated on a peer review panel that had been convened by the institute (36%; N = 4986) or had been listed as a PI/investigator/collaborator/consultant/other on an application that had been reviewed through the institute (71%; N = 9716). About 12% (N = 1611) were both institute reviewers and applicants and 5% (N = 690) were institutional officials, administrative staff, etc. We therefore limited our base of invitees to the 13,091 individuals that received an invitation to participate in the survey and were not administrative officials. Of those individuals, 1231 responded, giving a response rate of 9.4%. As respondents were allowed to skip questions, 381 respondents had left at least one of the answers completely blank and were therefore removed from this analysis. A total of 850 had complete responses; those responses are analyzed below. It should be noted that for some questions, respondents were given the option “prefer not to answer,” or “not applicable” which was counted as a response.

Survey respondent demographics

Overall, most respondents were Caucasian males over 50 years old with PhDs, working in academia at a mid to late career stage. Most reported working over 40 h per week. The AIBS database, from which the invited population came, does not have complete demographic information; therefore, we were unable to compare the demographics of the respondents to those of the invited population. These results are summarized in Table 1.

Table 1 Demographics of respondents

Grant submission levels

The majority (80%; N = 677) of respondents had actively submitted a grant in the last 3 years while 20% (N = 166) had not (7 chose not applicable). These proportions are fairly representative of the total invited population (71%). Respondents reported submitting a total of 3167 grants over the last 3 years. A histogram of the number of grants submitted per respondent (Gs) yielded a fairly clear multimodal distribution of grant submission frequencies, where nearly half of actively submitting applicants (44%) reported submitting on average 2 grants a year or more (Fig. 1).

Fig. 1
figure 1

% Respondents versus frequency of Gs (grants submitted) and Gr (grants reviewed). Relative proportions of respondents are represented on the y-axis while numbers of grants submitted (Gs) or reviewed (Gr) in the last 3 years are represented in the x-axis. A total of 3167 grants were recorded to be submitted and 2523 grants reviewed in this timeframe by 850 respondents

Nearly all of those who reported they submitted proposals had received feedback (91%; N = 615) while only 9% (N = 58) specified they had not (3 stated non-applicable). Of those who reported submitting proposals, 37% (N = 250) indicated that their last grant submission was funded while 60% (N = 407) indicated that it was not (19 stated non-applicable). It should be noted that unsuccessful applicants reported submitting slightly more applications on average (4.9 ± 0.10 applications) over the 3 year period as compared to successful applicants (4.4 ± 0.13 applications; t[655] = 2.52, p = 0.01).

Peer review participation

The majority of respondents (75%; N = 637) had actively served on a peer review panel in the last 3 years while 22% (N = 184) had not (29 chose not applicable). This participation rate is significantly higher than that of the total invited population (36%) and is likely due to the fact that the database only contains information relative to reviews conducted by AIBS, not those of funding agencies such as NIH or NSF. Thus, a reviewer could have reviewed for NIH but not for AIBS in this timeframe.

A total of 2523 reviews from the past 3 years were reported by reviewer respondents. In terms of the number of distinct grant reviews per respondent (Gr), a multimodal distribution existed whereby 30% of active reviewers reviewed on average twice a year or more (Fig. 1). Eighty eight percent (N = 555) of active reviewers have submitted a grant (39% success rate) versus 65% (N = 119) of inactive reviewers (34% success rate). Thus, while those who review are more likely to have submitted a grant, the success rates are comparable.

Applicant perceptions of review criteria and the usage of innovation and risk

Applicant respondents were asked to indicate all of the areas addressed in the critique of their last grant application. It should be noted that 16% [N = 109] did not indicate that any criteria were addressed and 35% [N = 240] indicated only one criterion, with the remaining 48% indicating more than one criteria [N = 328]. These results are summarized in Table 2; the three most frequently addressed criteria were methodology, impact and innovation, respectively. However, methodology was much more frequently addressed than impact or innovation, with only 24% (N = 164) of applicant respondents indicating that recent review feedback addressed innovation potential. Importantly, both unsuccessful and successful applicants indicated that innovation was mentioned with similar frequency, suggesting funding success did not affect commentary on this criterion (Table 3). Similarly, age, gender, and degree did not seem to affect commentary on innovation (Table 3, proportions and N are shown, * indicates statistically significant correlation [p < 0.01], standard error ranged from 7.0 to 9.5%).

Table 2 Areas of feedback (N = 677)
Table 3 Applicant perceptions of criteria (demographics)

In addition, the majority (56%; N = 379) of applicants indicated that the reviewers did not comment on the riskiness of the research while only 27% [N = 185] stated that they did (113 stated the question was not applicable; Table 5). Again, the likelihood of comments about risk did not differ significantly by success, age, or degree (Table 3). However, a significant difference was observed by gender (Χ2[1] = 8.44; p = 0.004), where female applicants were less likely to report comments on risk than their male applicants.

Phi coefficients were calculated between paired feedback areas in order to determine which areas co-occurred among applicant respondents (Table 4; * indicates statistically significant correlation [p < 0.01], standard error for all correlations is 0.04). In general, correlation coefficients were small but significant between some areas. Feedback regarding innovation was significantly correlated with impact but neither impact nor innovation was correlated with methodology. Interestingly, feedback regarding the research team was correlated with receiving feedback in many other areas, including impact, preliminary data, hypothesis, statistical issues, funding mechanism and budget. Feedback regarding methodology was only correlated with feedback regarding preliminary data.

Table 4 Phi associations among areas of feedback

Reviewer perceptions of review criteria and the usage of innovation and risk

Active reviewers—when asked whether the evaluation criteria used at their last panel meeting were appropriate to judge the best science and move the field forward—provided an average rating of 2.3 ± 0.04 (1 = most appropriate to 5 = least appropriate) (27 indicated the question was not applicable). It should be noted that no statistically significant differences were found across reviewer gender or degree; reviewers aged 50 and older rated the review criteria as more appropriate than did reviewers less than 50 years of age (t[609] = 2.18, p = 0.03) for perceptions of criteria appropriateness (Table 6; * indicates statistically significant correlation [p < 0.01], standard error is reported).

Eighty-one percent of active reviewers (N = 519) indicated that they factored innovation into selecting the best science (14% [N = 92] did not and 26 indicated the question was not applicable; Table 5). Moreover, 70% of reviewers viewed innovation as an essential component of scientific excellence when evaluating grant applications, while only 27% (N = 212) did not (22 indicated the question was not applicable). Negligible differences in reviewer inclusion of innovation were found across gender, age and degree (Table 6; proportions and N are reported, standard error ranged from 3.4 to 6.6%).

Table 5 Views of review criteria (applicant vs. reviewer)
Table 6 Reviewer perceptions of criteria (demographics)

When asked whether the risks associated with innovative research impacted the scores they assigned to grant applications, 58% of reviewers (N = 372) indicated that they did while 35% (N = 223) indicated that they did not (42 indicated the question was not applicable; Table 5). Negligible differences in reviewer inclusion of risk were found across gender, age and degree (Table 6).

Fifty-seven percent (N = 365) of reviewers indicated that the PI’s track record tempered their assessment of any detected methodological weaknesses, while 38% (N = 243) indicated that it did not (29 indicated the question was not applicable). Similar to innovation and risk, negligible differences were found across gender, age and degree (Table 6).

Overall, these results suggest that reviewers were significantly more likely to report that they addressed components of innovation, risk, and the mitigation of risk through the PI’s track record in their reviews than applicants perceived these components to have been addressed in their critiques.

Discussion

Survey generalizability

Our response rate of 9.4% was low but comparable to those of similar surveys on journal peer review (7.7%: Ware and Monkman 2008; 10%: Sense About Science 2009; 5–10%: Taylor and Francis 2016) and exceeds the 2.2% rate reported in the recent PRC survey (Ware 2016), although none of these other reports were published in the peer-reviewed literature. In terms of demographics, the majority of our 1241 survey respondents was white males with PhDs who were 50 years or older, worked in academia, were largely tenured, and worked more than 50 h per week (Table 1). These demographics compare favorably to those reported in previous journal peer review surveys, such as the 2009 Sense About Science (SAS 2009) and the 2016 PRC survey (Ware 2016), in terms of gender (74% of the SAS and 70% of the PRC respondents were male) and place of work (66% of the SAS respondents worked in academia). In addition, they reflect the general gender distribution of publishing authors (Sugimoto et al. 2013). However, in terms of respondent age, the 2008 Ware and Monkman and 2016 PRC journal peer review surveys had more even distributions. The age distribution of our survey respondents is somewhat younger than the age distribution of NIH grant reviewers with NIH funding; however, it should be noted that funded reviewers are only a subsample of the NIH reviewer population (Rockey 2015).

Nevertheless, the majority of our survey respondents had submitted a grant in the last year, and 38% of those who had submitted a grant reported funding success, which is higher than the 13–23% funding success rates reported for fiscal years 2013, 2014, and 2015 by NIH (Rockey 2015). Thus, it may be that our sample is representative of more senior researchers who have success with NIH funding and of those who comprise most NIH research panels (Etcheberrigaray 2014). Also, it should be noted that our sample of respondents had similar levels of grant submission to that of the total invited population and while the respondents reported higher levels of peer review than of the total invited population, it is likely that most of the reviews they refer to are non-AIBS related (e.g., NIH and NSF). Thus, we feel our respondent sample is generalizable to the larger grant reviewing population.

Attitudes toward innovation and risk

Respondents were asked to indicate the areas represented in the review feedback they received as applicants (it should be noted that reviewers were likely charged to address all of these criteria by the funding agency). Applicants indicated the three most frequent areas of reviewer feedback were methodology (53%), potential impact of research (33%), and innovation potential (24%) (Table 2). While those who received feedback on innovation were likely to also receive feedback on impact, feedback about innovation and methodology was unrelated (Table 4). These results very much align with the results from Eblen et al. (2016) as well as Rockey (2010), who also found the approach criterion to be a better predictor of overall score than the impact or innovation criteria. Yet, in spite of applicant reports, 81% of reviewers indicated that they factored innovation into selecting the best science (Table 5) and 70% viewed innovation as an essential component of scientific excellence, underscoring that applicant and reviewer respondents had significantly divergent perceptions of the use of review criteria.

Similar significant differences were also seen for the use of riskiness in review decisions, as only 27% of applicants received comments on the riskiness of their grant applications while 58% of reviewers indicated they took riskiness into account in their scores (Table 5). The focus on methods and lack of consideration of innovation and/or riskiness adds further evidence that reviewers may be (or at least be perceived as) risk-averse and also suggests that they may not be aware of this potential bias. While it is possible that reviewers may have taken riskiness into account in their scores, they do not seem to be clearly expressing these concerns in their review comments. It is more likely that the bias is real and that reviewers are simply unaware of their risk aversion, as previous studies have documented the penalization of highly innovative work in a well-controlled peer review system (Boudreau et al. 2016).

Some have suggested that highly innovative research may be associated with more reviewer uncertainty about their judgments of the methodology, which may lead to lower scores (Luukkonen 2012; Boudreau et al. 2016). While reviewers may consider innovation in their decisions, they likely give this less weight than methodological weaknesses, which is reflected in the content of the reviewer’s critique. Also, methodological weaknesses are usually more concrete, while estimations of innovativeness are more subjective and subject to individual perception, which likely influence the focus and length of the evaluation for each of these criteria. In addition, our results indicate the lack of a relationship between feedback concerning innovation and methodology (Table 4). Thus, the reviewer’s written critique will appear to applicants to focus on methodological issues. This differential weighting of criteria by reviewers to synthesize an overall score for the scientific merit of an application has been identified as commensuration bias (Lee 2015) and may explain why the vast majority of reviewers indicate they factored innovation into their critiques, despite the low frequency reported by applicants (Table 5). Such a bias suggests that it may not be possible to reliably focus on both innovation and methodological rigor; some have suggested distinct funding mechanisms to address this challenge (Gewin 2012; Ioannidis 2011).

However, it may also be that concerns about innovation and riskiness influence reviewer perceptions of other criteria. For instance, grant reviews are not double blinded (reviewers know applicants’ identities and are usually charged to evaluate the research team) and our results show that 57% of reviewers indicated that the PI’s track record tempered their assessment of any detected methodological weaknesses (Table 5). In fact, our analysis has indicated a significant relationship between the research team and many of the other criteria factors, suggesting that PI track records strongly influence reviewers’ perceptions of the value of the application (Table 4). Thus, while methodological weaknesses may be weighed more heavily than innovative ideas, reviewers’ risk tolerance may be enhanced by applicants’ track records of innovation and impact. Indeed, some studies have suggested that professional connections and knowledge of applicants’ work affect reviewer scoring (Gallo et al. 2016; Li 2015), although it is unclear if PI familiarity results in the funding of more innovative proposals. However, it should be noted that only 11% of applicants reported comments about the research team (Table 2), lending further support to a significant discrepancy between applicant and reviewer perceptions, which may establish a perception of prestige bias and cronyism among applicants.

In general, reviewer and applicant demographics and even funding success had little effect on respondent perceptions of review feedback, suggesting the pervasiveness of these discrepancies in the review process (Tables 3, 6). The exception seems to be with female applicants and their low level of reported comments on riskiness (Table 3). Some have suggested that a subtle gender bias exists in grant peer review where reviewers more often tend to see male applicants as leaders and score their applications more generously (Magua et al. 2017). If this is the case, it may be that reviewers temper their assessments of methodological weaknesses to a lesser degree for female applicants. More work is needed in this area to confirm whether this is true.

If reviewers are in fact biased, how might those who conduct peer review address biases? One suggestion has been to train reviewers to properly interpret and weigh the different review criteria to achieve the goals of the funding agency, journal, etc. (Lee 2015). However, when reviewers are simultaneously asked to assess feasibility and innovation, implicit risk aversion may result in resistance to achieving equitable weighing of component criteria. Also, there is likely considerable variation in risk preferences between individuals (Weber and Hsee 1999; Slovic 1999) and across different teams of peers (Gardner and Steinberg 2005). These risk preferences may be difficult to change with training, although to date, neither measurements of reviewers’ and review panels’ risk preferences nor the effects of training have been assessed.

Another important consideration is the lack of consistency in reporting of any of the review criteria areas in the feedback (Table 2), where despite an expectation that nearly every reviewer critique might include a focus on methodology, it is only reported 53% of the time. This variance may contribute to the considerable inter-rater reliability found in many peer reviews (Cole et al. 1981; Cicchetti 1991; Bornmann and Daniel 2008). Moreover, respondents indicated an average of 2.3 ± 0.04 with regard to the appropriateness of the criteria, with younger respondents finding the criteria less appropriate. However, it is not clear that reviewers have a consensus on which criteria are most appropriate and or should be weighted the highest. There is clearly a tension between criteria like high risk–high reward innovation and sound methodology. While groups like NIH have created special funding mechanisms to support risky innovation (NIH 2016b), the review criteria for R01 grants still include both innovation and methodology as criteria (NIH 2016a). Thus, it may be that more discussion needs to take place among the scientific community as to whether current review criteria are appropriate for most reviews, if other criteria need to be considered, and if criteria should be prioritized to promote reviewer reliability.

One important absence in our analysis is the linkage of reviewer/applicant responses to the same proposal. This type of analysis would be very interesting but can only be accomplished through the willingness of the funding agency to allow access to the data to make such comparisons. Also, the lack of correlation between feedback concerning preliminary data and innovation is particularly surprising (Table 4), as often there is a lack of sufficient preliminary data to support highly innovative research approaches, so much so that many funding mechanisms to support innovative research lack the requirement for preliminary data. It may be that reviewer consideration of preliminary data does not weigh as heavily in reviewer’s minds as other criteria or more likely it may influence feasibility concerns in the methodology comments. More research should be conducted exploring reviewer consideration of preliminary data relative to other criteria.

In conclusion, it is apparent that more studies are needed to tease out reviewer decision-making processes in grant peer review, particularly with regard to risk tolerance. The differences seen in this study between the perceptions of grant reviewers and applicants have unearthed an interesting area that warrants further investigation if we are to better understand and optimize the peer review process.