Introduction

Science funding is predominantly issued by national governments, science agencies, and philanthropic institutes. Seeking to fund the best science and achieve the highest marginal return on investment, funding organizations often organize competitions to allocate a limited number of grants. In many of these competitions, scientists are invited to write and submit a proposal describing a future research endeavor along with a CV or summary of academic accomplishments. The funding organization then reviews these submissions and selects those deemed most worthy of funding (Wahls, 2019).

Participation in funding competitions comes with some benefits to the individual researcher. Writing a detailed research proposal forces one to critically reflect on one’s ideas and develop rigorous research plans that may be of value also if no funding is obtained (Barnett et al., 2017). In addition, the applicant receives valuable peer feedback that may lead to an improved research design. Science funding based on research proposals may also reduce organizations’ reliance on prior accomplishments in their selection of awardees, and thus dampen Matthew effects in scientific careers (Bol et al., 2018; Merton, 1968).

These potential benefits notwithstanding, competing for funding through grant proposal writing is time-consuming (Ioannidis, 2011). A survey among scientists at top U.S. universities found that faculty spend about 8% of their total time on writing grant proposals and about 19% of the time available for research (Gross & Bergstrom, 2019). These percentages are likely to be higher at universities with lower endowments and in disciplines that require investments in expensive equipment or complex data collection efforts. Moreover, the costs of writing grant proposals are exacerbated by the low average funding rates in science funding competitions worldwide (Herbert et al., 2013). As budgets of funding agencies fail to keep up with the growth of science, the rate at which applications are funded keeps dropping (Lauer & Nakamura, 2015). The effort that goes into unfunded research proposals has been estimated to equal the total scientific value of funded research (Gross & Bergstrom, 2019). Proposal-based grant competitions are not only taxing on the applicant, but also on reviewers. A submitted proposal is typically reviewed by several panelists as well as multiple external reviewers (Bol et al., 2018), each sacrificing many hours of research time.

The high cost of proposal-based funding practices naturally raises the question of whether under this status quo, funding agencies make better decisions than under a less demanding alternative regime that does not require detailed research proposals. A number of funding agencies are currently experimenting with less taxing decision systems, including lotteries (Adam, 2019; Avin, 2015; Fang et al., 2016; Ioannidis, 2011). Yet evidence on the returns of the use of detailed proposals is lacking. Some research has examined agreement among reviewers of science and finds only moderate to low levels of agreement among reviewers in their assessments of grant applications (Cicchetti, 1991; Cole et al., 1981; Jayasinghe et al., 2003; Marsh & Ball, 1991; Mutz et al., 2012; Pier et al., 2018). While this suggests that proposal quality is not something academics readily agree upon, the funding decisions reached by diverse crowds may nonetheless be wise (Becker et al., 2017; Hong & Page, 2004; Lorenz et al., 2011). Another strand of research correlates aggregate evaluation scores with measures of scientific impact, netting out the impact of funding. Results are not unequivocal: Some find sizable correlations (Li & Agha, 2015), while others find them to be moderate to weak (Bol et al., 2018; Fang et al., 2016; Jacob & Lefgren, 2011; Wang et al., 2019). Moreover, the impact measures used in these studies may themselves be questioned on validity grounds (Bollen et al., 2009; Bornmann & Leydesdorff, 2013; Radicchi et al., 2008; Wang et al., 2013) and exclude forms of non-academic, societal impact (Eysenbach, 2011).

We circumvent the problem of measuring the quality of realized funding allocations by avoiding the direct assessment of decisions reached through proposal review. Instead, we ask whether the use of proposals makes reviewers evaluate grant applications differently compared to the scenario in which reviewers have no access to the research proposal. A necessary condition for proposals to lead to superior funding decisions that could not have been reached without them is that these decisions are at least different from the decisions that would have been made in their absence. We refer to such a difference as a proposal effect.

It is not obvious that proposals should have substantial impact on how an application is evaluated. First, applicants with stronger CVs may write stronger proposals causing the variation in proposal quality to become redundant if reviewers have access to CVs. Second, research suggests that when quality is ambiguous or difficult to observe, evaluators will base their judgments on status markers (Manzo & Baldassarri, 2015; Merton, 1968; Simcoe & Waguespack, 2011). Some controlled studies indeed confirm that in merit review evaluators rely on applicant seniority status, past citations, and publication record (Waguespack & Sorenson, 2011). If the quality of grant proposals is ambiguous and reviewers fall back on quality signals from the CV, then again funding decisions with and without proposal should be similar.

The procedures of many funding agencies nonetheless continue to heavily rely on proposal writing and review, under the implicit assumption of a substantial proposal effect. To evaluate the presence of a proposal effect, we first develop a model to derive the prediction of a proposal effect from explicit assumptions. We then discuss our empirical setting and the field experiment that we designed. The field experiment builds on the idea that we introduced earlier: if a proposal effect is present, there should be a difference in how an application, with and without a full proposal, is evaluated. Then, with the data from the field experiment we proceed to test the hypothesis that two panelists will disagree more on the merit of an application if only one has access to the proposal compared to when both have access.Footnote 1

We investigate this question drawing on novel data from a field experiment conducted by the Dutch Research Council (NWO), the premier science funding organization in the Netherlands.Footnote 2 The experiment involves the first round of NWO’s 2018 Vidi competition for investigator awards of 800,000 euros in which panelists make a preselection of promising applications. For the purpose of the experiment NWO recruited duplicate “shadow” panelists from its Scientific Advisory Board (https://www.nwo.nl/en/scientific-advisory-board). Proposal texts were withheld from a random subset of shadow panelists who rated applications only on the basis of the applicant’s CV and a one-paragraph proposal summary. This created two treatment groups: a proposal group and a no proposal group. We compare the extent to which evaluations of the applications in these conditions were aligned with the evaluations of the regular panelists.

In a series of tests, we find that withholding proposal texts from panelists did not substantially impact the evaluation of a proposal as measured by comparing rankings and scores from the experimental conditions to those of the regular panelists. These results suggests that the resources devoted to writing and evaluating grant proposals may not have their intended effect of facilitating the selection of the most promising science.

Theory

Consider a sample of applications that are reviewed by panelists who either have access to a full proposal and CV (i.e. the proposal (P) condition) or who only have access to a CV (i.e. the no-proposal (N) condition). Comparing these applications to the same set of applications reviewed by regular panelists creates two groups: (1) those where both panelists can read the proposal text (P–P) and those where the proposal text is accessible to one panelist but not the other (P–N). We argue that when both panelists have access to the proposal text (P–P) there should be more agreement on the quality of the application than when only one has access (P–N).

The theoretical basis for our argument that agreement should be higher for an application evaluated in the P–P group versus a proposal evaluated in the P–N group can be articulated in terms of two panelists j = 1,2 who evaluate applications i = 1….I. Each application consists of a CV and a proposal text, which have a quality Ci and Ti respectively, each with a normal distribution with zero mean.Footnote 3 CV quality and proposal quality are measured on the same scale and therefore have the same variance. The quality of the CV and the proposal may be correlated but not perfectly, as otherwise, trivially, the CV is a perfect substitute for the proposal and the omission of the proposal cannot be consequential.

In the P condition, a panelist j provides an evaluation XijP of application i that equally weighs the quality of the CV and that of the proposal, plus a normally distributed error EijP with zero mean:

$$X_{ij}^{P} = C_{i} + T_{i} + E_{ij}^{P}$$
(1)

In the N condition, a panelist j achieves an evaluation XijN the same way, except that they use the quality of the CV as their best guess of the quality of the proposal, again with a normally distributed error EijN with zero mean:

$$X_{ij}^{N} = \, 2C_{i} + E_{ij}^{N}$$
(2)

The Pearson correlation in panelists’ evaluations of applications from the P–P and P–N groups respectively then equals:

$${\text{Corr}} ( {X_{i1}^{P} ,X_{i2}^{P} } ) = 2{\text{Var}}( {C_{i} } ) + 2{\text{Cov}}( {C_{i} ,T_{i} } )/[2{\text{Var}}( {C_{i} } ) + {\text{Var}}( {E_{i1}^{P} } ) + 2{\text{Cov}}( {C_{i} ,T_{i} } )]$$
(3)
$${\text{Corr}}( {X_{i1}^{P} ,X_{i2}^{N} } ) = { [}2{\text{Var}}( {C_{i} } ) + 2{\text{Cov}}( {C_{i} ,T_{i} } ){]}/([2{\text{Var}}(C_{i} ) + {\text{Var}}(E_{i1}^{P} ) + 2{\text{Cov}}(C_{i} ,T_{i} )][4{\text{Var}}(C_{i} ) + {\text{Var}}(E_{i2}^{N} )])^{1/2}$$
(4)

The correlation for applications in the P–P group (3) will exceed that for applications in the P–N group (4) ifFootnote 4:

$$2\left[ {{\text{Var}}\left( {C_{i} } \right) \, - \, {\text{Cov}}\left( {C_{i} ,T_{i} } \right)} \right] \, > \, {\text{Var}}\left( {E_{i1}^{P} } \right) \, - \, {\text{Var}}\left( {E_{i2}^{N} } \right)$$
(5)

Inequality (5) will be met under the assumption that proposal evaluation is reasonably informative, which is the implicit rationale for the continued use of proposal writing and evaluation in many leading funding competitions. Proposal evaluation is informative if it measures something distinct from CV quality (lower Cov(Ci, Ti) which increases the left side of inequality (5)) and if proposal quality is not in the eye of the beholder (lower Ei1P which decreases the right side of inequality (5)). Panelist agreement on application evaluation will then be greater when both panelists evaluate applications in the P condition (P–P) than when only one does (P–N):

Hypothesis

Panelists’ evaluations of grant applications agree more when both have access to the proposal text than when only one has access.

In our statistical analysis we use two related measures of panelist agreement. Our first measure of agreement is the probability that two applications evaluated by two panelists have concordant rankings, which amounts to a Kendall’s Tau statistic. Given that the correlations in question pertain to bivariate normally distributed quantities, we can use the fact that Kendall’s Tau monotonically increases in the correlation following 2arcsin(Corr())/π to derive that any two applications are more likely to be ranked concordantly by two P panelists when both panelists have access to both proposals (P–P) than when only one panelist has access (P–N).

The second measure of agreement is the absolute difference in the evaluation, | Xi1PXi2P | or | Xi1PXi2N |. For normally distributed variables, the mean absolute deviation is √(2/π) times the standard deviation, which in turn monotonically decreases in Corr(Xi1P, Xi2P) respectively Corr(Xi1P, Xi2N), so must be smaller when both panelists have access to the same proposal than when only one has access.

Experimental design

The experiment was conducted in the Social Science & Humanities domain of NWO’s 2018 Vidi competition which consists of eight panels representing different disciplines (see Supplementary Information for further details). NWO duplicated these eight panels for the experiment. Each submitted application (N = 182) was assigned to two out of 58 regular panelists as part of the regular evaluation process as well as to two out of 41 shadow panelists from the corresponding shadow panel. Funding decisions were based only on regular panelist evaluations.

NWO matched both regular and shadow panelists to applications based on the similarity between proposal content and panelist expertise, panelist preferences, and conflicts of interest. NWO gave regular and shadow panelists guidelines and standard evaluation sheets and asked them to provide three scores on a scale of 1 (excellent) to 9 (bad) — one for the quality of the researcher (the CV score), one for the quality, innovative character, and academic impact of the proposed research (the proposal score), and one for the potential for utilization of knowledge for society and for the economy (the knowledge utilization score). The overall score NWO calculates is a weighted sum of the CV score (weighted 0.4), the proposal score (weighted 0.4), and the knowledge utilization score (weighted 0.2).

After shadow panelists were assigned to proposals, they were randomly assigned to an experimental condition using a randomized block design: within each shadow panel half of the panelists were assigned to a proposal condition (P) and the other half to a no-proposal condition (N). The randomized block design ensures that there are balanced numbers of applications in both conditions within each panel, ensuring the treatment is orthogonal to panels. In line with our hypothesis, our analysis considers applications belonging to one of two groups: (a) applications assessed only in the proposal condition (“proposal group” or P–P group) and (b) applications assessed once in the proposal and once in the no-proposal condition (P–N group). To ensure perfect balance in the composition of these two groups, for each application one evaluation always comes from a regular panelist, and one from a shadow panelist. Table 1 provides a breakdown of applications and panelists by panel and condition. For example, the table shows there are 6 regular panelists in the CW panel who all naturally reviewed in the P condition, and there were 4 shadow panelists, of which 2 were assigned to the P and 2 to the N condition. There are exactly 21 cases where an application in the CW panel was reviewed by at least one regular panelist in the P condition and at least one shadow panelist in the P condition. There are exactly 19 cases where an application in the CW panel was reviewed by at least one regular panelist in the P condition and at least one shadow panelist in the N condition.

Table 1 Number of panelists and applications per panel and condition/group

Analytical strategy

Our analytical strategy is to take two approaches to test our hypothesis. First, we evaluate panelist agreement on rankings. To this end, for each pair of applications in the P–N group reviewed by the same shadow panelist in the no-proposal condition we determined which of the two applications received a better scoreFootnote 5 and then took two evaluations of the same two applications by a panelist from the regular panel and determined if the order of the scores was the same. Analogously, for each pair of applications in the P–P group evaluated by the same shadow panelist in the proposal condition we determined which was evaluated better and computed how often panelists in the regular panel agreed with this ranking. Together there were 722 such comparisons. Ties were broken randomly. We measure agreement on rankings as the percentage of cases where the rank orders in the shadow and regular panel agree. Our estimand for this approach is the difference in this agreement percentage between applications in the P–N group and applications in the P–P group. The rationale for conducting this analysis is that in the presence of a proposal effect and in line with our hypothesis, rankings of applications in the proposal condition compared to rankings of applications in the no proposal condition should be more in line with rankings in the regular panel.

Second, we compare panelist disagreement on scores – which we measure as the absolute difference in scores between two panelists reviewing the same application – between applications in the P–N group and applications in the P–P group. Our estimand for this second approach is the difference in mean disagreement between applications in the P–N group and applications in the P–P group. In line with our hypothesis, we predict that average disagreement among panelists regarding the quality of an application will be more pronounced when only one of the two panelists has read the proposal compared to when both have read the proposal. The existence of such a difference in disagreement across the two groups would indicate a proposal effect in panelist judgment.

In evaluating panelist agreement on rankings and panelist disagreement on scores, we used nonparametric randomization tests. Panelists evaluated multiple applications, so we cannot assume independence of observations in any test for group differences across applications. Accordingly, we generated the sampling distribution of each of our estimands under the null hypothesis, i.e. the permutation distribution, by way of randomly reassigning the condition labels to panelists 1000 times. Specifically, we took the P and N labels in the shadow panels and randomly reassigned those labels to panelists. We only reshuffled the condition labels within panels, so that the block design was preserved. At each permutation we recalculated the estimand. We then calculated the two-sided p-value as the fraction of 1000 permuted panelist assignments for which the estimand exceeded its value in the non-permuted data.

Results

First, we examined whether not being able to access the full proposal text altered the way a panelist ranked those applications. A concordance percentage of 50% is achievable with random scoring and 100% is perfect agreement. We find that the percentage of concordant pairs in the P–N group (55.2%) is 3.7 points lower than that in the P–P group (58.9%) (see Table 2). The results of the randomization test, shown in Fig. 1, indicate no significant difference at the 5% level in disagreement between applications evaluated in the P–N and those in P–P groups (two-sided p-value = 0.43). Table 2 shows that the rankings calculated separately for CV, proposal, and knowledge utilization scores also yield only small differences, none of which are significant (see Supplementary Fig. 1). Noteworthy is that when both panelists can read both proposals (P–P), they agree on which is better only 53.4% of the time. This provides an explanation for the rejection of the hypothesis: It was derived under the assumption of informative proposal evaluations, and this assumption is not supported in the data.

Table 2 Panelist agreement on rankings
Fig. 1
figure 1

The vertical line represents the observed difference (− 3.7%) between the percentage of concordant pairs in the PN group (only regular panelists can read the proposal) and the percentage of concordant pairs in the P–P group (both shadow and regular panelists can read the proposal). White bars display the distribution of the differences obtained from hypothetical re-randomized assignments of panelists to conditions. With the difference in agreement in the unpermuted data being closer to zero than in the 5% most extreme cases of the permutations, the analysis finds no statistically significant difference at the 95% level in agreement between groups

In the subsequent analysis, we examined the average disagreement levels in overall scores between the P–P and P–N groups. Table 3 shows disagreement among pairs of panelists in the evaluation of different elements of an application (rows) by proposal group (columns). Overall, panelist disagreement varied little between groups. As seen in column 1 of Table 3, the mean level of disagreement on the overall scores was 0.04 lower in the P–N group than in the P–P group. This difference is small compared to the standard deviations of the two groups (0.67 and 0.63, respectively). Comparing the actual group difference with the distribution of differences generated from reshuffled samples showed no significant difference between the two groups at the 5% level (two-sided p-value = 0.36) (Fig. 2). We conducted similar tests for disagreement on CV scores, proposal scores, and knowledge utilization scores, all of which yielded consistent results (see Supplementary Fig. 2).

Table 3 Mean levels of disagreement on scores by group (standard deviations in parentheses)

Overall, we conclude from these results in combination with the results of the ranking analysis that one panelist not being able to read a proposal does not lead that panelist to disagree more with the other panelist on the application’s merit. The main hypothesis is rejected.

Fig. 2
figure 2

The vertical line represents the observed difference (− 0.04) in mean disagreement between the P–N group (only regular panelists can read the proposal) and the P–P group (both shadow and regular panelists can read the proposal). Disagreement is measured as the absolute difference in standardized overall scores between two panelists reviewing the same application. White bars display the permutation distribution of the difference in mean disagreement between the two groups, obtained from hypothetical re-randomized assignment of panelists to conditions. With the difference in mean disagreement being closer to zero than in the 5% most extreme cases of the permutations, the analysis finds no statistically significant difference at the 95% level in disagreement between groups

Discussion

We conclude that panelist assessment of an application changes little when the proposal text is omitted from it. Writing and evaluating proposals comprises the lion’s share of the costs of grant peer review (Graves et al., 2011). Our findings suggest that funding agencies using single-blind panel review, at least in a pre-selection stage prior to external review, can expect to achieve similar candidate selections by screening on the basis of CV and proposal abstract only. We hasten to reiterate that the writing of proposals may have intrinsic value to applicants also when not funded, and may together with reviewer input improve the quality of the work ultimately done once funded.

Studies of Matthew effects in science funding suggest that an emphasis on CV in merit assessment will strengthen the self-reinforcing character of winning grants (Bol et al., 2018; Wang et al., 2019). However, our results indicate that the presence of a full proposal text may not substantially alter evaluative outcomes. In a system that preselects on CV and proposal abstract only, then, the Matthew effect would likely not be much stronger despite there being little to go on besides applicant reputation.

Several limitations to the present investigation deserve consideration. First, limited statistical power renders it possible that writing a strong proposal does mildly increase an applicant’s chances for advancement to the next round. Our best estimate is that being able to read two proposals raises the chances a panelist will agree with another panelist who read both proposals on which of the two applications is the stronger one by about four percent points. This effect is small when compared to the dominant role of chance associated with one’s application being assigned to two favorable panelists (Cole et al., 1981).

Second, one may wonder whether shadow panelists assessed applications less meticulously or were less committed to the appraisal process. While our analysis revealed no systematic differences along any scoring dimensions between regular and shadow panelists evaluating the same proposals, we cannot rule out that there are differences we were not able to detect.

Third, our investigation was limited to peer review in an individual funding competition. In such competitions the CV of the applicant may play a more dominant role than otherwise. One may speculate that in competitions with collaborative proposals the proposal effect may be stronger so that the omission of the full proposal text would have a larger impact.

Finally, the experiment was limited to the initial scoring of candidates by panelists, preventing us from assessing a proposal effect in later stages of evaluation that involve expert reviewers. Nonetheless, even if a strong proposal effect exists in later rounds, most applications are already discarded in the preselection stage before the detailed description of the proposed research on which so much time was spent gets a chance to make a difference.