Critical mass and the dependency of research quality on group size
- First Online:
- Received:
- Accepted:
DOI: 10.1007/s11192-010-0282-9
- Cite this article as:
- Kenna, R. & Berche, B. Scientometrics (2011) 86: 527. doi:10.1007/s11192-010-0282-9
- 20 Citations
- 329 Downloads
Abstract
Academic research groups are treated as complex systems and their cooperative behaviour is analysed from a mathematical and statistical viewpoint. Contrary to the naive expectation that the quality of a research group is simply given by the mean calibre of its individual scientists, we show that intra-group interactions play a dominant role. Our model manifests phenomena akin to phase transitions which are brought about by these interactions, and which facilitate the quantification of the notion of critical mass for research groups. We present these critical masses for many academic areas. A consequence of our analysis is that overall research performance of a given discipline is improved by supporting medium-sized groups over large ones, while small groups must strive to achieve critical mass.
Keywords
Critical mass in research Research quality Research policy Research assessment exercise Agence d’évaluation de la recherche et de l’enseignement supérieur Research excellence framework Research fundingIntroduction
The capacity to assess the relative strengths of research groups is important for research institutes, funding councils and governments that must decide on where to focus investment. In recent years, there have been pressures to concentrate funding on institutions which already have significant resources, in terms of finances and staff numbers, due to an expectation that these produce higher quality research (Harrison 2009). On the other hand, advocates of competition argue for a more even spread of resources to also support pockets of excellence found in smaller universities. A central question in this debate (Harrison 2009) is whether there exists a critical mass in research and, if so, what is it? Here we show that research quality is indeed correlated with group size and that there are, in fact, two significant, related masses, which are discipline dependent. The critical mass marks the size below which a group is vulnerable to extinction and there is also a higher value at which the correlation between research quality and group size reduces. We present a model based on group interactions and determine the critical masses for many academic disciplines.
The notion of relative size of research groups varies significantly across subject areas. Larger research groups are common in experimental disciplines while theorists tend to work in smaller teams or even individually. Critical mass may be loosely described as the minimum size a research team must attain for it to be viable in the longer term. This is subject dependent. For example, life would clearly be very difficult for a single experimental physicist, while a pure mathematician may be quite successful working in isolation. Once critical mass is achieved, a research team has enhanced opportunities for cooperation as well as improved access to more resources. Compared to pure mathematicians, one expects that a greater critical mass of experimental physicists is required to form a viable research team.
Indeed, in recent years there has been a tendency in some countries to concentrate resources into larger research groups at the expense of smaller ones and to encourage smaller teams to merge, both within and between institutions (Harrison 2009). The question arises as to what extent conglomeration of research teams influences research quality. Larger teams may have an advantage in terms of environmental dynamism (collaboration fosters discussion and vice versa) and reduced individual non-research workloads (such as teaching and administration). Here, we quantify these intuitive notions and the effectiveness of conglomeration of research teams.
This work is primarily based upon the measures of research quality as determined in the UK’s Research Assessment Exercise (RAE). Although the data are primarily UK based, this analysis should be of widespread interest (King 2004). For example, the French assessment system, which is performed by the Agence d’Évaluation de la Recherche et de l’Enseignement Supérieur (AERES), is attempting to move towards a more accurate methodology, similar to the UK system. Therefore, to check the generality of our analysis, we compare the results of the UK’s RAE with those of the French equivalent.
4*: Quality that is world-leading in terms of originality, significance and rigour
3*: Quality that is internationally excellent in terms of originality, significance and rigour but which nonetheless falls short of the highest standards of excellence
2*: Quality that is recognised internationally in terms of originality, significance and rigour
1*: Quality that is recognised nationally in terms of originality, significance and rigour
Unclassified: Quality that falls below the standard of nationally recognised work
In the AERES evaluation system France is geographically divided into four parts, one part being evaluated each year. The 2008 evaluation, which is the most recent for which data are available, is considered more precise than the previous exercise and facilitates comparison with the British approach. However, since only 41 different institutions were evaluated (and of them, only 10 were traditional universities), the amount of data available for the French system is lower than for the UK equivalent. Furthermore, only a global mark is attributed to cumulated research groupings which can include several teams with heterogeneous levels. As a consequence, we lose the fine-grained analysis at the level of the research teams. This is clearly a weak point compared to the British system of evaluation and the AERES intends to change it in the near future. Nonetheless, in order to make the comparison with the British system, we translate the AERES grades A+, A, B, C into 4*, 3*, 2*, and 1*.
From the outset, we mention that there are obvious assumptions underlying our analysis and limits to what it can achieve. We assume that the RAE scores are reasonably robust and reliable (an assumption which is borne out by our analysis). We cannot account for collaborations between academia and industry, and we have omitted the engineering disciplines from this report as no clear patterns were discernible. Nor can we account for managerial tactics whereby assessed researchers are relieved from other duties such as teaching and administration. Factors such as these amount to (sometimes considerable) noise in the systems and it is remarkable that, despite them, reasonably clear and quantitative conclusions can be drawn.
The relationship between quality and quantity
It is reasonable to expect that both the size and the quality of a research group are affected by a multitude of factors: the calibre of individual researchers, the strength of communication links between them, their teaching and administrative loads, the quality of management, the extent of interdisciplinarity, the equipment used, whether the work is mainly experimental, theoretical or computational, the methodologies and traditions of the field, library facilities, journal access, extramural collaboration, and even previous successes and prestige factors. We will show that of all these and other factors, the dominant driver of research quality is the quantity of researchers that an individual is able to communicate with. Here, we develop a microscopic model for the relationship between quality and quantity. It will be seen through rigorous statistical analyses that this model captures the relationship well and that quantity of intra-group communication links is the dominant driver of group quality. Other factors then contribute to deviations of the qualities of individual groups from their size-dependent averages or expected values.
Denote the strength of the ith member of group g by \({a_g}_i.\) This parameter encapsulates not only the calibre of the individual and prestige of the group, but the added strength that individual enjoys due to factors such as extramural collaboration, access to resources, teaching loads, etc. It does not encapsulate cooperation between individual i and individual j, say. We address such factors below.
That the impression coming from Fig. 1a is not the full picture is demonstrated in Fig. 1b, where the quality s for applied mathematics is plotted against group size N. Clearly the distribution is not entirely random and there is a positive correlation between quality s and the team size N: larger teams tend to have higher quality. To understand the reason for the correlation between quality and group size, we must consider research groups as complex systems and take intra-group interactions into account.
The current view within the physics community is that a complex system is either one whose behaviour crucially depends on the details of the system (Parisi 1999) or one comprising many interacting entities of non-physical origin (http://194.44.208.227/~hol/research.html). Here, the second definition is apt. In recent years, statistical physicists have turned their attention to the analysis of such systems and found applications in many academic disciplines outside the traditional confines of physics. These include sociology (Galam 2004, Kondrat and Sznajd-Weron 2010), economics (Biely et al. 2009), complex networks (Dorogovtsev and Mendes 2003) as well as in more exotic areas (Bittner et al. 2007, Nußbaumer et al. 2009). Each of these disciplines involve cooperative phenomena emerging from the interactions between individual units. Microscopic physical models—mostly of a rather simple nature—help explain how the properties of such complex systems arise from the properties of their individual parts.
However, since two-way communication can only be carried out effectively between a limited number of individuals, one may further expect that, upon increasing the group size, a saturation or breakpoint point is eventually reached, beyond which the group fragments into sub-groups. We denote this breakpoint by N_{c}.
In Eqs. 5 and 6, the parameters \(\bar{a}_g, \bar{b}_g, \alpha_g\) and β_{g} represent features of individual groups labelled by g. Averaging these values gives a representation of the expected or average behaviour of the groups in the discipline to which group g belongs. We denote these averages as a, b, α and β, respectively.
Having defined large teams as those whose size N exceeds the breakpoint N_{c}, we next attempt to quantify the meaning of the hitherto vague term “critical mass” (Harrison 2009). This is described as the value N_{k} of N beneath which research groups are vulnerable in the longer term. We refer to teams of size N < N_{k} as small and groups with N_{k} < N < N_{c} as medium. To determine N_{k}, one may ask, if funding is available to support new staff in a certain discipline, is more globally beneficial to allocate them to a small, medium or large team?
Here we have considered the question of where to allocate M new staff, should they become available to a particular research discipline. We may also ask the complementary question: if the total number of staff in a given area is fixed, what is the best strategy, on average, for transferring them between small/medium and large groups. It turns out that incremental transfer of staff from a large group to a medium one (one which already exceeds critical mass) increases the overall strength of the discipline (Kenna and Berche 2010). A sub-critical group must, however, achieve critical mass before such a move is globally beneficial.
To summarize, based upon the notion that two-way collaborative links are the main drivers of quality, we have developed a model which classifies research groups as small, medium and large. If external funding becomes available, it is most beneficial to support groups which are medium in size. Likewise, the incremental transfer of staff from large groups to medium ones promotes overall research quality (Kenna and Berche 2010). On the other hand, small groups must strive to achieve critical mass to avoid extinction. Our approach involves piecewise linear models since these are easily interpreted as representing the effects of collaboration and of pooling of resources. We consider other models (such as higher-degree polynomial fits) less appropriate as they are not so easily interpreted. Similarly, although the opposite causal direction (quality leads to increase in group size) no doubt plays an important role in the evolution of research teams, this is less straightforward to interpret in a linear manner. (Indeed on the basis of the empirics we shortly present, we shall see that this is not the dominent driver of group evolution.) With these caveats in mind, we proceed to report on the analysis of various subject areas as evaluated in the UK’s RAE.
Analysis of various subject areas
We begin the analysis with the subject area with which we are most familiar, namely applied mathematics, which includes some theoretical physics groups (Fig. 1b). For applied mathematics, the smallest group consisted of one individual and the largest had 80.3, with the mean group size being 18.9. (Fractional staff numbers are associated with part-time staff.) As stated, the first observation from the figure is that the quality s indeed tends to increase with group size N. The solid curve in Fig. 1b is a piecewise linear regression fit to the data and the dashed curves represent the resulting 95% confidence belt for the normally distributed data. The fitted breakpoint N_{c} = 12.5 ± 1.8 splits the 45 research teams into 16 small/medium groups and 29 large ones. The value for the critical mass is calculated from (10) to be N_{k} = 6.2 ± 0.9 which, based on experience, we consider to be a reasonable value for this subject area.
The results of the analysis of the RAE for a variety of academic disciplines. Here n represents the number of research groups in each area, the critical mass for which is estimated to be N_{k}
Subject | n | N_{k} = N_{c}/2 | R^{2} | P_{m} | \(P_{b_1-b_2}\) | \(P_{b_2}\) |
---|---|---|---|---|---|---|
Applied mathematics | 45 | 6.2 ± 0.9 | 74.3 | <0.001 | <0.001 | 0.001 |
Physics | 42 | 12.7 ± 2.4 | 53.0 | <0.001 | 0.003 | 0.098 |
Geography, Earth & environment | 90 | 15.2 ± 1.4 | 65.9 | <0.001 | <0.001 | 0.627 |
Biology^{b} | 51 | 10.4 ± 1.6 | 53.7 | <0.001 | <0.001 | 0.096 |
Chemistry | 31 | 18.1 ± 6.4 | 62.1 | <0.001 | 0.206 | 0.026 |
Agriculture, veterinary, etc.^{b} | 30 | 4.9 ± 1.4 | 52.3 | <0.001 | 0.115 | 0.045 |
Law^{a} | 67 | 15.4 ± 1.9 | 70.8 | <0.001 | <0.001 | 0.113 |
Architecture & planning | 59 | 7.1 ± 1.4 | 33.4 | <0.001 | 0.014 | 0.261 |
French, German, Dutch & Scandanavian | 62 | 3.2 ± 0.4 | 49.8 | <0.001 | 0.004 | 0.008 |
English language and literature | 87 | 15.9 ± 1.4 | 73.6 | <0.001 | <0.001 | 0.407 |
Pure mathematics^{c} | 37 | ≤2 | 29.1 | <0.001 | ||
Medical sciences | 82 | 20.4 ± 4.0 | 27.5 | <0.001 | 0.006 | 0.118 |
Nursing, midwifery etc. | 103 | 9.2 ± 2.2 | 19.7 | <0.001 | 0.017 | 0.364 |
Computer science 1 | 81 | 24.6 ± 5.0 | 45.0 | <0.001 | 0.007 | 0.954 |
Computer science 2 | 81 | 16.3 ± 4.3 | 43.3 | <0.001 | 0.014 | 0.065 |
Computer science 3 | 81 | 5.6 ± 2.4 | 40.9 | <0.001 | 0.252 | <0.001 |
Archaelogy 1 | 26 | 12.7 ± 1.6 | 74.9 | <0.001 | <0.001 | 0.816 |
Archaelogy 2 | 26 | 8.5 ± 1.2 | 74.7 | <0.001 | <0.001 | 0.154 |
Economics & econometrics | 35 | 5.3 ± 1.4 | 59.1 | <0.001 | 0.091 | <0.001 |
Business & management | 90 | 23.8 ± 3.8 | 60.4 | <0.001 | <0.001 | 0.042 |
Politics & international studies^{b} | 59 | 12.5 ± 2.1 | 53.8 | <0.001 | 0.001 | 0.115 |
Sociology | 39 | 7.0 ± 1.6 | 50.7 | <0.001 | 0.086 | 0.010 |
Education | 81 | 14.5 ± 2.2 | 55.7 | <0.001 | <0.001 | 0.336 |
History^{b} | 83 | 12.4 ± 2.3 | 49.7 | <0.001 | <0.001 | 0.054 |
Philosophy & theology | 80 | 9.5 ± 1.5 | 49.7 | <0.001 | <0.001 | 0.574 |
Art & design^{b} | 71 | 12.5 ± 3.7 | 19.6 | 0.002 | 0.019 | 0.616 |
History of art, performing arts, communication studies and music | 172 | 4.5 ± 0.8 | 27.5 | <0.001 | 0.011 | 0.006 |
In Table 1, various statistical indicators are listed. The P_{m} are the P values for the null hypothesis that there is no underlying correlation between s and N. Also, \(P_{b_1-b_2}\) are the P values for the hypothesis that the slopes coincide on either side of the breakpoint. Small values of these indicators (below 0.05, say) indicate that we may reject these null hypotheses. The large value of \(P_{b_1-b_2}\) for chemistry (Fig. 2d) suggests that the null hypothesis of coinciding slopes cannot be rejected, resulting in the large error bar reported for the critical mass in Table 1. Indeed, a single-line fit to all chemistry data yields a P value for the model of less than 0.001 and R^{2} = 59.6. In this case the size of the smallest group submitted to RAE was N = 10, which is larger than for the other subjects considered (the minimum group size in physics was N = 1, and for biology it was N = 3) and indicates that chemistry in the UK is dominated by medium/large groups, most small ones having already petered out.
The results for pure mathematics are presented in Fig. 3f and merit special comment. In pure mathematics no breakpoint was detected and the data is best fitted by a single line s = a + bN. The relatively small slope and large intercept of this fit are similar to the corresponding values for large groups in applied mathematics. This suggests that the set of the pure mathematics data may be interpreted as also corresponding to large groups. In this case N_{c} may be interpreted as being less than or equal to the size of the smallest group submitted to the RAE, which was 4, so that the critical mass N_{k} is 2, or less. This indicates that local cooperation is less significant in pure mathematics, where the work pattern is more individualised. This is consistent with experience: in pure mathematics publications tend to be authored by one or two individials, rather than by larger collaborations.
In computer science, three competing candidates for the breakpoint were found, each with similar coefficients of determination. These are N_{c} = 49.1 ± 10.0 with R^{2} = 45.0, N_{c} = 32.5 ± 8.5 with R^{2} = 43.3, and N_{c} = 11.3 ± 4.7 with R^{2} = 40.9. We interpret these as signaling that this discipline is in fact an amalgam of several sub-disciplines, each with their own work patterns and their own critical masses. These are listed in the table as computer science 1, 2, and 3, respectively. Similarly, for archaeology, which with 26 data points is the smallest data set for which we present results, besides the peak in R^{2} corresponding a breakpoint at N_{c} = 25.4 ± 3.2 where R^{2} = 74.9, there is also a local maximum at N_{c} = 17.0 ± 2.4 where R^{2} = 74.7. No other discipline displays this feature of multiple breakpoints, which we interpret as signaling that these other fields are quite homogeneous in their work patterns.
In Table 1, \(P_{b_2}\) are the P values for the hypothesis that the large-group data are uncorrelated. From the model developed in Sect. The relationship between quality and quantity, one expects this saturation to be triggered for sufficiently large values of N_{c}. Indeed, this is the case for 16 subject areas to the right of the breakpoint, with N_{c} > 14. On the other hand, the data for 8 subject areas do not appear to completely flatten to the right of N_{c}. Six of these have a relatively small value of the breakpoint, and, according to Sect. The relationship between quality and quantity, we interpret the continued rise of s(N) to the right of N_{c} as being due to inter-subgroup cooperation (Kenna and Berche 2010). (The cases of chemistry and business/management buck this trend in that the data rises to the right despite having a comparably large breakpoint.) Although we have insufficient large-N data to test, one may expect that these cases will also ultimately saturate for sufficiently large N (Kenna and Berche 2010).
The English funding formula was adjusted in 2010 in such a way that 4* research receives a greater proportion of funds. The corresponding formula is s = (9p_{4*} + 3p_{3*} + p_{2*})/9. We have tested robustness of our analysis by checking that this change over the 2009 formula (Eq. 1) does not alter our conclusions within the quoted errors. For example, for applied mathematics, N_{k} changes from 6.2 ± 0.9 with R^{2} = 74.3 to N_{k} = 6.3 ± 1.0 with R^{2} = 73.5. Similarly, the values for physics change from N_{k} = 12.7 ± 2.4 with R^{2} = 53.0 to N_{k} = 12.6 ± 2.3 with R^{2} = 54.4. Another test of robustness in the applied mathematics case is to remove the two points with largest N and s values (In Fig. 2b) from the analysis. The resulting values are N_{k} = 6.1 ± 1.0, with R^{2} = 71.5, while the estimates for a_{1}, ...a_{2} remain stable. Thus, even the biggest and best groups follow the same patterns as the other large groups.
Discussion
At the end of Section The relationship between quality and quantity, we mentioned that the opposite causal mechanism of increasing quality driving increasing group size, may be dismissed as the dominant mechanism on the basis of empirical evidence. Such a model may be considered at the microscopic level by preferential attachment of quality to quality (a “success-breeds-success” mechanism). In such a system, a high quality group may gain more funding and find it easier to attract more high quality researchers, and hence may grow while maintaining quality. The converse may hold for a lower quality group. However, if this were the primary mechanism, one would expect a monotonic increase of N with s to plateau only when the maximum possible quality s = 1 is achieved, and the absence of a phase transition prior to this point. Since none of the disciplines analysed exhibit such behaviour, we may dissmiss it as the primary causal link between quality and quantity, at least for the subject areas presented. However, undoubtedly such a feedback mechanism may be at work at a sub-dominant level.
Finally, we discuss an important consequence of our analysis regarding perceptions of individual research calibre resulting from exercises such as the RAE and AERES. As discussed in Section The relationship between quality and quantity, Fig. 1a invites a comparison of group performance, from which it is tempting to deduce information on the average calibre of individuals forming those groups. The correspondingly naive interpretation associated with Eq. 3 leads to the erroneous conclusion that \(\bar{a}_g = s_g,\) i.e., that the average strength of individuals in group g is given by the measured quality of that group.
We have seen that this argumentation is incorrect because it fails to take intra-group interactions into account—i.e., it fails to take account of the complex nature of research groups, the importance of which we have demonstrated.
Conclusions
By taking seriously the notion that research groups are complex interacting systems, our study explains why the average quality of bigger groups appears to exceed that of smaller groups. It also shows that it is unwise to judge groups—or individuals within groups—solely on the basis of quality profiles, precisely because of this strong size dependency. Because of the overwelming importance of two-way communication links, small and medium sized groups should not be expected to yield the same quality profiles as large ones, and to compare small/medium groups to the average quality over all research groups in a given discipline would be misleading. An analysis of the type presented here may therefore assist in the determination of which groups are, to use a boxing analogy, punching above or below their weight within a given research arena. This type of analysis should be taken into account by decision makers when comparing research groups and when formulating strategy. Indeed we have shown that to optimize overall research performance in a given discipline, medium sized groups should be promoted while small ones must endeavour to attain critical mass.
Furthermore, we have compared the French and UK research evaluation systems and found them to be consistent, although the RAE gives information at a finer level that does the AERES.
Finally, we have quantified the hitherto intuitive notion of critical mass in research and determined their values for a variety of different academic disciplines.
Acknowledgments
We are grateful to Neville Hunt for inspiring discussions as well as for help with the statistical analyses. We also thank Arnaldo Donoso, Christian von Ferber, Housh Mashhoudy and Andrew Snowdon for comments and discussions, as well as Claude Lecomte, Scientific Delegate at the AERES, for discussions on the work of that agency.