1 Introduction

The amount of funding allocated to universities and their organizational units is increasingly based on multidimensional performance indicators, which are, for example, the subject of target agreements (e.g., De Fraja et al. 2019). To compare the considered units, it is necessary to aggregate these performance indicators into one-dimensional measures, such as effectiveness and efficiency degrees. The selection of performance indicators used to quantify university performance has a decisive influence on these degrees. In the following, we will deal with this aspect of indicator selection, whereby we focus on measuring the research efficiency of university research fields.

A special aggregation method is necessary to determine research efficiency since relevant university performance indicators cannot be aggregated to financial ratios due to the lack of market prices. For this purpose, empirical studies often use methods based on production theory. The advantage of these methods is that the evaluators do not have to specify weights for the performance indicators since the methods do so endogenously.

The production-theoretical relation is expressed on the one hand through the fact that the performance indicators usually are categorized as input factors to be minimized and output factors to be maximized. On the other hand, the methods determine a best-practice frontier in the form of a production function based on the input-output data. The distance between this function and the data points determines the resulting efficiency degree. Corresponding methods differ in how the distance between the production function and a data point is interpreted: deterministic methods interpret it completely as inefficiency, whereas stochastic methods assume that a part of the distance is caused by stochastic noise.

Concerning deterministic methods, the Data Envelopment Analysis (DEA) established by Charnes et al. (1978) is predominantly applied in the literature for efficiency measurement in the university context. Furthermore, the deterministic Normalized Additive Analysis (NAA)Footnote 1 introduced by Ahn et al. (2007) also seems to be an interesting option. Its advantage is the easy calculation of efficiency degrees, making it particularly suitable for practical use without expert support.

Regarding stochastic methods, the Stochastic Frontier Analysis (SFA) introduced by Aigner et al. (1977) and Meeusen and Van den Broeck (1977) predominantly attracts attention in the literature. An interesting alternative is the Stochastic Non-Smooth Envelopment of Data (StoNED) developed by Kuosmanen and Kortelainen (2012), although, to our knowledge, it has not yet been used in a university context. It combines ideas from DEA and SFA and thus may retain their advantages without significant disadvantages.

Various studies show that the results of efficiency analyses can strongly depend on the choice of method (e.g., Andor and Hesse 2014). Therefore, parallel analyses with varying methods are appropriate. Accordingly, we compare the methods mentioned above in our study.

Although the selection of input and output factors also obviously influences the efficiency degree, their selection rarely seems systematic. To date, no standard has been established for input and output factors to determine research efficiency in the university context. It is important to emphasize that these factors should be selected sparingly since too many factors can lead to very high, hardly discriminating efficiency degrees when the methods mentioned above are applied (Dyson et al. 2001).

Therefore, our objective is to investigate the influence of the inclusion of selected input and output factors on the efficiency measurement results, depending on the purpose of the analysis, the analyzed research field, and the efficiency measurement method. Based on a standard scenario with common input and output factors, we analyze the extent to which the extension of this scenario by expenditures, research grants, and bibliometric indicators leads to a shift in the efficiency measurement results. The resulting contributions of our study can be summarized as follows:

  1. 1.

    We provide guidance for the choice of input and output factors to measure the efficiency of university research.

  2. 2.

    Among other things, to cover the dissemination and reception of research results, we include four citation measures that have not yet been evaluated in direct comparison with each other: citations, citations per publication, h-index, and J-factor.

  3. 3.

    We differentiate between two purposes of analysis, the accurate determination of the ranking and the accurate estimation of the efficiency degrees. It is shown that the two purposes react with different sensitivity to additional input/output factors.

  4. 4.

    We consider six research fields at German universities to account for field-specific differences, e.g., publication habits.

  5. 5.

    In addition to DEA and SFA, for the first time, we include StoNED and NAA in analyses of the impact of the selected input and output factors in the university context.

Our paper is structured as follows: Sect. 2 deduces the state of the art and research gaps. Section 3 presents the study’s research design. In Sects. 4, 5 and 6, we analyze the impact of including the additional input and output factors in our standard scenario. Section 7 discusses the results of this study. The paper concludes with a critical evaluation of the study and an outlook in Sect. 8.

2 State of the art and research gaps

We used a systematic literature search to examine which studies have been undertaken to analyze the influence of additional input and output factors on the efficiency degree of university research or their research fields. 16 publications showed to be relevant to our study. For each of these, Table 1 shows the input and output factors that were used in all scenarios analyzed as well as the input and output factors that were varied throughout the respective analysis. The publications’ results that are interesting in our context are briefly presented below.

Table 1 Input and output factors used in the relevant publications from the literature review

Most of the mentioned studies used DEA as method for measuring efficiency. Johnes and Johnes (1992, 1993), and Johnes (1995) compared respective efficiency degrees for departments of economics in the U.K. Mostly varying more than one input or output factor, their results changed with different intensity depending on the variant. Based on these studies, Johnes and Johnes (1995) changed the research setting slightly and found a quite high correlation of 91.9% between the estimated efficiency degrees for the different variants. Similarly, McMillan and Datta (1998) found moderate deviations when they investigated the influence of nine different input-output combinations on the estimated efficiency degrees of Canadian universities. While McMillan and Datta varied at least two input or output factors at a time, Johnes (2006) analyzed the effects of excluding one to three of the input and output factors on the efficiency degrees of British universities. It was found that the exclusion of staff and expenditures on library and information services had almost no effect on the estimated efficiency degrees, and the rank correlation between the results was 92%. This effect was not observed for the other factors. Johnes and Yu (2008) investigated the efficiency of Chinese universities. Depending on the input factors considered, rank correlations between the efficiency degrees with and without including the publications were at least 98.2%.

Rayeni and Saljooghi (2010) examined Iranian universities by excluding one input or output factor in each scenario. They found out that the exclusion of staff led to the biggest changes in the efficiency degrees and a correlation of 18.3% while the other variations resulted in clearly smaller changes and higher correlations. The analysis of the departments of a Greek university by Kounetas et al. (2011) showed that the changes in efficiency degrees varied depending on the variant under consideration, especially when excluding publications. Agasisti et al. (2012) simultaneously varied several output factors for Italian science departments. Depending on the variant, quite different efficiency degrees with rather low correlations between 3.9 and 55.8% were found. Again, several output factors were varied simultaneously, but the effect of a single factor was not considered.

Bielecki and Albers (2012) examined German business schools. Big changes in efficiency degrees emerged, e.g., weighted nationally visible publications increased the efficiency degrees. The effect of habilitations per professor, for which different years were taken as a basis, could not be determined. The rank correlations were low with a maximum of 26%, often even negative, but rarely significant. The efficiency of Spanish universities was investigated by De la Torre et al. (2017). They found rank correlations between 70 and 97%. Furthermore, concerning Spanish universities, Expósito-García and Velasco-Morente (2018) observed minimal differences in efficiency degrees when varying publications as an output factor.

In addition to DEA, Gralka et al. (2019) used SFA to analyze the efficiency of German universities. The average efficiency degrees for the output variations were similar, although the deviations were somewhat higher for SFA. Compared to those with research grants, the rank correlations for the different analyses ranged from 84.9 to 86.3% for DEA and from 68.5 to 83.3% for SFA. Also comparing German universities by applying SFA, Geissler et al. (2021) investigated the impact of varying the output factor on the estimated efficiency degrees. They found that the consideration of normalized measures led to a significant reduction in the average efficiency degree and stronger discrimination of the top group. Comparing the three normalized measures did not yield any considerable changes concerning both efficiency degrees and ranking. Finally, Ghimire et al. (2021) developed a DEA model to evaluate the efficiency of Canadian universities by using diverse weights to model the variation of input and output factors. Their study showed a crucial role of the selection of inputs and outputs in determining the efficiencies and rankings.

Regarding the state of the art, various research gaps can be identified concerning the research questions outlined in the introduction. First, DEA was the dominant aggregation method used in the identified studies. SFA was used in only two studies. Novel methods, such as StoNED and NAA, have not been discussed. Thus, to date, there is no comprehensive study that compares the inclusion of input and output factors concerning the above-mentioned efficiency measurement methods.

Furthermore, there lacks systematic analysis of the impact of including single input and output factors, as several factors are often varied simultaneously. Such approaches have the disadvantage that the strength of the respective influence of the individual input or output factors is not visible. Moreover, such effects were not examined by comparing different research fields at once since the mentioned studies only analyzed single departments or universities as a whole.

Concerning the possible purposes of a performance analysis, the studies mainly compared the estimated efficiency degrees. Such a comparison is only of limited usefulness due to method-immanent changes in the average level of efficiency degrees for different numbers of considered input and output factors (see Sect. 3.2). Moreover, the purpose of accurately determining the ranking has been addressed in only a few publications.

3 Research design

Based on the identified research gaps, we present our research design: Sect. 3.1 explains which specifications we consider for the respective methods to measure efficiency. Section 3.2 presents the evaluation criteria used for the analyses. In Sect. 3.3, we describe the underlying standard scenario and data set.

3.1 Model specification

As stated in the introduction, we use DEA, NAA, SFA, and StoNED to measure efficiency. Different models have been developed for each of these methods. We choose those models that are predominantly applied in empirical applications. In general, we assume an output orientation since universities can influence outputs rather than inputs in the short term.

For DEA, we use the two basic models DEABCC (Banker et al. 1984) and DEACCR (Charnes et al. 1978); the former assumes variable, the latter constant returns to scale. We use the model with arithmetic mean for NAA, according to Ahn et al. (2007). SFA is used under the assumption of a Cobb-Douglas function. Since it includes only one output factor by default, ray production functions are used to model multiple output factors (see for SFA, e.g., Löthgren 1997, and for StoNED, e.g., Schaefer and Clermont 2018). According to Schaefer and Clermont (2018), these have an advantage over distance functions; however, the limitations addressed by Henningsen et al. (2017) have to be taken into account. To circumvent some of these limitations, the angles for SFA are exponentiated before including them in the function, as suggested by the authors. Owing to the theoretically and empirically established dominance of the maximum likelihood method over the method of moments for the partitioning of the error term, the maximum likelihood method is used (Olson et al. 1980, for an empirical comparison, see, e.g., Andor and Hesse 2014). The error term is modeled multiplicatively. A half-normal distribution is assumed concerning the inefficiency term, and with respect to the noise term, we assume a normal distribution. The Battese-Coelli estimator is chosen as point estimator (Battese and Coelli 1988, p. 390) since it minimizes the mean squared error compared to the estimators proposed by Jondrow et al. (1982) (Bogetoft and Otto 2011). We use the described specifications for SFA also for StoNED. Only the partitioning of the error term is, due to the model, not done with the maximum likelihood method, but with the pseudo-likelihood method (Kuosmanen and Kortelainen 2012).

Since the stochastic methods are not unit invariant when using ray production functions, all empirical data presented below are normalized based on their arithmetic mean, following the proposal of Henningsen et al. (2017). That way, dependencies on the unit are excluded (Dyson et al. 2001).

3.2 Evaluation criteria

Efficiency measurement serves mainly two purposes: accurate determination of the ranking and accurate estimation of the efficiency degrees. To evaluate the impact of the input and output factors on the respective efficiency measurement results, quantitative criteria are required that allow statements on the similarity of the results. Thus, with respect to ranking, the similarity between ranking positions is of interest. In the following, we evaluate the extent to which the rankings resulting from different input and output factors coincide by using the rank correlation (RC) according to Spearman (1904). It is determined by

$$ RC = \frac{{Cov\left( {r\left( {TE_{1} } \right), r\left( {TE_{2} } \right)} \right)}}{{\sqrt {Var\left( {r\left( {TE_{1} } \right)} \right) Var\left( {r\left( {TE_{2} } \right)} \right)} }}, $$
(1)

where \(r\left( {TE_{1} } \right)\) and \(r\left( {TE_{2} } \right)\) represent the corresponding rankings of the efficiency degrees for two analyses with different input or output factors.

Suppose, conversely, the focus is on estimating the efficiency degrees as accurately as possible. In that case, it should be noted that increasing the number of input and output factors considered can already inherently lead to an increase in the estimated efficiency degrees (for DEA, see, e.g., Sexton et al. 1986, and Dyson et al. 2001). Due to this effect, considering the change in efficiency degrees themselves, for example, via the mean absolute deviation, does not seem reasonable. To level out this effect, we consider the standard deviation of mean deviation (SDMD), which provides information about the spread of the changes in the efficiency degrees. It is calculated by

$$ SDMD = \sqrt {\frac{1}{n}\mathop \sum \limits_{j = 1}^{n} \left( {\left( {TE_{1,j} - TE_{2,j} } \right) - \frac{1}{n}\mathop \sum \limits_{j = 1}^{n} \left( {TE_{1,j} - TE_{2,j} } \right)} \right)^{2} } , $$
(2)

where \(TE_{1,j}\) and \(TE_{2,j}\) represent the efficiency degree of university j for two analyses with varying input or output factors, and n indicates the number of universities.

Assuming that, for example, taking additional input or output factors into account would lead to a shift of all efficiency degrees by five percentage points. This would result in an SDMD of 0. If, conversely, there are large differences in the shifts of efficiency degrees, this would lead to a high SDMD.

3.3 Standard scenario and selection of data

To analyze the impact of selected input and output factors on measuring research efficiency, we use a standard scenario as a benchmark. In this regard, we consider two input factors and one output factor frequently applied in studies (see also the reviews by Berbegal-Mirabent and Solé Parellada 2012; De Witte and López-Torres 2017; Gralka 2018). As staff-related input factors, we use the number of professors and the number of research assistants. The data were obtained from the German Federal Statistical Office in a special evaluation. This institution subdivided the data concerning German universities into 89 areas of research, which we grouped into eleven research fields.

As output factor of the standard scenario, we use the number of journal publications to indicate the generation of public knowledge. The publication data were collected using the Web of Science database hosted by the Competence Centre for Bibliometrics,Footnote 2 enhanced with institution encoding. The publications were allocated to the eleven research fields based on the journal in which the article was published. Thereby, we draw on the science classification system of Archambault et al. (2011) to avoid overlap in the assignment of journals to research fields. An extension of the original classification system was used that considers some journals not included before (Tunger et al. 2022). This classification system allocates each journal to precisely one subfield, which we aggregated to our research fields.

Due to the big differences in the publication and citation habits, as well as in the available resources of different research fields (Linton et al. 2011), we conduct separate investigations by research field. From the eleven research fields mentioned above, we selected the following six for our analyses based on data availability and sufficient publication activity: business & economics, engineering, humanities, mathematics & computer science, natural sciences, and social sciences.

For the data collection, the year 2010 was chosen to ensure a sufficiently long citation period for the bibliometric indicators (Clermont et al. 2021). It should be noted that the focus here is not on empirical statements about a specific year and individual universities but on conceptual analyses. A plausibility analysis was conducted for the data sets of the standard scenario and the other input and output factors so that only complete and consistent data sets were included in the analyses.

Table 2 shows the descriptive data of the standard scenario for the six research fields investigated. Depending on the research field, between 40 and 74 universities are included in the analyses. The differences in habits between the research fields can be seen. Specifically, engineering and natural sciences are characterized by relatively high numbers of professors, research assistants, and publications.

Table 2 Descriptive statistics on the standard scenario in 2010

To briefly give a general impression, it is observed that for the standard scenario, the efficiency degrees for NAA are clearly lower than for the other methods. Additionally, most of the time, for the deterministic methods (DEA, NAA) the median of the efficiency degrees is lower than the mean. That indicates that the corresponding distribution of estimated efficiency degrees is right-skewed. For the stochastic methods (SFA, StoNED), the opposite is true. Regarding the research fields, no general difference is visible.

4 Influence of expenditures

In addition to staff, universities also spend financial resources on research, which is another input factor to be minimized. In contrast to studies that exclusively consider staff (e.g., Lee and Worthington 2016; Wohlrabe and Friedrich 2017; Ibrahim and Fadhli 2021), some studies include financial resources in addition to staff (e.g., Wang and Hu 2017; Duan 2019; Jiang et al. 2020).

In the following, we examine how the efficiency degrees change if we integrate expenditures as an additional input factor in the standard scenario. In our data set, the research fields’ expenditures are divided into equipmentFootnote 3 and personnel expenditures. In the latter case, it can be assumed that these are already to some extent covered by the input factors for staff (professors and research assistants) so that we are aware of the issue of double counting. Therefore, we differentiate according to the type of expenditures and analyze five variants of taking expenditures into account, as listed in Table 3.

Table 3 Considered inputs and outputs for variations of expenditures

4.1 Descriptive analysis

Table 4 presents the descriptive statistics for the expenditure data, which are as well extracted from the special evaluation of the German Federal Statistical Office. Again, there are clear differences regarding the research fields, with the highest expenditures being made in engineering and natural sciences. The last rows of Table 4 show the average share of personnel and equipment expenditures in total expenditure. As expected, personnel expenditures account for the largest share of expenditures in all research fields. Engineering and natural sciences also exhibit a high share of equipment expenditures and, thus, a higher relevance of these expenditures. This could be because research projects in these research fields tend to require more technical equipment.

Table 4 Descriptive statistics on expenditures in 2010

Table 5 presents the Pearson correlations among the input factors considered. Very high correlations of 90% or more are highlighted in gray here and in all subsequent tables. As can be seen from Table 5, the correlations are mostly above 80% and often even above 90%. The correlations for equipment expenditures are often the lowest, whereas the correlations between staff (especially research assistants), personnel, and total expenditures are very high. This confirms the assumption that personnel expenditures are widely described by staff. When comparing research fields, it is noticeable that the correlations are particularly high for engineering, whereas on average, they are the lowest for business & economics. In addition to the question of how much the results change depending on the factors included, the question arises as to whether these changes are related to the correlations in the initial data.

Table 5 Pearson correlations between expenditures and staff

4.2 Results concerning the purpose “determination of the ranking”

We consider the influence of including the various expenditures as an additional input factor on the determination of the ranking based on the RC. Since the results of the methods show the same basic trend, they are presented in the following only for DEABCC as an example. The corresponding RCs between the results of the different variants are presented in Table 6.

Table 6 Rank correlations for DEABCC when including expenditures

All research fields show very high RCs of (mostly clearly) more than 90%. Especially engineering stands out with RCs of at least 98.9%. This is analogous to the correlations of the initial data. Overall, only marginal differences can be observed between research fields. Furthermore, the RCs are only slightly lower for the variants where the equipment expenditures vary, although these have the lowest correlations in the initial data. Thus, there is no direct relationship between the correlations of the input data in Table 5 and the RCs in Table 6.

Overall, the very high RCs show that the inclusion of the various expenditures has little to no impact on the resulting rankings from DEABCC. As indicated earlier, this finding also applies to the other methods. Since the resulting rankings hardly change, the expenditures can thus be neglected with regard to this purpose.

4.3 Results concerning the purpose “estimation of efficiency degrees”

We now analyze the influence of expenditures on the estimated efficiency degrees using the SDMD. In contrast to the RC, a stronger dependence on the applied method is evident. In particular, the results of DEABCC and DEACCR differ from those of SFA, StoNED, and NAA, which is why we present the results for DEABCC and SFA as examples in the following. Table 7 presents the corresponding SDMDs of the different variants. Here and in all the following tables, low SDMDs not greater than five percentage points are marked in gray.

Table 7 Standard deviations of mean deviation for DEABCC and SFA when including expenditures

For DEABCC, in contrast to RCs, bigger differences can be observed between the variants and research fields. Since the SDMDs amount up to 14.9 percentage points, there are sometimes clear shifts in the ratios of the efficiency degrees to one another. This effect is particularly evident for business & economics and humanities, whereas the SDMDs are rather low for engineering. Additionally, the spread tends to be particularly high when equipment expenditures are taken into account.

When applying SFA, the resulting changes are considerably smaller than those with DEABCC. For example, the SDMDs for business & economics, engineering, and mathematics & computer science are always less than five percentage points. For StoNED and NAA, this is also partly the case for other research fields, e.g., humanities and natural sciences. Overall, the highest spread tends to be found again for the inclusion of equipment expenditures.

In summary, in contrast to the accurate determination of the ranking, accurately estimating the efficiency degrees shows a stronger impact of the expenditures. The absolute values of the SDMDs differ more strongly between the methods and the research fields considered. However, there are no differences in the basic statements, e.g., the highest SDMDs are generally found for the additional inclusion of equipment expenditures. Accordingly, including expenditures, especially equipment expenditures, should be examined more precisely for the addressed purpose, especially when using DEABCC or DEACCR.

5 Influence of research grants

An aspect much discussed in research evaluations concerns the extent to which grants should be considered an input factor of research to be minimized or an output factor to be maximized. Generally, the input and output factors selected for a specific efficiency analysis should be deduced from the goals pursued (Ahn and Le 2015; Dyckhoff 2018). From such a goal-oriented view, contradictions between the use of a parameter as an input or output factor can be explained, e.g., depending on the stakeholders involved. For example, a grant donor may pursue the goal of using these grants as economically as possible, which supports its consideration as an input. This corresponds with the production-theoretical point of view, from which grants represent a special form of financial resources. On the other hand, universities aim to increase their performance, competitiveness, and reputation, which can be, e.g., measured by grants since their acquisition largely depends on previous outstanding research (Johnes 1997). From this perspective, grants are interpreted as an output factor.

In addition to various studies that do not consider grants (e.g., Abramo and D'Angelo 2009; Zhang et al. 2016; Ibrahim and Fadhli 2021), some include it as an input factor (e.g., Furková 2013; Jauhar et al. 2018; Wang 2019), whereas others include it as an output factor (e.g., Johnes 1995; Clermont 2016; Gralka et al. 2019). Besides, some authors discuss models with dual-role factors, e.g. grants can be considered as input and output factor simultaneously (e.g., Beasley 1990; Cook et al. 2006). However, in order to ensure consistent investigations across all methods and sections, we only consider standard models and compare the variants listed in Table 8 to analyze the effects of considering grants as input or output.

Table 8 Considered inputs and outputs for variations of grants

5.1 Descriptive analysis

First, we look at the descriptive statistics in Table 9, where the data are again extracted from the special evaluation of the German Federal Statistical Office. As before, the highest values are found for engineering and natural sciences. If we look at the share of grants in total expenditures, we see that mathematics & computer science also have a very high share and thus a high relevance of grants. The other three research fields differ, with a maximum share of 24%. This is basically in line with expectations since grant-financed projects with high volumes often finance technical equipment, among other things.

Table 9 Descriptive statistics on grants (in mio. €) in 2010

The Pearson correlations between grants and the input/output factors of the standard scenario are presented in Table 10. There are quite clear differences visible for the various research fields. For example, engineering and natural sciences show the highest correlations with at least 80% and for some variants, even more than 90%, whereas those for business & economics are the lowest, with a maximum of 68.2%. Furthermore, the correlations between grants and research assistants are the highest across all research fields. Regarding the correlation between grants and publications, natural sciences stand out from the other research fields with 92.4%.

Table 10 Pearson correlations between grants, staff, and publications

5.2 Results concerning the purpose “determination of the ranking”

Regarding the RCs, the methods show very similar results, which is why, in analogy to Sect. 4, they are again presented in Table 11 for DEABCC as an example.

Table 11 Rank correlations for DEABCC when including grants

For all research fields, comparing the standard scenario with the inclusion of grants as input leads to the highest RCs. Except for business & economics, these values are above 90%. Thus, considering grants as input has little effect on the ranking. When grants are included as output, the results differ more, with RCs between 73.5% and 91.7%. Despite the high correlation between grants and publications in the initial data, e.g., the RC for natural sciences is only 76.9%. This confirms that a high correlation of the factors does not necessarily imply a high RC of the results. Comparing grants as input with grants as output shows considerably lower correlations with RCs starting at 47.5%, which is expected due to the opposite view. Concerning the research fields, social sciences show the highest RCs on average, whereas business & economics show the lowest RCs.

For the other research fields and methods, the RCs exhibit the same basic trend, and the same findings can be obtained. Thus, overall, it can be concluded that there is a notable impact of the inclusion of grants as an output factor on the ranking. Meanwhile, considering grants as an input factor is less relevant since the resulting rankings hardly change in the presence of RCs of mostly over 90%.

5.3 Results concerning the purpose “estimation of efficiency degrees”

To accurately estimate the efficiency degrees, there are slight differences regarding the results of the methods. For example, the DEABCC, DEACCR, and SFA results show higher SDMDs than StoNED and NAA. Therefore, the results for DEABCC and StoNED are presented as examples in Table 12.

Table 12 Standard deviations of mean deviation for DEABCC and StoNED when including grants

First, we look at the SDMDs for DEABCC. Some of them are very high, with up to 18.6 percentage points when comparing with the standard scenario. The lowest SDMDs result from the comparison between the standard scenario and the inclusion of grants as input. However, these are not greater than five percentage points for only one variant. Overall, there are clear shifts in the ratios of efficiency degrees to one another. In correspondence to the DEABCC results in Sect. 5.2, social sciences show the lowest average changes and business & economics the highest. Again, there is no relation with the initial data.

In contrast, as outlined at the beginning, the spread tends to be considerably lower for StoNED. The SDMDs amount to a maximum of 13.1 percentage points when comparing with the standard scenario. Three comparisons yield SDMDs of less than five percentage points. Unlike DEABCC, engineering and natural sciences have the lowest average SDMDs here. Thus, the research fields with the smallest changes vary with the method used. Overall, the StoNED results differ in the specific values of the SDMDs, but not in the resulting basic statements. For example, considering grants as input again leads to the smallest changes here.

In summary, for this purpose, bigger changes in the ratios of efficiency degrees result from the consideration of grants as an output rather than as an input. However, even though the inclusion as input yields lower SDMDs, these are still quite high in absolute terms. Although StoNED and NAA show smaller changes than the other methods, the results have no structural differences. This is also valid regarding the research fields. Hence, regarding the considered purpose, both variants of the inclusion of grants can be reasonable.

6 Influence of bibliometric indicators

A series of bibliometric indicators exist that aim to capture the dissemination and reception of research results of academic institutions in the scientific community. Although such indicators have been discussed for some time with regard to their use in evaluations (Clarke 2009; Bornmann et al. 2013; Ketzler and Zimmermann 2013), their wide range is not yet reflected in efficiency analyses (Geissler et al. 2021). Most commonly, the number of publications (cf. the standard scenario) and/or the number of citationsFootnote 4 (e.g., Bonaccorsi et al. 2006; Wohlrabe and Friedrich 2017) are used. Occasionally, average citations per paper (CPP) are also used (e.g., Agasisti et al. 2012; Mammadov and Aypay 2020). In contrast, more complex bibliometric indicators are rarely applied. For example, the h-index gives the number of publications h with at least h citations (Hirsch 2005) and is thus more robust to outliers. To our knowledge, in efficiency analyses that solely refer to university research, the h-index has so far only been used by Agasisti et al. (2012).

Differences in publication and citation customs between research fields and journals are taken into account by normalized index measures intended to counteract comparability problems.Footnote 5 The journal-normalized J-factor (Ball et al. 2009), e.g., relates publications to the respective publication organ. In this way, it considers journal-specific differences in scientific communication. To the best of our knowledge, normalized measures have only been included in efficiency analyses of university research by Geissler et al. (2021). The authors found that the three regarded normalized measures yielded almost identical results in efficiency analyses. Against this background, we focus on considering the J-factor as a representative of normalized index measures. Given the explanations above, we extend the standard scenario by seven variants of including bibliometric indicators as output factors, as Table 13 shows.

Table 13 Considered inputs and outputs for variations of bibliometric indicators

6.1 Descriptive analysis

Based on the selected publication year 2010 and the allocation of journals to research fields as explained in Sect. 3.3, respective citations were gathered by means of the databases of the Competence Centre of Bibliometrics. We collected the citations for the period 2010–2018. For such a period of nine years, Clermont et al. (2021) found sufficient validity for the citation measures considered here, i.e., for CPP, h-index, and J-factor.Footnote 6 We calculated these measures for the six research fields and universities.

Table 14 presents the descriptive statistics for the measures. We again observe research field-specific differences, with many citations for engineering and natural sciences, corresponding to their publication numbers. Besides, the average CPP and h-index are the highest there. The J-factors show similar average values due to the normalization across the research fields, but their ranges are quite different, with the largest for humanities.

Table 14 Descriptive statistics on the bibliometric indicators*

Table 15 presents the corresponding Pearson correlations. Compared to the factors considered so far, some of the correlations are clearly lower. High correlations of over 70% and even over 90% are found almost exclusively between publications, citations, and the h-index. In contrast, the correlations with CPP and J-factor are considerably lower. In fact, they are often below 40% and not significant. Only in some cases, there are higher correlations. For example, the correlations between CPP and J-factor for mathematics & computer science, natural sciences, and business & economics are over 70%. When comparing research fields, again engineering and natural sciences show the highest correlations on average.

Table 15 Pearson correlations between bibliometric indicators

6.2 Results concerning the purpose “determination of the ranking”

Regarding the purpose of accurately determining the ranking, DEABCC, DEACCR, and NAA show very similar results with rather high RCs. In the following, the results for DEACCR are presented, as this method leads to the highest RCs. When estimating efficiency using SFA in this subsection, infeasibility occurs in many variants. This may be due to a lack of skewness in the initial data (see Ruggiero 2007 for a discussion). Therefore, no reasonable statements can be made for SFA here. Hence, we also present the results for StoNED in the following. The RCs of DEACCR and StoNED results are presented in Table 16.

Table 16 Rank correlations for DEACCR and StoNED when including bibliometric indicators

For DEACCR, across all research fields, it can be seen that the RCs are almost always above 70% and often above 90%. Contrastingly, the RCs of the J-factor are comparatively low in the case of two and three outputs. Thus, the J-factor leads to the biggest changes in the rankings overall. For the CPP, the RCs are clearly higher, although these have similarly low correlations as the J-factor in the initial data. If including CPP, h-index, or J-factor in the efficiency analysis, adding citations almost always leads to very high RCs of over 95%. Thus, the ranking hardly changes. Even when extending the standard scenario only by citations, the RCs are above 90%.

In terms of research fields, the results for social sciences are, on average, the most highly correlated, whereas business & economics have the lowest values. Overall, DEABCC and NAA widely lead to the same conclusions. For these, the inclusion of the J-factor also induces the biggest changes, whereas including citations results in very high RCs. The highest average RCs partly occur here for other research fields, e.g., concerning DEABCC for mathematics & computer science.

Regarding StoNED, there are considerably lower RCs of often less than 70%, especially when including the J-factor. However, the inclusion of other bibliometric indicators should also be examined because of the rather low RCs. Only for citations, their additional inclusion leads again to very high RCs of more than 90%. Regarding the research fields with the highest RCs, there are differences compared to DEACCR. This shows that in our study, there is no structural pattern regarding the research fields.

For the deterministic methods, the inclusion of further bibliometric indicators often leads to rather minor changes in the ranking. An exception is the J-factor, which is why its use as an additional indicator should be examined. For StoNED, considering further bibliometric indicators also seems relevant since the conclusions are similar, but the RCs are lower overall. However, for all methods, considering the frequently used citations does not seem necessary, especially if considering CPP, h-index, or J-factor as well. There are hardly any fundamental differences between the research fields for all the methods.

6.3 Results concerning the purpose “estimation of efficiency degrees”

For this purpose, again similarities in the results between the methods can be observed. For example, DEABCC and StoNED show comparatively high SDMDs, whereas these are lower for NAA and DEACCR. In Table 17, the results of DEABCC and NAA are, therefore, considered as examples.

Table 17 Standard deviations of mean deviation for DEABCC and NAA when including bibliometric indicators

Once more, we first look at the SDMDs for DEABCC. With SDMDs of up to 20.7 percentage points, significant shifts in the ratios of the results are evident. Again, the findings are consistent in their basic statements. Thus, the J-factor tends to lead to the highest SDMDs. However, the difference is not as clear as in the case of RCs since the CPP, in particular, also partially lead to high SDMDs. Therefore, the inclusion of all bibliometric indicators should be examined here.

As in the case of RCs, SDMDs are relatively low—mostly below five percentage points—if citations are included as the third output factor when considering CPP, h-index, or J-factor. Meanwhile, considering citations as only additional output factor sometimes leads to SDMDs clearly greater than five percentage points. If we calculate the research field-specific average across all SDMDs, we see that the changes are lowest on average for mathematics & computer science, whereas they are somewhat higher for social sciences and business & economics. This is analogous to the RCs.

For NAA, SDMDs amount up to eleven percentage points, being considerably lower than for DEABCC. Moreover, they are often lower than five percentage points. The other conclusions made for DEABCC are also valid for NAA, except that the research fields with the lowest average SDMDs vary with the method used. As can be seen, high SDMDs are observed when the J-factor is used; therefore, for NAA and StoNED, its inclusion should be examined in particular.

A strong dependence on the bibliometric indicators can be stated for an accurate estimation of the efficiency degrees, whereby the J-factor again leads to the biggest changes here. As in the case of RCs, the inclusion of citations seems to be negligible, at least when considering CPP, h-index, or J-factor as well. The basic results here also suggest a dependence of the results on the considered research field and the method used. However, apart from the basic differences in height, no pattern is discernible, e.g., between the research fields.

7 Discussion and conclusions

We investigated the impact of the inclusion of selected input and output factors based on a standard scenario depending on the purpose of the analysis, the research field under consideration, and the method used. We addressed the effects of taking into account different types of expenditures, grants, and bibliometric indicators that have hardly been included in efficiency analyses so far. In the following, we will discuss which fundamental implications and guidance can be derived from our results for evaluators as well as university and political decision-makers.

To avoid misinterpretation of our results, it must be clarified that we do not know the true efficiency of the research fields and universities under consideration. Therefore, we cannot provide any information about which input and output factors or methods have to be selected to model the actual efficiency as accurately as possible. Our conceptually oriented analyses rather answer the question of to what extent various extensions of the standard scenario lead to a change in the results of the efficiency measurement. As mentioned in Sect. 5, the reasonability of including input and output factors not only depends on the RCs and SDMDs but especially on the goals pursued in the respective situation. Generally, an analysis should always follow the pursued goals whereas the used input and output factors are—more or less suitable—indicators to quantify these goals (Ahn and Clermont 2018). In this context, the collection of data for input and output factors usually requires a high effort that is only reasonable when the additional input and output factors lead to substantial changes in the results. Our analyses show that this is not always the case.

If the addition of a certain input or output factor leads to small changes, represented by high RCs or low SDMDs, the results show strong similarities with respect to the purpose considered. This indicates that the additional inclusion of the corresponding factor is not necessary from a result-driven point of view. The reverse is true in the case of strong changes between the results of the considered variants, i.e., low RCs and high SDMDs. In this case, the additional inclusion of the input or output factor has a stronger effect on the results, i.e., the inclusion of such factors should be examined.

The question arises whether there is a relation between the correlation in the initial data and the similarity of the results of the respective variants considered. Concerning DEA, Dyson et al. (2001) stated that such a conclusion is not possible due to the lack of translation invariance for additive shifts. Our investigations confirm this finding for the other methods considered here, independent of the underlying purpose of the analysis and the factors included. For example, the initial data showed comparatively low correlations between equipment expenditures and other expenditure types. However, this did not mean that considering equipment expenditures would have led to notably lower RCs. Furthermore, for all methods, there are comparisons in which one factor is less correlated than another in the initial data, but leads to higher RCs or lower SDMDs in the results.

Our analyses naturally yield different evaluation criteria values in detail, depending on the research field and method considered and the variants compared. Moreover, there are also differences in the average level of values. However, it is not found that one research field or method leads to higher or lower RCs/SDMDs than the others for all investigations. Regarding the input and output factors considered, largely consistent tendencies are identified, i.e., the resulting core statements regarding the compared variants apply to all research fields and methods. This applies, for example, to the question of which input or output factor leads to the greatest changes. Consequently, statements can be derived below that are valid regardless of the considered research field and selected method for measuring efficiency.

The resulting key findings of our investigations are summarized in Table 18, which can serve as guidance for efficiency analyses of research at universities. The table provides an overview of the input and output factors for which an examination of their use in efficiency analyses seems particularly expedient since their consideration leads to big changes in the results. It is differentiated according to the two purposes under consideration, as there are differences in the results and thus also in the implications. The results in the second column correspond to the purpose of accurately determining the ranking (Sects. 4.2, 5.2, and 6.2) while the results in the third column refer to the accurate estimation of efficiency degrees (Sects. 4.3, 5.3, and 6.3). A further differentiation is made between the methods whenever the average heights of the RCs and SDMDs varied strongly between them.

Table 18 Central implications of the investigations

In summary, there are differences regarding the purposes pursued by the efficiency measurement, especially in their sensitivities to additional input/outputs factors, which illustrates the relevance for selecting factors. Thus, examining the inclusion of additional input and output factors appears to be particularly appropriate for accurately estimating the efficiency degrees. For this purpose, it can be stated that the focus should be on considering equipment expenditures, grants (both as input and output), and the J-factor. Our results indicate that the inclusion of grants as an output and the J-factor should be examined mainly for the purpose of accurately determining the ranking.

8 Limitations and outlook

The findings of our study are limited to the focus set. In this respect, various extensions are possible to develop additional guidance for efficiency analyses of university research. For example, an extension to other research fields, methods, and input or output factors is obvious. Time-series analyses could also be of interest to examine the robustness of our findings over time. Moreover, considering further purposes and evaluation criteria seems fruitful. For example, it should be noted that the two evaluation criteria we use—RC and SDMD—only provide information regarding average similarity. Nevertheless, individual stronger deviations may occur.

In general, there are no overall differences in the methods’ level of RCs and SDMDs. In detail, however, it is visible that the results differ strongly across the methods. While this was to be expected in accordance with respective literature (e.g., Ahn et al. 2020), it does not provide information on which method performs better in estimating efficiency. This emphasizes the importance of the method choice for the present situation, which is a separate research topic. As it is addressed in only a few comprehensive studies that use Monte Carlo simulations to evaluate a method’s performance in different scenarios (e.g., Andor and Hesse 2014), this seems to be a fruitful path for future research.

Regarding the bibliometric indicators considered, various further investigations can be carried out. For example, instead of the journal-normalized J-factor, the inclusion of the field-normalized Mean Normalized Citation Score (Waltman et al. 2011) could be analyzed. In contrast to journal normalization, field normalization allocates publications based on defined categories (e.g., university departments). One aspect that has been often discussed in this regard is the allocation of publications to research fields. Ultimately, this can only be ensured through the standardization of data collection, which is currently not foreseeable. Furthermore, questions arise about the relationship between input factors and bibliometric indicators. It could be investigated, e.g., to what extent an increase in staff of institutes increases their attractiveness for professors who are particularly strong in research. In addition to an expected increase in publications, this could also increase CPP, h-index, and J-factor.

Altmetrics represent a relatively new category of indicators for measuring the perception of research performance, especially by the interested public. Here, e.g., tweets, news articles, blog posts, and Mendeley readership are counted. Analyzing the inclusion of such indicators could also lead to interesting insights. Additionally, output factors that reflect the third mission could be considered. One such aspect is the transfer of scientific achievements into practice, measured, for example, by the number of patents, licenses, and start-ups.