Introduction

Count data are part of the daily bread and butter of scientometric researchers. Perhaps most eminently, number of publications and number of citations are examined. A default statistical model for count data relies on the Poisson distribution. For example, a Poisson model was used to describe scientific productions of mathematicians, physicists, and inventors (Huber 2000; Huber and Wagner-Döbler 2001a, 2001b). However, in particular citation count distributions are known to have a smaller mean as compared to the variance (Didegah and Thelwall 2013; Ketzler and Zimmermann 2013). This phenomenon is known as overdispersion and it violates equidispersion (i.e., mean and variance coincide) as an important assumption for valid statistical inference based on the Poisson model (Hilbe 2011). This issue has been acknowledged in numerous scientometric studies (Didegah and Thelwall 2013; Ketzler and Zimmermann 2013; Sun and Xia 2016) with some of them dating back to the early 1980s (Cohen 1981). Underdispersion, however, has been less an issue in the scientometric literature. This is illustrated by a simple search in GoogleScholar (https://scholar.google.com/) that reveals 69 hits for the search string “source:scientometrics overdispersion” and 4 hits for the search string “source:scientometrics underdispersion” (retrieved on August the 4th in 2020). This huge difference can most likely be explained because conservative inference based on overestimated standard errors (Faddy and Bosch 2001) is perceived to be less problematic as compared to liberal inference.

Moreover, Forthmann et al. (2019a) have recently shown that such dispersion issues affect also the measurement precision of individual capacity estimates in item-response theory (IRT) models. In a large simulation study, they demonstrated that reliability estimates of person capacity derived from the Rasch Poisson count model (RPCM; i.e., an IRT model relying on the Poisson distribution; Rasch 1960) can be huge overestimates for an overdispersed population model. Analogously, RPCM-based reliability was found to be underestimated when data were simulated according to an underdispersed population model. In addition, also more complex population models with fractions of the items being overdispersed and underdispersed, respectively, resulted in mis-estimated RPCM-based reliabilities. These observations are critically important for applications of IRT count data models to quantify research performance. For example, the RPCM has been proposed as an elegant approach to measure researcher capacity based on a variety of bibliometric indicators (Mutz and Daniel 2018b). However, analogous to the known problems related to statistical inference, reliability for researcher capacity estimates derived from the RPCM can be heavily overestimated in the presence of overdispersion. Mutz and Daniel (2018b) found an impressive reliability estimate of 0.96 which might be misestimated. Hence, the aim of the current study is to investigate potentially misestimated measurement precision of researcher capacities by applying the RPCM, IRT count models based on the negative binomial distribution, and the more flexible Conway-Maxwell-Poisson count model in a large dataset of inventors and a reanalysis of Mutz and Daniel’s dataset.

Estimating researcher capacity by item response theory models

The measurement of research performance is critically important for many practical purposes such as allotment of funding, candidate selection for academic jobs, interim evaluations of researchers after starting a new position, management of research strategies at the institutional level (Mutz and Daniel 2018b; Sahel 2011; Zhang et al. 2019). Most often, single indicators such as the h index (Hirsch 2005) are used to quantify performance at the level of individual researchers. But a mere deterministic use of bibliometric indicators has shortcomings that can be overcome: Probabilistic accounts of bibliometric measures allow, for example, the quantification of measurement precision (Glänzel and Moed 2013). In this vein, it should be noted that stochastic variants of bibliometric indicators such as the h index exist (Burrell 2007). The contrast between deterministic vs. probabilistic measurement accounts is paralleled in the psychometric literature. IRT models are probabilistic in nature (Lord 1980) and used to link properties of test items (i.e., item parameters such as difficulty) and capacity of individuals (i.e., ability to solve the items) with observed behavior (i.e., has an item be solved by a person or not). Indeed, an IRT model from the Rasch family was used by Alvarez and Pulgarín (1996) to scale journal impact based on citation and publication counts. However, this approach did not resonate well in the scientometric literature (Glänzel and Moed 2013).

More recently, Mutz and Daniel (2018b) revived the idea to use IRT scaling for the assessment of researcher capacity and highlighted five problems with commonly used approaches to assess researcher capacity [for more details see (Mutz and Daniel 2018b) and the references therein]: (a) indicators are observed and do not represent an abstract latent variable reflecting a researcher’s competency instead of mere observed performance, (b) measurement error (or precision) is not appropriately taken into account, (c) the count data nature of most performance indicators is not appropriately taken into account, (d) multiple indicators can be reduced to rather few underlying latent dimensions (suggesting that looking at many indicators for evaluation purposes might be unnecessary), and (e) indicators are potentially not comparable between different scientific fields. Mutz and Daniel (2018b) suggested to use Doebler et al.'s (2014) extension of the RPCM to overcome these problems.

From the Rasch Poisson counts model to the Conway-Maxwell-Poisson counts model: gaining modeling flexibility

In the psychological literature, the RPCM was used to scale many different cognitive abilities such as reading competency (Jansen 1995; Jansen and van Duijn 1992; Rasch 1960), intelligence (Ogasawara 1996), mental speed (Baghaei et al. 2019; Doebler and Holling 2016; Holling et al. 2015), or divergent thinking (Forthmann et al. 2016, 2018). The model was further used for migraine attacks (Fischer 1987) or sports exercises such as sit-ups (Zhu and Safrit 1993). All these examples have in common with most bibliometric indicators that they provide count data as responses for each of the respective test items (e.g., the number of correctly read words within an allotted test time). In this work, the RPCM and the other IRT models will be introduced and used within a Generalized Linear Mixed Model (GLMM) framework that allows linking the expected value with a log-linear term including a researcher capacity parameter and an item easiness parameter (De Boeck et al. 2011). In addition, a variance function exists for each of the models, i.e. a variance depends on an (item-specific) dispersion parameter and the person- and item-specific conditional mean. For example, in the RPCM the expected value μji of count yji for person j and item i is modeled as a multiplicative function of capacity parameter εj > 0 and item easiness parameter σi > 0

$${\mu }_{ji}={\varepsilon }_{j}{\sigma }_{i}$$
(1)

Then, the mean in the RPCM is in fact log-linear:

$${\mu }_{ji}={\varepsilon }_{j}{\sigma }_{i}= \mu \left({\theta }_{j},{\beta }_{i}\right)=\mathrm{exp}\left({{\beta }_{i}+\theta }_{j}\right)$$
(2)

for the log-transformed capacity parameter θj = ln(\({\varepsilon }_{j}\)) and item easiness parameter βi = ln(\({\sigma }_{i}\)). The higher βi, the easier it is to obtain a comparably large item score. The probability to observe yji is Poisson, so the probability mass function is

$$P\left({Y}_{ji}={y}_{ji}|{\varepsilon }_{j},{\sigma }_{i}\right)=\frac{{{\mu }_{ji}}^{{y}_{ji}}}{{y}_{ji}!}\mathrm{exp}(-{\mu }_{ji})$$
(3)

This distributional assumption implies equidispersion, i.e. Var(Yji) = E(Yji). The dispersion parameter φi equals 1 for all items, and the variance function is Vjiji, φi) = μjiφi = μji for the RPCM.

Mutz and Daniel (2018b) used Doebler et al.'s (2014) extension of the RPCM in which an upper asymptote for performance is proposed, intended for speed of processing measures in which simple cognitive tasks have to be solved under time constraints. The Doebler et al. (2014) model is coined the Item Characteristic Curve Poisson counts model (ICCPCM), because it borrows the sigmoid shape of Item Characteristic Curves in binary IRT, to describe the conditional means. Because of the known dispersion issues with bibliometric indicators (Didegah and Thelwall 2013; Ketzler and Zimmermann 2013; Sun and Xia 2016), Mutz and Daniel (2018b) used a Poisson-Gamma mixture extension of Doebler et al.’s speed model (i.e., a negative binomial model). They performed a thorough simulation study on the parameter recovery of this model and recommended to use it only with large sample sizes (N = 400 or higher). Hence, to reduce sample size requirements for the purpose of the current work, we use the comparably less complex negative binomial model (Hung 2012) in which the expected value is modeled following Eq. 2, but replacing Eq. 3 by a negative binomial assumption. The variance function for this model is Vjiji, φi) = μji + \({\upmu }_{ji}^{2}\)i = μji + \({\upmu }_{ji}^{2}\)/exp(τi) which allows item-specific overdispersion, but not underdispersion. It should be further noted, that this variant of the negative binomial distribution has been coined NB2 in the literature because of the quadratic nature of its variance parameterization (Hilbe 2011). This model will be referred to as NB2 counts model (NB2CM) in this work. For completeness, we will also consider a NB1 counts model (NB1CM; i.e., a model with a linear variance parameterization; Hilbe 2011) with μji = exp(βi + θj) and Vjiji, φi) = μji(1 + φi) = μji(1 + exp(τi)). The RPCM results as a border case of the NB1CM: when τi approaches -∞, exp(τi) approaches 0 for all items and hence Vjiji, φi) goes to μji.

The third and final model considered in this work is the Conway-Maxwell-Poisson counts model (CMPCM; Forthmann et al. 2019a). The CMPCM extends the RPCM to allow for both overdispersion and underdispersion at the level of items. Hence, with respect to dispersion modeling the CMPCM can be considered the most flexible approach examined in this work. Importantly, the CMPCM is based on Huang's (2017) mean parameterization of the Conway-Maxwell-Poisson distribution that leads to a log-linear model for the conditional mean as in Eq. 2. The conditional variance function for the CMPCM, however, cannot be provided in a simple formula. However, as it is the case for the other count models above, the variance is a function of the mean and dispersion parameter (see Huang 2017). Importantly, this parameterization is different from other suggested regression models based on the Conway-Maxwell-Poisson distribution (e.g., Guikema and Goffelt 2008; Sellers and Shmueli 2010), and is a bona fide generalized linear model (GLM; Huang 2017). Finally, in all four models, RPCM, NB2CM, NB1CM, and CMPCM, it is assumed that the capacity variable θ follows a marginal normal distribution with mean zero and standard deviation \({\sigma }_{\theta }^{2}.\) This distributional assumption for θ implies that the item easiness parameters βi can be interpreted as expected value for an item on the log-scale when researchers have average capacity of zero on log-scale. In addition, dispersion parameters τi reflect error variance and local reliability for the models based on the CMP and negative binomial distributions (concrete interpretation depends on the respective distributions). All model parameters are estimated by means of a marginal maximum likelihood (MML) approach which is common in the IRT literature (e.g., De Boeck et al. 2011).

Forthmann et al. (2019a) found in a simulation study that item easiness parameters and standard deviation of the latent capacity parameters in a CMPCM were already consistently estimated with sample sizes as small as N = 100. In addition, item-specific dispersion parameter estimates were found to be only slightly underestimated (but increasing the number of items seemed to prevent this bias). Hence, sample size requirements for this flexible IRT model are less demanding as compared to Mutz and Daniel's (2018b) complex NB2 extension of the ICCPCM (Doebler et al. 2014). This renders the CMPCM a potentially useful candidate model that needs to be examined in relation to negative binomial models that are perhaps most often applied in the scientometric literature to account for dispersion issues (e.g., Didegah and Thelwall 2013; Ketzler and Zimmermann 2013).

Reliability of researcher capacity estimates

Reliability has been defined within the framework of classical test theory (e.g., Gulliksen 1950). The basic equation of classical test theory proposes that an observed test score X results as the sum of a true score T and an error term E (i.e., X = T + E). Reliability refers to the ratio of true score variance to observed score variance or alternatively one minus the ratio of error variance to observed score variance. Hence, reliability estimates quantify measurement precision of test scores and have an intuitive metric ranging from zero to one. In the tradition of IRT, however, measurement precision of ability estimates refers to usage of standard errors and confidence intervals conditional on the ability level, but summary indices of measurement precision that are based on the information available conditional on ability were developed also for IRT. In this vein, reliability has been coined reliability of person separation, marginal reliability, or empirical reliability as early as in the 1980s (Green et al. 1984; Wright and Masters 1982) and it is defined as the ratio of estimated true ability variance adjusted for measurement error (i.e., error variance as quantified by the average squared standard error across all ability-specific squared standard errors) and the uncorrected estimated true ability variance. Hence, reliability here displays to some degree a conceptual similarity with reliability as it is defined within classical test theory (but it should not be confused with it). For example, error variance is constant in classical test theory, but it varies as a function of persons within IRT which requires averaging across error variances to yield a reliability estimate (Wang 1999). Alternatively, it can be understood as the squared correlation (again this is analogous to classical test theory) between estimated ability parameters and the true ability parameters (Brown and Croudace, 2015; Brown 2018). This quantification of reliability is easy to calculate for a variety of available estimators of ability parameters (Brown and Croudace 2015; Brown 2018) and provides an established intuitive metric.

Notably, the empirical reliability estimates are based on the estimate of the variance of the capacity distribution and the standard errors of the capacity estimates (more details are provided in the method section below). Hence, biases in these estimates would directly result in misestimated empirical reliability. In a previous simulation study that focused on the CMPCM (Forthmann et al. 2019a), we found accurate estimates of the ability variance and accurate standard errors even for sample sizes as small as N = 100. Parameter recovery for the CMPCM has not been examined yet for sample sizes smaller than N = 100, hence for situations in which only smaller samples are available new simulations should be run to examine potential bias of reliability estimates. Furthermore, if simulation findings generalize to the negative binomial models is also an open question and requires attention for situations with smaller sample sizes. As a more general remark, it should be noted that reliability is population specific and invariance across samples is not guaranteed (e.g., samples that are affected by range restriction may not result in an accurate estimate of capacity variance). Finally, measurement precision of reliability estimates can be assumed to be a function of the value of reliability with wider confidence intervals for reliability around the 0.50 s and quite narrow confidence intervals for excellent reliability above 0.90. Analytically, this is guaranteed by results of Feldt et al. (1987) and valid for cases when Cronbach’s α and reliability coincide.

Aim of the current study

The goal of this study is a thorough comparison of various available count data IRT models (i.e., RPCM, NB2CM, NB1CM, and CMPCM) based on two scientometric datasets. All of these models allow modeling of the expected value in a log-linear fashion. The models differ with respect to their capability to model dispersion in the data. The RPCM is the least flexible model that is based on the assumption of equidispersion. The NB2CM and NB1CM allow global or item-specific overdispersion modeling, whereas the CMPCM is the most flexible model allowing global or item-specific overdispersion and/or underdispersion. Hence, the first goal is to compare these models in terms of their relative fit to the data based on information criteria that also take model parsimony into account. In addition, a series of likelihood ratio tests are used to compare nested models of increasing complexity. The second and main aim of this work is to examine the reliability of the researcher capacity estimates for the best fitting model and the other competing candidate models. Standard errors are known to be liberal (conservative) when data are overdispersed (underdispersed) which biases statistical inference based on the simple Poisson model. However, Forthmann et al. (2019a) demonstrated that these known problems further affect the reliability of capacity estimates. The RPCM was found to overestimate reliability with an overdispersed simulation model and an underdispersed population model resulted in underestimation of capacity reliability estimates. Hence, this work extends the promising work by Mutz and Daniel (2018b) in two important ways: a) comparing a greater variety of distributional models, b) examining less complex models with lower demands in terms of sample size, and c) providing a detailed check of the inter-relatedness of capacity reliability estimates and whether or how dispersion was taken account.

Method

Data sources

Patent dataset

The first dataset is a subset of the patent dataset provided by the National Bureau of Economic Research (https://data.nber.org/patents/). The full dataset is described in Hall et al. (2001) and we use the file in which inventors were identified by an disambiguation algorithm (Li et al. 2014). The disambiguated data are openly available at the Harvard Dataverse (https://dataverse.harvard.edu/dataverse/patent). Here we use the same subset of N = 3055 inventors that was used by Forthmann et al. (2019b). To control for issues arising from data truncation only inventors with careers within the years from 1980 to 2003 were used. Moreover, inventors were required to have at least one patent in each of the six four-year intervals in which the data were split. In this study, we use the number of patents granted as an indicator of inventive capacity. Hence, the number of patents for each of the six four-year intervals, respectively, was used as one item in this study (i.e., six items in total). It has been argued that productivity within a period of time (annual productivity is perhaps most often used in this regard) is highly relevant as a measure of scientific excellence (Yair and Goldstein 2020).

Mutz and Daniel’s dataset

The second dataset comprises of N = 254 German social sciences researchers who were listed in a membership directory of a quantitative methodology division of an unspecified academic society (Mutz and Daniel 2018b). The following six bibliometric indicators measure researcher capacity (Mutz and Daniel 2018b): a) TOTCIT: total number of citations received (excluding the highest cited paper), b) SHORTCIT: number of citations received within a 3-year citation window, c) NUMPUB: total number of published articles, d) TOP10%: number of publications in the top 10% of the researcher’s scientific field, e) PUBINT: the number of papers published together with international co-authors, and f) NUMCIT: the number of papers that received at least one citation. The dataset is openly available for reanalysis (Mutz and Daniel 2018a).

Analytical approach

All models were fitted with the statistical software R (R Core Team 2019) by means of the glmmTMB package (Brooks et al. 2017). All R scripts and links for download of the datasets are provided in the online repository of this work (https://osf.io/em642/). All models were fitted with the same log-linear model to predict the expected value (see Eq. 2) and a normally distributed capacity parameter θ on the log-scale. The mean of the capacity parameter distribution was fixed to a value of zero and the variance \({\sigma }_{\theta }^{2}\) was estimated (see Forthmann et al. 2019a). The glmmTMB package provides empirical Bayes estimates for each θj by means of the maximum a posteriori (MAP) estimator. The NB2CM, NB1CM, and CMPCM were fit in two variants. First, more parsimonious models with only one dispersion parameter for all items were estimated. Then, models with item-specific dispersion parameters were fit. Dispersion parameters in glmmTMB are modeled with a log-link. For the negative binomial models this implies that large negative estimated values imply the absence of overdispersion. For the CMPCMs, however, negative values imply underdispersion, whereas a value of zero implies equidispersion and positive values imply overdispersion (Forthmann et al. 2019a). The tables in which model results are reported include a note on these different interpretations of the dispersion parameters to facilitate interpretation (see Tables 1 and 2). Models with increasing complexity were compared based on likelihood ratio tests (denoted by Δχ2) and the Akaike information criterion (AIC; Akaike 1973) and the Bayesian information criterion (Schwarz 1978). The information criteria take also model parsimony into account with the BIC imposing a stronger penalty for complex models. We also calculated Akaike weights for multi-model inference as implemented in the R package MuMIn (Barton 2019). For the NB2PCMs and the NB1PCMs we first looked at model comparison statistics and —for the sake of brevity—report here only the respective better fitting variants of the negative binomial models. Results for the not reported models can be found in the online repository for this work (https://osf.io/em642/).

Table 1 Patent data: Model estimation results for RPCM and CMPCMs
Table 2 Data from Mutz and Daniel (2018a, b): Model estimation results for RPCM and CMPCMs

Reliability of the researcher capacity estimates was globally determined based on empirical reliability (Brown and Croudace 2015; Green et al. 1984):

$$Rel\left(\theta \right)=1- {\overline{SE}}_{\theta }^{2}/{\widehat{\sigma }}_{\theta }^{2}$$
(4)

with \({\overline{SE}}_{\theta }^{2}\) being the average standard error of the researcher capacity estimates and \({\widehat{\sigma }}_{\theta }^{2}\) the estimated variance of the researcher capacity distribution. In addition, conditional reliability (i.e., the reliability for a specific capacity level) can be calculated analogously to Eq. 4 (Green et al. 1984):

$$Rel\left({\theta }_{j}\right)=1- {SE}_{{\theta }_{j}}^{2}/{\widehat{\sigma }}_{\theta }^{2}$$
(5)

with \({SE}_{{\theta }_{j}}^{2}\) being the standard error of specific capacity estimate θj. We compared conditional reliability estimates to explore the dependence of reliability on the capacity level. Our main aim in this regard was to compare both empirical and conditional reliability estimates between the respective best fitting model and the alternative models to reveal how model selection might influence the evaluation of reliability and important related interpretations (e.g., deciding that estimates are accurate enough for a high-stakes decision).

Results

Patent dataset

The parameter estimates of all models can be found in Table 1 [with the exception of the NB1PCMs (global dispersion: AIC = 86,680.06, BIC = 86,742.59; item-specific dispersion: AIC = 86,247.58, BIC = 86,349.19) that were found to fit less well to the data as compared to the NB2PCMs (global dispersion: AIC = 84,818.82, BIC = 84,881.36; item-specific dispersion: AIC = 84,769.92, BIC = 84,871.54)]. The item easiness parameter estimates were highly comparable across all fitted models. These parameters indicate an increase of productivity up to the interval from 1992 to 1995. Then, productivity seemed to decrease slightly up to the final intervals of inventors’ careers, but productivity did not fall back to the level of the first career interval. All models that take deviations from the Poisson assumption of equidispersion into account fitted better than the RPCM (as indicated by likelihood ratio tests and information criteria; see Table 1). All models (i.e., those with general and item-specific dispersion parameters) clearly indicated the presence of overdispersion.

The overall best fitting model was the NB2PCM with item-specific dispersion parameters. The estimated dispersion parameters indicated that overdispersion was strongest for the first career interval. In addition, the amount of overdispersion was found to decrease monotonically over inventor’s careers (see Table 1). It is noteworthy that this pattern of dispersion parameters was not found to be paralleled by the CMPCM with item-specific dispersion. To examine whether this observation was masked by the complex interplay between conditional mean and dispersion, we checked the item-specific dispersion index [Var(Yi)/E(Yi); Bonat et al. 2018; Consul and Famoye 1992] with θj = 0 (i.e., average capacity). Sellers and Shmueli's (2010) approximate variance formula was used for the calculation of item-specific dispersion indices (which was justified because all ν were < 1). However, dispersion indices did also yield a different pattern as compared to the NB2CM item-specific dispersion parameters. That is, the findings revealed that the dispersion index increased from 1.18 to 2.58 at the fourth time interval (1992–1995) and, then, decreased to a value of 1.90 for the last interval (2000–2003).

Furthermore, as expected when overdispersion is present, the RPCM led to a huge overestimation of empirical reliability of capacity estimates (0.837) as compared to any of the other models that take overdispersion into account (range of empirical reliability estimates: 0.682–0.694). This overestimation resulted from both an overestimation of the variance of the capacity estimate distribution as well as an underestimation of the average standard errors of the capacity estimates (see Table 2). While empirical reliability estimates were found to be highly comparable across all models with dispersion parameters (see Table 2), this was not the case for conditional reliability (see Fig. 1). In Fig. 1, conditional reliability estimates are plotted against the z-transformed capacity estimates for all count models. Conditional reliability based on the RPCM was clearly overestimated for the full range of capacity estimates. In addition, conditional reliability estimates for the NB2PCMs and CMPCMs were quite comparable up to a capacity of 1.5 SDs above the mean. For values greater than 1.5 SDs above the mean, it is clearly visible in Fig. 1 that the CMPCMs overestimate conditional reliability for highly productive inventors (this interpretation is only feasible in comparison to the NB2PCM with item-specific dispersion as the best fitting model for this dataset). The NB2CM with item-specific dispersion displayed also some slightly lower conditional reliabilities in the top capacity range as compared to the NB2CM with global dispersion parameter. Hence, particularly for the most productive inventors in this sample, the choice of the distributional model and approach to dispersion modeling was crucial for an accurate assessment of reliability of capacity estimates.

Fig. 1
figure 1

Bivariate scatterplot of the conditional reliability estimates for the respective researcher capacity estimates (y axis) against the z-standardized capacity estimates (x axis) for the patent data

Mutz and Daniel’s (2018a, b) dataset

All parameter estimates of all fitted models (with the exception of the NB2PCM) can be found in Table 2. The NB2PCM with global dispersion parameter (AIC = 10,350.75; BIC = 10,393.38) fitted better as compared to the NB1PCM with global dispersion parameter (AIC = 11,479.16; BIC = 11,521.80). However, for this dataset we chose the NB1PCMs above the NB2PCMs because the NB1PCM with item-specific dispersion (AIC = 10,004.21; BIC = 10,068.16) fitted better to the data as compared to the NB2PCM with item-specific dispersion (the model estimation had many technical problems so that information criteria could not be calculated) and, thus, provided stronger competition for the overall model comparison procedure. In addition, we had to fix the item-specific dispersion parameters for NUMPUB and NUMCIT to the same value to deal with generally observed technical problems with the negative binomial models with item-specific dispersion. Notably, these problems were not unexpected because Mutz and Daniel (2018b) found that the dispersion parameters for these items adhered to the Poisson model.

The absolute values of the item-easiness parameter estimates (see Table 2) across the fitted models differed considerably stronger as compared to the patent dataset (see Table 1). Both models with a general dispersion parameter displayed overdispersion and fitted better as compared to the RPCM (see Table 2). The order of the dispersion parameter estimates in both models with item-specific dispersion parameters was highly comparable. However, the CMPCM with item-specific dispersion displayed underdispersion for NUMPUB and NUMCIT, whereas the large negative values for log-dispersion in the NB1PCM for these two items indicated that dispersion was at the lower limit of the parameter space (i.e., these items adhered to the Poisson model; see also Mutz and Daniel 2018b). The CMPCM with item-specific dispersion with unambiguously the best fitting model across all criteria. This observation is crucial because it highlights the inability of negative binomial models to take underdispersion into account. Clearly, the presence of underdispersion here caused technical problems for the estimation of negative binomial models. In addition, it is important to consider that IRT models deal with conditional distributions that can display underdispersion even when the unconditional distribution of, for example, the number of publications does not display underdispersion.

Empirical reliability estimates for this dataset were much more comparable across fitted models (range from 0.942 to 0.981; see Table 2) as compared to the patent data (see Table 1). Again, the highest reliability estimate resulted for the RPCM. However, the best fitting model here produced a reliability estimate of almost the same size (i.e., 0.973). Hence, even in situations in which the data require item-specific dispersion modeling, it can be the case that compared to the RPCM reliability appears to be pretty accurately estimated because some items may display overdispersion (here: TOTCIT, SHORTCIT, TOP10%, and PUBINT) and some items may display underdispersion (here: NUMPUB and NUMCIT). Nonetheless, the variance of the capacity distribution was clearly overestimated in the RPCM as compared to the CMPCM with item-specific dispersion and also the average standard error was slightly overestimated. All other models resulted in higher average standard errors because these models all modeled overdispersion (i.e., the negative binomial models can only model overdispersion and the CMPCM with global dispersion empirically demonstrated overdispersion).

Figure 2 shows the conditional reliability plot for Mutz and Daniel’s dataset. The plot clearly shows that both NB1CMs and the CMPCM with global dispersion parameter would have led to strong underestimates of conditional reliability across almost the full range of capacity. In addition, the differences between the best fitting CMPCM with item-specific dispersion and all other dispersion models decreases with increasing capacity. The RPCM as compared to the best fitting CMPCM with item-specific dispersion tends to overestimate conditional reliability slightly stronger towards the lower tail of capacity (see Fig. 2).

Fig. 2
figure 2

Bivariate scatterplot of the conditional reliability estimates for the respective researcher capacity estimates (y axis) against the z-standardized capacity estimates (x axis) for the BQ data (Mutz and Daniel 2018b)

Discussion

Accurate evaluations of researcher capacity are of high practical value in several contexts requiring selection decisions (e.g., for academic jobs, funding allotment, or research awards). The attractiveness of IRT models in this context has been highlighted in recent research (Mutz and Daniel 2018b) and the current work extends this idea in important ways. First, less complex models to predict the expected values were considered to reduce requirements with respect to sample size. Second, a model based on the Conway-Maxwell-Poisson distribution was added to the pool of candidate distributions. This distribution is more flexible as compared to the frequently used negative binomial distribution as it can handle equidispersion, overdispersion, and underdispersion at the item level. Finally, this study focused on the reliability of researcher capacity and the complex interplay of the chosen distributional model and conditional reliability (i.e., measurement precision at specific capacity levels.).

This study further amplifies the call for item-specific dispersion modeling because across the studied datasets models with item-specific dispersion fitted best. Moreover, it has been demonstrated that CMPCMs are a useful alternative model for bibliometric indicators. Bibliometric indicators are commonly known to have unconditional distributions that display overdispersion very often. However, IRT count data models deal with conditional distributions and the findings for Mutz and Daniel’s dataset convincingly demonstrate that underdispersion at the item level can be present in the data and needs to be taken appropriately into account. Otherwise, technical problems and inaccurate estimation of the reliability of capacity estimates are expected. It is therefore highly recommended that a careful comparison of various competing IRT count data models precedes any examination of reliability.

Beyond the reliability concept as examined in this work (i.e., person separation of capacity as a latent variable) one might be interested in the reliability of a single observed indicator (see Allison, 1978). Dispersion parameters in the used IRT models in this work were found to be indicator-specific across both studied datasets, and they were estimated jointly with all other model parameters. However, when only a subset of indicators is available with the others missing at random, the derived model parameters can be used to calculate ability based on the subset and also the conditional variance and reliability. This can also be done in the extreme case of a single indicator (e.g., total number of published articles). This is conceptually similar, but mathematically distinct from Allison's (1978) approach, which is based on a conditional Poisson assumption for each indicator, and does not require parameter estimates from a whole set of indicators. These reliability considerations are beyond the scope of the current work and we strongly recommend future research that closely examines varying conceptions of measurement precision.

This study is clearly limited to less complex models as used by Mutz and Daniel (2018b). They used variants of the ICCPCM (Doebler et al. 2014) including a negative binomial extension of the model within a Bayesian estimation framework. We decided to use more parsimonious models in this work for several reasons. First, the focus was on the interplay between reliability of researcher capacity estimates and the used distributional models which is already quite complex. Hence, we chose simpler models for a better focus on our research question. Second, a ICCPCM extension to the CMPCM (i.e., an ICCCMPCM) seems possible and straightforward in terms of theory, but estimation routines that allow application of such an extension are to the best of our knowledge currently not available. For example, Bayesian estimation of the CMPCM – as the basis for an ICCCMPCM – is currently not available in ready-to-use software packages for GLMMs. Hence, given this might become an available alternative in the future, potential extensions of the CMPCM such as an ICCCMPCM will deserve a close examination.

Moreover, there are calls in the literature to use a variety of indicators (Moed and Halevi 2015) and others consider mere productivity as the most important indicator for funding and promotion (Yair and Goldstein 2020). In this work, productivity across careers of inventors and multiple bibliometric indices for a sample of social science researchers were studied, but the intention of this work was not to take any position of which approach to measurement (i.e., mere productivity vs. multi-faceted indicators) might be better. In fact, any evaluation of researchers’ performance should be best guided by the concrete goals and consequences related to the respective selection decision (or other reason for such an evaluation).

Finally, it should be noted that recent research has extended the CMP distribution, especially with the goal to make the NB distribution one of the limit cases (Chakraborty and Ong 2016; Chakraborty and Imoto 2016; Imoto 2014). This is a potential avenue for a unification of the presented models, albeit regression modelling software including one of the extensions and random effects is currently not available. We also caution that none of the mentioned distributions lends itself naturally to a mean parametrization in the sense of Huang (2017), complicating model interpretation.

Conclusion

The current work has added important points to consider when such an evaluation is based on IRT-based estimates of researcher capacity. A sufficiently large set of alternative models has been examined with lower requirements with respect to sample size as compared to previous studies (Mutz and Daniel 2018b). Importantly, the CMPCM was introduced in this work that balances needed flexibility and sample size requirements. Most importantly, this study clearly demonstrated that model choice must precede any analysis of the reliability of researcher capacity estimates. Similarly, when researcher capacity standard errors are to be used, say to construct confidence intervals, model choice is equally important. Finally, count data models in general were shown to be well-suited for contexts in which decisions about the best researchers are to be made because all count data models displayed the highest level of conditional reliability at the top-ranges of researcher capacity.