# The productivity of top researchers: a semi-nonparametric approach

• Published:

## Abstract

Research productivity distributions exhibit heavy tails because it is common for a few researchers to accumulate the majority of the top publications and their corresponding citations. Measurements of this productivity are very sensitive to the field being analyzed and the distribution used. In particular, distributions such as the lognormal distribution seem to systematically underestimate the productivity of the top researchers. In this article, we propose the use of a (log)semi-nonparametric distribution (log-SNP) that nests the lognormal and captures the heavy tail of the productivity distribution through the introduction of new parameters linked to high-order moments. The application uses scientific production data on 140,971 researchers who have produced 253,634 publications in 18 fields of knowledge (O’Boyle and Aguinis in Pers Psychol 65(1):79–119, 2012) and publications in the field of finance of 330 academic institutions (Borokhovich et al. in J Finance 50(5):1691–1717, 1995), and shows that the log-SNP distribution outperforms the lognormal and provides more accurate measures for the high quantiles of the productivity distribution.

This is a preview of subscription content, log in via an institution to check access.

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

## Notes

1. Different weight functions w(x) can be used; for details, see Abramowitz and Stegun (1972, pp. 774–775). We will consider P 0(x) = 1.

2. For more details about the Edgeworth and Gram–Charlier series, see Kendall and Stuart (1977, pp. 167–172).

3. It must be noted that given a truncating order, the resulting distribution is purely parametric, but the truncating order is flexible to achieve a more accurate approximation to a given distribution. Without loss of generality, we will assume that d 0 = 1.

4. Log-SNP’s moments can be directly derived as $$E\left[ {z^{t} } \right] = e^{{\mu t + \frac{1}{2}t^{2} \sigma^{2} }} \left[ {1 + \sum\nolimits_{s = 1}^{n} {d_{s} \left( {\sigma t} \right)^{s} } } \right]$$ (see Ñíguez et al. 2013).

5. It should be noted that the different size of journals in the JCR categories represents a shortcoming of the selection procedure. Nevertheless, it is not clear if other arbitrary selection method would yield to better results and, anyhow, this issue does not affect the advantages of the methodology proposed in this paper.

6. For details about the data treatment, see O’Boyle and Aguinis (2012), p. 86.

7. We took the JCR of the year 2007 to be consistent with O’Boyle and Aguinis (2012), as that was the year used by the authors to select the five main journals within each field of knowledge.

8. The code for the implementation of the maximum likelihood estimation algorithm in R package is available upon request.

9. Note that we did not include the d s parameters for s odd, after having tested that they were not significantly different from zero. This result reinforces the fact that the parameter σ captures all relevant features about the skewness. It must be highlighted that the latter does not contradict the fact that the d s parameters for s even are highly significant, which means that productivity distributions have very thick tails and thus require different parameters to provide accurate measures of the “probability of being a very top researcher” in every field.

10. The quantiles of the log-SNP distribution are obtained from the cdf displayed in Eq. (15) and the Inverse Transform Method (ITM).

## References

• Abramo, G., & D’Angelo, C. A. (2014). Assessing national strengths and weaknesses in research fields. Journal of Informetrics, 8(3), 766–775.

• Abramo, G., D’Angelo, A. C., & Pugini, F. (2008). The measurement of Italian universities’ research productivity by a non parametric-bibliometric methodology. Scientometrics, 76(2), 225–244.

• Abramowitz, M., & Stegun, I. A. (1972). Handbook of mathematical functions with formulas, graphs, and mathematical tables. New York: Dover Publications.

• Aguinis, H., O’Boyle, E., Gonzalez-Mulé, E., & Joo, H. (2015). Cumulative advantage: Conductors and insulators of heavy-tailed productivity distributions and productivity tars. Personnel Psychology,. doi:10.1111/peps.12095.

• Albarrán, P., Juan, A. C., Ortuño, I., & Ruiz-Castillo, J. (2011). The skewness of science in 219 sub-fields and a number of aggregates. Scientometrics, 88(2), 385–397.

• Bertocchi, G., Gambardella, A., Jappelli, T., Nappi, C. A., & Peracchi, F. (2015). Bibliometric evaluation vs. informed peer review: Evidence from Italy. Research Policy, 44(2), 451–466.

• Birkmaier, D., & Wohlrabe, K. (2014). The Matthew effect in economics reconsidered. Journal of Informetrics, 8(4), 880–889.

• Blinnikov, S., & Moessner, R. (1998). Expansions for nearly Gaussian distributions. Astronomy and Astrophysics, Supplement Series, 130(1), 193–205.

• Bornmann, L. (2011). Scientific peer review. Annual Review of Information Science and Technology, 45(1), 199–245.

• Borokhovich, K. A., Bricker, R. J., Brunarski, K. R., & Simkins, B. J. (1995). Finance research productivity and influence. The Journal of Finance, 50(5), 1691–1717.

• Broadus, R. N. (1987). Toward a definition of ‘bibliometrics’. Scientometrics, 12(5–6), 373–379.

• Campanario, J. M. (2015). Providing impact: The distribution of JCR journals according to references they contribute to the 2-year and 5-year journal impact factors. Journal of Informetrics, 9(2), 398–407.

• Chen, X. (2007). Large sample sieve estimation of semi-nonparametric models. In J. Heckman & E. Leamer (Eds.), Handbook of econometrics, Ch. 76, Part B (Vol. 6, pp. 5549–5632). Amsterdam: Elsevier.

• Chung, K. H., & Cox, R. A. (1990). Patterns of productivity in the finance literature: A study of the bibliometric distributions. The Journal of Finance, 45(1), 301–309.

• Coupé, T. (2003). Revealed performances. Worldwide rankings of economists and economics departments. Journal of the European Economic Association, 1(6), 1309–1345.

• Cramér, H. (1925). On some classes of series used in mathematical statistics. In Sixth scandinavian congress of mathematicians (pp. 399–425). Copenhagen.

• Crespo, J. A., Ortuño-Ortín, I., & Ruiz-Castillo, J. (2012). The citation merit of scientific publications. PLoS ONE, 7(11), e49156.

• Da Silva, R., Kalil, F., De Oliveira, J. M., & Martinez, A. S. (2012). Universality in bibliometrics. Physica A: Statistical Mechanics and its Applications, 391(5), 2119–2128.

• Day, T. E. (2015). The big consequences of small biases: A simulation of peer review. Research Policy, 44(6), 1266–1270.

• Del Brio, E. B., & Perote, J. (2012). Gram–Charlier densities: Maximum likelihood versus the method of moments. Insurance: Mathematics and Economics, 51(3), 531–537.

• Duch, J., Zeng, X. T., Sales-Pardo, M., Radicchi, F., Otis, S., Woodruff, T. K., et al. (2012). The possible role of resource requirements and academic career-choice risk on gender differences in publication rate and impact. PLoS ONE, 7(12), e51332.

• Dundar, H., & Lewis, D. (1998). Determinants of research productivity in higher education. Research in Higher Education, 39(6), 607–631.

• Egghe, L. (2005). Power laws in the information production process: Lotkaian informetrics. Kidlington: Elsevier Academic Press.

• Ellison, G. (2013). How does the market use citation data? the hirsch index in economics. American Economic Journal: Applied Economics, 5(3), 63–90.

• Eom, Y. H., & Fortunato, S. (2011). Characterizing and modeling citation dynamics. PLoS ONE, 6(9), e24926.

• Finardi, U. (2013). Correlation between journal impact factor and citation performance: An experimental study. Journal of Informetrics, 7(2), 357–370.

• Frandsen, T. F. (2005). Geographical concentration. The case of economics journals. Scientometrics, 63(1), 69–85.

• Gallant, A. R., & Nychka, D. W. (1987). Seminonparametric maximum likelihood estimation. Econometrica, 55(2), 363–390.

• Garfield, E. (1980). Bradford’s Law and related statistical pattern. Essays of an Information Scientist, 4(19), 476–483.

• Genest, C. (1997). Statistics on statistics: Measuring research productivity by journal publications between 1985 and 1995. The Canadian Journal of Statistics, 25(4), 427–443.

• Guerrero-Bote, V. P., Zapico-Alonso, F., Espinosa-Calvo, M. E., Gomez-Crisostomo, R., & Moya-Anegon, F. (2007). Import–export of knowledge between scientific subject categories: The iceberg hypothesis. Scientometrics, 71(3), 423–441.

• Harzing, A. (2008). Publish or Perish: A citation analysis software program. http://www.harzing.com/resources.htm.

• Harzing, A. W. (2014). A longitudinal study of Google Scholar coverage between 2012 and 2013. Scientometrics, 98(1), 565–575.

• Harzing, A. W., & Alakangas, S. (2016). Google Scholar, Scopus and the Web of Science: A longitudinal and cross-disciplinary comparison. Scientometrics, 106(2), 787–804.

• Harzing, A. W., & Van der Wal, R. (2008). Google Scholar as a new source for citation analysis? Ethics in Science and Environmental Politics, 8(1), 61–73.

• Heberger, A. E., Christie, C. A., & Alkin, M. C. (2010). A bibliometric analysis of the academic influences of and on evaluation theorists’ published works. American Journal of Evaluation, 31(1), 24–44.

• Hodgson, G. M., & Rothman, H. (1999). The editors and authors of economics journals: A case of institutional oligopoly? The Economic Journal, 109(453), 165–186.

• Kaur, J., Ferrara, E., Menczer, F., Flammini, A., & Radicchi, F. (2015). Quality versus quantity in scientific impact. Journal of Informetrics, 9(4), 800–808.

• Kaur, J., Radicchi, F., & Menczer, F. (2013). Universality of scholarly impact metrics. Journal of Informetrics, 7(4), 924–932.

• Kendall, M., & Stuart, A. (1977). The advanced theory of statistics, vol. I (4th ed.). London: C. Griffin.

• Kocher, M. G., Luptacik, M., & Sutter, M. (2006). Measuring productivity of research in economics: A cross-country study using DEA. Socio-Economic Planning Sciences, 40(4), 314–332.

• Kretschmer, H., & Kretschmer, T. (2007). Lotka’s distribution and distribution of co-author pairs’ frequencies. Journal of Informetrics, 1(4), 308–337.

• Kumar, S., Sharma, P., & Garg, K. C. (1998). Lotka’s law and institutional productivity. Information Processing and Management, 34(6), 775–783.

• Lancho-Barrantes, B. S., Guerrero-Bote, V. P., & Moya-Anegón, F. (2010). The iceberg hypothesis revisited. Scientometrics, 85(2), 443–461.

• Lotka, A. J. (1926). The frequency distribution of scientific productivity. Journal of the Washington Academy of Science, 16(12), 317–323.

• Martínez-Mekler, G., Martínez, R. A., del Río, M. B., Mansilla, R., Miramontes, P., & Cocho, G. (2009). Universality of rank-ordering distributions in the arts and sciences. PLoS ONE, 4(3), e4791.

• Mauleón, I., & Perote, J. (2000). Testing densities with financial data: an empirical comparison of the Edgeworth–Sargan density to the Student’s t. European Journal of Finance, 6(2), 225–239.

• Mingers, J., & Leydesdorff, L. (2015). A review of theory and practice in scientometrics. European Journal of Operational Research, 246(1), 1–19.

• Momeni, F., & Mayr, P. (2016). Evaluating co-authorship networks in author name disambiguation for common names. arXiv:1606.03857.

• Newman, M. J. (2005). Power laws, Pareto distributions and Zipf’s law. Contemporary Physics, 46(5), 323–351.

• Nicholls, P. T. (1986). Empirical validation of Lotka’s law. Information Processing and Management, 22(5), 417–419.

• Nicholls, P. T. (1989). Bibliometric modelling processes and the empirical validity of Lotka’s law. Journal of the American Society for Information Science, 40(6), 379–385.

• Nicolaisen, J., & Hjørland, B. (2007). Practical potentials of Bradford’s law: A critical examination of the received view. Journal of Documentation, 63(3), 359–377.

• Ñíguez, T.-M., Paya, I., Peel, D., & Perote, J. (2012). On the stability of the constant relative risk aversion (CRRA) utility under high degrees of uncertainty. Economics Letters, 115(2), 244–248.

• Ñíguez, T.-M., Paya, I., Peel, D., & Perote, J. (2013). Higher-order moments in the theory of diversification and portfolio composition. Economics Working Paper Series 2013/003. Lancaster University.

• O’Boyle, E., & Aguinis, H. (2012). The best and the rest: Revisiting the norm of normality of individual performance. Personnel Psychology, 65(1), 79–119.

• Perc, M. (2010). Zipf’s law and log-normal distributions in measures of scientific output across fields and institutions: 40 years of Slovenia’s research as an example. Journal of Informetrics, 4(2), 358–364.

• Phillips, P. B. (1977). A general theorem in the theory of asymptotic expansions as approximations to the finite sample distributions of econometric estimators. Econometrica, 45(6), 1517–1534.

• Price, D. S. (1976). A general theory of bibliometric and other cumulative advantage processes. Journal of the American Society for Information Science, 27(5), 292–306.

• Radicchi, F., Fortunado, S., & Castellano, C. (2008). Universality of citation distribution: Towards an objective measure of scientific impact. Proceedings of the National Academy of Sciences of the United States of America, 105(45), 17268–17272.

• Redner, S. (1998). How popular is your paper? An empirical study of the citation distribution. The European Physical Journal B-Condensed Matter and Complex Systems, 4(2), 131–134.

• Rousseau, R. (1994). Bradford curves. Information Processing and Management, 30(2), 267–277.

• Ruiz-Castillo, J., & Costas, R. (2014). The skewness of scientific productivity. Journal of Informetrics, 8(4), 917–934.

• Sabharwal, M. (2013). Comparing research productivity across disciplines and career stages. Journal of Comparative Policy Analysis: Research and Practice, 15(2), 141–163.

• Sargan, D. (1975). Gram-Charlier approximation applied t ratios or k-class estimatiors. Econometrica, 43(2), 327–346.

• Seggie, S. H., & Griffith, D. A. (2009). What does it take to get promoted in marketing academia? Understanding exceptional publication productivity in the leading marketing journals. Journal of Marketing, 73(1), 122–132.

• Van den Besselaar, P., & Sandström, U. (2016). What is the required level of data cleaning? A research evaluation case. Journal of Scientometric, 5(1), 07–12.

• Wallace, D. L. (1958). Asymptotic approximations to distributions. Annals of Mathematical Statistics, 29(3), 635–654.

• Williamson, I. O., & Cable, D. M. (2003). Predicting early career research productivity: The case of management faculty. Journal of Organizational Behavior, 24(1), 25–44.

• Yang, K., & Meho, L. I. (2006). Citation analysis: A comparison of Google Scholar, Scopus, and Web of Science. Proceedings of the American Society for Information Science and Technology, 43(1), 1–15.

## Acknowledgments

We thank Herman Aguinis and Ernest O’Boyle for allowing us to use their database on academic productivity compiled in O’Boyle and Aguinis (2012). We also thank two anonymous referees for their constructive and valuable suggestions. Financial support from the Spanish Ministry of Economics and Competitiveness, through the project ECO2013-44483-P, FAPA-Uniandes, through the project PR.3.2016.2807, and Universidad EAFIT are also gratefully acknowledged.

## Author information

Authors

### Corresponding author

Correspondence to Lina M. Cortés.

## Appendices

### Appendix 1

This appendix lists the first eight d s parameters in terms of the central moments of the SNP distribution. For more information, see Del Brio and Perote (2012).

$$d_{1} = \mu_{1}$$
(18)
$$d_{2} = \frac{1}{2}\left( {\mu_{2} - 1} \right)$$
(19)
$$d_{3} = \frac{1}{6}\left( {\mu_{3} - 3\mu_{1} } \right)$$
(20)
$$d_{4} = \frac{1}{24}\left( {\mu_{4} - 6\mu_{2} + 3} \right)$$
(21)
$$d_{5} = \frac{1}{120}\left( {\mu_{5} - 10\mu_{3} + 15\mu_{1} } \right)$$
(22)
$$d_{6} = \frac{1}{720}\left( {\mu_{6} - 15\mu_{4} + 45\mu_{2} - 15} \right)$$
(23)
$$d_{7} = \frac{1}{5040}\left( {\mu_{7} - 21\mu_{5} + 105\mu_{3} - 105\mu_{1} } \right)$$
(24)
$$d_{8} = \frac{1}{40320}\left( {\mu_{8} - 28\mu_{6} + 210\mu_{4} - 420\mu_{2} + 105} \right)$$
(25)

### Appendix 2

This appendix derives the cdf of the SNP distribution.

\begin{aligned} G_{x} \left( a \right) = & \int\limits_{ - \infty }^{a} {g\left( {x;\varvec{d}} \right)dx = \int\limits_{ - \infty }^{a} {\phi \left( x \right)dx} + \sum\limits_{s = 1}^{n} {d_{s} \int\limits_{ - \infty }^{a} {H_{s} \left( x \right)\phi \left( x \right)dx} } } \\ = & \int\limits_{ - \infty }^{a} {\phi \left( x \right)dx - \left. {\sum\limits_{s = 1}^{n} {d_{s} H_{s - 1} \left( x \right)\phi \left( x \right)} } \right|}_{ - \infty }^{a} \\ = & \int\limits_{ - \infty }^{a} {\phi \left( x \right)dx - \phi \left( a \right)\sum\limits_{s = 1}^{n} {d_{s} H_{s - 1} \left( a \right)} } \\ \end{aligned}

Given that $$\mathop {\lim }\limits_{x \to \pm \infty } H_{s} \left( x \right)\phi \left( x \right) = 0 \quad \forall s \ge 1,$$ it follows that

\begin{aligned} \int {H_{s} \left( x \right)\phi \left( x \right)dx} = & \int {\left( { - 1} \right)^{s} \frac{{d^{s} \phi \left( x \right)}}{{dx^{s} }}dx_{t} = \left( { - 1} \right)^{s} \frac{{d^{s - 1} \phi \left( x \right)}}{{dx^{s - 1} }}} \\ = &\, \left( { - 1} \right)^{s} \left( { - 1} \right)^{s - 1} H_{s - 1} \left( x \right)\phi \left( x \right) = - H_{s - 1} \left( x \right)\phi \left( x \right) \\ \end{aligned}

## Rights and permissions

Reprints and permissions

Cortés, L.M., Mora-Valencia, A. & Perote, J. The productivity of top researchers: a semi-nonparametric approach. Scientometrics 109, 891–915 (2016). https://doi.org/10.1007/s11192-016-2072-5