Abstract
There is a great variation of research output across countries in terms of differences in the amount of published peer-reviewed literature. Besides determining the causal determinants of these differences, an important task of scientometric research is to make accurate predictions of countries’ future research output. Building on previous research on the key drivers of differences in countries’ research outputs, this study develops a model which includes sixteen macro-level predictors representing aspects of the research and economic system, of the political conditions, and of structural and cultural attributes of countries. In applying a machine learning procedure called boosted regression trees, the study demonstrates these predictors are sufficient for making highly accurate forecasts of countries’ research output across scientific disciplines. The study also shows that using a functionally flexible procedure like boosted regression trees can substantially increase the predictive power of the model when compared to traditional regression. Finally, the results obtained allow a different perspective on the functional forms of the relations between the predictors and the response variable.
Similar content being viewed by others
Notes
Data on all predictors from 2004 to 2012 were used for fitting the model (Step 1), predictor data from 2013 were used for model validation (Step 2), and predictor data from 2014 were used for forecasting research output in 2015. (See “Analytical strategy” section).
Since sufficient data could only be retrieved for 2015, these were used for the complete period. Consequently, the number of universities was treated as a constant.
The automatic identification of the optimal tree number is implemented in the Stata plugin boost (Schonlau 2005). This plugin was used for all the predictions made in this study.
In order to account for the increasing trends of many variables in the FE and RE models, a time trend term was included into the models.
Additionally, a negative binomial regression was estimated without log-transforming the dependent variable. Since the results of the negative binomial regression were even less accurate than those from OLS, this approach was not pursued any further. Such a result is not unusual, given that employing log-transformed dependent variables within a linear approach may be more appropriate in some cases than using a count data model (Thelwall and Wilson 2014).
In order to test whether the accuracy of the BRT model depends upon the number of observations used to train it as well as to see whether it loses predictive accuracy when applied to data that corresponds to a more distant time in the future, I performed another sensitivity test in that I trained the model with data from 2004 to 2009 and validated it with data from 2014. In other words, the model trained with predictor data up to 2009 was used to predict ln(docs) in 2014. The predicted values for 2014 were then compared with the actual values in 2014. Although the prediction error slightly increased (RMSE = 0.29), the loss in predictive accuracy is moderate. This means that even when considerably less observations are used for training the model and when the trained model is validated with data more distant in the future, BRT still outperforms traditional regression approaches in terms of predictive accuracy.
References
Abramo, G., & D’Angelo, C. A. (2014). How do you define and measure research productivity? Scientometrics, 101, 1129–1144.
Basu, A. (2010). Does a country’s scientific ‘productivity’ depend critically on the number of country journals indexed? Scientometrics, 82, 507–516.
Breiman, L., Friedman, J. H., Olshen, R. A., & Stone, C. J. (1984). Classification and regression trees. Belmont: Wadsworth.
Canagarajah, A. S. (2002). A geopolitics of academic writing. Pittsburgh: University of Pittsburgh Press.
Diaz-Puente, J. M., Cazorla, A., & Dorrego, A. (2007). Crossing national, continental, and linguistic boundaries: Toward a worldwide evaluation research community in journals of evaluation. American Journal of Evaluation, 28, 399–415.
Elith, J., Leathwick, J. R., & Hastie, T. (2008). A working guide to boosted regression trees. Journal of Animal Ecology, 77, 802–813.
Friedman, J. H. (2001). Greedy function approximation: A gradient boosting machine. The Annals of Statistics, 29, 1189–1232.
Friedman, J. H. (2002). Stochastic gradient boosting. Computational Statistics and Data Analysis, 38, 367–378.
Friedman, J. H., & Meulman, J. J. (2003). Multiple additive regression trees with application in epidemiology. Statistics in Medicine, 22, 1365–1381.
Gantman, E. R. (2009). International differences of productivity in scholarly management knowledge. Scientometrics, 80, 155–167.
Gantman, E. R. (2012). Economic, linguistic, and political factors in the scientific productivity of countries. Scientometrics, 93, 967–985.
Gul, S., Nisa, N. T., Shah, T. A., Gupta, S., Jan, A., & Ahmad, S. (2015). Middle East: Research productivity and performance across nations. Scientometrics, 105, 1157–1166.
Hastie, T., Tibshirani, R., & Friedman, J. (2009). The elements of statistical learning: Data mining, inference, and prediction (2nd ed.). New York: Springer.
Hsie, P.-N., & Chang, P.-L. (2009). An assessment of world-wide research productivity in production and operations management. International Journal of Production Economics, 120, 540–551.
Jamjoom, B. A., & Jamjoom, A. B. (2016). Impact of country-specific characteristics on scientific productivity in clinical neurology research. eNeurologicalSci, 4, 1–3.
Kaufmann, D., Kraay, A., & Mastruzzi, M. (2011). The worldwide governance indicators: Methodology and analytical issues. Hague Journal on the Rule of Law, 3, 220–246.
King, D. A. (2004). The scientific impact of nations. Nature, 430, 311–316.
Koljatic, M. M., & Silva, M. R. (2001). The international publication productivity of Latin American countries in the economics and business administration fields. Scientometrics, 51, 381–394.
Lee, L.-C., Lin, P.-H., Chuang, Y.-W., & Lee, Y.-Y. (2011). Research output and economic productivity: A Granger causality test. Scientometrics, 89, 465–478.
Makridakis, S. G., Wheelwright, S. C., & Hyndman, R. J. (1998). Forecasting: Methods and applications (3rd ed.). New York: Wiley.
Man, J. P., Weinkauf, J. G., Tsang, M., & Sin, D. D. (2004). Why do some countries publish more than others? An international comparison of research funding, English proficiency and publication output in highly ranked general medical journals. European Journal of Epidemiology, 19, 811–817.
Meo, S. A., Masri, Al, Abeer, A., Usmani, A. M., Memon, A. N., Zaidi, S. Z., et al. (2013). Correction: Impact of GDP, spending on R&D, number of universities and scientific journals on research publications among Asian countries. PLoS One, 8, e66449.
Ntuli, H., Inglesi-Lotz, R., Chang, T., & Pouris, A. (2015). Does research output cause economic growth or vice versa? Evidence from 34 OECD countries. Journal of the Association for Information Science and Technology, 66, 1709–1716.
Origgi, G., & Ramello, G. B. (2015). Current dynamics of scholarly publishing. Evaluation Review, 39, 3–18.
Rahman, M., & Fukui, T. (2003). Biomedical research productivity: Factors across the countries. International Journal of Technology Assessment in Health Care, 19, 249–260.
Research Trends (2008). Geographical trends of research output. http://www.researchtrends.com/issue8-november-2008/geographical-trends-of-research-output. Accessed 10 Apr 2016.
Rodriguez, V., & Soeparwata, A. (2012). ASEAN benchmarking in terms of science, technology, and innovation from 1999 to 2009. Scientometrics, 92, 549–573.
Sarwan, R., & Hassan, S.-U. (2015). A bibliometric assessment of scientific productivity and international collaboration of the Islamic World in science and technology (S&T) areas. Scientometrics, 105, 1059–1077.
Schonlau, M. (2005). Boosted regression (boosting): An introductory tutorial and a Stata plugin. The Stata Journal, 5, 330–354.
Short, J. R., Boniche, A., Kim, Y., & Li, P. L. (2001). Cultural globalization, global English, and geography journals. The Professional Geographer, 53, 1–11.
Thelwall, M., & Wilson, P. (2014). Regression for citation data: An evaluation of different methods. Journal of Informetrics, 8, 963–971.
Trivedi, P. (1993). An analysis of publication lags in econometrics. Journal of Applied Econometrics, 8, 93–100.
Vinkler, P. (2008). Correlation between the structure of scientific research, scientometric indicators and GDP in EU and non-EU countries. Scientometrics, 74, 237–254.
Vinluan, L. R. (2012). Research productivity in education and psychology in the Philippines and comparison with ASEAN countries. Scientometrics, 91, 277–294.
Witten, I. H., Frank, E., & Hall, M. A. (2011). Data mining: Practical machine learning tools and techniques. Burlington: Elsevier.
Acknowledgments
The author wishes to thank the anonymous reviewer for her/his invaluably helpful comments on an earlier version of this article.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Mueller, C.E. Accurate forecast of countries’ research output by macro-level indicators. Scientometrics 109, 1307–1328 (2016). https://doi.org/10.1007/s11192-016-2084-1
Received:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11192-016-2084-1