Predicting health indicators for open source projects (using hyperparameter optimization)

Xia, Tianpei; Fu, Wei; Shu, Rui; Agrawal, Rishabh; Menzies, Tim

doi:10.1007/s10664-022-10171-0

Predicting health indicators for open source projects (using hyperparameter optimization)

Published: 22 June 2022

Volume 27, article number 122, (2022)
Cite this article

Empirical Software Engineering Aims and scope Submit manuscript

Tianpei Xia¹,
Wei Fu¹,
Rui Shu¹,
Rishabh Agrawal¹ &
…
Tim Menzies ORCID: orcid.org/0000-0002-5040-3196¹

636 Accesses
6 Citations
1 Altmetric
Explore all metrics

Abstract

Software developed on public platform is a source of data that can be used to make predictions about those projects. While the individual developing activity may be random and hard to predict, the developing behavior on project level can be predicted with good accuracy when large groups of developers work together on software projects. To demonstrate this, we use 64,181 months of data from 1,159 GitHub projects to make various predictions about the recent status of those projects (as of April 2020). We find that traditional estimation algorithms make many mistakes. Algorithms like k-nearest neighbors (KNN), support vector regression (SVR), random forest (RFT), linear regression (LNR), and regression trees (CART) have high error rates. But that error rate can be greatly reduced using hyperparameter optimization. To the best of our knowledge, this is the largest study yet conducted, using recent data for predicting multiple health indicators of open-source projects. To facilitate open science (and replications and extensions of this work), all our materials are available online at https://github.com/arennax/Health_Indicator_Prediction.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

How Healthy Is My Project? Open Source Project Attributes as Indicators of Success

Improving generality and accuracy of existing public development project selection methods: a study on GitHub ecosystem

Article 18 March 2022

Software Change Prediction with Homogeneous Ensemble Learners on Large Scale Open-Source Systems

Notes

In a recent TSE’21 article. we have explained by SE hyperparameter optimization can be so simple: SE data can be intrinsically simpler than other kinds of data and, hence simpler to explore (see Figure 6d of Agrawal et al. (2021)).
https://github.com/arennax/Health_Indicator_Prediction
We use default settings for the baselines to find if they can provide good prediction performance, and how much space hyperparameter-tuning can improve. Using a pre-selected parameter-settings from literature may bring bias because of different data format or prediction tasks.
i.e. A maximum of 200 evaluations for Random Search, Grid Search, Flash and DE; for ASKL, maximum runtime for each project is restricted to 15 seconds, please see Section 5.1 for details.
In the Apache Software Foundation, projects can be canceled and “moved to the attic” (https://attic.apache.org) when they are unable to muster 3 votes for a release, lack of active contributors, or unable to fulfill their reporting duties to the Foundation.

References

Aggarwal K, Hindle A, Stroulia E (2014) Co-evolution of project documentation and popularity within github. In: Proceedings of the 11th working conference on mining software repositories, pp 360–363
Agrawal A, Fu W, Chen D, Shen X, Menzies T (2019) How to” DODGE” complex software analytics. IEEE Trans Softw Eng
Agrawal A, Menzies T (2018) Is” better data” better than” better data miners”?. In: 2018 IEEE/ACM 40th international conference on software engineering (ICSE), IEEE, pp 1050–1061
Agrawal A, Menzies T, Minku LL, Wagner M, Yu Z (2018) Better software analytics via” DUO”: Data mining algorithms using/used-by optimizers. arXiv:1812.01550
Agrawal A, Yang X, Agrawal R, Yedida R, Shen X, Menzies T (2021) Simpler hyperparameter optimization for software analytics: Why, how, when. IEEE Trans Softw Eng, 1–1. https://doi.org/10.1109/TSE.2021.3073242
Bao L, Xia X, Lo D, Murphy GC (2019) A large scale study of long-time contributor prediction for github projects. IEEE Trans Softw Eng
Bergstra JS, Bardenet R, Bengio Y, Kégl B (2011) Algorithms for hyper-parameter optimization. In: Advances in neural information processing systems, pp 2546–2554
Bidoki NH, Sukthankar G, Keathley H, Garibay I (2018) A cross-repository model for predicting popularity in github. In: 2018 international conference on computational science and computational intelligence (CSCI), IEEE, pp 1248–1253
Borges H, Hora A, Valente MT (2016a) Predicting the popularity of github repositories. In: Proceedings of the The 12th international conference on predictive models and data analytics in software engineering, pp 1–10
Borges H, Hora A, Valente MT (2016b) Understanding the factors that impact the popularity of github repositories. In: 2016 IEEE international conference on software maintenance and evolution (ICSME), IEEE, pp 334–344
C M, MacDonell S (2012) Evaluating prediction systems in software project estimation. IST 54(8):820–827
Google Scholar
Chen C, Twycross J, Garibaldi JM (2017) A new accuracy measure based on bounded relative error for time series forecasting. PloS One 12:3
Google Scholar
Chen F, Li L, Jiang J, Zhang L (2014) Predicting the number of forks for open source software project. In: Proceedings of the 2014 3rd International workshop on evidential assessment of software technologies, pp 40–47
Coelho J, Valente M T, Milen L, Silva L L (2020) Is this github project maintained? measuring the level of maintenance activity of open-source projects. Information and Software Technology 122
Cohen PR (1995) Empirical methods for artificial intelligence. MIT Press, Cambridge, MA, USA
MATH Google Scholar
Crowston K, Howison J (2006) Assessing the health of open source communities. Computer 39(5):89–91
Article Google Scholar
Das S, Mullick S S, Suganthan P N (2016) Recent advances in differential evolution–an updated survey. Swarm and Evolutionary Computation 27:1–30
Article Google Scholar
Demšar J (2006) Statistical comparisons of classifiers over multiple data sets. The Journal of Machine Learning Research 7:1–30
MathSciNet MATH Google Scholar
Feldt R, Magazinius A (2010) Validity threats in empirical software engineering research-an initial survey. In: SEKE, pp 374–379
Feurer M, Klein A, Eggensperger K, Springenberg J T, Blum M, Hutter F (2019) Auto-sklearn: Efficient and robust automated machine learning. In: Automated Machine Learning. Springer, Cham, pp 113–134
Foss T, Stensrud E, Kitchenham B, Myrtveit I (2003) A simulation study of the model evaluation criterion mmre. TSE 29(11):985–995
Google Scholar
Foundation A S (2018) Apache software foundation projects https://projects.apache.org/projects.html
Foundation L (2020) Community health analytics open source software https://chaoss.community/
Foundation L (2020) Linux foundation projects https://www.linuxfoundation.org/projects/directory/
Friedman M (1940) A comparison of alternative tests of significance for the problem of m rankings. The Annals of Mathematical Statistics 11(1):86–92
Article MathSciNet Google Scholar
Fu W, Menzies T, Shen X (2016) Tuning for software analytics: Is it really necessary?. IST Journal 76:135–146
Google Scholar
Fu W, Nair V, Menzies T (2016) Why is differential evolution better than grid search for tuning defect predictors?. arXiv:1609.02613
Georg JPL, Germonprez M (2018) Assessing open source project health
Han J, Deng S, Xia X, Wang D, Yin J (2019) Characterization and prediction of popular projects on github. In: 2019 IEEE 43rd annual computer software and applications conference (COMPSAC), IEEE, vol 1, pp 21–26
Herbold S (2017) Comments on scottknottesd in response to” an empirical comparison of model validation techniques for defect prediction models”. IEEE Trans Softw Eng 43(11):1091–1094
Article Google Scholar
Herbold S, Trautsch A, Grabowski J (2018) Correction of “A comparative study to benchmark cross-project defect prediction approaches”. IEEE Trans Softw Eng 45(6):632–636
Article Google Scholar
Hohl P, Stupperich M, Münch J, Schneider K (2018) An assessment model to foster the adoption of agile software product lines in the automotive domain. In: 2018 IEEE international conference on engineering, technology and innovation (ICE/ITMC), IEEE, pp 1–9
Jansen S (2014) Measuring the health of open source software ecosystems: Beyond the scope of project health. Inf Softw Technol 56(11):1508–1519
Article Google Scholar
Jarczyk O, Jaroszewicz S, Wierzbicki A, Pawlak K, Jankowski-Lorek M (2018) Surgical teams on github: Modeling performance of github project development processes. Inf Softw Technol 100:32–46
Article Google Scholar
Kalliamvakou E, Gousios G, Blincoe K, Singer L, German D M, Damian D (2014) The promises and perils of mining github. In: Proceedings of the 11th working conference on mining software repositories, pp 92–101
Kalliamvakou E, Gousios G, Blincoe K, Singer L, German D M, Damian D (2016) An in-depth study of the promises and perils of mining github. Empir Softw Eng 21(5):2035–2071
Article Google Scholar
Kikas R, Dumas M, Pfahl D (2016) Using dynamic and contextual features to predict issue lifetime in github projects. In: 2016 IEEE/ACM 13th working conference on mining software repositories (MSR), IEEE, pp 291–302
Kitchenham B A, Pickard L M, MacDonell S G, Shepperd M J (2001) What accuracy statistics really measure. IEEE Softw 148(3):81–85
Article Google Scholar
Korte M, Port D (2008) Confidence in software cost estimation results based on mmre and pred. In: PROMISE’08, pp 63–70
Krishna R, Agrawal A, Rahman A, Sobran A, Menzies T (2018) What is the connection between issues, bugs, and enhancements?. In: 2018 IEEE/ACM 40th international conference on software engineering: software engineering in practice track (ICSE-SEIP), IEEE, pp 306–315
Krishna R, Nair V, Jamshidi P, Menzies T (2021) Whence to learn? transferring knowledge in configurable systems using BEETLE. IEEE Trans Softw Eng 47(12):2956–2972. https://doi.org/10.1109/TSE.2020.2983927
Article Google Scholar
Langdon W B, Dolado J, Sarro F, Harman M (2016) Exact mean absolute error of baseline predictor, MARP0. IST 73:16–18
Google Scholar
Liao Z, Yi M, Wang Y, Liu S, Liu H, Zhang Y, Zhou Y (2019) Healthy or not: A way to predict ecosystem health in github. Symmetry 11(2):144
Article Google Scholar
Manikas K, Hansen K M (2013) Reviewing the health of software ecosystems-a conceptual framework proposal. In: Proceedings of the 5th international workshop on software ecosystems (IWSECO), Citeseer, pp 33–44
Minku L L (2019) A novel online supervised hyperparameter tuning procedure applied to cross-company software effort estimation. Empir Softw Eng 24 (5):3153–3204
Article Google Scholar
Molokken K, Jorgensen M (2003) A review of software surveys on software effort estimation. In: Empirical Software Engineering, 2003. ISESE 2003. Proceedings. 2003 International Symposium on, IEEE, pp 223–230
Molokken K, Jorgensen M (2003) A review of software surveys on software effort estimation. In: 2003 International Symposium on Empirical Software Engineering, 2003. ISESE 2003. Proceedings, IEEE, pp 223–230
Munaiah N, Kroh S, Cabrey C, Nagappan M (2017) Curating github for engineered software projects. Empir Softw Eng 22(6):3219–3253
Article Google Scholar
Nagy A, Njima M, Mkrtchyan L (2010) A bayesian based method for agile software development release planning and project health monitoring. In: 2010 international conference on intelligent networking and collaborative systems, IEEE, pp 192–199
Nair V, Yu Z, Menzies T, Siegmund N, Apel S (2018) Finding faster configurations using flash. IEEE Transactions on Software Engineering 1–1. https://doi.org/10.1109/TSE.2018.2870895
Nemenyi PB (1963) Distribution-free multiple comparisons. Princeton University
Paasivaara M, Behm B, Lassenius C, Hallikainen M (2018) Large-scale agile transformation at ericsson: a case study. Empir Softw Eng 23(5):2550–2596
Article Google Scholar
Parnin C, Helms E, Atlee C, Boughton H, Ghattas M, Glover A, Holman J, Micco J, Murphy B, Savor T et al (2017) The top 10 adages in continuous deployment. IEEE Softw 34(3):86–95
Article Google Scholar
Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V et al (2011) Scikit-learn: Machine learning in python. J Mach Learn Res 12:2825–2830
MathSciNet MATH Google Scholar
Port D, Korte M (2008) Comparative studies of the model evaluation criterion mmre and pred in software cost estimation research. In: ESEM’08, pp 51–60
Qi F, Jing X-Y, Zhu X, Xie X, Xu B, Ying S (2017) Software effort estimation based on open source projects: Case study of github. Inf Softw Technol 92:145–157
Article Google Scholar
Santos A R, Kroll J, Sales A, Fernandes P, Wildt D (2016) Investigating the adoption of agile practices in mobile application development. In: ICEIS (1), pp 490–497
Sarro F, Petrozziello A, Harman M (2016) Multi-objective software effort estimation. In: ICSE, ACM, pp 619–630
Shepperd M, Cartwright M, Kadoda G (2000) On building prediction systems for software engineers. EMSE 5(3):175–182
MATH Google Scholar
Shrikanth NC, Menzies T (2021) The early bird catches the worm: Better early life cycle defect predictors. arXiv:2105.11082
Snoek J, Larochelle H, Adams R P (2012) Practical bayesian optimization of machine learning algorithms. arXiv:1206.2944
Stensrud E, Foss T, Kitchenham B, Myrtveit I (2003) A further empirical investigation of the relationship of mre and project size. ESE 8(2):139–161
Google Scholar
Stewart K (2019) Personnel communication
Storn R, Price K (1997) Differential evolution–a simple and efficient heuristic for global optimization over cont. spaces. JoGO 11(4):341–359
MATH Google Scholar
Tantithamthavorn C, McIntosh S, Hassan A E, Matsumoto K (2016) Automated parameter optimization of classification techniques for defect prediction models. In: Proceedings of the 38th international conference on software engineering, pp 321–332
Tantithamthavorn C, McIntosh S, Hassan A E, Matsumoto K (2018) The impact of automated parameter optimization on defect prediction models. IEEE Trans Softw Eng 45(7):683–711
Article Google Scholar
Tu H, Menzies T (2021) Frugal: Unlocking ssl for software analytics
Tu H, Papadimitriou G, Kiran M, Wang C, Mandal A, Deelman E, Menzies T (2021) Mining workflows for anomalous data transfers. In: 2021 IEEE/ACM 18th international conference on mining software repositories (MSR), pp 1–12
Wahyudin D, Mustofa K, Schatten A, Biffl S, Tjoa A M (2007) Monitoring the “health” status of open source web-engineering projects. International Journal of Web Information Systems
Wang T, Zhang Y, Yin G, Yu Y, Wang H (2018) Who will become a long-term contributor? a prediction model based on the early phase behaviors. In: Proceedings of the Tenth Asia-Pacific symposium on internetware, pp 1–10
Weber S, Luo J (2014) What makes an open source code popular on git hub?. In: 2014 IEEE international conference on data mining workshop, IEEE, pp 851–855
Witten I H, Frank E, Hall M A (2011) Data mining: Practical machine learning tools and techniques, 3rd edn. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA
Google Scholar
Wu G, Shen X, Li H, Chen H, Lin A, Suganthan P N (2018) Ensemble of differential evolution variants. Inf Sci 423:172–186
Article MathSciNet Google Scholar
Wynn Jr D (2007) Assessing the health of an open source ecosystem. In: Emerging Free and Open Source Software Practices. IGI Global, pp 238–258
Xia T (2021) Principles of project health for open source software
Xia T, Shu R, Shen X, Menzies T (2020) Sequential model optimization for software effort estimation. IEEE Transactions on Software Engineering
Yu Y, Wang H, Yin G, Wang T (2016) Reviewer recommendation for pull-requests in github: What can we learn from code review and bug assignment?. Inf Softw Technol 74:204–218
Article Google Scholar
Zemlin J (2017) If you can’t measure it, you can’t improve it. https://www.linux.com/news/if-you-cant-measure-it-you-cant-improve-it-chaoss-project-creates-tools-analyze-software/

Download references

Acknowledgements

This work is partially funded by a National Science Foundation Grant #1703487.

Author information

Authors and Affiliations

Department of Computer Science, North Carolina State University, Raleigh, NC, USA
Tianpei Xia, Wei Fu, Rui Shu, Rishabh Agrawal & Tim Menzies

Authors

Tianpei Xia
View author publications
You can also search for this author in PubMed Google Scholar
Wei Fu
View author publications
You can also search for this author in PubMed Google Scholar
Rui Shu
View author publications
You can also search for this author in PubMed Google Scholar
Rishabh Agrawal
View author publications
You can also search for this author in PubMed Google Scholar
Tim Menzies
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Tim Menzies.

Additional information

Communicated by: Federica Sarro

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Xia, T., Fu, W., Shu, R. et al. Predicting health indicators for open source projects (using hyperparameter optimization). Empir Software Eng 27, 122 (2022). https://doi.org/10.1007/s10664-022-10171-0

Download citation

Accepted: 17 March 2022
Published: 22 June 2022
DOI: https://doi.org/10.1007/s10664-022-10171-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Predicting health indicators for open source projects (using hyperparameter optimization)

Abstract

Access this article

Similar content being viewed by others

How Healthy Is My Project? Open Source Project Attributes as Indicators of Success

Improving generality and accuracy of existing public development project selection methods: a study on GitHub ecosystem

Software Change Prediction with Homogeneous Ensemble Learners on Large Scale Open-Source Systems

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Predicting health indicators for open source projects (using hyperparameter optimization)

Abstract

Access this article

Similar content being viewed by others

How Healthy Is My Project? Open Source Project Attributes as Indicators of Success

Improving generality and accuracy of existing public development project selection methods: a study on GitHub ecosystem

Software Change Prediction with Homogeneous Ensemble Learners on Large Scale Open-Source Systems

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation