What makes a popular academic AI repository?

Fan, Yuanrui; Xia, Xin; Lo, David; Hassan, Ahmed E.; Li, Shanping

doi:10.1007/s10664-020-09916-6

What makes a popular academic AI repository?

Published: 05 January 2021

Volume 26, article number 2, (2021)
Cite this article

Empirical Software Engineering Aims and scope Submit manuscript

Yuanrui Fan^1,2,
Xin Xia³,
David Lo⁴,
Ahmed E. Hassan⁵ &
…
Shanping Li¹

1100 Accesses
10 Citations
Explore all metrics

Abstract

Many AI researchers are publishing code, data and other resources that accompany their papers in GitHub repositories. In this paper, we refer to these repositories as academic AI repositories. Our preliminary study shows that highly cited papers are more likely to have popular academic AI repositories (and vice versa). Hence, in this study, we perform an empirical study on academic AI repositories to highlight good software engineering practices of popular academic AI repositories for AI researchers. We collect 1,149 academic AI repositories, in which we label the top 20% repositories that have the most number of stars as popular, and we label the bottom 70% repositories as unpopular. The remaining 10% repositories are set as a gap between popular and unpopular academic AI repositories. We propose 21 features to characterize the software engineering practices of academic AI repositories. Our experimental results show that popular and unpopular academic AI repositories are statistically significantly different in 11 of the studied features—indicating that the two groups of repositories have significantly different software engineering practices. Furthermore, we find that the number of links to other GitHub repositories in the README file, the number of images in the README file and the inclusion of a license are the most important features for differentiating the two groups of academic AI repositories. Our dataset and code are made publicly available to share with the community.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Cited But Not Archived: Analyzing the Status of Code References in Scholarly Articles

On Deprecated API Usages: An Exploratory Study of Top-Starred Projects on GitHub

Publish or perish, but do not forget your software artifacts

Article Open access 08 October 2020

Notes

References

Aggarwal K, Hindle A, Stroulia E (2014) Co-evolution of project documentation and popularity within github. In: Proceedings of the 11th working conference on mining software repositories. ACM, pp 360–363
Alves TL, Ypma C, Visser J (2010) Deriving metric thresholds from benchmark data. In: IEEE international conference on software maintenance. IEEE, pp 1–10
Balcan MF, Dick T, Sandholm T, Vitercik E (2018) Learning to branch. In: International conference on machine learning, pp 344–353
Bissyandé TF, Thung F, Lo D, Jiang L, Réveillere L (2013) Popularity, interoperability, and impact of programming languages in 100,000 open source projects. In: 2013 IEEE 37th annual computer software and applications conference. IEEE, pp 303–312
Blei DM, Ng AY, Jordan MI (2003) Latent dirichlet allocation. J Mach Learn Res 3:993–1022
MATH Google Scholar
Boettiger C (2015) An introduction to docker for reproducible research. ACM SIGOPS Oper Syst Rev 49(1):71–79
Article Google Scholar
Borges H, Hora A, Valente MT (2016a) Predicting the popularity of github repositories. In: Proceedings of the the 12th international conference on predictive models and data analytics in software engineering. ACM, p 9
Borges H, Hora A, Valente MT (2016b) Understanding the factors that impact the popularity of github repositories. In: 2016 IEEE international conference on software maintenance and evolution (ICSME). IEEE, pp 334–344
Breiman L (2001) Random forests. Mach Learn 45(1):5–32
Article Google Scholar
Cliff N (2014) Ordinal methods for behavioral data analysis. Psychology Press, New York, NY
Book Google Scholar
Collberg C, Proebsting TA (2016) Repeatability in computer systems research. Commun ACM 59(3):62–69
Article Google Scholar
Collobert R, Bengio S, Mariéthoz J (2002) Torch: a modular machine learning software library. Tech. rep., Idiap
Cutler A, Cutler DR, Stevens JR (2012) Random forests. In: Ensemble machine learning. Springer, pp 157–175
Fan Y, Xia X, Lo D, Hassan AE (2018a) Chaff from the wheat: characterizing and determining valid bug reports. In: IEEE transactions on software engineering
Fan Y, Xia X, Lo D, Li S (2018b) Early prediction of merged code changes to prioritize reviewing tasks. Empir Softw Eng 23(6):3346–3393
Article Google Scholar
Fan Y, Xia X, da Costa DA, Lo D, Hassan AE, Li S (2019) The impact of changes mislabeled by szz on just-in-time defect prediction. In: IEEE transactions on software engineering
Fleiss JL (1971) Measuring nominal scale agreement among many raters. Psychol Bull 76(5):378
Article Google Scholar
Fogel K (2005) Producing open source software: how to run a successful free software project. O’Reilly Media, Inc
Ghotra B, McIntosh S, Hassan AE (2015) Revisiting the impact of classification techniques on the performance of defect prediction models. In: 2015 IEEE/ACM 37th IEEE international conference on software engineering, vol 1. IEEE, pp 789–800
Gousios G, Pinzger M, Deursen AV (2014) An exploratory study of the pull-based software development model. In: Proceedings of the 36th international conference on software engineering. ACM, pp 345–355
Gundersen OE, Gil Y, Aha DW (2017) On reproducible ai: towards reproducible research, open science, and digital scholarship in ai publications. AI Mag 39(3):56–68
Article Google Scholar
Han J, Deng S, Xia X, Wang D, Yin J (2019) Characterization and prediction of popular projects on github. In: 2019 IEEE 43rd annual computer software and applications conference (COMPSAC), vol 1. IEEE, pp 21–26
Harrell FE Jr (2015) Regression modeling strategies: with applications to linear models, logistic and ordinal regression, and survival analysis. Springer, Berlin
Book Google Scholar
Hosmer DW Jr, Lemeshow S, Sturdiest RX (2013) Applied logistic regression, vol 398. Wiley, Hoboken
Book Google Scholar
Hu Y, Zhang J, Bai X, Yu S, Yang Z (2016) Influence analysis of github repositories. SpringerPlus 5(1):1268
Article Google Scholar
Huang J, Ling CX (2005) Using auc and accuracy in evaluating learning algorithms. IEEE Trans Knowl Data Eng 17(3):299–310
Article Google Scholar
Jiang J, Lo D, He J, Xia X, Kochhar PS, Zhang L (2017) Why and how developers fork what from whom in github. Empir Softw Eng 22(1):547–578
Article Google Scholar
Kim M, Bergman L, Lau T, Notkin D (2004) An ethnographic study of copy and paste programming practices in oopl. In: Proceedings. 2004 International symposium on empirical software engineering, ISESE’04. IEEE, pp 83–92
Kimble J (1992) Plain english: a charter for clear writing. TM Cooley L Rev 9:1
Google Scholar
Li Z, Lu S, Myagmar S, Zhou Y (2006) Cp-miner: finding copy-paste and related bugs in large-scale software code. IEEE Trans Softw Eng 32 (3):176–192
Article Google Scholar
Newman D, Lau JH, Grieser K, Baldwin T (2010) Automatic evaluation of topic coherence. In: Human language technologies: the 2010 annual conference of the North American chapter of the association for computational linguistics, pp 100–108
Nosek BA, Alter G, Banks GC, Borsboom D, Bowman SD, Breckler SJ, Buck S, Chambers CD, Chin G, Christensen G, et al. (2015) Promoting an open research culture. Science 348(6242):1422–1425
Article Google Scholar
Paszke A, Gross S, Massa F, Lerer A, Bradbury J, Chanan G, Killeen T, Lin Z, Gimelshein N, Antiga L, et al. (2019) Pytorch: an imperative style high-performance deep learning library. In: Advances in neural information processing systems, pp 8024–8035
Phua C, Alahakoon D, Lee V (2004) Minority report in fraud detection: classification of skewed data. ACM SIGKDD Explor Newsl 6(1):50–59
Article Google Scholar
Portugal RLQ, do Prado Leite JCS (2016) Extracting requirements patterns from software repositories. In: 2016 IEEE 24th international requirements engineering conference workshops (REW). IEEE, pp 304–307
Prana GAA, Treude C, Thung F, Atapattu T, Lo D (2019) Categorizing the content of GitHub README files. Empir Softw Eng 24(3):1296–1327
Article Google Scholar
Schober P, Boer C, Schwarte LA (2018) Correlation coefficients: appropriate use and interpretation. Anesth Analg 126(5):1763–1768
Article Google Scholar
Scott AJ, Knott M (1974) A cluster analysis method for grouping means in the analysis of variance. Biometrics 30(3):507–512
Article Google Scholar
Sonnenburg S, Braun ML, Ong CS, Bengio S, Bottou L, Holmes G, LeCun Y, MÃžller KR, Pereira F, Rasmussen CE, et al. (2007) The need for open source software in machine learning. J Mach Learn Res 8:2443–2466
Google Scholar
Sutton RS, Barto AG (2018) Reinforcement learning: an introduction. MIT Press
Tantithamthavorn C, Hassan AE (2018) An experience report on defect modelling in practice: pitfalls and challenges. In: Proceedings of the 40th international conference on software engineering: software engineering in practice. ACM, pp 286–295
Tantithamthavorn C, McIntosh S, Hassan AE, Matsumoto K (2017) An empirical comparison of model validation techniques for defect prediction models. IEEE Trans Softw Eng 43(1):1–18
Article Google Scholar
Tantithamthavorn C, McIntosh S, Hassan AE, Matsumoto K (2018) The impact of automated parameter optimization on defect prediction models. IEEE Trans Softw Eng 45(7):683–711
Article Google Scholar
Tian Y, Nagappan M, Lo D, Hassan AE (2015) What are the characteristics of high-rated apps? A case study on free android applications. In: 2015 IEEE international conference on software maintenance and evolution (ICSME). IEEE, pp 301–310
Upton GJ (1992) Fisher’s exact test. J R Stat Soc: Ser A (Stat Soc) 155(3):395–402
Article Google Scholar
Wan Z, Lo D, Xia X, Cai L, Li S (2017) Mining sandboxes for linux containers. In: IEEE international conference on software testing, verification and validation (ICST). IEEE, pp 92–102
Wan Z, Xia X, Hassan AE, Lo D, Yin J, Yang X (2018) Perceptions, expectations, and challenges in defect prediction. IEEE Trans Softw Eng 46(11):1241–1266
Article Google Scholar
Wang TC, Liu MY, Zhu JY, Liu G, Tao A, Kautz J, Catanzaro B (2018) Video-to-video synthesis. In: Advances in neural information processing systems, vol 31, pp 1144–1156
Weber S, Luo J (2014) What makes an open source code popular on git hub?. In: IEEE international conference on data mining workshop. IEEE, pp 851–855
Wilcoxon F (1945) Individual comparisons by ranking methods. Biom Bull 1(6):80–83
Article Google Scholar
Woodfield SN, Dunsmore HE, Shen VY (1981) The effect of modularization and comments on program comprehension. In: Proceedings of the 5th international conference on Software engineering. IEEE Press, pp 215–223
Xia X, Wan Z, Kochhar PS, Lo D (2019) How practitioners perceive coding proficiency. In: 2019 IEEE/ACM 41st international conference on software engineering (ICSE). IEEE, pp 924–935
Yan M, Xia X, Zhang X, Yang D, Xu L (2017) Automating aggregation for software quality modeling. In: IEEE international conference on software maintenance and evolution (ICSME). IEEE, pp 529–533
Yan M, Xia X, Shihab E, Lo D, Yin J, Yang X (2018) Automating change-level self-admitted technical debt determination. IEEE Trans Softw Eng 45(12):1211–1229
Article Google Scholar
Yang J, Lu J, Lee S, Batra D, Parikh D (2018) Graph r-cnn for scene graph generation. In: Proceedings of the European conference on computer vision (ECCV, pp 670–685
Zar JH (2005) Spearman rank correlation. Encyclopedia of Biostatistics 7
Zhu J, Zhou M, Mockus A (2014) Patterns of folder use and project popularity: a case study of github repositories. In: Proceedings of the 8th ACM/IEEE international symposium on empirical software engineering and measurement. ACM, p 30

Download references

Acknowledgements

This research was partially supported by the National Key R&D Program of China (No. 2018YFB1003904) and the Australian Research Council’s Discovery Early Career Researcher Award (DECRA) (DE200100021).

Author information

Authors and Affiliations

College of Computer Science and Technology, Zhejiang University, Hangzhou, China
Yuanrui Fan & Shanping Li
PengCheng Laboratory, Shenzhen, China
Yuanrui Fan
Faculty of Information Technology, Monash University, Melbourne, Australia
Xin Xia
School of Information Systems, Singapore Management University, Singapore, Singapore
David Lo
School of Computing, Queen’s University, Kingston, Canada
Ahmed E. Hassan

Authors

Yuanrui Fan
View author publications
You can also search for this author in PubMed Google Scholar
Xin Xia
View author publications
You can also search for this author in PubMed Google Scholar
David Lo
View author publications
You can also search for this author in PubMed Google Scholar
Ahmed E. Hassan
View author publications
You can also search for this author in PubMed Google Scholar
Shanping Li
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xin Xia.

Additional information

Communicated by: Tim Menzies

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Fan, Y., Xia, X., Lo, D. et al. What makes a popular academic AI repository?. Empir Software Eng 26, 2 (2021). https://doi.org/10.1007/s10664-020-09916-6

Download citation

Accepted: 14 October 2020
Published: 05 January 2021
DOI: https://doi.org/10.1007/s10664-020-09916-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

What makes a popular academic AI repository?

Abstract

Access this article

Similar content being viewed by others

Cited But Not Archived: Analyzing the Status of Code References in Scholarly Articles

On Deprecated API Usages: An Exploratory Study of Top-Starred Projects on GitHub

Publish or perish, but do not forget your software artifacts

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

What makes a popular academic AI repository?

Abstract

Access this article

Similar content being viewed by others

Cited But Not Archived: Analyzing the Status of Code References in Scholarly Articles

On Deprecated API Usages: An Exploratory Study of Top-Starred Projects on GitHub

Publish or perish, but do not forget your software artifacts

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation