Skip to main content
Log in

Towards a change taxonomy for machine learning pipelines

Empirical study of ML pipelines and forks related to academic publications

  • Published:
Empirical Software Engineering Aims and scope Submit manuscript

Abstract

Machine Learning (ML) academic publications commonly provide open-source implementations on GitHub, allowing their audience to replicate, validate, or even extend the ML algorithms, data sets and metadata. However, thus far little is known about the degree of collaboration activity happening on such ML research repositories, in particular regarding (1) the degree to which such repositories receive contributions from forks, (2) the nature of such contributions (i.e., the types of changes), and (3) the nature of changes that are not contributed back to forks, which might represent missed opportunities. In this paper, we empirically study contributions to 1,346 ML research repositories and their 67,369 forks, both quantitatively and qualitatively, by building on Hindle et al.’s seminal taxonomy of code changes. We found that while ML research repositories are heavily forked, only 9% of the forks made modifications to the forked repository. 42% of the latter sent changes to the parent repositories, half of which (52%) were accepted by the parent repositories. Our qualitative analysis on 539 contributed and 378 local (fork-only) changes extends Hindle et al.’s taxonomy with two new top-level change categories related to ML (Data and Dependency Management), and 16 new sub-categories, including nine ML-specific ones (input data, parameter tuning, pre-processing, training infrastructure, model structure, pipeline performance, sharing, validation infrastructure, and output data). While the changes that are not contributed back by the forks mostly concern domain-specific features and local experimentation (e.g., parameter tuning), the origin repositories do miss out on a non-trivial 15.4% of Documentation changes, 13.6% of Feature changes and 11.4% of Bug fix changes.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

Data Availability

All the scripts along with the mined data are provided in the replication packageFootnote 27

Notes

  1. https://www.tensorflow.org

  2. https://pytorch.org

  3. https://paperswithcode.com

  4. https://modeldepot.io/

  5. In August 2022, PapersWithCode indexed more than 77,640 academic AI publications along with their code bases.

  6. In the remainder of this paper, we use the term “ML research repositories” to identify such ML pipelines.

  7. https://twitter.com/paperswithcode/status/1091315540092768257

  8. https://web.archive.org/web/20190404211946/https://modeldepot.io/search/results?q=

  9. https://arxiv.org/abs/1512.03385

  10. https://github.com/thtrieu/darkflow

  11. https://arxiv.org/pdf/1506.02640.pdf

  12. https://arxiv.org/pdf/1612.08242.pdf

  13. https://github.com/mohammed-elkomy/two-stream-action-recognition

  14. https://arxiv.org/pdf/1406.2199.pdf

  15. https://arxiv.org/pdf/1604.07669.pdf

  16. https://arxiv.org/pdf/1507.02159.pdf

  17. https://docs.github.com/en/rest/reference/search

  18. PRs on GitHub are not limited to just one commit.

  19. https://github.com/BoseAslCohort/youtube-8m/commit/c1b01315bafc24e83248cd862a9324bb21d4d52d

  20. https://docs.github.com/en/rest/reference/repos

  21. https://github.com/SullyChen/Autopilot-TensorFlow

  22. https://arxiv.org/pdf/1604.07316.pdf

  23. https://github.com/shikorab/tf-faster-rcnn

  24. https://github.com/TSchattschneider/PointCNN/commit/1827a79b2ede15007a06d327d95f10bc0753420

  25. Note that Hindle et al.’s meta program change category is unrelated to the field of Metaprogramming (https://en.wikipedia.org/wiki/Metaprogramming).

  26. https://github.com/SAILResearch/suppmaterial-22-aaditya-ml_change_taxonomy

  27. https://github.com/SAILResearch/suppmaterial-22-aaditya-ml_change_taxonomy

References

  • Adding auto-generated files example (2018). https://github.com/alorozco53/text-detection-ctpn/commit/f90326f68522f3af3e4cdf5688138685de66bace

  • Adding/removing dependency example (2019). https://github.com/google/youtube-8m/commit/09774db80a515b667a91b14fe21a6134f3856c7a

  • Amershi S, Begel A, Bird C, DeLine R, Gall H, Kamar E, Nagappan N, Nushi B, Zimmermann T (2019) Software engineering for machine learning: a case study. In: 2019 IEEE/ACM 41st international conference on software engineering: software engineering in practice (ICSE-SEIP), pp 291–300

  • Arpteg A, Brinne B, Crnkovic-Friis L, Bosch J (2018) Software engineering challenges of deep learning. In: 2018 44th Euromicro conference on software engineering and advanced applications (SEAA). IEEE, pp 50–59

  • Benestad HC, Anda B, Arisholm E (2009) Understanding software maintenance and evolution by analyzing individual changes: a literature review. J Softw Maint Evol Res Pract 21(6):349–378

    Article  Google Scholar 

  • Biazzini M, Baudry B (2014) may the fork be with you: novel metrics to analyze collaboration on github. In: Proceedings of the 5th international workshop on emerging trends in software metrics, pp 37–43

  • Bird C, Bachmann A, Aune E, Duffy J, Bernstein A, Filkov V, Devanbu P (2009) Fair and balanced? bias in bug-fix datasets. In: Proceedings of the 7th joint meeting of the european software engineering conference and the ACM SIGSOFT symposium on the foundations of software engineering, pp 121–130

  • Bissyandé TF, Thung F, Wang S, Lo D, Jiang L, Ré veillère L (2013) Empirical evaluation of bug linking. In: 2013 17th European conference on software maintenance and Reengineering, pp 89–98

  • Bloice MD, Holzinger A (2016) A tutorial on machine learning and data science tools with python. Machine Learning for Health Informatics, pp 435–480

  • Borges H, Valente MT (2018) What’s in a github star? understanding repository starring practices in a social coding platform. J Syst Softw 146:112–129

    Article  Google Scholar 

  • Brisson S, Noei E, Lyons K (2020) We are family: analyzing communication in github software repositories and their forks. In: 2020 IEEE 27th international conference on software analysis Evolution and Reengineering (SANER). IEEE, pp 59–69

  • Bug fix example 1 (2019). https://github.com/piaosonglin1985/tf-faster-rcnn/commit/8e60b9dc92390f1bfb8cf6e62d93bcabbc123c4a

  • Bug fix example 2 (2017) https://github.com/MarvinTeichmann/KittiSeg/commit/ec6b5ccb6f30ac6591d03faa2fa0bf8b1fdbf3ef

  • Change file permission example (2017). https://api.github.com/repos/CodeRecipeJYP/fast-style-transfer/commits/7027a3843fa3d793697da5ba188887629a4d69eb

  • Chen Z, Zhang JM, Sarro F, Harman M (2022) Maat: a novel ensemble approach to addressing fairness and performance bugs for machine learning software. In: Proceedings of the 30th ACM joint european software engineering conference and symposium on the foundations of software engineering (ESEC/FSE’22). ACM Press

  • Cheng D, Cao C, Xu C, Ma X (2018) Manifesting bugs in machine learning code: An explorative study with mutation testing. In: 2018 IEEE international conference on software quality, reliability and security (QRS). IEEE, pp 313–324

  • Constantino K, Zhou S, Souza M, Figueiredo E, Kästner C (2020) Understanding collaborative software development: an interview study. In: Proceedings of the 15th international conference on global software engineering, pp 55–65

  • Cortés-Coy LF, Linares-Vásquez M, Aponte J, Poshyvanyk D (2014) On automatically generating commit messages via summarization of source code changes. In: 2014 IEEE 14th international working conference on source code analysis and manipulation, pp 275–284

  • Decan A, Mens T, Grosjean P (2019) An empirical comparison of dependency network evolution in seven software packaging ecosystems. Empir Softw Eng 24(1):381–416

    Article  Google Scholar 

  • Dey T, Mockus A (2020) Which pull requests get accepted and why? a study of popular npm packages, arXiv:2003.01153

  • Dwarakanath A, Ahuja M, Sikand S, Rao RM, Bose RJC, Dubash N, Podder S (2018) Identifying implementation bugs in machine learning based image classifiers using metamorphic testing. In: Proceedings of the 27th ACM SIGSOFT international symposium on software testing and analysis, pp 118–128

  • External documentation example (2017). https://github.com/Raochuan89/TensorBox/commit/aeb45e8fdc100f74aa8cf2fa85b1324483a1fff1

  • Fan Y, Xia X, Lo D, Hassan AE, Li S (2021) What makes a popular academic AI repository? Empir Softw Eng 26(1):1–35

    Article  Google Scholar 

  • Faragó C, Hegedũs P (2014) R Ferenc, The impact of version control operations on the quality change of the source code. In: International conference on computational science and its applications. Springer, pp 353–369

  • Feature example (2018). https://github.com/tch/PointCNN/commit/891f3e04b44805b066865aeef1275ac6f217c58f

  • Fogel K (2005) Producing open source software: How to run a successful free software project. O’Reilly Media, Inc.,

  • German DM, Adams B, Hassan AE (2016) Continuously mining distributed version control systems: an empirical study of how linux uses git. Empir. Softw. Eng. 21(1):260–299

    Article  Google Scholar 

  • Ghadhab L, Jenhani I, Mkaouer MW, Messaoud MB (2021) Augmenting commit classification by using fine-grained source code changes and a pre-trained deep neural language model, vol 135

  • Gousios G, Pinzger M, Deursen AV (2014) An exploratory study of the pull-based software development model. In: Proceedings of the 36th international conference on software engineering, pp 345–355

  • Granger B, Pérez F (2021) Jupyter: thinking and storytelling with code and data Authorea Preprints

  • Hindle A, German DM, Godfrey MW, Holt RC (2009) Automatic classication of large changes into maintenance categories. In: 2009 IEEE 17th International Conference on Program Comprehension. IEEE, pp 30–39

  • Hindle D, German M, Holt R (2008) What do large commits tell us? a taxonomical study of large commits. In: Proceedings of the 2008 international working conference on mining software repositories, ser. MSR ’08. New York, NY, USA: association for computing machinery, pp 99–108. [Online]. Available:. https://doi.org/10.1145/1370750.1370773

  • Hu Y, Zhang J, Bai X, Yu S, Yang Z (2016) Influence analysis of github repositories. SpringerPlus 5(1):1–19

    Article  Google Scholar 

  • Idowu S, Strüber D, Berger T (2021) Asset management in machine learning: a survey. In: 2021 IEEE/ACM 43rd international conference on software engineering: software engineering in practice (ICSE-SEIP), pp 51–60

  • Input data example (2017). https://github.com/google/youtube-8m/commit/4619056162f466293d99e0c59512f8d0f3427fe2

  • Internal documentation example-1 (2017). https://github.com/google/youtube-8m/commit/3439e33d81df8cd906987ee5889ebc937186114a

  • Internal documentation example-2 (2017). https://github.com/CharlesShang/FastMaskRCNN/commit/0d8ddfaa55dbd3d553b79aed34f40662c46aa45f

  • Kalliamvakou E, Gousios G, Blincoe K, Singer L, German DM, Damian D (2014) The promises and perils of mining github. In: Proceedings of the 11th working conference on mining software repositories, ser. MSR 2014. New York, NY, USA: association for computing machinery, pp 92–101. [Online]. Available: https://doi.org/10.1145/2597073.2597074

  • Kim M, Cai D, Kim S (2011) An empirical investigation into the role of api-level refactorings during software evolution. In: Proceedings of the 33rd international conference on software engineering, pp 151–160

  • Krippendorff K (2011) Computing krippendorff’s alpha-reliability

  • Li H, Shang W, Adams B, Sayagh M, Hassan AE (2020) A qualitative study of the benefits and costs of logging from developers’ perspectives. IEEE Transactions on Software Engineering

  • Lima A, Rossi L, Musolesi M (2014) Coding together at scale: Github as a collaborative social network. In: Eighth international AAAI conference on weblogs and social media

  • Martínez-Fernández S, Bogner J, Franch X, Oriol M, Siebert J, Trendowicz A, Vollmer AM, Wagner S (2021) Software engineering for ai-based systems, a survey, arXiv:2105.01984

  • Model structure example (2018). https://github.com/shikorab/tf-faster-rcnn/commit/327778b2c4f297b307ff0de552d2bfc47278e290

  • Mukherjee S, Almanza A, Rubio-González C (2021) Fixing dependency errors for python build reproducibility. In: Proceedings of the 30th ACM SIGSOFT international symposium on software testing and analysis, pp 439–451

  • Nahar N, Zhou S, Lewis G, Kästner C (2022) Collaboration challenges in building ml-enabled systems: communication, documentation, engineering, and process. In: 2022 IEEE/ACM 44th international conference on software engineering (ICSE)

  • Ng A (2021) Mlops: from model-centric to data-centric ai

  • O’Leary K, Uchida M (2020) Common problems with creating machine learning pipelines from existing code

  • Output data example (2018). https://github.com/Mappy/tf-faster-rcnn/commit/51e0889fbdcd4c48f31def4c1cb05a5a4db04671

  • Ozkaya I (2020) What is really different in engineering ai-enabled systems? IEEE Softw 37(4):3–6

    Article  Google Scholar 

  • Parameter tuning example (2017). https://github.com/google/youtube-8m/commit/0e526caace96d3cf6f0686757d568f9ffba998b4

  • Parameter tuning example 2 (2017). https://github.com/DeepLabCut/DeepLabCut/commit/6568c2ba6facf5d90b2c39af7b0f024a40f2b15f

  • Pashchenko I, Vu D-L, Massacci F (2020) A qualitative study of dependency management and its security implications. In: Proceedings of the 2020 ACM SIGSAC conference on computer and communications security, pp 1513–1531

  • Pipeline Performance example (2018). https://github.com/google/youtube-8m/pull/69

  • Polyzotis N, Roy S, Whang SE, Zinkevich M (2018) Data lifecycle challenges in production machine learning: a survey. ACM SIGMOD Rec 47(2):17–28

    Article  Google Scholar 

  • Pre-processing example (2018). https://github.com/lancele/Semantic-Segmentation-Suite/commit/d50b5c812392614fc2bdaf269921beb1f7086f63

  • Project data example (2017). https://github.com/Bruceeeee/facenet/commit/d9e6213cd8286334000ddf75529eba3662cef38a#diff-dbc5c3b9f46e69236207956b34904d0dea62ff866d442e97bb397ff49a03a86b

  • Rahman MM, Roy CK (2014) An insight into the pull requests of github. In: Proceedings of the 11th working conference on mining software repositories, pp 364–367

  • Ren L, Zhou S, Kä stner C (2018) Poster: forks insight: Providing an overview of github forks. In: 2018 IEEE/ACM 40th international conference on software engineering: companion (ICSE-Companion), pp 179–180

  • Salza P, Palomba F, Di Nucci D, D’Uva C, De Lucia A, Ferrucci F (2018) Do developers update third-party libraries in mobile apps?. In: Proceedings of the 26th conference on program comprehension, pp 255–265

  • Sambasivan N, Kapania S, Highfill H, Akrong D, Paritosh P, Aroyo LM (2021) Everyone wants to do the model work, not the data work: data cascades in high-stakes ai. In: Proceedings of the 2021 CHI conference on human factors in computing systems, pp 1–15

  • Santos JAM, Santos AR, Mendonç a MG (2015) Investigating bias in the search phase of software engineering secondary studies. In: CIbSE, pp 488

  • Sato D, Wider A, Windheuser C (2019) Continuous delivery for machine learning. https://martinfowler.com/articles/cd4ml.html#DeploymentPipelines

  • Sharing example (2016). https://github.com/anishathalye/neural-style/pull/40

  • Sharing example (2018). https://github.com/jerichooconnell/tf_unet/commit/60b67bb964d19dd4a4677f7557dc738838a116e9

  • Shivaji S, Whitehead EJ, Akella R, Kim S (2012) Reducing features to improve code change-based bug prediction. IEEE Trans Softw Eng 39 (4):552–569

    Article  Google Scholar 

  • Swanson EB (1976) The dimensions of maintenance. In: Proceedings of the 2nd international conference on Software engineering, pp 492–497

  • Tizpaz-Niari S, Černỳ P, Trivedi A (2020) Detecting and understanding real-world differential performance bugs in machine learning libraries. In: Proceedings of the 29th ACM SIGSOFT international symposium on software testing and analysis, pp 189–199

  • Training infrastructure example (2017). https://github.com/IAC-Team/SemSeg/commit/efbfffbd202cccbd54fca1125ed6de41b5df2f90

  • Update dependency example (2018). https://github.com/google/youtube-8m/commit/72f42cd938d3cf4f928614a5fcdca237489e7c92

  • Validation example (2017). https://github.com/bethesirius/TensorBox/commit/1eb41e944494e721f3c4b1a5d287af99f4035a42

  • Wang J, Li L, Zeller A (2020) Better code, better sharing: on the need of analyzing jupyter notebooks. In: Proceedings of the ACM/IEEE 42nd international conference on software engineering: new ideas and emerging results, pp 53–56

  • Washizaki H, Uchida H, Khomh F, Gué héneuc Y-G (2019) Studying software engineering patterns for designing machine learning systems. In: 2019 10th International workshop on empirical software engineering in practice (IWESEP). IEEE, pp 49–495

  • Wu R, Zhang H, Kim S, Cheung S-C (2011) Relink: recovering links between bugs and changes. In: Proceedings of the 19th ACM SIGSOFT symposium and the 13th european conference on foundations of software engineering, pp 15–25

  • Yan M, Fu Y, Zhang X, Yang D, Xu L, Kymer JD (2016) Automatically classifying software changes via discriminative topic model: supporting multi-category and cross-project. J Syst Softw 113:296–308

    Article  Google Scholar 

  • Zhang X, Chen Y, Gu Y, Zou W, Xie X, Jia X, Xuan J (2018) How do multiple pull requests change the same code: a study of competing pull requests in github. In: 2018 IEEE international conference on software maintenance and evolution (ICSME).IEEE, pp 228–239

  • Zhang T, Gao C, Ma L, Lyu M, Kim M (2019) An empirical study of common challenges in developing deep learning applications. In: 2019 IEEE 30th international symposium on software reliability engineering (ISSRE). IEEE, pp 104–115

  • Zhao Y, Leung H, Yang Y, Zhou Y, Xu B (2017) Towards an understanding of change types in bug fixing code. Inf Softw Technol 86:37–53

    Article  Google Scholar 

  • Zhou S, Vasilescu B, Kä stner C (2020) How has forking changed in the last 20 years? a study of hard forks on github. In: 2020 IEEE/ACM 42nd international conference on software engineering (ICSE). IEEE, pp 445–456

  • Zhou S, Vasilescu B, Kastner C (2019) What the fork: a study of inefficient and efficient forking practices in social coding. In: Proceedings of the 2019 27th ACM joint meeting on european software engineering conference and symposium on the foundations of software engineering, pp 350–361

Download references

Acknowledgements

We thank Greg Wilson for providing insightful ideas and comments for this work. We also thank Boyuan Chen, Minke Xiu, Javier Rosales and Wanqing Li for their contributions to the analysis and feedback on this work.

Funding

This research is partially supported by the NSERC grants RGPIN-2019-06014 and RGPAS-2019-00075.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Aaditya Bhatia.

Ethics declarations

Ethics approval

No ethics approval was required for this paper.

Conflict of interests/Competing interests

The authors have no relevant financial or non-financial interests to disclose.

Additional information

Communicated by: Lei Ma

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Bhatia, A., Eghan, E.E., Grichi, M. et al. Towards a change taxonomy for machine learning pipelines. Empir Software Eng 28, 60 (2023). https://doi.org/10.1007/s10664-022-10282-8

Download citation

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s10664-022-10282-8

Keywords

Navigation