Empirical Software Engineering

, Volume 21, Issue 1, pp 260–299 | Cite as

Continuously mining distributed version control systems: an empirical study of how Linux uses Git

  • Daniel M. GermanEmail author
  • Bram Adams
  • Ahmed E. Hassan


Distributed version control systems (D-VCSs —such as git and mercurial) and their hosting services (such as Github and Bitbucket) have revolutionalized the way in which developers collaborate by allowing them to freely exchange and integrate code changes in a peer-to-peer fashion. However, this flexibility comes at a price: code changes are hard to track because of the proliferation of code repositories and because developers modify (“rebase”) and filter (“cherry-pick”) the history of these changes to streamline their integration into the repositories of other developers. As a consequence, researchers and practitioners, who typically only consider the (cleaned up) history in the official project repository, are unaware of important elements and activities in the collaborative software development process. In this paper, we present a method that continuously mines all known D-VCSs of a software project to uncover the complete development history of a project. We use this method to (1) show the divergence between the code history development in the official Linux kernel repository and the complete kernel development history, and (2) to investigate the characteristics of the ecosystem of git repositories of the Linux kernel. Finally, we discuss how continuous mining could be adopted by current D-VCS hosting services.


Mining software repositories Distributed version control Rebasing Empirical software engineering Measuring bias Linux Open source development 


  1. Antoniol G, Ayari K, Di Penta M, Khomh F , Guéhéneuc YG (2008) Is it a bug or an enhancement?: a text-based approach to classify change requests. In: Proceedings of the 2008 Conference of the Center for Advanced Studies on Collaborative research: meeting of minds (CASCON), pp 23:304–23:318Google Scholar
  2. Barr ET, Bird C, Rigby PC, Hindle A, German DM, Devanbu P (2012) Cohesive and isolated development with branches. In: Proceedings of the 15th International Conference on Fundamental Approaches to Software Engineering (FASE), pp 316–331Google Scholar
  3. Baysal O, Holmes R, Godfrey MW (2012) Mining usage data and development artifacts. In: Proceedings of the 9th IEEE working conf. on Mining Software Repositories (MSR), pp 98–107Google Scholar
  4. Bird C, Zimmermann T (2012) Assessing the value of branches with what-if analysis. In: Proceedings of the ACM SIGSOFT 20th intl. symp. on the Foundations of Software Engineering (FSE), pp 45:1–45:11Google Scholar
  5. Bird C, Gourley A, Devanbu PT, Gertz M, Swaminathan A (2006) Mining email social networks. In: MSR, pp 137–143Google Scholar
  6. Bird C, Bachmann A, Aune E, Duffy J, Bernstein A, Filkov V, Devanbu P (2009a) Fair and balanced?: bias in bug-fix datasets. In: Proceedings of the the 7th joint meeting of the European Software Engineering Conf. and the ACM SIGSOFT symposium on the Foundations of Software Engineering (ESEC/FSE), pp 121–130Google Scholar
  7. Bird C, Rigby PC, Barr ET, Hamilton DJ, German DM, Devanbu P (2009b) The promises and perils of mining git. In: MSR ’09: Proceedings of the 6th Int. Working Conf. on Mining Software Repositories, pp 1–10Google Scholar
  8. Black Duck Inc (2013) Tools: Compare Repositories.
  9. Brun Y, Holmes R , Ernst MD , Notkin D (2011) Proactive detection of collaboration conflicts. In: Proceedings of Foundations of Software Engineering (FSE), pp 168–178Google Scholar
  10. Chacon S (2009) Pro Git. ApressGoogle Scholar
  11. Chapman D (2011) A Guide To The Kernel Development Process.
  12. Corbet J (2005) The kernel and BitKeeper part ways.
  13. Corbet J (2008a) How to participate in the linux community.
  14. Corbet J (2008b) Linux-Next and Patch Management Process.
  15. Corbet J, Kroah-Hartman G, McPherson A (2013) Linux kernel development: How fast it is going, who is doing it, what they are doing, and who is sponsoring it.
  16. Dhaliwal T, Khomh F, Zou Y, Hassan AE (2012) Recovering commit dependencies for selective code integration in software product lines. In: ICSM, pp 202–211Google Scholar
  17. Gousios G, Pinzger M, Deursen Av (2014) An exploratory study of the pull-based software development model. In: Proceedings of the 36th International Conference on Software Engineering, ICSE 2014, pp 345–355Google Scholar
  18. Hassan AE (2008) Automated classification of change messages in open source projects. In: SAC, pp 837–841Google Scholar
  19. Herzig K, Just S, Zeller A (2013) It’s not a bug, it’s a feature: how misclassification impacts bug prediction. In: 35th International Conference on Software Engineering, ICSE ’13, pp 392–401Google Scholar
  20. Jiang Y, Adams B, German DM (2013) Will my patch make it? and how fast?: case study on the linux kernel. In: MSR, pp 101–110Google Scholar
  21. Kalliamvakou E, Gousios G, Blincoe K, Singer L, German DM, Damian D (2014) The promises and perils of mining github. In: Proceedings of the 11th Working Conference on Mining Software Repositories, MSR 2014, pp 92–101Google Scholar
  22. Kawrykow D, Robillard MP (2011) Non-essential changes in version histories. In: ICSE ’11: Proceedings of the 33th International Conference On Software Engineering, pp 351–360Google Scholar
  23. Kim S, Zhang H, Wu R, Gong L (2011) Dealing with noise in defect prediction. In: Proceedings of the 33rd Intl. Conf. on Software Engineering (ICSE), pp 481–490Google Scholar
  24. Kroah-Hartman G (2010) Android and the linux kernel community.
  25. Lee T, Nam J, Han D, Kim S, In HP (2011) Micro interaction metrics for defect prediction. In: Proceedings of the 19th ACM SIGSOFT symp. and the 13th European Conf. on Foundations of Software Engineering (ESEC/FSE), pp 311–321Google Scholar
  26. Mockus A, Votta LG (2000) Identifying reasons for software changes using historic databases. In: ICSM, pp 120–130Google Scholar
  27. Nguyen T, Adams B, Hassan AE (2010) A case study of bias in bug-fix datasets. In: Proceedings of the 17th Working Conf. on Reverse Engineering (WCRE), pp 259–268Google Scholar
  28. Parnin C, Rugaber S (2011) Resumption strategies for interrupted programming tasks. Software Quality Control 19(1):5–34CrossRefGoogle Scholar
  29. Rigby PC, German DM, Storey MA (2008) Open source software peer review practices: a case study of the apache server. In: ICSE ’08: Proc. of the 30th Int. Conf. on Soft. Eng., pp 541–550Google Scholar
  30. Robbes R, Lanza M (2007) Characterizing and understanding development sessions. In: Proceedings of the 15th IEEE Intl. Conf. on Program Comprehension (ICPC), pp 155–166Google Scholar
  31. Shihab E, Bird C, Zimmermann T (2012) The effect of branching strategies on software quality. In: Proceedings of the Intl. Symp. on Empirical Software Engineering and Measurement (ESEM), pp 301–310Google Scholar
  32. Tian Y, Lawall J, Lo D (2012) Identifying linux bug fixing patches. In: Proceedings of the 2012 Intl. Conf. on Software Engineering (ICSE), pp 386–396Google Scholar
  33. Weissgerber P, Neu D, Diehl S (2008) Small patches get in!. In: Proceedings of the intl. working conf. on Mining Software Repositories (MSR), pp 67–76Google Scholar
  34. Zhang F, Khomh F, Zou Y, Hassan AE (2012) An empirical study of the effect of file editing patterns on software quality. In: Proceedings of the 19th Working Conf. on Reverse Engineering (WCRE), pp 456–465Google Scholar
  35. Zou L, Godfrey MW (2006) An industrial case study of program artifacts viewed during maintenance tasks. In: Proceedings of the 13th Working Conf. on Reverse Engineering (WCRE), pp 71–82Google Scholar

Copyright information

© Springer Science+Business Media New York 2014

Authors and Affiliations

  • Daniel M. German
    • 1
    Email author
  • Bram Adams
    • 2
  • Ahmed E. Hassan
    • 3
  1. 1.University of VictoriaVictoriaBritish Columbia
  2. 2.Polytechnique MontréalMontréalCanada
  3. 3.Queen’s UniversityKingstonCanada

Personalised recommendations