Empirical Software Engineering

, Volume 21, Issue 5, pp 2035–2071 | Cite as

An in-depth study of the promises and perils of mining GitHub

  • Eirini Kalliamvakou
  • Georgios Gousios
  • Kelly Blincoe
  • Leif Singer
  • Daniel M. German
  • Daniela Damian
Article

Abstract

With over 10 million git repositories, GitHub is becoming one of the most important sources of software artifacts on the Internet. Researchers mine the information stored in GitHub’s event logs to understand how its users employ the site to collaborate on software, but so far there have been no studies describing the quality and properties of the available GitHub data. We document the results of an empirical study aimed at understanding the characteristics of the repositories and users in GitHub; we see how users take advantage of GitHub’s main features and how their activity is tracked on GitHub and related datasets to point out misalignment between the real and mined data. Our results indicate that while GitHub is a rich source of data on software development, mining GitHub for research purposes should take various potential perils into consideration. For example, we show that the majority of the projects are personal and inactive, and that almost 40 % of all pull requests do not appear as merged even though they were. Also, approximately half of GitHub’s registered users do not have public activity, while the activity of GitHub users in repositories is not always easy to pinpoint. We use our identified perils to see if they can pose validity threats; we review selected papers from the MSR 2014 Mining Challenge and see if there are potential impacts to consider. We provide a set of recommendations for software engineering researchers on how to approach the data in GitHub.

Keywords

Mining software repositories git GitHub Code reviews 

References

  1. Aranda J, Venolia G (2009) The secret life of bugs: Going past the errors and omissions in software repositories. In: Proceedings of the 31st international conference on software engineering, pp 298– 308Google Scholar
  2. Bacchelli A, Bird C (2013) Expectations, outcomes, and challenges of modern code review. In: Proceedings international conference on soft engineering, ICSE ’13, pp 712–721Google Scholar
  3. Bachmann A, Bird C, Rahman F, Devanbu P, Bernstein A (2010) The missing links: bugs and bug-fix commits. In: Proceedings of the 18th ACM SIGSOFT international symposium on Foundations of software engineering, pp 97–106Google Scholar
  4. Baysal O, Gousios G (2014) The MSR’14 Mining Challenge., http://2014.msrconf.org/challenge.php
  5. Begel A, Bosch J, Storey MA (2013) Social networking meets software development: perspectives from github, msdn, stack exchange, and topcoder. Software, IEEE 30(1):52–66CrossRefGoogle Scholar
  6. Bird C, Bachmann A, Aune E, Duffy J, Bernstein A, et al. (2009a) Fair and balanced?: bias in bug-fix datasets. In: Proceedings of the the symposium on the foundations of software engineering, pp 121–130Google Scholar
  7. Bird C, Rigby PC, Barr ET, Hamilton DJ, German DM, Devanbu P (2009b) The promises and perils of mining git. In: Mining software repositories, (MSR’09). IEEE, pp 1–10Google Scholar
  8. Bissyande TF, Lo D, Jiang L, Reveillere L, Klein J, Le Traon Y (2013) Got issues? who cares about it? a large scale investigation of issue trackers from github. In: 2013 IEEE 24th international symposium on software reliability engineering (ISSRE). IEEE, pp 188–197Google Scholar
  9. Corbin J, Strauss A (2008) Basics of qualitative research: Techniques and procedures for developing grounded theory. SageGoogle Scholar
  10. Dabbish L, Stuart C, Tsay J, Herbsleb J (2012) Social coding in GitHub: transparency and collaboration in an open software repository. In: Proceedings conference on computer supported cooperative work, pp 1277–1286Google Scholar
  11. Finley K (2011) Github Has Surpassed Sourceforge and Google Code in Popularity., http://readwrite.com/2011/06/02/github-has-passed-sourceforge
  12. Gousios G (2013) The GHTorrent dataset and tool suite. In: Proceedings of the 10th Conference on mining software repositories, MSR ’13, pp 233–236. http://dl.acm.org/citation.cfm?id=2487085.2487132
  13. Gousios G, Spinellis D (2012) GHTorrent: GitHub’s data from a firehose. In: MSR ’12: proceedings of the 9th working conference on mining software repositories, pp 12–21Google Scholar
  14. Gousios G, Zaidman A (2014a) A dataset for pull-based development research. In: Proceedings of the 11th working conference on mining software repositories, MSR 2014, pp 368–371Google Scholar
  15. Gousios G, Zaidman A (2014b) A dataset for pull-based development research. In: Proceedings of the 11th working conference on mining software repositories, MSR 2014, pp 368–371Google Scholar
  16. Gousios G, Pinzger M, Av D (2014) An exploratory study of the pull-based software development model. In: Proceedings of the 36th international conference on software engineering, ICSE 2014, pp 345– 355Google Scholar
  17. Gousios G, Zaidman A, Storey MA, Av D (2015) Work practices and challenges in pull-based development: The integratorĂŹs perspective. In: Proceedings of the 37th international conference on software engineering, ICSE 2015, to appearGoogle Scholar
  18. Grigorik I (2012) The Github archive., http://www.githubarchive.org/
  19. Howison J, Crowston K (2004) The perils and pitfalls of mining sourceforge. In: Proceedings of the international workshop on mining software repositories, pp 7–11Google Scholar
  20. Kalliamvakou E, Damian D, Singer L, German DM (2014a) The code-centric collaboration perspective: evidence from GitHub. Technical Report DCS-352-IR, University of VictoriaGoogle Scholar
  21. Kalliamvakou E, Gousios G, Blincoe K, Singer L, German DM, Damian D (2014b) The promises and perils of mining github. In: Proceedings of the 11th working conference on mining software repositories, MSR 2014, pp 92–101Google Scholar
  22. Kochhar PS, Bissyandé TF, Lo D, Jiang L (2013) Adoption of software testing in open source projects–a preliminary study on 50,000 projects. In: 2013 17th European conference on software maintenance and reengineering (CSMR). IEEE, pp 353–356Google Scholar
  23. Marlow J, Dabbish L, Herbsleb J (2013) Impression formation in online peer production: activity traces and personal profiles in github. In: Proceedings of conference computer supported cooperative work, pp 117–128Google Scholar
  24. Matragkas N, Williams JR, Kolovos DS, Paige RF (2014) Analysing the ’biodiversity’ of open source ecosystems: The github case. In: Proceedings of the 11th working conference on mining software repositories, MSR 2014, pp 356–359Google Scholar
  25. McDonald N, Goggins S (2013) Performance and participation in open source software on github. In: CHI’13 extended abstracts on human factors in computing systems. ACM, pp 139–144Google Scholar
  26. Neath K (2012) Notifications & stars., https://github.com/blog/1204-notifications-stars
  27. Nguyen TH, Adams B, Hassan AE (2010) A case study of bias in bug-fix datasets. In: 2010 17th working conference on reverse engineering (WCRE). IEEE, pp 259–268Google Scholar
  28. Padhye R, Mani S, Sinha VS (2014) A Study of External Community Contribution to Open-source Projects on GitHub. In: Proceedings of the 11th working conference on mining software repositories, MSR 2014, pp 332–335Google Scholar
  29. Pham R, Singer L, Liskin O, Figueira Filho F, Schneider K (2013) Creating a shared understanding of testing culture on a social coding site. In: Proceedings of international conference on soft engineering, ICSE ’13, pp 112–121Google Scholar
  30. Rahman F, Posnett D, Herraiz I, Devanbu P (2013) Sample size vs. bias in defect prediction. In: Proceedings of the 2013 9th joint meeting on foundations of software engineering, pp 147–157Google Scholar
  31. Rahman MM, Roy CK (2014) An insight into the pull requests of GitHub. In: Proceedings of the 11th working conference on mining software repositories, MSR 2014, pp 364–367Google Scholar
  32. Rainer A, Gale S (2005) Evaluating the quality and quantity of data on open source software projects. In: Proceedings of the first international conference on open source systems (OSS 2005), pp 29– 36Google Scholar
  33. Rigby P C, Bird C (2013) Convergent contemporary software peer review practices. In: Proceedings of the 2013 9th joint meeting on foundations of software engineering, ESEC/FSE 2013, pp 202–212Google Scholar
  34. Rigby PC, German DM, Storey MA (2008) Open source software peer review practices: a case study of the Apache server. In: Proceedings of the 30th international conferences on software engineering, ICSE ’08, pp 541–550Google Scholar
  35. Sheoran J, Blincoe K, Kalliamvakou E, Damian D, Ell J (2014) Understanding ”watchers” on github. In: Proceedings of the 11th working conference on mining software repositories, MSR 2014, pp 336–339Google Scholar
  36. Takhteyev Y, Hilts A (2010) Investigating the geography of open source software through github. http://takhteyev.org/papers/Takhteyev-Hilts-2010.pdf
  37. Thung F, Bissyande T, Lo D, Jiang L (2013) Network structure of social coding in GitHub. In: 17th European conference on software maintenance and reengineering (CSMR), pp 323–326Google Scholar
  38. Tsay J, Dabbish L, Herbsleb J (2014) Influence of social and technical factors for evaluating contribution in github. In: Proceedings of the 36th international conference on software engineering, ICSE 2014, pp 356–366Google Scholar
  39. Tsay JT, Dabbish L, Herbsleb J (2012) Social media and success in open source projects. In: Proceedings of computer supported cooperative work companion, pp 223–226Google Scholar
  40. Wagstrom P, Jergensen C, Sarma A (2013) A network of rails: a graph dataset of ruby on rails and associated projects. In: Proceedings of the 10th international work conferences on mining software repositories, pp 229–232Google Scholar
  41. Weiss D (2005) Quantitative analysis of open source projects on sourceforge. In: Proceedings of the first international conference on open source systems (OSS 2005), pp 140–147Google Scholar

Copyright information

© Springer Science+Business Media New York 2015

Authors and Affiliations

  • Eirini Kalliamvakou
    • 1
  • Georgios Gousios
    • 2
  • Kelly Blincoe
    • 1
  • Leif Singer
    • 1
  • Daniel M. German
    • 1
  • Daniela Damian
    • 1
  1. 1.University of VictoriaVictoriaCanada
  2. 2.Radboud University of NijmegenNijmegenNetherland

Personalised recommendations