Empirical Software Engineering

, Volume 21, Issue 5, pp 2035–2071

An in-depth study of the promises and perils of mining GitHub

  • Eirini Kalliamvakou
  • Georgios Gousios
  • Kelly Blincoe
  • Leif Singer
  • Daniel M. German
  • Daniela Damian
Article

Abstract

With over 10 million git repositories, GitHub is becoming one of the most important sources of software artifacts on the Internet. Researchers mine the information stored in GitHub’s event logs to understand how its users employ the site to collaborate on software, but so far there have been no studies describing the quality and properties of the available GitHub data. We document the results of an empirical study aimed at understanding the characteristics of the repositories and users in GitHub; we see how users take advantage of GitHub’s main features and how their activity is tracked on GitHub and related datasets to point out misalignment between the real and mined data. Our results indicate that while GitHub is a rich source of data on software development, mining GitHub for research purposes should take various potential perils into consideration. For example, we show that the majority of the projects are personal and inactive, and that almost 40 % of all pull requests do not appear as merged even though they were. Also, approximately half of GitHub’s registered users do not have public activity, while the activity of GitHub users in repositories is not always easy to pinpoint. We use our identified perils to see if they can pose validity threats; we review selected papers from the MSR 2014 Mining Challenge and see if there are potential impacts to consider. We provide a set of recommendations for software engineering researchers on how to approach the data in GitHub.

Keywords

Mining software repositories git GitHub Code reviews 

Copyright information

© Springer Science+Business Media New York 2015

Authors and Affiliations

  • Eirini Kalliamvakou
    • 1
  • Georgios Gousios
    • 2
  • Kelly Blincoe
    • 1
  • Leif Singer
    • 1
  • Daniel M. German
    • 1
  • Daniela Damian
    • 1
  1. 1.University of VictoriaVictoriaCanada
  2. 2.Radboud University of NijmegenNijmegenNetherland

Personalised recommendations