Analysing Time-Stamped Co-Editing Networks in Software Development Teams using git2net

Data from software repositories have become an important foundation for the empirical study of software engineering processes. A recurring theme in the repository mining literature is the inference of developer networks capturing e.g. collaboration, coordination, or communication from the commit history of projects. Many works in this area studied networks of co-authorship of software artefacts, neglecting detailed information on code changes and code ownership available in software repositories. To address this issue, we introduce git2net, a scalable python software that facilitates the extraction of fine-grained co-editing networks in large git repositories. It uses text mining techniques to analyse the detailed history of textual modifications within files. We apply our tool in two case studies using GitHub repositories of multiple Open Source as well as a proprietary software project. Specifically, we use data on more than 1.2 million commits and more than 25,000 developers to test a hypothesis on the relation between developer productivity and co-editing patterns in software teams. We argue that git2net opens up an important new source of high-resolution data on human collaboration patterns that can be used to advance theory in empirical software engineering, computational social science, and organisational studies. Supplementary Information The online version contains supplementary material available at (10.1007/s10664-020-09928-2).


S1 Data cleaning
As discussed in section 5.2 of the main manuscript, to test our research hypothesis, only edits in which existing code was modified are relevant. Therefore, any commits that are not labelled as "replacement" by git2net are dropped. Any edits that originated from or resulted in an empty line are dropped for the same reason. Additionally, we do not consider any edits to files for which no cyclomatic complexity can be computed. These are generally data files, images, etc. which are not part of this analysis. The exact number of disregarded edits per project are shown in Table S1.
While cleaning the data, we discovered a large number of commits with inter-commit times of 0 and 1 seconds, particularly for linux. Further analysis revealed that developers often make multiple consecutive commits after working on a section of code. In doing so, individual commit messages can be assigned to different sets of edits, facilitating the tracking of changes in the project. This behaviour invalidates our assumption that the inter-commit time represents an upper boundary of the time a developer spent on the edits is violated. As illustrated in Figure S1 we thus aggregate commits with inter-commit times of less than a given threshold ∆ to a single code contribution. While the threshold needs to be sufficiently high to avoid the cases mentioned above, setting it too high will merge commits that belong to adjacent contributions. After discussions with professional software developers, we aggregated consecutive commits with intercommit times of less that ∆ = 5 minutes 1 . Subsequently, we perform all analyses at the level of (aggregated) code contributions rather than commits.
The GitHub repository of gentoo only exists since August 2015, whereas development started as early as 1999. When creating the git repository, an initial commit was made that includes the entire history of the project until this point. To not falsely attribute all previous development efforts to the author making this first commit, we drop all edits to lines initially added with the first commit. This amounts to almost 25% of the remaining edits in the database.
The distribution of developer productivity reveals a small number of outliers with very large values. A manual inspection showed that these are mostly due to automated changes of code style, or search and replace operations 2 . We argue that such commits are not representative of typical software development and thus consider them as outliers. To not bias our analysis, we removed them from the dataset by excluding the top ε quantile of contributions with respect to productivity. Similar to the aggregation time window, the removal threshold cannot be set too low, but setting it too high will also result in the removal of the 1 We highlight that our results are robust with regard to different parameters ∆ . Results for ∆ of 1 and 10 minutes are shown in the supplementary material. 2 cf. commit 4be44fcd3bf648b782f4460fd06dfae6c42ded4b in linux or commit eaaface92ee81f30a6ac66fe7acbcc42c00dc450 in gentoo Table S1: Overview of data collection and cleaning process. Only commits with less than 1000 modified files were originally mined. Edits were dropped due to not being replacements or not relating to code.  Fig. S1: Aggregation of commits to contributions. most productive contributions in the respective project. After discussion with professional developers, we decided to remove the top ε = 5% contributions with regard to developer productivity from the dataset 3 . The git protocol allows developers to use any name and email address when making commits. Hence it frequently occurs that the same developer appears with differently spelt names (e.g. in the case of names with special characters) or different email addresses (e.g. work and personal emails) in the same repository (Bird et al., 2009;German, 2004). This is an essential problem as it adds noise to both the data collection and any subsequent analyses. Unfortunately, it is also a challenging problem, particularly when dealing with large-scale data. Aiming to correctly disambiguate authors, we compared two heuristic-based approaches. For the first approach, we matched authors based on the author email recorded in git. The second approach performs a matching based on the recorded author names after removing capitalisation and special characters. Upon comparison with a manually created ground-truth for the igraph data set, we found that the second approach yielded better performance. Therefore, this approach was used for all projects. Figure S2 provides an overview of the amount and types of edits made in the projects included in our largescale analysis of coordination overhead in software teams based on a line-based extraction of edits. The first two columns show edit counts as well as relative edit counts for a moving window of 295 days. The window size was selected based on the finding of Scholtes et al. (2016) that after 295 days of inactivity, the probability of a subsequent commit of an Open Source Software developer is less than 10% and hence the developer should no longer be considered as member of the development team. We find that (with values ranging between 60 and 90%) additions make up most of the largest parts of all edits across projects. In contrast almost no code is deleted without being replaced as can be seen from the very low amount of deletions. Code replacements, where an existing line is edited, make up around 20% of the data. As we aim to study coordination overhead, code replacements are the main focus of our analysis as these allow us to directly link the consecutive developers editing the same line with a co-editing relationship.

S2 Data description
The third column shows the count of replacements for which code by other authors is edited as fraction of the total number of code replacements. Colours show the development of both team size and communication requirements over time. The dashed line shows a linear model of the form y = αx + β fitted to the data. We find that the slope is positive and significant for all projects indicating an increase in coordination requirements for larger teams. These results confirm the findings made in section 5.1 in the main manuscript for Levenshtein edit distance also for count data.

S3 Feature correlations
In this section, we report the feature correlations used for the feature selection in section 5.2 of the main manuscript. There, a description of the all features is provided in Table 2. For the feature correlations for the linux kernel development project please refer to Figures 13 and 14 in the main manuscript.

S4 Model selection
In this section, we report the AIC as well as Chi-square test based model selection results finding that ME+ is the most suitable model to describe the productivity in all considered projects. The three candidate models models are defined in Table 4 of the main manuscript. For the model selection results for the linux kernel development project please refer to Table 5.