Keywords

1 Introduction

Women are underrepresented in the software development industry and particularly within OSS communities [1,2,3,4,5]. In the software industry, females only account for 21% of the whole software development workforce and earn on average $22,251 less than their male peers [6]. In GitHubFootnote 1, a well-known open-source platform, the annual diversity report on OSS communities reported that the overall number of females decreased from 37% in 2017 to 33% in 2018. Repository data from GitHub projects provide an interesting source of information which, under appropriate analysis, may reveal facts related to females’ activities and their dynamic interactions in a collaborative development environment. To this end, we undertake a socio-technical analysis to get more insights into females’ involvement in OSS projects. More specifically, we seek to answer the questions: Where Are Females in OSS Projects? How They Evolve? And How they contribute to the sustainability of the OSS social capital?

Why Diversity Matters?

Diversity in the workplace including diversity of gender, ethnicity, and even religion has been shown to improve retention and reduce the costs associated with employee turnover [4, 7,8,9]. In a diverse workplace, employees are more likely to remain loyal when they feel respected and valued for their unique contribution [10]. That said, while GitHub is supposed to be a meritocracy based platform and free of gender barriers [11], the situation of females involvement is below the expectations [6] which limits the workforce required for communities growth. To benefit from gender diversity, OSS communities should lower barriers for female developers’ self-guided personal development, which is crucial for them to achieve their own technical distinctions.

In order to support gender diversity within GitHub communities, we should understand how members, especially women, interact in these community environments. Understanding social interaction of female’s collaboration could reveal insights not only about the structure of collaboration but also how productive are cross-gender collaboration: we know who contributed on what and when by examining historical data from GitHub. From there, we can build a social network of contributors (for both males and females) who have collaborated on the same source code files. Social Network Analysis (SNA) provides a class of metrics known as “centrality” metrics that provide a useful abstraction of the gender interaction information. Thus, we analyze and compare activity data (i.e., #commits, #comments, #fileModified, and code churn) between genders. Then we analyze SNA metrics of interactions (i.e., Female-Female, F-M, and M-M).

In particular, we answer the three following research questions:

  • RQ1: Where are females in OSS projects?

  • RQ2: How the top females’ performers evolved over time?

  • RQ3: What is the role of females in building a sustainable collaborative social capital?

Paper Structure.

We first discuss related work in Sect. 2. Then we present our methodology in Sect. 3 including data collection and gender detection approach. Section 4 introduces the basic terminology and concepts needed to comprehend how we built social networks, while Sect. 5 offers an overview of our findings along with discussion. In Sect. 6 we discuss possible threats to validity. Finally, Sect. 7 concludes and outlooks future work.

2 Related Work

The study by Vasilescu et al. [4] is the one that shares objectives closest to ours. The authors acknowledge the importance of gender diversity and explored how diverse are online teams with respect to gender and tenure based on the findings of a survey of GitHub contributors. Authors point out that when forming or recruiting a software team, increased gender and tenure diversity are associated with greater productivity. Similarly, James et al. [12] studied the perception, performance, team dynamics, and opportunities using a survey targeting software professionals and reported that there is no significant difference between genders. Medenz et al. [13] investigated a gender inclusive method and how it can help increase gender inclusiveness in the tools that are used by OSS communities.

Wang et al. [11] discussed the existence of the competence-confidence gap. Authors developed a theoretical explanation for female developers’ low rate of initiating a pull request in OSS projects arguing that it is not easy for female developers to directly translate competence to confidence [11]. Similarly, Terrell [14] compared acceptance rates of contributions from men and women in an open-source software community and finds that, in overall, women’s contributions tend to be accepted more often than men’s - but when a woman’s gender is identifiable, her Pull requests are rejected more often.

In this study, we investigate the socio-technical interactions of females in OSS communities using SNA. We leverage on the historical data of six projects from GitHub. The benefit of assessing females’ position in the overall collaborative networks is that it does not require qualitative information that can only be provided by contributors, such as questions about what they have done before. Analyzing networks of collaboration such as file co-edition, comments, pull requests offer opportunities for understanding the value provided by women in software development and help other researchers recommending improvements.

3 Methodology

3.1 Data Collection

The primary goal of our study is to understand how females interact and evolve within OSS communities. To this end, we performed a preliminary analysis of socio-technical interactions on publicly available historical data from six open source projects, mined from GitHub: Angular.js, Moby, Rails, Tensorflow, Django, Elasticsearch. We selected these well-known OSS projects because they are long-lived with more than a thousand contributors, and under diverse programming languages. Table 1 shows descriptive statistics regarding the studied projects.

Table 1. Quantification of activities by gender. (1) unknown gender is filtered out; (2) numbers are normalized by the number of contributors for each gender.

In order to understand the female’s distribution along with their activities within OSS projects, we first extracted information for each commit including the login of contributors, Timestamp of the commit, number of files modified, and code churn information (i.e., quantification of the commit size). Next, we used a Rest GitHub APIFootnote 2 to extract detailed information of the contributor for each commit. For instance, the HTTP GET request “https://api.github.com/users/Narretz” sends back available details of the account including contributor’s name “Martin Staffa”. However, users registered in GitHub can choose to disclose their names in their profiles, so that out of the six projects studied, 1059 contributor names were missing. To discover as many of the missing names as possible, the GitHub API provides access to user’s public events which list all public events performed by a user on the site. This list provides another way to view the personal information associated with GitHub data. For instance, calling “https://api.github.com/users/bumbu/events/public” returns a Json object that includes a full name associated with the user and commits ‘Alex Bumbu’. Using this method, we were able to detect 155 names from the missing list since GitHub users can upgrade their privacy settings to hide personal informationFootnote 3. We verified that less than 10% of contributors’ names are missing for each project. Unknown names in the profile cannot be used to identify genders, we ignored them in our analysis.

3.2 Gender Detection

GitHub does not store information about contributors’ gender identification. There are few approaches to automatically infer a contributor’s gender [4, 14, 15]. Vasilescu et al. [4] used first name and country to infer a user’s gender. Terrell et al. [14] suggested a method using email to link a user to her Linkedin profiles which contains rich information. Although this method shows a high precision, the recall is not good, especially for female developers from East Asian countries where the usage of Linkedin is not that high. However, these heuristics have not been empirically validated with GitHub data. To alleviate issues related to previous approaches, we used the commercial tool called Namsor APIFootnote 4. Table 1 shows a comparison of females and males percentage distributions.

4 Social Network Representation

Social Network Analysis (SNA) is the process of investigating social structures through the use of networks and graph theory [16]. In network analysis, we have nodes (i.e., contributors) which represent vertices of a graph and connections which represent edges (i.e., a type of interaction). An interaction occurs when two developers modify the same source code file or comment on the same topic.

SNA Metrics.

Metrics that measure a node’s direct connections to other nodes are connectivity metrics. The first metric is the density of a network which captures the number of actual connections between members divided by the number of possible connections, as depicted in Fig. 1. Density values range from [0 to 1], a higher density indicates that network nodes have tighter connections with each other. Second, connectivity metric is the degree centrality of a node which represents the number of connections incident on a node. Each time two contributors change the same file or comment on the same topic we have one non-weighted connection (i.e., a link). In contrast, a node is considered disconnected if it has no edges with other nodes meaning no interactions with other contributors. Centrality metrics quantify how closely contributors are indirectly connected to other contributors in the network. We consider two metrics: closeness and betweenness. Closeness stands for the average number of steps required to go from the current node to all other nodes. Closeness measures the distance between each participant and all other participants (i.e., is a participant connected directly to all other participants, or would information need to pass through several other participants to reach that individual?). Betweenness stands for the average number of shortest paths between pairs of other nodes that run through the node. Betweenness is used to measure the extent to which a node lies between other nodes. More shortest paths run through a node, more likely important this node is. Clustering Coefficient is calculated as the probability that any two neighbors of the current node are connected. In our study, the clustering coefficient measures the collective collaboration of contributors from different genders.

Fig. 1.
figure 1

Social position of females within the socio-technical network (females are represented in yellow). (Color figure online)

We used SNA metrics to explore diversity of interactions of Female-Female (F-F), Female-Male (F-M), and Male-Male (M-M) using files co-edition networks from six GitHub projects.

5 Results

  • RQ1. Where Are Females in OSS Projects?

Motivation.

Social capital of open source software needs gender diversity among other diversities (i.e., cultural, ethnic, etc.), to be sustainable [17]. Gender diversity demands coordinated and delicate interactions [4]. Unfortunately, little is known about how females behave and interact within OSS communities, comprehensive view is rarely discussed at least to understand what would be specific for females to ease their integration within OSS communities.

Approach.

We first extract historical data of Commits, Comments, and Pull Requests from GitHub with their respective authors similarly to the approach provided by Joblin et al. [18]. After identifying the gender of each contributor, we focused on three groups of interactions cross-gender F-F, M-M, and F-M. Thus, for each project, we built a social network, along with subnetworks regarding the three groups, related to contributors’ interactions for the six studied projects. More precisely, by interactions we mean the relation by which two or more contributors collaborate with each other through an activity such as working on the same source code file (i.e., co-edition) or commenting on the same commit. We tracked back this information by looking at the version change history and other elements provided by the GitHub platform.

Results.

Figure 1 shows social networks related to contributors’ interactions for the six studied projects. Nodes represented in red color point out males and yellow ones highlight the position of females within the overall network. Visually, as shown in Fig. 1, females are distributed throughout the overall network. They are positioned within the Core as well as Peripheral members and are interacting with both genders. Furthermore, to examine interactions regarding gender, we calculate the SNA metrics of three sub-networks Fig. 2: (i) interactions between females F-F, (ii) interactions between males M-M, and (iii) interactions with the opposite gender F-M. Table 2 reports the averages distribution of the most important SNA metrics cross-networks.

Fig. 2.
figure 2

Sub networks representing cross-gender interactions.

Table 2. Summary of the most important SNA metrics using sub-networks related to gender.

For all projects, F-F interactions are as dense as M-M and F-M interactions are less dense meaning that there are more interactions between contributors within the same gender group than the opposite gender. For example, within Angular.JS project, the two networks F-F and M-M have the same density equal to 0.05, however, the density of cross-gender interactions network M-F is very low (=0.007).

The Avg. degree metric for the sub-networks F-F is extremely low compared to M-M meaning that females are less directly connected to other females (2.7 for Angular.js) and also less connected to males (7.05). We hypothesize that this result is skewed because of the unbalanced numbers of contributors for each gender.

The Avg.ClusCoef related to the sub-network M-M is much higher than F-F meaning that there are established groups of collaboration (for Angular.js, Avg.ClusCoef = 0.89 between males (M-M) against 0.46 for F-F) co-editing the same components for a long time. While Avg.ClusCoef value close to 0 for the sub-networks M-F suggesting that there are few cross-gender interactions.

Closeness metric is roughly similar cross-gender meaning that gender has no effect on the proximity of members with other members of the community and how fast the members interact with each other (i.e., fewer steps to reach other nodes).

Betweenness metric has a very small value suggesting that nodes are not that much connected (value = 1 means a hub node and value = 0 means a disconnected node). For instance, in Angular.js project, the average Betweenness for the network F-F equal to 0.02, while it is equal to 6e-4 for the sub-network M-M.

Also, as one can see in the subnetwork M-F in Fig. 2, females can play a broker role ensuring the communication between sub groups, which can decrease the centralization of the OSS community and increase communication between Core (i.e., experts) and Peripheral members (i.e., casual contributors) [19].

figure a
  • RQ2. How the top females’ performers evolved over time?

Motivation.

A common problem that OSS communities face is the high instability of their contributors [20]. This problem is even amplified by the fact that females are underrepresented, their permanence is also an open issue. We sought to gain more understanding of the evolvement of female contributors over time and to learn from the experience of the top performers.

Approach.

For each project, we have retrieved the third quartile (i.e., upper quartile that has the top 25% above it) of the female developers that contribute the most according to the: number of commits, comments, and Pull Requests. From it, we calculate the performance for each contributor according to the following formula.

$$ Performance \left( {C_{i} } \right) = \frac{1}{Total\_activity}\left[ {date\_lastActivity\left( {C_{i} } \right) - date\_firstActivity\left( {C_{i} } \right)} \right] $$

Results.

Figure 3 shows the distribution of performance between males and females. We performed Mann-Whitney-Wilcoxon statistical tests, with a confidence level of alpha = 0.05. We found that for all projects except Moby the tests were not statistically significant (p > 0.05) suggesting that there is no statistical difference of the performance cross-gender except for the project Moby as reported in Table 3.

figure b
Fig. 3.
figure 3

Comparison of the gender performance

Table 3. ManWhitney significance test
  • RQ3. What is the role of females in building a sustainable collaborative social capital?

Motivation.

Sustained participation in crucial for successful OSS project [19]. We expect that the more often individuals participate in OSS, the higher their chance of prolonged engagement.

Results.

Table 1 summarizes the involvement of females in OSS projects. The percentage of females range from 3.4% to 5.8% which is drastically less than the average numbers presented in the annual report of GitHub [6]. When normalizing these numbers by genders, we noticed that women contribute slightly less than men except for the Moby project where women outperform (20.08% vs. 18.73% for men). Apparently, in the Angular.js project, the productivity of men is two times higher than women in terms of commits (2.79% vs. 5.86%). Similarly, we report on the difference of productivity for other projects (Moby = 0.9; Rails = 2.3; Django = 2.9; ElasticSearch = 1.6; TenserFlow = 1.4).

Moreover, Table 4 shows the evolution of the activity related to both genders (due to lack of space, we report data only for Angular.js). Although OSS projects keep attracting both genders, the number of active contributors (i.e., committing and commenting) decreases proportionally for both genders.

figure c
Table 4. Activity evolution. Example of Angulat.Js project.

6 Threats to Validity

This section discusses the threats that might have affected our findings, and how we have alleviated these threats.

Threats to Internal Validity concern alternative factors that could have influenced our results. This exploratory study might suffer from at least the similar threats that other studies using GitHub Data do [21] related to possible issues with data gathering and the missing of validation. We believe that using a large amount of data mitigates this threat. Another threat relates to the accuracy of gender prediction. Various approaches and tools to infer gender based on names and countries have been proposed in the literature [4, 22, 23]. Most of these tools rely on English names stored in databases. However, these heuristics have not been yet empirically validated with GitHub data. We tried three tools [4, 14, 15], and found that each has strengths and drawbacks. We used Namsor, a commercial tool with the assumption that it provide the most accurate results. We also assume the gender is a binary attribute.

Threats to construct validity consider the agreement between a theoretical concept and a specific measuring procedure. One threat considers the intrinsic blind spots when investigating consecutive snapshots of a dynamic social network over time-stamps (years). Indeed, there is no way to ensure that years are the “optimal” time-stamps to represent the snapshots of our social networks in order to capture the evolvement and interaction between nodes (contributors). Therefore, tracking a community evolution in such a constraint can yield evolving observations in a non-consecutive way. Furthermore, some important events (e.g., major releases, vision changes, etc.) that communities may undergo across time-stamps will not be detected and will be difficult to predict.

Threats to external validity relate to the generalization of our findings. We only considered six OSS projects from GitHub. Thus, we cannot assume a generalization to all projects hosted on GitHub or other platforms such as GitLab and Bitbucket, even though there is no inherent reason why they would be biased.

7 Conclusion

We have used SNA approach to study gender diversity within six GitHub communities. We first examined the position of both genders within social networks and their interactions and found that females are spread over the overall network including Core and Peripheral members. Next, we tracked back the evolvement of the upper quartile of the female developers that contribute the most according to the: number of commits, comments, and Pull Request. We found that there is no statistical difference regarding the performance cross-gender except for the project Moby. Finally, even if females are underrepresented in OSS communities, they contribute in building a sustainable collaborative social capital for open software.

The results of this study motivate further fine-grained analysis of female’s implication in OSS that will (i) investigate the quality of work in terms of introduced bugs or complexity; (ii) explore further socio-technical interactions of women to understand the value of their contributions; (iii) characterize projects attractiveness for women.