In this section, we present a more detailed overview on how the Stack Overflow website works and the data we gathered about users. We also give a thorough outline of the features we extracted or created and that will serve the basis of the upcoming analyses. We now present our data collection and labeling methodology. Additionally, we introduce our dataset, focusing specifically on how the data breaks down along gender lines.
The Website
Users who create accounts on Stack Overflow can ask and answer questions as well as comment on questions or answers. For easy navigation between question and topics, users label questions with tags, indicating the topic of the question (for example if the question is about a specific programming language or algorithm). They gain reputation points, our fundamental measure of success, by receiving explicit positive feedback called “upvotes” on their questions or answers. We outline the ways users accumulate reputation in Table 1. Any user who accrues 15 reputation points gains the ability to upvote questions and answersFootnote 2 Stack Overflow also rewards specific behaviors with badges, which are tokens given for some kind of accomplishment (for example visiting the site every day for an extended period of time, receiving a set number of upvotes for a question they ask etc.).
Table 1 How users gain or lose reputation points on Stack Overflow We use the Stack Overflow APIFootnote 3 to collect information on all users with at least 100 reputation points, as these users can be considered active on the website (they are granted the basic rights to comment, upvote, flag and edit on Stack Overflow). In all, we collected data on 565,171 users. To supplement the information provided by the API, we scraped data on users’ activity, including the badges they collected, the tags they used, and the count of questions and answers they posted.
Feature Creation
Several of the features we use in our analysis can be extracted directly from user profiles. First, we note users’ meta data, including their biography text, sign-up date, and whether they link to a personal website or social networking accounts such as Twitter, Linkedin, or Github. We operationalize these features in our models as a self-promotion index which takes a value between 0 and 1 depending on the proportion of self-promotion fields that the user has filled out. Within the biography field we check the text for the substrings “senior”, “lead”, “head”, and “manage”, assigning a dummy to each user taking the value 1 if they list any of these leadership or senior position indicators in their bio.
Second we quantify their activity on the site by how many questions they ask and answer, how often they edit posts, how many upvotes and downvotes they cast, and how often they make posts with which tags. Finally, we have information about user’s outcomes and success on the site from their reputation scores and the number and types of badges they receive.
Gender inference
Inferring gender of individuals from their online profiles is a complex problem. We apply a two-step approach to infer user gender, first using genderComputer (Vasilescu et al. 2014), a tool specifically created to infer gender of Stack Overflow users from their given usernames and location. GenderComputer considers a variety of string manipulations (for example reversing “Nohj” to get “John”) to expand the scope of the inference. Location can provide additional accuracy by distinguishing, for example, between an Andrea from the UK (likely a woman) and one from Italy (likely a man). This method classified the users from our sample into 238,150 male, 24,717 female, and 302,304 unidentified users. In order to evaluate the quality of this classification, we manually examined 100 users classified as men and 100 classified as women. We found that while the method performed very well on men (97% agreement with our manual check), our manual check agreed only in 44 out of 100 cases of women. This replicates the recent finding by Ford et al. (2017) that genderComputer sacrifices precision for greater recall when inferring women users.
The second step of our inference seeks to correct this bias by applying a more conservative method based only on first names and location called Gender Guesser. By considering only users rated as likely male or likely female by both methods, we are left with a smaller but more accurate sample. 10,571 users are rated as highly likely women by both methods. We randomly choose 10,571 likely men (again classified as such by both methods) to obtain a balanced sample. We repeated our manual check of a random sample of accounts finding 96% agreement with our classification of men, and 84% agreement with our classification of women. This ensemble approach resembles Ford et al.’s modification of genderComputer to focus on the detection of first names within the username (Ford et al. 2017).
We acknowledge several limitations and drawbacks to our approach to inference. First, we make the simplifying assumption that gender is binary. We argue that this is a fundamental limitation of examining questions about gender differences using harvested data. Second, discarding alias usernames builds on the assumption that men and women are equally likely to adopt user names that can be mapped to their respective gender, and that this mapping does not substantially impact our hypotheses. However, previous research has shown that anonymity impacts behavior (Robertson et al. 2017), and it is possible that some users utilized an anonymous name in order to establish an independent identity. Such name selection is highlighted by the literature on gender swapping in online communities (Bruckman 1996; Szell and Thurner 2013), where, for example, women may pose as men if they feel that they will be taken more seriously or to avoid harassment. We also note the limitations of the geographic component of our inference: a minority of users include location data, and a given location may not reflect a user’s origin (for example if an Italian man named Andrea moved to the UK). Finally, name-gender databases have been shown to have significantly less accuracy when used to infer gender for non-European names (Karimi et al. 2016).
Despite these limitations, we argue that our focus on identifiable names provides the best possible data to test our hypotheses of gender behavioral and outcome differences on Stack Overflow. By limiting our dataset to users where we are highly confident about our gender inference, we gain greater confidence in the estimates of our econometric models. Moreover, our analysis includes robustness checks with 5, 10, 20, and 50 percent of our gender labels in the balanced sample randomly shuffled. These test help us better understand the effect of potential classification errors on our results. See details in Section 5.2.
Detecting user communities
Given the size of the site and the diversity of topics that its users discuss, we consider that coherent communities of users may exist with significantly different patterns of behavior, norms, and outcomes for men and women. For example, users active in a more diverse community may be less likely to leave the site (Vasilescu et al. 2015), while women encountering other women are more likely to engage in a thread (Ford et al. 2017). Using a similar approach to Bosu et al. (2013), we grouped users in communities by building a network where two users are connected if their posts often share the same tags. Specifically, we created a similarity measure between users by calculating a weighted Jaccard similarity measure, defined as
$$s(u,v)= \frac{{\sum}_{t \in T} \min (t_{u},t_{v})} {{\sum}_{t \in T} \max(t_{u},t_{v})} $$
where T is the collection of all tags used at least 200 times, and tu denotes the number of times user u made a post with tag t. We then filtered the edges using Serrano’s disparity filter (Serrano et al. 2009), which, for each node, checks the weights on all its adjacent links against the null hypothesis that they are uniformly distributed. Each observed weight then has a p-value. We filter edges using this p-value (p<.01). The resulting network has approximately 150,000 edges connecting the roughly 21,000 users. We use the Louvain algorithm (Blondel et al. 2008) to detect communities in this network. We tune the method’s resolution parameter to find larger communities to facilitate a qualitative understanding of the communities found.Footnote 4 We plot the network, visualized using a force-layout algorithm, in Fig. 1. The nodes are colored by community membership.
We manually checked the most commonly used tags in each community and found many clearly interpretable communities. The prominent programming languages and frameworks we identify in the largest communities coincide with those found in other analyses of programming language use, for instance on GitHub (Celinska and Kopczynski 2017). We describe the 10 largest communities, accounting for 80% of our users, in Table 2. Note that we sampled the males to achieve a 50-50 male-female ratio in our dataset. We see small, occasionally statistically significant gender differences. We find that the C#/asp.net, a Microsoft-developed software framework, and Ruby/Rails, a web development framework, communities have the highest representation of men, while Android, a programming language for mobile phone applications, has more women. We find that Ruby/Rails is the community with the lowest incidence of downvoting.
Table 2 Descriptive statistics of the 10 largest user communities based on a network of tag-use similarity among our balanced sample of users
As past work indicates, community structure has a significant impact on user behavior and the possibilities for gaining reputation (Bosu et al. 2013). For instance, it may be easier to ask a new question or post answers in a newer community, for example on Angular/Node related questions, than in a long established community such as on C++. Therefore subsequent models explaining gender differences (see Results section) include fixed effects for user community. We also control for the size of the community, the percentage of the community that is male, and the percent of reputation generated by users in the community in the last year as a proxy for how new the community is.