The Role of Gender in Scholarly Authorship
To illustrate SML and UML in greater detail, I will now turn to an important case: gender differences in scientific publishing. Applying gender detection (SML) and topic models (UML) on a large sample of U.S. sociology dissertations will reveal new insights into how research topics are deeply divided by gendered preferences.
Despite growing awareness, gender differences in academia still persist across all disciplines and countries (Barone 2011; Holman et al. 2018; Huang et al. 2020; Larivière et al. 2013). At the center of interest lies the “productivity puzzle” (Xie and Shauman 1998), i.e., evidence that male researchers publish more than their female colleagues. Explanations point to many related differences, for instance, in collaboration practices (Abramo et al. 2019; Jadidi et al. 2017; Uhly et al. 2017), family responsibilities (Carr et al. 1998; Fox 2005), or rank of alma mater (van den Besselaar and Sandström 2017). Those results appear in a new light given recent results from Huang et al. (2020). By reconstructing over seven million researcher careers from a large sample of publications, they could show that gender differences in productivity and impact are stable, but that those differences are rooted in gender-specific dropout rates.
Although those findings have wide-ranging policy implications, there is another major aspect of gender biases in academic publishing that is still widely overlooked: research content and topics. Among the few exceptions, Nielsen et al. (2017) detect a higher likelihood in medical studies to include gender in their analyses if women are among the authors. Only recently, Key and Sumner (2019) find gendered research topics in political science. I will also refer to their results in the Discussion section.
Using SML: Detecting Gender
To further our understanding on gender preferences for certain research topics, I base my analysis on dissertations. Theses are a formal requirement for becoming part of the scientific community (Collins 2002). Trying to gain recognition as experts, potential graduates spend a long time on their respective projects. Most PhD candidates ponder the objectives and meanings of their thesis many times. Hence, chosen topics should reflect personal preferences, also because one’s thesis is a strategic decision (Bourdieu 1988, p. 94). In addition, theses have to be single-authored, circumventing problems present in studies using research articles with several coauthors.
The data are retrieved from the ProQuest database and represent a close approximation of all US-based dissertations (Hofstra et al. 2020). To reflect the sociological field, all theses written in a sociology department have been included (N = 41,045). Thus, the research topics represent a rather narrow perspective on sociology, excluding interrelated fields such as education or psychology.Footnote 8 Taken together, the analyzed texts comprise each dissertation’s abstract and title and range from 1980 to 2015.
To derive gender, almost all of the cited studies (see the Section “The Role of Gender in Scholarly Authorship”) use the first names of authors. Although this process is often moved to footnotes (if mentioned at all), I will now describe the three SML steps to classify gender from names and compare the proposed approach to a commercial application (genderize.io).
First and foremost, each SML needs training data (step 1). Regardless of the particular classification, pre-labelled data are needed to establish the statistical models from which to derive the desired classifications. The larger the amount of training data, the better the subsequent predictions. A major obstacle for this undertaking, hence, is to realize such “ground truths” in a sizeable fashion. Classifying training data get particularly expensive (time and/or money), when human coders (considered the “gold standard” of ML training sets) are needed.
One way of obtaining suitable training data is to use process-generated data, often derived, for instance, from public records or, in industry, from customer data bases, sales, or web logs. In the case of gender prediction, I will utilize the largest collection of names that is publicly available, the US Social Security Administration (SSA) record. It contains first names annually collected for each of the 355,149,899 babies born in the US between 1880 and 2019. Unlike West et al. (2013) or Karimi et al. (2016), we use the full records (all names with at least 10 occurrences per year). Like most social actions, however, name-giving yields a heavily skewed power-law distribution with relatively few high-frequency names (James, John, and Robert at the top for male names; Mary and Elizabeth being the most popular female names).
In total, 99,444 unique names have been awarded in the US since 1880. Most are associated with one gender exclusively, only 10,942 have been assigned to both sexes in all those years. Although we can assume a probability of 1 for the names that indicate solely one gender, ambiguous names provide us with an interesting case to apply SML.
Before performing the predictions, we need to match our test data to SSA. The test data comprise first names of PhD students of whom we do not know the gender. The sample initially contains 41,045 dissertations; of those, 37,437 first names are found in SSA. The other 3608 are either relatively rare Asian names (e.g., Byung) or double barrel names (e.g., Zxy-Yann), but mostly first names recorded only as single letters in the database, i.e., sort of missing data. That is a rather low number compared with other approaches, for instance, West et al. (2013) did not find more than 26%, or Hofstra and de Schipper (2018), who could not align more than 30% of their training with test data.
To apply SML, I will now focus on the 33,082 students who have names that are assigned to both sexes in SSA, resulting in 2545 unique names for which we want to predict the associated gender. For that purpose, we need to build a probabilistic classifier (step 2), i.e., a statistical model to link features (explanans, here: ambiguous first names) and outcome (explanandum, here: gender) to derive classifications based on the training data. In this article, I use logistic regression (generalized linear models, GLMs).Footnote 9 Although this is arguably one of the most popular methods in social sciences and should, hence, be familiar to most readers, it is only rarely used by social scientists to predict out-of-sample cases. That is in contrast to its use by computer scientists, who employ GLMs for predictions and see them as an essential part of ML tools (Lantz 2019).
To align the results of the proposed SSA approach in the next step to results achieved with genderize.io (Wais 2016), the probability is set to a rather strict level of 0.95. That means a student’s gender is associated with being female or male if the model predicts that 95% of all times.Footnote 10 If the probability is lower, the case is set to “unknown.” The results depicted in Tab. 1 show that more female students finish the PhD (around 56%), which is in accordance with official statistics (National Center for Education Statistics 2018). 2643 students (around 7%) cannot be assigned to a gender with the desired certainty. That is also a very convincing value compared with other studies (Karimi et al. 2016; Larivière et al. 2013; West et al. 2013).
Table 1 Overview of gender predictions for the ProQuest sample (N = 41,045) based on the SSA approach However, the final and most important step of all SML is to assess the accuracy of predictions by comparing with to a “ground-truth” and/or other approaches (step 3). Accuracy is most often defined by calculating a confusion matrix as shown in Tab. 2. It evaluates the classification performance by counting the number of “true” and “false” instances. The ground truth in this article consists of 500 names (~ 20% of the test) for which three experts manually coded female or male names. Thus, we match each of the predictions of the machine to the gold-standard of human coders—the key to assessing the performance of each ML task.
Table 2 Confusion matrix for gender predictions of the SSA approach. Results for genderize.io are reported in brackets. Ground truth results are based on 500 first names manually coded by three researchers. “False” means ambiguous codings of the human coders We compare the SSA predictions with one of the most popular databases for gender predictions, genderize.io, which has found many prominent applications in science (Huang et al. 2020). The simple SSA approach proposed here is clearly outperforming genderize.io. From Tab. 2 we can easily calculate accuracy \(\frac{\left(TN+TP\right)}{\left(TN+TP+FN+FP\right)}\) and F1 score \(\left(\frac{TP}{TP+\frac{1}{2}\left(FP+FN\right)}\right)\), two of the most important indicators of model quality in ML. Although the SSA achieves an accuracy of 0.76 and a F1 score of 0.85, genderize.io can only reach an accuracy of 0.68 and a F1 score of 0.79. These are rather large differences given SML tasks (Karimi et al. 2016).
In addition, the SSA data are completely open to the public and easily downloadable in a machine-readable format. Even more importantly, results for SSA-based predictions are fully reproducible because of the fixed set of names for a given time span. In contrast, genderize.io is continuously expanding its database, so that (other) researchers are not able to reproduce previous classification results. Finally, genderize.io is not free of charge for more than 1000 names per day, which is an additional disadvantage.
Using UML: Deriving Topics
One of the crucial yet often unspoken steps in working with large amounts of texts is to prepare and clean the data. To provide an appropriate “how-to” description, I will spell out those details before describing the UML applied here.
In a first step, all stopwords have been removed (e.g., “and,” “or,” “the”).Footnote 11 After that, the words have been lemmatized. Lemmatization is a common step in NLP to reduce different forms of a word (e.g., singular and plural) to a common base form (e.g., “women” becomes “woman”). As final preprocessing step, I concatenated bigrams appearing more than 50 times (e.g., “united” and “states” become “united_states”). In so doing, we can detect meaningful phrases like “factor_analysis” or “statistical_significant” in the dissertation abstracts (Blaheta and Johnson 2001).
After preprocessing, I use topic modeling in order to reduce large quantities of text to meaningful dimensions. Topic models are a popular instance of UML (Jordan and Mitchell 2015). Such models assign documents in a corpus to a combination of topics. Topics are directly derived from documents by probabilistic algorithms and consist of words that co-occur across documents. In so-called generative models, each topic is seen as a probability distribution across all words of a given vocabulary, describing the likelihood of a word to be chosen as part of a certain topic. This likelihood is independent of the position of the word in a text, which is why it is most often referred to as a “bag-of-words” representation of documents. Although this assumption is clearly not realistic (e.g., grammar is ignored), it has proven to be very reliable in practical applications (Landauer 2007).
For a decade, topic models have become very popular in social sciences (Evans and Aceves 2016; McFarland et al. 2013). In particular, science of science studies make use of this sort of dimension reduction, for instance, by reconstructing the history of a field (Anderson et al. 2012; Hall et al. 2008), explaining scientists’ choice of research strategy (Evans and Foster 2011), tracing researchers’ interest changes (Jia et al. 2017), or relating relevant career outcomes to authors’ topic choices in medicine (Hoppe et al. 2019) or education (Munoz-Najar Galvez et al. 2020).
The core of most topic models was proposed by Blei et al. (2003), the latent Dirichlet allocation (LDA). Given a desired number of topics k and a set of D documents containing words from a word vocabulary V, LDA models infer K topics that are each a multinomial distribution over V. Thus, topics are a mixture of words V with probability β of a word belonging to a topic. The more often words co-occur in documents, the higher the probability that the words constitute a topic. At the same time, a document is also considered as a mixture of topics, so that a single document can be assigned to multiple themes. The topic proportions are given by parameter θ. By design, all topics occur within each document; thus, the proportion of θ gives us the strength of the connection between a topic (itself an ordered vector of words) and a document. Finally, it is important to note that the sampling process of LDA and all its extensions draw for each topic and each document from an eponymous Dirichlet distribution. Hence, the same multinomial distribution is used for all documents in a corpus.
In this article, I use an extension of LDA called structural topic modeling (STM) (Roberts et al. 2014, 2016). Its key feature is to enable researchers to incorporate document metadata and utilize such information (e.g., year) to improve the consistent estimation of topics. The covariates of a document d are denoted as Xd. The basic model relies on the same process explained above. However, in an STM the topic proportions θ depend on a logistic-normal generalized regression. Thus, for each word a topic is drawn from the document-specific distribution for one document based on its covariates Xd, not only—as in the regular LDAFootnote 12—on a general distribution that is the same for all documents. It has been shown in several simulations that the incorporation of covariates improves the results of the topic quality substantially (Roberts et al. 2014, 2016).
STM proved to be especially useful for longer periods of time and changing discourses (Farrell 2016), which suits the data at hand well, for it spans three decades of sociological dissertations. In addition, we can include gender as an additional covariate and therefore examine potential differences between male and female research preferences. Hence, the gender predictions done with SML that are described above will be utilized as a covariate to predict gender differences in the choice of research topics in U.S. dissertations from 1980 to 2015.
Like most UML (e.g., cluster analysis, principal component analysis) researchers have to set K even though the number of relevant dimensions is not known a priori. Insufficient numbers render models too coarse whereas high values could result in very specialized subthemes. This is a widely recognized issue in topic modeling (Chang et al. 2009; Heiberger and Munoz-Galvez 2021) and requires elaborate, qualitative judgment of the researchers.
However, we can base our judgment on some established metrics, semantic coherence (Mimno et al. 2011) and exclusivity (Roberts et al. 2014). The coherence of a semantic space addresses whether a topic is internally consistent by calculating the frequency with which high-probability topic words tend to co-occur in documents. Yet, semantic coherence alone can be misleading as high values can simply be obtained by very common words of a topic that occur together in most documents. To account for the desired statistical discrimination between topics we therefore also consider a topic’s exclusivity. This measure provides us with the extent to which the words of a topic are distinct to it. Both exclusivity and coherence complement each other and, hence, are examined in concert to give us an impression where topics represent word distributions in documents and at the same time provide differentiated dimensions. Accordingly, STM developers recommend that researchers look for the “semantic coherence-exclusivity frontier” (Roberts et al. 2014, p. 1070). We can observe such a “plateau” at K = 60 (Fig. 1). Given the trade-off between more exclusive, yet less coherent (in the upper sense) topics, those plateaus form the most parsimonious (i.e., smallest) choice of K.
Gender Preferences of Research Topics in US Sociological Dissertations
Topics consist of terms ordered by their probability of being used in a document that contains the given topic (denoted β above). Table 3 presents a ranking with FREX (Roberts et al. 2016), where a term is weighted by the harmonic mean of the word’s rank in terms of frequency (FR) and exclusivity (EX) within a topic. For instance, topic 7’s (T7) most descriptive words are “black,” “neighborhood,” “white,” and its FREX are “black_woman,” “black,” “segregation.” It seems intuitive to assume that a thesis with high loads of T7 is engaged in a topic regarding Race.
Table 3 Overview of topics So, what did young sociologists in the US write theses on during the last 30 years? Table 3 gives an overview of derived topics. PhD students’ research interests are widely spread, as was to be expected from a varied, fragmented discipline (Abbott 2001; Heiberger et al. 2021a). Topics comprise broad research themes used by many (e.g., T17 Culture, T35 Survey: Scales), thematic specialties (e.g., T25 Political sociology, T30 Economic sociology), methods (e.g., T42 Experiments), topics crossing many social spheres (e.g., T7 Race, T21 Social networks), and concepts related to other disciplines (e.g., T26 Social work).
Although exploring those topics in greater detail might be a worthwhile undertaking (and can be done by examining Tab. 3), this article focuses on the application of SML and UML in order to reveal different choices of research topics by gender. And indeed, we detect clear preferences for some of the most prevalent choices of students (Fig. 2). Although research on Culture (T17) and Survey (T2, T35) is almost equally spread across genders, we observe large differences when it comes to T4 Modeling and T27 Social theory. Both are much more frequently chosen by male students, in the case of T27 the probability is more than twice as high that a thesis on social theory is written by a male student.
Figure 2 also allows us to observe some general trends. In particular, research related to Culture (T17) is rising in popularity with students. This is connected to the influence of the “cultural change” on all social sciences (Jacobs and Spillman 2005). In contrast, US PhD students are writing about survey-related methods less and less frequently. T2 and T35 are constantly losing popularity. T35 started in 1980 as one of the most demanded topics and has been starkly declining ever since. This trend might also reflect more general research currents; at least, it is also observed for the discipline of education (Munoz-Najar Galvez et al. 2020).
Making further use of the STM results, we can also calculate topics exhibiting the largest differences across topics. For that purpose, we identify topics with an equal probability for both gender (i.e., similar distribution of topic load) and, in turn, topics revealing large differences. Thus, 0 represents no differences in topic usage, whereas higher values indicate deviations across gender preferences. The interpretation is straight-forward. For instance, a value of 2 for female preferences in a certain topic means that females have a probability two times higher than males of writing about that topic.
Figure 3 shows the five most pronounced differences for each gender. It reveals more than “fine distinctions.” The list of female research preferences reads like a list of tasks traditionally assigned to women, ranging from motherhood (T6), childhood (T15) to socialization (T57) and caregiving (T3). The likelihood of engaging with these topics is at least twice as high for females than for their male colleagues. Even more striking, female PhD candidates in sociology are 6 times more likely to write about feminism (T56) than males. It is somewhat ironic that maybe the most important movement for gender equality is pretty gender specific. At least when it comes to topic choices in sociology dissertations, hence, preferences are clear-cut between the sexes.
In accordance, male preferences also lie in social arenas in which men occupy the majority (Fig. 3). Politics (T25), economy (T30, T50), and justice (T5, 31) exhibit the largest differences between sexes. In all three areas male PhDs have an around two times higher probability than their female equivalents of writing about those topics in their dissertation.
Now, one might object that times change. However, deviations remain considerable and differences are present for the whole observation period in most cases (Fig. 4). Yet, there are exceptions. For instance, the gap is closing in terms of Socialization (T57). In the 1980s, it was among the most popular choice for females and has declined ever since. In contrast, Crime (T5) has gained popularity across the sexes, though more among male students. The reverse is true for Caregiver (T3). T3 started in the 1980s at an equally marginal level, yet has attracted substantial interest since the 2000s, in particular, among female students. Although we observe some ups and downs, gender-specific majorities have not flipped in any case during the 35 years of observation, revealing strong and persistent gender differences in research preferences.Footnote 13