1 Introduction

Machine learning (ML) summarizes statistical methods in which computers learn from data and extract information. Applications of ML paved the way for some of the most promising technical innovations in recent years (e.g., artificial intelligence, gene prediction or search engines) and changed the everyday-life of many people (Jordan and Mitchell 2015). ML represents a breakthrough in computer sciences; yet, its adoption in the social sciences is less enthusiastic. Although a recent article gives a comprehensive overview of sociological studies using ML (Molina and Garip 2019), an application-oriented introduction that might ease a sociologist’s way into the subject is still lacking.Footnote 1

Therefore, this article has three goals: First, I discuss and categorize ML methods. From a methodological perspective, ML can be classified in two paradigms: supervised machine learning (SML) and unsupervised machine learning (UML) (e.g., Jordan and Mitchell 2015; Molina and Garip 2019). Although SML uses labeled data as input to train algorithms in order to predict outcomes of unlabeled data, UML detects underlying patterns in unlabeled observations by exploiting the statistical properties of the data. I will give an overview of both areas emphasizing that several of such tools used in ML are not, by any means, new to social scientists interested in statistics.

The other two aims are intertwined. On the one hand, I present a “how-to” guide for both SML and UML. I do that, on the other hand, by applying SML and UML to an important substantial case, i.e., the mostly unexplored role that research topic choice plays in the academic gender gap. By shedding light on the empirical case, the application of ML will be practically illustrated “by doing.” Thus, I will not present a literature review, as this has been done in a comprehensive way for sociology by Molina and Garip (2019) only recently.Footnote 2 Instead, I will present an easy-to-use SML classifier to derive the associated gender from first names. The proposed approach outperforms a prominent commercial application (genderize.io). Detecting gender “automatically” might be useful in many cases of (quantitative or qualitative) content analysis. Using the predicted gender of authors, I examine gender-specific preferences in research topics by using UML. In particular, I explain and apply structural topic modeling (Roberts et al. 2014) in order to reduce a corpus of texts from a near-complete sample of US dissertations on sociology to its main dimensions. In so doing, the article reveals important differences in research choices of female and male PhD students, and, hence, adds a widely overlooked aspect to the rich literature on gender biases in academic publishing.

2 Principles of ML

2.1 SML

Many people might first think of SML when referring to machine learning, as it is the most widely used area of ML and comprises the methods that witnessed the largest performance boost owing to larger and more detailed data in recent years (e.g., image recognition). Although SML was primarily used in computer sciences, its applications spread nowadays to almost all scientific fields and business branches (Jordan and Mitchell 2015). The main aim of SML is to predict an outcome with a given set of features. That is the same as when social scientists refer to estimating a dependent variable by using a set of independent variables.

Thus, in how far does SML actually differ from classic statistical methods? The answer lies in the regularization of variance and empirical tuning of parameters (Molina and Garip 2019; Mullainathan and Spiess 2017). I would also like to emphasize that (apparent) differences stem from differing goals. Although classic statistics tries to infer parsimonious models that explain how an outcome is generated, SML does not care about interpretability but only how to best forecast the outcome. “Generative modeling” (Donoho 2017) focuses on unbiased and consistent estimators of a given dataset, i.e., beta-coefficients are the most interesting part of regressions for social scientists because they provide access to explainingFootnote 3 the data at hand. This is a crucial epistemological difference yielding many practical consequences.

In contrast, SML prioritizes predictions. Regardless of meaningful interpretations and unbiased estimators, SML uses functions of high complexity as long as they perform well “out-of-sample,” i.e., models are able to predict new data. That means, issues such as autocorrelation or multicollinearity are treated as features, not problems. Consequently, functions may yield “black-boxes” (e.g., when it comes to multi-layer neural networks or high orders of interaction effects); a large number of variables might be used; hard-to-interpret polynomials and interactions are included; and a certain degree of “in-sample error” for the sake of predicting new data correctly may be allowed.

Thus, unlike most social scientists using one dataset for modeling efforts, SML consists of at least two datasets: training and test data.Footnote 4 The first dataset is used to develop (i.e., train) the model, the second to test its predictive capacity on out-of-sample data. Often, the train and test sets are randomly sampled from the same dataset, which is split, e.g., 50/50 (although there is no general rule of thumb that I am aware of).

Supervised machine learning is aimed at regulating between under- and overfitting. Although classic statistical models are prone to overfitting and therefore possess only limited predictive abilities for new data, SML uses “regularizers” (i.e., parameters of algorithms) to balance both underfitting and overfitting. To accomplish this task SML uses the training data and tunes regularizers to fit the data at hand (number and effect differing by algorithm). Therefore, researchers can use many variables as input and consider complex functions up to total mathematical black-boxes but still regulate their models to fit out-of-sample data. Note the difference compared with classic inferential statistics, in which models follow the idea of being most parsimonious. Although there is a wide array of potential algorithms to connect an outcome with featuresFootnote 5, the basic principle of almost all SML can be summarized in the following steps, which will be applied in Sect. 2.1 SML:

  1. 1.

    Split data in training/test data (one should answer questions comprising: Split ratio? Scale of outcome? Number of features? Match data if necessary?).

  2. 2.

    Train a model (choose an algorithm linking features to outcome; decide what models perform best; tune model parameters).

  3. 3.

    Evaluate model accuracy (model fit by out-of-sample predictions with test data).Footnote 6

2.2 UML

Unlike SML, there is no “supervisor” for UML and no pre-labeled data from which algorithms learn. Instead, UML tries to reveal patterns that are hidden in the data. It detects underlying structures in observations by exploiting statistical properties of the data. In essence, UML is aimed at creating categorization schemes or typologies. Researchers can then define types along the derived (latent) dimensions and represent each case relative to the types given its underlying values.

Often, researchers have no access to a ground truth in order to set the number of dimensions and validate models.Footnote 7 To determine types’ fuzzy empirical boundaries, inductive approaches such as cluster analysis, principal components, or latent class analysis are often combined with theoretical considerations. The main purpose of UML is to explore data and reduce its complexity. Researchers might use the output as input for further analysis (e.g., Munoz-Najar Galvez et al. 2020) or to develop theoretical models (Hall and Soskice 2001).

Resulting (ideal) types are arguably among the most important methodological tools of social scientists and have been used for a long-time (Ahlquist and Breunig 2012). Thus, utilizing exploratory techniques is not at all new to social sciences; yet, UML does provide novel ways of analyzing large amounts of text and social networks, both kinds of data often associated with the digital age and computational social science (Heiberger and Riebling 2016; Lazer et al. 2009). In particular, the “automatic” categorization of large corpora has found many applications on social phenomena (Evans and Aceves 2016). Topic models represent one of the most frequently used Natural Language Processing (NLP) tools in the social sciences (McFarland et al. 2013). Its main idea is to summarize a large corpus of documents into relatively few meaningful themes (i.e., topics) and, hence, keep the most relevant information. For instance, social scientists used methods from “natural language processing” to reconstruct the discursive history of scientific fields (Wieczorek et al., 2021; Hall et al. 2008), analyze media effects on attitudes (Erhard et al. 2022), trace the fragmentation of political discourse (Heiberger et al. 2021a), or explain scientists’ choice of research strategy (Evans and Foster 2011).

Social networks constitute another branch with deep roots in UML. Since the 1970s, network researchers have applied “blockmodeling” to find structural equivalent nodes and group them together (White et al. 1976). The rise of network data with the internet led to many new developments in this area, most often characterized as community detection (Fortunato 2010). Although many physicists are involved in developing new, mathematically sophisticated graph-partitioning methods, the idea is the same as for all UML: summarize data by finding its most important dimensions and/or group similar cases to derive types.

3 Application: Gender Differences in Scientific Publishing

3.1 The Role of Gender in Scholarly Authorship

To illustrate SML and UML in greater detail, I will now turn to an important case: gender differences in scientific publishing. Applying gender detection (SML) and topic models (UML) on a large sample of U.S. sociology dissertations will reveal new insights into how research topics are deeply divided by gendered preferences.

Despite growing awareness, gender differences in academia still persist across all disciplines and countries (Barone 2011; Holman et al. 2018; Huang et al. 2020; Larivière et al. 2013). At the center of interest lies the “productivity puzzle” (Xie and Shauman 1998), i.e., evidence that male researchers publish more than their female colleagues. Explanations point to many related differences, for instance, in collaboration practices (Abramo et al. 2019; Jadidi et al. 2017; Uhly et al. 2017), family responsibilities (Carr et al. 1998; Fox 2005), or rank of alma mater (van den Besselaar and Sandström 2017). Those results appear in a new light given recent results from Huang et al. (2020). By reconstructing over seven million researcher careers from a large sample of publications, they could show that gender differences in productivity and impact are stable, but that those differences are rooted in gender-specific dropout rates.

Although those findings have wide-ranging policy implications, there is another major aspect of gender biases in academic publishing that is still widely overlooked: research content and topics. Among the few exceptions, Nielsen et al. (2017) detect a higher likelihood in medical studies to include gender in their analyses if women are among the authors. Only recently, Key and Sumner (2019) find gendered research topics in political science. I will also refer to their results in the Discussion section.

3.2 Using SML: Detecting Gender

To further our understanding on gender preferences for certain research topics, I base my analysis on dissertations. Theses are a formal requirement for becoming part of the scientific community (Collins 2002). Trying to gain recognition as experts, potential graduates spend a long time on their respective projects. Most PhD candidates ponder the objectives and meanings of their thesis many times. Hence, chosen topics should reflect personal preferences, also because one’s thesis is a strategic decision (Bourdieu 1988, p. 94). In addition, theses have to be single-authored, circumventing problems present in studies using research articles with several coauthors.

The data are retrieved from the ProQuest database and represent a close approximation of all US-based dissertations (Hofstra et al. 2020). To reflect the sociological field, all theses written in a sociology department have been included (N = 41,045). Thus, the research topics represent a rather narrow perspective on sociology, excluding interrelated fields such as education or psychology.Footnote 8 Taken together, the analyzed texts comprise each dissertation’s abstract and title and range from 1980 to 2015.

To derive gender, almost all of the cited studies (see the Section “The Role of Gender in Scholarly Authorship”) use the first names of authors. Although this process is often moved to footnotes (if mentioned at all), I will now describe the three SML steps to classify gender from names and compare the proposed approach to a commercial application (genderize.io).

First and foremost, each SML needs training data (step 1). Regardless of the particular classification, pre-labelled data are needed to establish the statistical models from which to derive the desired classifications. The larger the amount of training data, the better the subsequent predictions. A major obstacle for this undertaking, hence, is to realize such “ground truths” in a sizeable fashion. Classifying training data get particularly expensive (time and/or money), when human coders (considered the “gold standard” of ML training sets) are needed.

One way of obtaining suitable training data is to use process-generated data, often derived, for instance, from public records or, in industry, from customer data bases, sales, or web logs. In the case of gender prediction, I will utilize the largest collection of names that is publicly available, the US Social Security Administration (SSA) record. It contains first names annually collected for each of the 355,149,899 babies born in the US between 1880 and 2019. Unlike West et al. (2013) or Karimi et al. (2016), we use the full records (all names with at least 10 occurrences per year). Like most social actions, however, name-giving yields a heavily skewed power-law distribution with relatively few high-frequency names (James, John, and Robert at the top for male names; Mary and Elizabeth being the most popular female names).

In total, 99,444 unique names have been awarded in the US since 1880. Most are associated with one gender exclusively, only 10,942 have been assigned to both sexes in all those years. Although we can assume a probability of 1 for the names that indicate solely one gender, ambiguous names provide us with an interesting case to apply SML.

Before performing the predictions, we need to match our test data to SSA. The test data comprise first names of PhD students of whom we do not know the gender. The sample initially contains 41,045 dissertations; of those, 37,437 first names are found in SSA. The other 3608 are either relatively rare Asian names (e.g., Byung) or double barrel names (e.g., Zxy-Yann), but mostly first names recorded only as single letters in the database, i.e., sort of missing data. That is a rather low number compared with other approaches, for instance, West et al. (2013) did not find more than 26%, or Hofstra and de Schipper (2018), who could not align more than 30% of their training with test data.

To apply SML, I will now focus on the 33,082 students who have names that are assigned to both sexes in SSA, resulting in 2545 unique names for which we want to predict the associated gender. For that purpose, we need to build a probabilistic classifier (step 2), i.e., a statistical model to link features (explanans, here: ambiguous first names) and outcome (explanandum, here: gender) to derive classifications based on the training data. In this article, I use logistic regression (generalized linear models, GLMs).Footnote 9 Although this is arguably one of the most popular methods in social sciences and should, hence, be familiar to most readers, it is only rarely used by social scientists to predict out-of-sample cases. That is in contrast to its use by computer scientists, who employ GLMs for predictions and see them as an essential part of ML tools (Lantz 2019).

To align the results of the proposed SSA approach in the next step to results achieved with genderize.io (Wais 2016), the probability is set to a rather strict level of 0.95. That means a student’s gender is associated with being female or male if the model predicts that 95% of all times.Footnote 10 If the probability is lower, the case is set to “unknown.” The results depicted in Tab. 1 show that more female students finish the PhD (around 56%), which is in accordance with official statistics (National Center for Education Statistics 2018). 2643 students (around 7%) cannot be assigned to a gender with the desired certainty. That is also a very convincing value compared with other studies (Karimi et al. 2016; Larivière et al. 2013; West et al. 2013).

Table 1 Overview of gender predictions for the ProQuest sample (N = 41,045) based on the SSA approach

However, the final and most important step of all SML is to assess the accuracy of predictions by comparing with to a “ground-truth” and/or other approaches (step 3). Accuracy is most often defined by calculating a confusion matrix as shown in Tab. 2. It evaluates the classification performance by counting the number of “true” and “false” instances. The ground truth in this article consists of 500 names (~ 20% of the test) for which three experts manually coded female or male names. Thus, we match each of the predictions of the machine to the gold-standard of human coders—the key to assessing the performance of each ML task.

Table 2 Confusion matrix for gender predictions of the SSA approach. Results for genderize.io are reported in brackets. Ground truth results are based on 500 first names manually coded by three researchers. “False” means ambiguous codings of the human coders

We compare the SSA predictions with one of the most popular databases for gender predictions, genderize.io, which has found many prominent applications in science (Huang et al. 2020). The simple SSA approach proposed here is clearly outperforming genderize.io. From Tab. 2 we can easily calculate accuracy \(\frac{\left(TN+TP\right)}{\left(TN+TP+FN+FP\right)}\) and F1 score \(\left(\frac{TP}{TP+\frac{1}{2}\left(FP+FN\right)}\right)\), two of the most important indicators of model quality in ML. Although the SSA achieves an accuracy of 0.76 and a F1 score of 0.85, genderize.io can only reach an accuracy of 0.68 and a F1 score of 0.79. These are rather large differences given SML tasks (Karimi et al. 2016).

In addition, the SSA data are completely open to the public and easily downloadable in a machine-readable format. Even more importantly, results for SSA-based predictions are fully reproducible because of the fixed set of names for a given time span. In contrast, genderize.io is continuously expanding its database, so that (other) researchers are not able to reproduce previous classification results. Finally, genderize.io is not free of charge for more than 1000 names per day, which is an additional disadvantage.

3.3 Using UML: Deriving Topics

One of the crucial yet often unspoken steps in working with large amounts of texts is to prepare and clean the data. To provide an appropriate “how-to” description, I will spell out those details before describing the UML applied here.

In a first step, all stopwords have been removed (e.g., “and,” “or,” “the”).Footnote 11 After that, the words have been lemmatized. Lemmatization is a common step in NLP to reduce different forms of a word (e.g., singular and plural) to a common base form (e.g., “women” becomes “woman”). As final preprocessing step, I concatenated bigrams appearing more than 50 times (e.g., “united” and “states” become “united_states”). In so doing, we can detect meaningful phrases like “factor_analysis” or “statistical_significant” in the dissertation abstracts (Blaheta and Johnson 2001).

After preprocessing, I use topic modeling in order to reduce large quantities of text to meaningful dimensions. Topic models are a popular instance of UML (Jordan and Mitchell 2015). Such models assign documents in a corpus to a combination of topics. Topics are directly derived from documents by probabilistic algorithms and consist of words that co-occur across documents. In so-called generative models, each topic is seen as a probability distribution across all words of a given vocabulary, describing the likelihood of a word to be chosen as part of a certain topic. This likelihood is independent of the position of the word in a text, which is why it is most often referred to as a “bag-of-words” representation of documents. Although this assumption is clearly not realistic (e.g., grammar is ignored), it has proven to be very reliable in practical applications (Landauer 2007).

For a decade, topic models have become very popular in social sciences (Evans and Aceves 2016; McFarland et al. 2013). In particular, science of science studies make use of this sort of dimension reduction, for instance, by reconstructing the history of a field (Anderson et al. 2012; Hall et al. 2008), explaining scientists’ choice of research strategy (Evans and Foster 2011), tracing researchers’ interest changes (Jia et al. 2017), or relating relevant career outcomes to authors’ topic choices in medicine (Hoppe et al. 2019) or education (Munoz-Najar Galvez et al. 2020).

The core of most topic models was proposed by Blei et al. (2003), the latent Dirichlet allocation (LDA). Given a desired number of topics k and a set of D documents containing words from a word vocabulary V, LDA models infer K topics that are each a multinomial distribution over V. Thus, topics are a mixture of words V with probability β of a word belonging to a topic. The more often words co-occur in documents, the higher the probability that the words constitute a topic. At the same time, a document is also considered as a mixture of topics, so that a single document can be assigned to multiple themes. The topic proportions are given by parameter θ. By design, all topics occur within each document; thus, the proportion of θ gives us the strength of the connection between a topic (itself an ordered vector of words) and a document. Finally, it is important to note that the sampling process of LDA and all its extensions draw for each topic and each document from an eponymous Dirichlet distribution. Hence, the same multinomial distribution is used for all documents in a corpus.

In this article, I use an extension of LDA called structural topic modeling (STM) (Roberts et al. 2014, 2016). Its key feature is to enable researchers to incorporate document metadata and utilize such information (e.g., year) to improve the consistent estimation of topics. The covariates of a document d are denoted as Xd. The basic model relies on the same process explained above. However, in an STM the topic proportions θ depend on a logistic-normal generalized regression. Thus, for each word a topic is drawn from the document-specific distribution for one document based on its covariates Xd, not only—as in the regular LDAFootnote 12—on a general distribution that is the same for all documents. It has been shown in several simulations that the incorporation of covariates improves the results of the topic quality substantially (Roberts et al. 2014, 2016).

STM proved to be especially useful for longer periods of time and changing discourses (Farrell 2016), which suits the data at hand well, for it spans three decades of sociological dissertations. In addition, we can include gender as an additional covariate and therefore examine potential differences between male and female research preferences. Hence, the gender predictions done with SML that are described above will be utilized as a covariate to predict gender differences in the choice of research topics in U.S. dissertations from 1980 to 2015.

Like most UML (e.g., cluster analysis, principal component analysis) researchers have to set K even though the number of relevant dimensions is not known a priori. Insufficient numbers render models too coarse whereas high values could result in very specialized subthemes. This is a widely recognized issue in topic modeling (Chang et al. 2009; Heiberger and Munoz-Galvez 2021) and requires elaborate, qualitative judgment of the researchers.

However, we can base our judgment on some established metrics, semantic coherence (Mimno et al. 2011) and exclusivity (Roberts et al. 2014). The coherence of a semantic space addresses whether a topic is internally consistent by calculating the frequency with which high-probability topic words tend to co-occur in documents. Yet, semantic coherence alone can be misleading as high values can simply be obtained by very common words of a topic that occur together in most documents. To account for the desired statistical discrimination between topics we therefore also consider a topic’s exclusivity. This measure provides us with the extent to which the words of a topic are distinct to it. Both exclusivity and coherence complement each other and, hence, are examined in concert to give us an impression where topics represent word distributions in documents and at the same time provide differentiated dimensions. Accordingly, STM developers recommend that researchers look for the “semantic coherence-exclusivity frontier” (Roberts et al. 2014, p. 1070). We can observe such a “plateau” at K = 60 (Fig. 1). Given the trade-off between more exclusive, yet less coherent (in the upper sense) topics, those plateaus form the most parsimonious (i.e., smallest) choice of K.

Fig. 1
figure 1

Distribution of exclusivity (right y‑axis) and semantic coherence (left) to approximate the number of topics (K)

3.4 Gender Preferences of Research Topics in US Sociological Dissertations

Topics consist of terms ordered by their probability of being used in a document that contains the given topic (denoted β above). Table 3 presents a ranking with FREX (Roberts et al. 2016), where a term is weighted by the harmonic mean of the word’s rank in terms of frequency (FR) and exclusivity (EX) within a topic. For instance, topic 7’s (T7) most descriptive words are “black,” “neighborhood,” “white,” and its FREX are “black_woman,” “black,” “segregation.” It seems intuitive to assume that a thesis with high loads of T7 is engaged in a topic regarding Race.

Table 3 Overview of topics

So, what did young sociologists in the US write theses on during the last 30 years? Table 3 gives an overview of derived topics. PhD students’ research interests are widely spread, as was to be expected from a varied, fragmented discipline (Abbott 2001; Heiberger et al. 2021a). Topics comprise broad research themes used by many (e.g., T17 Culture, T35 Survey: Scales), thematic specialties (e.g., T25 Political sociology, T30 Economic sociology), methods (e.g., T42 Experiments), topics crossing many social spheres (e.g., T7 Race, T21 Social networks), and concepts related to other disciplines (e.g., T26 Social work).

Although exploring those topics in greater detail might be a worthwhile undertaking (and can be done by examining Tab. 3), this article focuses on the application of SML and UML in order to reveal different choices of research topics by gender. And indeed, we detect clear preferences for some of the most prevalent choices of students (Fig. 2). Although research on Culture (T17) and Survey (T2, T35) is almost equally spread across genders, we observe large differences when it comes to T4 Modeling and T27 Social theory. Both are much more frequently chosen by male students, in the case of T27 the probability is more than twice as high that a thesis on social theory is written by a male student.

Fig. 2
figure 2

Topic prevalence by year and gender

Figure 2 also allows us to observe some general trends. In particular, research related to Culture (T17) is rising in popularity with students. This is connected to the influence of the “cultural change” on all social sciences (Jacobs and Spillman 2005). In contrast, US PhD students are writing about survey-related methods less and less frequently. T2 and T35 are constantly losing popularity. T35 started in 1980 as one of the most demanded topics and has been starkly declining ever since. This trend might also reflect more general research currents; at least, it is also observed for the discipline of education (Munoz-Najar Galvez et al. 2020).

Making further use of the STM results, we can also calculate topics exhibiting the largest differences across topics. For that purpose, we identify topics with an equal probability for both gender (i.e., similar distribution of topic load) and, in turn, topics revealing large differences. Thus, 0 represents no differences in topic usage, whereas higher values indicate deviations across gender preferences. The interpretation is straight-forward. For instance, a value of 2 for female preferences in a certain topic means that females have a probability two times higher than males of writing about that topic.

Figure 3 shows the five most pronounced differences for each gender. It reveals more than “fine distinctions.” The list of female research preferences reads like a list of tasks traditionally assigned to women, ranging from motherhood (T6), childhood (T15) to socialization (T57) and caregiving (T3). The likelihood of engaging with these topics is at least twice as high for females than for their male colleagues. Even more striking, female PhD candidates in sociology are 6 times more likely to write about feminism (T56) than males. It is somewhat ironic that maybe the most important movement for gender equality is pretty gender specific. At least when it comes to topic choices in sociology dissertations, hence, preferences are clear-cut between the sexes.

Fig. 3
figure 3

Gender preferences of research topics in US sociology theses. The five largest differences for each gender are depicted

In accordance, male preferences also lie in social arenas in which men occupy the majority (Fig. 3). Politics (T25), economy (T30, T50), and justice (T5, 31) exhibit the largest differences between sexes. In all three areas male PhDs have an around two times higher probability than their female equivalents of writing about those topics in their dissertation.

Now, one might object that times change. However, deviations remain considerable and differences are present for the whole observation period in most cases (Fig. 4). Yet, there are exceptions. For instance, the gap is closing in terms of Socialization (T57). In the 1980s, it was among the most popular choice for females and has declined ever since. In contrast, Crime (T5) has gained popularity across the sexes, though more among male students. The reverse is true for Caregiver (T3). T3 started in the 1980s at an equally marginal level, yet has attracted substantial interest since the 2000s, in particular, among female students. Although we observe some ups and downs, gender-specific majorities have not flipped in any case during the 35 years of observation, revealing strong and persistent gender differences in research preferences.Footnote 13

Fig. 4
figure 4

Gender preferences of research topics in US sociology thesis, over time (1980–2015)

4 Discussion

The article at hand serves three purposes: first, it provides an introduction to ML methods with a focus on social sciences; second, it applies ML methods and, in so doing, provides a “how-to” guideline for using SML and UML (including code); and, third, by applying ML it discloses substantial gender differences in research preferences for a large sample of dissertations written by US PhD students in sociology departments.

The substantive results shed new light on gender differences in academia. Despite an abundance of studies (Barone 2011; Holman et al. 2018; Huang et al. 2020; Larivière et al. 2013), research topics are a widely overlooked factor when it comes to gender biases in academic publishing. The results show a surprisingly clear picture: female PhD students in the US prefer topics such as Caregiver, Motherhood, or Feminism. The prevalence of those research areas is up to six times higher and at least more than twice as high for theses of female PhD students compared with their male counterparts. In contrast, theses written by men focus more than twice as often on Law enforcement, Crime, or Economic sociology. The pronounced gender preferences have been mostly stable for more than 30 years.

A potential explanation might be that those topics are closely related to real-life experiences of students, i.e., that men and women undergo different socialization processes, live through different societal expectations and roles, and, hence, develop different research interests (Key and Sumner 2019). In favor of this explanation, a comprehensive study finds that curricular choices are strongly influenced by gender-specific interests similar to those seen in the research topics of PhD students (Charles and Bradley 2009). Still, it is surprising that the long, and sometimes painful, yet highly reflexive process of writing a dissertation exhibits such a high degree of gender-bias. It may very well be though, as Key and Sumner (2019) suggest for political science, that many of the much-discussed biases in publication behavior of both sexes rest on the choice of research topics.

While obtaining those substantive insights, the article tries to inform social scientists on ML methods by applying them. The results on SML suggest a clear recommendation when it comes to gender prediction from first names, often used in science of science studies or content analysis. Using a GLM framework and SSA data yields greater accuracy than genderize.io, despite the platform’s popularity among researchers (e.g., Huang et al. 2020). The rather simple approach presented here is not only more accurate in predicting gender, but free of charge and, even more importantly, replicable. Genderize.io is neither of those things.

That points to a larger issue of SML (and to a lesser degree UML)—the more data, the better your results. And the most data are obtained by tech companies such as Google, Microsoft, Facebook, Twitter, etc. I do not object to researcher’s usage of data gathered or provided by companies as long as data are open and access for researchers unrestricted. Clearly, that is often not the case, given that the proprietary use of information constitutes large parts of the value of those companies. This issue is not exclusive to ML; yet, owing to its reliance on large (and well annotated) data to perform well, it is more apparent than in other parts of social sciences. Scientific research is a public good and needs to be reproducible for peers (Merton 1973); the only solution to this issue seems therefore to be as transparent as possible, in both data and methods. The direct way to achieve this is to use open-source data and publish one’s code (Heiberger and Riebling 2016). It implies neglect of data if they stem from non-open sources or cannot be provided to other researchers (if not the whole interested public) to replicate results or further research.

Another more technical reason for transparency is that most ML methods afford many decisions, some of which may change results considerably. However, this is not different to any other elaborated data collection or statistical analysis. Yet, given the complexity of many ML applications, it seems important to keep up social scientists’ statistical rigor, i.e., include ML in the field’s methodological canon by exposing it to the same thorough critique that any other quantitative analysis would be subjected to. Therefore, I would strongly suggest that one of the best solutions might be to use several options at crucial bifurcations (e.g., choice of K) and, hence, check the robustness of the results.

However, running ML is costly; re-running ML to check robustness or tune parameters even more so. In terms of time (computer power) and money (having large enough numbers of human-annotated data), ML makes existing differences in resources between institutes or research groups more pronounced. It seems therefore crucial to come up with suitable infrastructures so that structural possibilities do not restrict researchers in a fundamental way. In contrast, SML might provide incentives to close a longstanding gap in social sciences, that is, between qualitative and quantitative research. The crucial annotation of data might build a bridge and establish an innovative division of labor between often separated qualitative coding and quantitative inferences (Kang and Evans 2020).

It is important to note that many ML methods are not new to social scientists. On the contrary, the ML arsenal has been well-known in social sciences for decades; for instance, the popular “Ward” method of conducting hierarchical cluster analysis was published in the early 1960s (Ward 1963). Similar techniques for reducing data to their latent dimensions (i.e., clustering analysis) come in a new guise and are now often labeled UML. Building on that long-standing expertise, any of the various ML methods (two of which have been discussed here in some detail) should be readily accessible to researchers given a profound background in social science.

Although social scientists have been used to the idea of UML for a long time and apply UML to describe higher-order patterns and explore datasets, the logic of SML may be considered more novel. One fruitful way of exploiting the possibilities is shown in this article, i.e., using SML to predict an independent variable and put that to further use (for instance, see Heiberger et al. 2021a for more complex examples). Another idea is spelled out in detail by Watts (2014), arguing that out-of-sample predictions may improve sociological explanations and could be used as a “hard” test as to whether a model fits reality. Such out-of-sample tests would also help to amplify the reach of social scientists’ results (by being applicable to other data), reduce barriers to replicating one’s own results, and, hence, counter common “p-hacking” efforts (Molina and Garip 2019).

All that said, it is, in my opinion, important that social scientists use the possibilities offered by ML. One key facilitator to applying promising ML methods will be, of course, training students. Yet, it takes time for young researchers to enter the field. Another, more subtle concern relates to current epistemological boundaries. Sociologists are trained to be skeptical. Therefore, they recognize ML not as a set of potentially powerful methods one could use but as a much-criticized research object (e.g., Weber 2016). I am not arguing that the latter is not worthwhile doing. Yet, not applying ML seems like not an option. If social scientists are not involved or act as mere bystanders in analyzing social phenomena with cutting-edge methods (of which ML is a prime example), other, more technical disciplines will do it, and are already doing it on a large-scale (see, for instance, an agenda formulated by physicists (Conte et al. 2012)). This article may play a humble part in paving the way to spreading the use of ML among social scientists by introducing some useful ML methods with an application case many researchers in the field might relate to—the divide of research topics by gender.