1 Introduction

Despite the efforts undertaken for the whole economic profession to fight against discrimination, women are underrepresented in academia. Lundberg and Stearns (2019) make an assessment of the presence of female economists in the profession, and they report a very slow improvement in the last two decades. The picture is as follows. In the beginning of this century, 35% percent of PhD students and 30% of assistant professors were female. Since then, these numbers have not increased.Footnote 1 Additionally, Siniscalchi and Veronesi (2020) summarizing Chevalier (2020) (Report of the Committee on the Status of Women in the Economics Profession) point out that the proportion of women assistant professors in the “top 10” schools has declined to less than 20% by 2019. They document also that female have been less successful in promoting to tenured associate or full professors.

In economics, the tenure path often requires to publish in the top five (Top 5, or just T5) journals, namely American Economic Review (AER), Econometrica (ECA), Journal of Political Economy (JPE), Quarterly Journal of Economics (QJE) and Review of Economic Studies (REStud). Heckman and Moktan (2020) analyze the tenure decisions of the top 35 Economics departments in the USA, and they conclude that T5 publications are a very powerful explanatory variable of the promotion to tenure. Publishing in a T5 is becoming the main goal of young professors in economics because their professional career may depend on succeeding on this target. In addition, the content published in these journals is also determining the path of research in economics. As a consequence of these facts, the competition to publish in any of these journals has increased in recent years. Card and DellaVigna (2013) analyze the publication records in the Top 5 from 1970 to 2012 showing that the acceptance rate has fallen from 15% (1970) to 6% (2012). They explain this fact as a combination of the increasing number of submissions and a declining number of published papers. Card et al. (2019) further analyze the publication records from two of the T5 journals (the QJE and REStud), together with the Journal of European Economic Association and the Review of Economics and Statistics. They report that the current proportion of accepted papers is 3%. Is the T5 entry barrier harder for women? The answer provided by Card et al. (2019) to this question is ambiguous. On the one hand, these authors do not find any gender biases in the refereeing process, and editors decisions are gender-neutral conditional on the referee advises. On the other hand, they find that conditional on referee process, female authored papers end up accumulating more citations in later years.Footnote 2 A potential explanation for this second result is that journals hold female-authored papers to higher standards. Hengel (2020) uses readability scores and finds that female-authored papers are better written and improve during peer review and as they publish more papers. These results could be related to some “horizontal” features or characteristics of female-authored papers that lead to more citations or better writing standards, but not to higher acceptance rates in the editorial process. As Card et al. (2019) control by research fields (JEL codes), their results may be linked to more subtle horizontal differences. For instance, in the same research field, males may choose a more theoretical approach and females a more applied perspective (which tends to be more cited or subject to less complicated wording), leading to particular career outcomes.Footnote 3

Several papers have pointed out persistent gender differences in the choice of research fields in economics. Dolado et al. (2012) analyze the gender distribution of research fields in the top-50 economics departments in 2005, and show that women are unevenly distributed across fields. Similarly, Chari and Goldsmith-Pinkham (2017) use data from submissions to the National Bureau of Economic Research Summer Institute (2001–2016) and show that the distribution of female researchers is not uniform across fields. From these, we learnt that women are particularly underrepresented in macro, finance and economic theory, and more prevalent in labor or applied microeconomics fields. Beneito et al. (2021) find similar results using data from the annual AEA meetings from 2010–2016, while Lundberg and Stearns (2019) focus on PhD dissertations in Economics from 1991–2017, in almost all major PhD-granting departments in the USA. Using the JEL code for research areas, they find that women are more prone to study topics in Labor and Public Economics than in Macro and Finance. They also show that this pattern has not changed over time.

We want to contribute to this literature in two directions. First, we focus on exploring the gender horizontal distribution across research topics in the leading economics journals. We do so by using a new methodological approach based on machine learning techniques. This classifies our abstracts’ database into latent topics. We collect all the articles published in T5 journals for the period 2002–2019. We obtain 5311 articles, and we keep track for each article of the authors’ names, year of publication, journal and the abstract. With this information, we can provide a very accurate picture of the performance of men and women publishing record in these leading journals. Our primary objective is to describe what these latent topics are and the gender distribution across them. Notice this is a very particular sample of researchers though.

Second, from the universe of algorithms for topic modeling we implement and develop the structural topic model (STM) developed by Roberts et al. (2019). This choice is because the algorithm allows to incorporate document-level meta-data into a probabilistic text model. Precisely, we keep track of journal names and publication years as covariates to improve the estimation of the prevalence of topics in our data. Our abstracts come from different sources and different periods of time, so it is natural to allow this meta-data to affect the frequency with which a topic appears. The output of the algorithm is a stochastic model that generates latent topics and allocate the documents to them in a probabilistic way. The main advantage of this unsupervised machine learning approach is that latent topics are mixtures over words where each word has a probability to belong to the different topics. Therefore, these topics can capture, conditional on covariates and without human intervention, research fields, information regarding the style of writing, methodology, conversational patterns or even different ways of thinking.

We start by identifying the number of latent topics for which the stochastic model fits best our data. The result is that female authors are unevenly distributed across latent topics. It turns out that female prevalence dispersion is higher across these topics compared to other approaches. Moreover, we show that although the proportion of females is slightly increasing among the population of T5 authors over the years, the identified horizontal differences persist. We compute the empirical distribution of latent topics by gender and we show some striking differences between male and female expected proportions. We want to emphasize the importance of these results, not only because latent topics may capture subtle horizontal differences, but also because the gender differences we estimate are “automatically” generated given the documents, without any arbitrary allocations to particular categories (as JEL codes, or declared areas). Thus, they are possibly more robust.

Notwithstanding, the choice of the number of latent topics, even if optimal as we discuss, is subject to clustering issues. To address these issues, we also choose to reduce the number of topics the algorithm has to generate, and in order to capture the mixtures of words that more closely resemble to research areas. There is a trade-off when choosing ex ante the number of latent topics. On the one hand, a relatively high number of topics usually fits better the data. On the other hand, a lower number of latent topics facilitates the broad semantic interpretation of them. In our setting, a lower number of topics turns out to make them closer to traditional research fields. Consistently with our main findings, we corroborate the uneven distribution of topic/research fields by gender, but now, much more in line with the existing literature cited above. Thus, we can also discuss the link between the existing literature and our class of probabilistic results. Our approach provides complementary evidence from previous literature over horizontal research differences between males and females. The idea is that the larger set of research topics may allow to identify more precisely the gender gaps, and what is more important, may help to understand the driving forces behind these gaps.

There are several channels for which the gender differences in the choice of research topic that we identify can have an impact on the probability of publishing in top journals, earning tenure and in general on career success. Conde-Ruiz et al. (2017, 2021) and Siniscalchi and Veronesi (2020) provide two dynamic mechanisms that may explain how “horizontal” gender differences, together with an initially uneven distribution of gender researchers, may generate an unintentional discrimination trap linked with the functioning of academic organizations (journals, departments, etc.). In particular, Conde-Ruiz et al. (2017, 2021) analyze a promotion setting in which workers’ skills are assessed by committees whose members have different abilities to evaluate workers’ signals (they are better at evaluating workers from the same group). This “homo-accuracy” assumption naturally translates to the present academic setting, where promotions and editorial processes are done by “committees” and where evaluators making research in the same research field are able to assess better the underlying quality of the candidate. Under this “homo-accuracy bias,” the group that is most represented in the evaluation committee generates more accurate signals, and, consequently, has a greater incentive to invest in human capital. This gives rise to a discrimination trap. If, for some exogenous reason, one group is initially poorly evaluated (less represented into evaluation committees), this translates into lower investment in human capital of individuals of such group, which leads to lower representation in the evaluation committee in the future, generating a persistent discrimination process. Siniscalchi and Veronesi (2020) focus specifically on the academic labor market and point out a similar unintentional discrimination trap linked to the so-called self-image bias. Research evaluation is biased toward young researchers with similar characteristics to them. The authors build up an overlapping-generations model with two groups of researchers with equally desirable (but a little bit different) research characteristics and identical ex ante productivity distributions. If one group is slightly over-represented into the evaluation group, this group (and its specific research characteristics) may dominate forever. These theoretical results go in line with the empirical findings of Dolado et al. (2012) that show that the probability for a female researcher to work on a given field is positively related to the share of women already working on that field (path-dependence). The proportions these authors find based on JEL codes are very similar to what we find automatically at the same level of aggregation, but we can set forth a lot more field idiosyncrasy under an extended optimal topic choice. At the end of the paper, we discuss various issues for further research in related applications.

The paper is organized as follows: the next section presents the raw data and the descriptive analysis of the patterns of publication in T5 journals. Section 3 presents the structural topic model. Section 4 studies the gender differences in the latent estimated topics. Section 5 extends the model to analyze topics as research fields. Last section concludes, and in Appendix we explore several extensions and provide details about the functioning of the structural topic model (STM) algorithm.

2 Raw Data and Descriptive Analysis

We collect the publicly available information from all articles published between 2002 and 2019 in the T5 leading journals in economics, as already indicated: The American Economic Review, Econometrica, The Journal of Political Economy, The Quarterly Journal of Economics, and The Review of Economic Studies. For each article, we collect the information about the journal, year of publication, authors and the abstract of the paper.

Fig. 1
figure 1

Number of articles published per year in T5. Note Publications exclude notes (without abstract), comments, announcements, and Papers and Proceedings (P&P)

We have 5311 articles in total over the period 2002–2019, the average number of papers published in top-5 journals per year is 295, with a maximum of 351 (on year 2017), and a minimum of 234 (on year 2002). Figure 1 shows that the distribution of published papers by journal is uneven. AER accounts for 34.3%, while JPE only represent 13.4% of the sample. AER publishes regular articles as well as shorter papers.Footnote 4 We include in our sample the shorter papers (as long as they have abstract) since their editorial processes is similar to regular articles. We exclude the articles published in AER as Papers and Proceedings since their requirements and editorial processes are different.Footnote 5 We want to compare this descriptive information with Card and DellaVigna (2013) who analyze all the articles published in the T5 from 1970 to 2012. They obtain several interesting facts, among them, that the total number of articles published in these journals declined from 400 per year in the late 1970s to 300 per year in 2012. They also show that one journal, the American Economic Review, accounted in 2012 for 40% of T5 publications, up from 25% in the 1970s. In our updated sample, as it is shown in the figure, we find that this trend has stabilized after 2012.

Card and DellaVigna (2013) also find that the number of authors per paper has increased from 1.3 in 1970 to 2.3 in 2012. We observe the same trend in the recent years, in particular in 2019 the average number of authors was above 2.5. Figure 2 reports the share of articles by number of authors, one to five or more. Clearly, the steepest trend downward is for solo authorship, whereas the three-author case (or even the four-author case) exhibits the opposite pattern. The two-author case share has remained fairly stable over the entire sample at around 40% of articles (base, not augmented). Five or more authors in economics articles at leading journals are still a rare event.

Fig. 2
figure 2

Number of authors of published papers in T5

Next, we move to analyze gender issues. We do not observe directly gender in our data. For solving that problem, we classify authors by gender according to their first name. We rely on three different databases: the first-names’ database published by the USA. Social Security Administration, created using data from Social Security card applications; the database constructed by Tang et al. (2011), who use Facebook to collect data on first names and self-reported gender; and finally, the names’ database developed by Bagues and Campa (2017). We check manually any candidate who (a) falls within the [0.05 0.95] probability interval of being male/female or (b) cannot be found in any of the databases.

We convert the original sample of articles into an articles-authors sample. We transform the original 5311 articles to a total sample of 11,721 (with implied 9840 articles-men authors, and 1881 articles-women authors). Except otherwise indicated all measures below are computed over this augmented articles-authors sample.

Fig. 3
figure 3

Number of article-author observations by gender and the share of female articles

Figure 3 depicts the share of female authors (right axis), which has been increasing (with fluctuations) at a rate of 6.2% per year, (compared to men’s share average rate at 3.7%), reaching 20% share during a couple of years in the recent past. Despite female authors are increasing at a higher rate, and that there have been an important improvement in the last decades, women are clearly under-represented in T5 publications. These data are consistent with the data from the report of the Committee on the Status of Women in the Economics Profession, Chevalier (2020). Figure 4 compares the evolution of the share of women in the different professor categories of the top 20 Schools of Economics in the USA in 2020 with the proportion of female authors in top 5. Notice that the share of female authors is very similar to the 20,4% share of women in the faculty of the top 20 Schools in the USA on 2020. In line with Heckman and Moktan (2020), the rate of increase of female coauthors in T5 seems to parallel the rate of increase of female full Professors in these departments. The average proportion of females that are full professor in Spain and the EU average are very similar as well.

Fig. 4
figure 4

The pipeline for top 20 economics departments: percent and numbers of faculty and students who are women. Source CSWEP Report, 2020 and own elaboration

Fig. 5
figure 5

Co-authorships patterns in T5 journals

Fig. 6
figure 6

Distribution of number of T5 papers published by gender

We have split the description of the data into two figures: one for single gender groups and another for mixed teams. Figure 5a shows the corresponding co-authorships pattern when the set of co-authors are single gender groups. The more salient feature of these data is that while the share of sole male authors has been declining from 30% of total, to slightly above 10%, the share of sole female authors has been stable over the entire sample, at a share close to 5%. We want also to point out that despite the slow decline, two males are the most common co-authors team.

The equal share of male-female authors has been fairly stable at about 12% (92.7% of these articles are, in particular, one male-one female). Alternatively, the share of articles with at least one woman and at least two men has been increasing from nearly 5% over total to around 14%. Thus, the strongest trend in data seems to be associated with the participation of female authors in articles with relatively more male authors.

Figure 6 shows the distribution of the number of published papers by gender. Conditioning on having published in T5 journals, females are more likely than males to publish only one or two papers, while the proportion of authors that have published more than three papers is greater for males than for females. Clearly though, more than 80% of either female (15% of the distribution) or male authors have published less than two T5 over the last 20 years. This is an important fact for understanding the role of superstars in the profession as well as the mechanisms underlying the formation of networks of coauthors.

3 The Empirical Model: Structural Topic Model (STM)

Our empirical strategy is to use unsupervised machine learning techniques to uncover the hidden structure of our text documents.Footnote 6 By unsupervised we denote the absence of human intervention in order to identify the latent topics behind the abstracts of articles published in the T5 journals during the period 2002–2019. For us, an abstract is a set of words and these words have different probabilities to belong to one or several latent topics. Informally, when we are writing on a particular topic there are words that are used more often than others. Our objective is to provide a low-dimensional representation (topics) of a high-dimensional object (abstracts) while retaining as much as possible its informational content.

The baseline for topic modeling is the LDA algorithm (latent Dirichlet allocation) developed by Blei et al. (2003) and also the most popular machine learning algorithm in reducing the dimensionality of text documents.Footnote 7 In this paper, we use an algorithm called STM (structural topic model) developed by Roberts et al. (2019), which can be understood as a refinement for this LDA algorithm. This topic model is said to be structural because it allows the use of “covariates” to inform about the structure (partial pooling of parameters). These covariates in our case are going to be the different journal names and the different years in the sample. The idea is to better capture along these dimensions the changing relationship between words in abstracts and the latent topics. Next, we want to explain the algorithm and the outcome variables, and in “Appendix A” we provide a more technical discussion over STM and LDA.

We start by describing the inputs. We have our 5311 abstracts (or documents) to extract all the words. First, we have to “clean” this set of words in order to reduce the vocabulary and select terms with more informational content. This helps us for a better estimation of more semantically meaningful topics. The corpora is the set of unique words that we obtain, after converting to lower case and remove from the original raw text common stopwords,Footnote 8 as “for” or “in.” Also, we prune the words until we get their original linguistic root (“educ” instead of “education”) and eliminate the words that appears one or two times only.Footnote 9 In our case, we start with a set of 13,835 different terms and end up in a corpora of 4241 of unique words.

The second step is to represent our text data in a document-term matrix of D rows (5311 abstracts) and V columns (4182 unique words in our corpus) where the element (dv) of the matrix is the number of times the \(v_{th}\) unique word appears in the \(d_{th}\) abstract. This document-term matrix that reduces the dimensionality of our original text variables is the input of the algorithm. Our objective is to find a probabilistic topic model that is able to explain the document-term-matrix in two additional steps. First by identifying K topics in our corpora and then by representing documents as a combination of those topics. What is a topic? The topic k is a probability distribution \(\beta _k\) over all the unique words of our corpus, where \(\beta _k^v\) is the probability that topic k generates word v. Each document d has its own distribution over the set of topics \(\theta _d\). This captures that each document/abstract can refer to several topics. Then, \(\theta _d^k\) would mean the weight of topic k in document d. The probabilistic topic model is described by these topic \(\beta _k\) and document \(\theta _d\) distributions. Given that, we can compute the probability that an arbitrary word in the document d coincides with the \(v_{th}\) term is \(p_{dv}=\sum _k\beta _k^v\theta _d^k\). Using these probabilities, we can obtain the total likelihood of our data, \({\prod _{d}}{\prod _{v}}p_{d,v}^{n_{d,v}}\), where the \(n_{d,v}\) corresponds to the elements in the document-term matrix (the number of times the \(v_{th}\) unique word appears in the \(d_{th}\) abstract).Footnote 10

This total likelihood is our “objective” function. In a nutshell, the LDA and the STM algorithms are designed for finding numerically the stochastic model of latent topics (the distributions \(\beta _k\) and \(\theta _d\)) that better suit our document-term matrix, that is that maximizes this total likelihood. We are going to skip here further details on the algorithms we use, and we refer the interested reader to “Appendix A” (and also to Roberts et al. 2014). However, we want to make two important observations.

First, as indicated above, we are implementing STM instead of LDA. The main advantage of STM for our data is that we can use very relevant covariate information about our documents in order to improve parameter estimation.Footnote 11 In particular, for each document/abstract we interact the year of publication as well as the journal name. We take advantage of the variability of the abstract along the time and across journals for improving the estimation of our stochastic model in particular of the distribution \(\theta _d\)).

The second important observation refers to the determination of the number of topics. We can follow two strategies. One, it is to find the number of topics that better fits the data, which usually leads to a large (optimal) K. The alternative is to force the algorithm to use a given number of topics for facilitating the interpretation of those. For our baseline analysis, we use the first approach and we work with 54 topics, but we also pursue the estimation of our stochastic model using a fixed number of topics to facilitate comparison with the results in existing literature.

Previous literature, using JEL codes (for example, in Card et al. 2019) or research areas in top departments (for example, in Dolado et al. 2012) have concentrated in a broad definition of topics as fields of research, say, Labor or Econometrics. However, the unsupervised learning methodology we use allows us to go beyond pre-labeled research areas so as to capture more subtle differences, such as writing style, particular methodologies, or the variation in research questions. For example, our methodology allow us, when identifying latent topics, to separate two papers of labor economics, but one more applied and other with a theoretical contribution. We consider our approach a promising tool to analyze if there are horizontal gender differences in economics research, that is, whether or not male and female write different articles even within the same research field. For this reason, in the next section we will analyze our stochastic model with \(K=54\) topics, while in Sect. 5, we will be focusing on estimating our stochastic model with \(K=15\) topics. In addition to these two exercises, in Appendix we extend our original sample for including the abstracts of 1117 articles published as Papers and Proceeding in AER, between 2011 and 2018 (before 2011 these types of papers do not have abstracts and after 2018 are published in a different journal). We will show that for this extended sample the optimal number of topics increases to \(K=70\). While we have preferred to exclude these papers of the main baseline analysis because these are very short papers with very different editorial processes than regular submissions, this extended sample generates interesting new insights.

4 Gender Differences in Latent Estimated Topics

As we said above, the number of topics that best fits the text data is 54.Footnote 12 We estimate probabilities for each document to belong to this set of built-in latent topics using the structural topic model. The STM output is summarized by the latent topics displayed in Fig. 7 that shows the key words associated with each of the 54 topics. The words within each row are ordered left to right by the probability they appear in each latent topic. Eventually, we could assign some labels to latent topics, based on well-known fields names in economics. For instance, we can associate the more prevalent topic in the sample in expectation, topic 28, to international trade. Likewise, the second more prevalent topic in the distribution, topic 9, may be associated with Econometric Theory. However, this is not the goal of the analysis as we have indicated above. The important thing is that latent topics may be related to something beyond research fields, as methodology or style of writing. These latent characteristics hide gender differences too.

4.1 Topic Prevalence

Once we have identified the estimated latent topics, we can analyze how our documents/abstracts are distributed among them. In allocating an abstract to a particular topic, we consider our underlying \(\theta _d\) distribution. Then, we assign document d to different topics with different probability weights. Following this approach, Fig. 8 shows latent estimated topics in a way that also illustrates the number of documents in each topic, notice that in Fig. 8 the size of the circle is proportional to the expected number of documents in the topic (we have also reproduced numerically this information in a column in Fig. 7). As we cannot make a mapping of our 54 topics to particular fields of research, it is difficult to interpret the information of Fig. 8 regarding the size of the topics. For example, topics 11, 9 and 21, in Fig. 8 are related to “Econometric Theory,” and are relatively large compared with other topics. However, if the algorithm would have introduced more topics within “Econometric Theory,” each topic would have had a smaller mass, the weight of the research field being the same. In other words, our perception of the successful topics is affected by how the research field is split into topics.

Fig. 7
figure 7

Optimal K topics ranked by prevalence in the corpus

Fig. 8
figure 8

Connectedness between topics and the fraction documents/abstracts in each topic (\(\theta _d\) distribution)

Figure 8 also contains information over the connectedness between topics. For example, if the latent topic k is closer to \(k'\) than \(k''\), it means that the distribution \(\beta _k\) is more alike to the distribution \(\beta _{k'}\) than to distribution \(\beta _{k''}\). Looking at Fig. 7 and the description of the latent topics in Fig. 8, some interesting patterns arise. For example, the previous discussed topics 11, 9 and 21 (“Econometric Theory”) are in someway isolated from the rest of topics. In Fig. 8, we can also identify some other clusters of topics, for example (east in Fig. 8) 51, 34, 23, 2, etc., are topics related to Macro-Finance, closer to those in Econometric Theory, but not that much; (west in Fig. 8) 50 is a central node of a set of topics related to Political Economy and Institutions); (southwest in Fig. 8) 29, 32, 22, etc., are topics related to microeconomics (contract theory, decision theory, etc.). Finally, applied areas as labor, international-development, or public economics are located around topics 19, 49, 28, and 48 (north in Fig. 8). In “Appendix D”, we undertake a more formal analysis of the distance between topics using a simple correspondence analysis of the probability matrix for documents to belong to the different latent topics. We find the corpus organized along two dimensions: Dimension 1 can be interpreted as going from Applies to Theory, whereas Dimension 2 goes from, say, Economics to Econometrics.

Fig. 9
figure 9

Connectedness between topics and the female authors documents/abstracts in each topic

Fig. 10
figure 10

Topic Word Clouds: Topic 49 vs Topic 16

Fig. 11
figure 11

On the presence of women, by topic: mean and one standard deviation across time

Using our classification of authors’ names by gender and the allocation of documents to latent topics, we can build up a similar figure with information about the gender distribution. Figure 9 shows latent topics where the sizes of circles are proportional to the percentage of female authors working in such topics (we have also reproduced numerically this information in the last column in Fig. 7).

Figure 9 provides interesting evidence of the main message of this paper, male and female display different patterns when doing research. Independently of the grade of under-representation of women in the profession, if there were not significant gender horizontal differences we would expect that sizes of latent topics measure for the proportion of females were similar. On the contrary, we observe an uneven distribution of sizes.

There is a small subset of topics (north in Fig. 9), specially topic 49, with a relative high proportion of females, that moreover seem to be closely connected (according to the terminology for applied economics fields). On the contrary, there is other set of topics (for example, southwest in Fig. 9) that are also closely connected and where the presence of females is scarce (around terms common to economic theory research questions).

4.2 Topic Analysis and the Gender Distribution

As we said above, it is difficult to describe the precise semantic meaning of the latent topics when we are working with \(K=54\). We are able, however, to look closer to the latent topics where females are more or less prevalent and its potential implications. In particular, Fig. 10 shows that the latent topic with the highest proportion of female authors is topic 49 (32.8% as indicated in Fig. 7). On the contrary topic 16 turns out to be the topic with the lowest proportion of females (10.1% as indicated in Fig. 7). As a simple illustration, Fig. 10 represents these topics as word clouds, where the size of terms in the cloud is equivalent to its probability in the latent topic distribution \(\beta _k\).

Fig. 12
figure 12

Empirical distributions across topics between males and females (conditional of having published an article in Top 5)

Fig. 13
figure 13

Relative propensity of publishing papers by females over topics

Fig. 14
figure 14

Empirical distributions across topics between males, females and mixed authorship (conditional of having published an article in top 5)

Fig. 15
figure 15

Diversify across latent topics by gender (HHI)

The words that seem to be more prominent in the cloud 49 are women, men, parent, children, health, etc. These words could be easily linked to research fields, as gender or health economics, traditionally associated with women. Similarly, the word cloud of topic 16 seems to be related to Micro theory that has been often labeled (while not statistically) as an area where there are less female than average.

Latent topics may differ in other dimensions beside semantic content. For instance, Hengel (2020) uses readability scores to measure the quality of writing of article abstracts.Footnote 13 We have implemented E. Hengel’s Python module Textatistic to compute readability results over the article abstracts across our latent topics. The finding is that scores across more female topics are better rated than across more male topics. However, it is hard to disentangle the role of the prevalence of female authors face to face the wording within a topic. Moreover, scores that are outliers should be properly treated to ease comparisons. We leave the study of these readability issues implying fundamental gender differences for further research.

Rather, Fig. 11 shows the mean of the presence of women authors by topic, together with the standard deviation of this presence over the sample of years. For some latent topics, the proportion of females is larger than the average (which is 15.9% over the period 2002–2019), reaching a proportion of 33% for topic 49. On the contrary, females are specially underrepresented in other topics, as topic 16, with only a 10%. Dispersion over time differs also across topics, and it seems that is higher for topics with higher proportion of females (the correlation between dispersion and the proportion of females is 0.35). While it is true that the proportion of female authors has been increasing in the last two decades from around 13% on 2002 to 19% on 2019, we do not see a trend in the dispersion of the proportion of females by topic. Consequently we see the prevalence of females across topics as a signal of gender “horizontal” differences in research.

Nevertheless, for having a more accurate picture of this “horizontal” differences, we need to add the information regarding the relative prevalence of the topics. It could be possible that females are unrepresented in a particular topic, and this circumstance having little impact as far as this topic contains very few published papers.

Figure 12 shows the distribution between males and females across topics normalized for having the same size. This gives us the propensity that, say, a female authored paper belongs to any of the 54 topics. We rank the topics according to probability of being chosen by a male author. This figure provides evidence that male and female authors either have different preferences or follow different strategies when pursuing and publishing their research. We observe that topics with higher “demand” by males are also highly demanded by females. However, there is a set of topics, for which the proportion of published papers for men are high, which are less attractive (o more difficult to publish) for females. In general, male and female distributions are different, with the salient feature of topic 49 for females, that it is a clear spike in the female distribution of published papers.

We confirm this evidence with a complementary Fig. 13 representing the dispersion of published female authored papers across topics, but accounting also for the prevalence of latent topics. In particular, for each topic we have the proportion of published papers by female authors (taken from Fig. 12) minus the proportion of published papers in this topic overall. Conditioning on having published a paper, male and female would be equally likely to publish a paper in a specific topic, this difference would be zero. Then, we can interpret this difference as the excess propensity to publish a paper in a particular topic by females. These differences can be positive or negative, and the sum over all topics is zero. The figure shows that there are topics for which the propensity of publishing papers by females is higher than males, and the opposite. Again topic 49 but also topics 41 (health) and 30 (applied IO) are in one side, while theory topics as 16 or 37 are in the other side.

In order to analyze the pattern of coauthor-ships we have pooled the articles in three groups, papers written by male authors, by female authors, and gender mixed team of authors. The main results are summarized in Fig. 14 that shows that there is a important difference between the pattern of latent topics between sole male teams and sole female teams, while mixed teams generate an intermediate distribution over the latent topics.

Finally, we want to address a related but different question, how male and female diversify across topics. For example, when writing an article, an author may contribute to a single latent topic or several, authors that have published several papers may have written similar articles or they could have been more diverse: are these diversification patterns different for males and females? For addressing this question, the first step is to choose a measure of latent topic dispersion/concentration. A natural candidate is the Herfindahl–Hirschman Index (HHI) that is used to measure the concentration in a market.

The HHI index is calculated by squaring the market share of the firm (the topic) that compete in a single market and then summing up the resulting numbers \(\mathrm{HHI}=\sum _{i=1}^{N}s_{i}^{2}\). We apply this index to our problem as follows. For each author (the market), we identify all the latent topics that she has contributed to (the firms). For each article the algorithm computes a probability distribution over the latent topics. We repeat the process for all articles of the same author. Then, the cumulative probability divided by the number of articles is the contribution of the author to this particular latent topic (the market share, \(s_i\)). For example, if an author publishes very similar papers related to a single or a few latent topics, her HHI will be high. On the contrary, authors with a more diverse research agenda will have a lower HHI. Figure 15 shows the corresponding average HHI for males and females.

We have computed the HHI controlling for the number of papers by author. It is clear that an author that has published more papers is likely to have contributed to a larger set of latent topics and therefore she must have a lower HHI. Interestingly, the figure shows some differences between genders in terms of diversification. Females are more diverse (lower HHI) when publishing one or two papers, but less (higher HHI) when publishing a larger number of papers in the Top 5.Footnote 14

5 Topics as Research Fields

In this section, we estimate the stochastic model with a lower number of topics, with two objectives. On the one hand, a low K facilitates the semantic interpretation of topics and then to analyze, for instance, whether or not, the weight of a particular field in the T5 has increased over time. On the other hand, a low number of topics will allow us to frame our results with previous literature that has used a small number of categories linked to JEL codes and research areas in top departments. After estimating the model for a range of \(K \in {10, \ldots , 20}\), we have found that \(K=15\) is a number of topics for which the estimated model performs better in terms of fitting to the data and the semantic content of the latent topics at the same time. The model with \(K=15\) latent topics is summarized in Fig. 16.

Fig. 16
figure 16

Latent topics ranked by prevalence in the corpus with \(k=15\)

Fig. 17
figure 17

A topic with “labor”: topic 8 in the set with \(K=15\)

Fig. 18
figure 18

Word clouds for topics with the stem “labor” among the fifteen more frequent words in the set with \(K=54\)

Fig. 19
figure 19

Connectedness for \(K = 15\)

The reader may then wonder what additional information is contained in the unrestricted version of the structural topic model (STM). One way to illustrate on the importance of an adequate selection of the number of topics is to explore in detail the composition effects we already discussed above. We proceed as follows. First, we consider the stem “labor,” and we look for it among the fifteen more frequent words within the restricted version of the STM, that is, the version with just 15 latent topics (\(K = 15\)). We only find that particular word under the required frequency within topic 8 in Fig. 16. Figure 17 depicts the word cloud for that topic 8 in the restricted version of the model with \(K = 15\). Clearly, in this particular case, one may say this cloud describes well the research field corresponding to JEL code J, which is Labor and Demographic Economics.

The key idea with the structural topic model is that a field like “Labor” can fit many research lines in the unrestricted version of the model, in our case the one with 54 latent topics. When we look for the stem “labor” within the 54 latent topics, we find it among the fifteen more frequent words in as many as six topics. Figure 18 illustrates on the most prevalent among these topics which are: Labor Search, Labor Supply, Human Capital, or Productivity Analysis. Notice, in particular, that there are important differences on the prevalence of females across these different subtopics, from 18 per cent in the more policy oriented topic which is “labor supply” to 14 per cent in the more theoretical “labor search” (go back to Fig. 7 for these shares). Important variability can be washed out when the methodology used account for the research field environment rather than for the research topic environment.

As we have anticipated, the reduction of the number of topics to \(K=15\) makes easier to label the latent topics as meaningful research fields, though. Following our previous analysis, Fig. 19a plots the latent topics showing the relative semantic distance between topics as well as their weight in terms of the fraction of documents/abstracts that they contain.

If we compare Fig. 7 (with \(K=54\)) and Fig. 19a (with \(K=15\)), they have a similar “geography” in terms of general areas of knowledge. Therefore, similar patterns in terms of the distances between topics arise. For example, “Econometric Theory” seems to be isolated, whereas applied fields such as Labor and Public Economics are closely connected.

Figure 19b (as Fig. 8 with \(K=54\)) provides evidence of the “horizontal” differences between males and females in doing research. The results go in line with the previous literature as in Dolado et al. (2012), Chari and Goldsmith-Pinkham (2017), Beneito et al. (2021) and Lundberg and Stearns (2019) that point out that females are unevenly distributed across fields. We concur with previous literature that females are over-represented in Applied-Micro fields, specially Health-Gender, Experimental and Education and underrepresented in Econometric and Economic Theory fields, Macro-Monetary and Finance.

For example, Dolado et al. (2012) use the classification of women by research areas (JEL 20 fields) in the top 50 economic departments in 2005. The proportions they find are very similar to ours: (i) I-Health, Education and Welfare, 25%, (ii) D-Microeconomics, 14%; (iii) J-Labour and Demographic Economics, 15% or (iv) C2-Econometrics, 14.3%. In our analysis, we found that the percentage of female authors are, for example: (i) Health and Gender, 23%; (ii) Decision Theory (13.6%), Game Theory (11.4%); (iii) Macroeconomics and Monetary, 14.2%; or (iv) Econometrics, 14.4%. Having said that, the distribution of the proportion of females across these restricted topics seems to be slightly less disperse than those identified in the previous literature with other sources of data. This can be due to the fact that our methodology is more “continuous” than allocating females to fixed categories, and as far as the probabilistic model allocates females’ articles to latent topics with statistical weights.

Fig. 20
figure 20

Growth rates of prevalence and female proportion by topics

Figure 20 analyzes together the evolution of the prevalence of the topics and the proportion of females authors. For building this figure, we have computed the growth rate of topics’ prevalences and topics’ female proportions from the averages in the latest seven years (2013–2019) and the first seven years (2002–2008) of the sample. First, we can observe that the proportion of females have increased in all topics, but Finance (\(-6.6\)%). Regarding the prevalence, only four topics have decreased their weight in terms of prevalence, Mechanism Design (\(-10.3\)%), Econometrics (\(-29\)%), Game Theory (\(-22.5\)%) and Experimental (\(-8.4\)%). On the one hand, the topics where the percentage of women authors have risen more are Political Economy (\(+67.7\)%), Decision Theory (\(+42.5\)%), Macroeconomics and Monetary (\(+32.3\)%), Experimental (\(+40\)%) or Labor (\(+35\)%). In all of them, the women were clearly underrepresented. On the other hand, the topics where the percentage of women has grown the least, besides Finance, have been Health and Gender (\(+11.4\)%), Econometrics (\(+9.4\)%), and IO (\(+9.2\)%).

Finally, there is no clear relationship between the growth rate of topic prevalence and the increase in female prevalence. This is surprising. We do not have data about the seniority of authors, but as the proportion of female is increasing, we can expect that the proportion of females among the new entrants in the T5 market should be relatively large. New entrants should be more likely to work in “hot” topics rather than in declining ones. The combination of both effects should lead to a positive correlation between the increase in the prevalence of a topic and the increase in female representation, something that we do not observe clearly in the data. However, another alternative explanation to the increase of the proportion of women in some topics is that females that already have published in top five in the past, have extended their network of male coauthors and getting more papers published.

6 Conclusions

Using unsupervised machine learning techniques and a new data base composed by the abstracts of all articles published in T5 journals in Economics for the period (2002–2019), we have shown that there are persistent and significant horizontal differences in the way males and females approach research in Economics. Using the structural topic model, we have identified latent topics for which the distribution of female authors is more uneven than with research fields. These findings are important for several reasons, because: (i) T5 publications are key for research careers and also for determining the path of economic research; (ii) the results are robust in the sense that they are automatically generated with a probabilistic model without any deterministic allocation of papers to pre-established categories or fields of research; (iii) finally, recent theoretical results by Conde-Ruiz et al. (2017, 2021) and Siniscalchi and Veronesi (2020) show that “horizontal” gender differences in the choice of research topic may lead to a gender discriminatory trap.

Beyond the scope of the present paper, we plan to extend our analysis in several directions. Firstly, we want to recollect more information about the authors, in order to be able to capture dynamic effects. For instance, we want to differentiate between the research patterns by senior and junior authors. We want also to investigate how male and female build the network of coauthors and how this process determines the choice of latent topics. Secondly, we want to show the usefulness of the methodology and the latent topics we have identified by reviewing research questions analyzed by previous literature in academic gender gaps. For example, Hengel (2020) analyzes the differences in quality of writing of papers. She shows that female-authored manuscripts are better written and concludes that female are subject to higher writing standards. The reason might be an unwelcome gendered culture through the entire editorial process at the time of deciphering complicated texts. We are currently implementing Hengel’s readability scores methodology to the latent topics. Our preliminary findings suggest that those papers belonging to topics with more prevalence of females are better written. Although this evidence can be interpreted as supporting the view that female-authored articles are better written than equivalent articles by men, it can be also the case that the results are driven by the particular topics. In other words, we need a deeper econometric analysis to disentangle if the written quality of the papers is driven by gender of the author or by the choice of the latent topics.

Likewise, Card et al. (2019) shows that female authored papers have more citations, suggesting that journals hold female-authored papers to higher standards. They have obtained this result controlling for research field. We plan to collect data on citations and review this result but controlling by latent topic. Finally, we want also to use algorithms (for example, LASSO a widely used regression analysis machine learning method) for testing if the differences between gender research patterns are important enough, for building a predictive model of gender given an observed abstract.