1 Introduction

A long line of research suggests a U-shaped relationship between female labour force participation (FLFP) and economic development (Sinha 1967; Goldin 1995; Tam 2011): while the early stages of economic growth are accompanied by a de-feminisation of the labour force, women tend to become economically active again as income rises further. Figure 1 documents this pattern in a pooled cross-section of 172 countries during 1990–2013. Some of the classic contributions on the subject link this pattern to structural change: while industrialisation leads women to exit the labour force, the transition from an industry- to a service-based economy induces a re-entry (Boserup 1970; Goldin 1995). In other models, the Feminisation U is produced by changes in fertility (Galor and Weil 1996; Lagerlof 2003) and the gender education gap (Hiller 2014), both of which tend to first increase and then decline with rising income per capita.

Fig. 1
figure 1

FLFP and income per capita: binned scatterplot (1990–2013). Note: the diagram groups the observations into 60 equal-size bins (based on income per capita) and plots the means of income per capita and FLFP for the observations within each bin. The data refer to 169 countries during 1990–2013, for a total of 3,891 observations

Despite the status of the Feminisation U as a ‘stylised fact’ of development economics, recent empirical contributions have cast doubt on its veracity. Gaddis and Klasen (2014) argue that empirical support for a U-shaped pattern is feeble, showing that the estimated relationship vanishes under dynamic panel-data estimations. Sub-national studies have generally produced mixed results (Roncolato 2016; Lahoti and Swaminathan 2016), while some cross-country regressions provide evidence of an inverted U-shaped relationship (Çağatay and Özler 1995).

In this paper, we re-engage with the empirical controversy surrounding the Feminisation U by exploring variation in the dynamic path of FLFP across countries. We argue that the mechanisms that generate the theoretical U-curve—structural change, fertility dynamics, and gender differences in education—depend critically on initial conditions. In particular, these mechanisms play out more strongly in societies that espouse less gender-equal cultural norms when they embark on the process of modern economic development.

Income per capita begins to rise above subsistence levels when an economy emerges from the Malthusian trap (Galor and Weil 2000). If cultural norms at this point emphasise the role of women as homemakers, industrialisation is more likely to be accompanied by a withdrawal of women from the labour market; women will bear more of the brunt of childcare as fertility rises; and they will likely be prevented from accessing new educational opportunities on an equal footing with men. In these societies, the path of FLFP is strongly U-shaped. When initial cultural norms assign a more equal role to men and women, by contrast, the mechanisms that produce the U-curve are more muted: industrialisation does not completely deter women from seeking and obtaining manufacturing jobs; a rise in fertility is accompanied by a distribution of care labour that is less biased towards women; and new educational opportunities are less likely to be monopolised by men. As a result, fewer women quit the labour force as the economy travels through the middle-income band, leading to a less distinctly U-shaped path of FLFP.

Thus, the dynamic path of FLFP depends critically on the cultural norms that prevail when the economy enters a post-Malthusian growth regime. Here, we test this proposition empirically. To measure pre-historical gender norms, we rely on the seminal work of Boserup (1970) and Alesina et al. (2013). These authors link the emergence of cultural norms about the appropriate role of women in the economy to the form of traditional agriculture practiced in pre-industrial society, when the economy is in the Malthusian regime and income per capita is constant. The key contrast is between plough-based agriculture and forms of shifting cultivation that rely on lighter, hand-held tools such as the hoe or the digging stick. Unlike the latter, the plough requires significant upper body strength, putting a productivity premium on male labour. Thus, ancestral societies with a plough-based agricultural technology developed an early specialisation of labour along gender lines. Over time, this division of labour facilitated the emergence of cultural norms that cast women as ‘natural’ home-makers.

By linking the plough argument to the mechanisms that generate the Feminisation U, we suggest that the legacy of ancestral plough use may be exerting a moderating influence on the dynamics of FLFP in the course of development. In societies that traditionally practiced plough agriculture, the gender norms that prevail at the transition to a post-Malthusian regime tend to disadvantage women. As a result, the path of FLFP is likely to be more strongly U-shaped as income per capita rises. The less a society’s ancestors relied on a plough-based technology, the more equal are the norms that prevail when per-capita income begins to rise above subsistence, leading to a less markedly U-shaped path of FLFP.

To model the plough as an effect modifier, we allow the parameters of the U-curve to depend linearly on Alesina et al.’s (2013) measure of ancestral plough use, leading to a specification with interaction terms. Based on a global panel dataset covering 169 countries during 1990–2013, we present stylised facts that are consistent with initial cultural norms exerting a modifying influence on the dynamics of FLFP.Footnote 1 The pattern of results found in the data is extremely robust. A significant contrast between the dynamic paths of FLFP in plough vs. non-plough societies is observed in specifications that correct for dynamic panel bias (e.g. GMM) and treat economic development and historical plough use as endogenous to FLFP. We also consider other historical events (e.g. the Neolithic revolution) as potential correlates of initial gender-role cultural norms and hence as potential effect modifiers. In all cases, we find no evidence that these factors exert a significant effect-modifying influence once the time path of FLFP is allowed to depend on historical plough use. On the contrary, the observed heterogeneity in the shape of the Feminisation U appears to be robustly linked to the legacy of historical plough adoption only.

Our findings contribute to the literature in at least three ways. Recent studies have examined the historical emergence (Alesina et al. 2013; Hansen et al. 2015), transmission (Fernandez and Fogli 2009), persistence (Grosjean and Khattar 2019), and transformation (Fernandez 2013) of the cultural norms that assign different socio-economic roles to men and women. Yet, for the most part, this work has developed in isolation from the Feminisation U literature. Thus, our first contribution is to integrate two strands of research that have developed on parallel tracks.

Second, we suggest a possible way of solving the ongoing empirical controversy surrounding the Feminisation U: empirical studies may fail to observe a U-shaped relationship if they focus on ‘non-plough’ countries or if ‘non-plough’ settings dominate the sample. To our best knowledge, this is the first paper that investigates a potential source of heterogeneity in the relationship between economic development and FLFP.

Third, our results point to an important nuancing of Alesina et al.’s (2013) argument. Their seminal analysis provides estimates of the average effects of historical plough use on labour market outcomes today, providing broad evidence of the role of the plough in shaping gender roles. We nuance their results by revealing substantial heterogeneities along the income path. In particular, the detrimental effect of plough legacies is only observed in middle-income economies. Our findings imply that the labour-supply effects of the plough shock are neither immediate nor permanent: a substantial reduction in FLFP does not appear until a plough-based economy has reached a middling level of income, and this reduction is later reversed completely as more advanced levels of economic development are achieved. We think that this qualification reflects more accurately Boserup’s (1970) original formulation of the ‘plough thesis’.

2 The Feminisation U and initial conditions: bringing two strands of literature together

We begin by reviewing the main theoretical mechanisms that are known to lead to a Feminisation U—structural change, fertility dynamics, and gendered educational choices. In each case, we highlight how the prediction of a U-shaped path of FLFP depends on assumptions about initial conditions—in particular, the gender norms that prevail when the growth rate of per-capita income begins to rise above zero. This framework motivates the subsequent empirical analysis.

2.1 Structural change

Early explanations of the Feminisation U placed an emphasis on sectoral shifts in production and employment (Boserup 1970; Goldin 1995). Initially, economic growth shifts the locus of production from family farms and home workshops to factories, firms, and other places of wage work. This transition induces women to exit the labour force. For one thing, the physical separation between home and workplace makes it more difficult to reconcile productive and reproductive tasks (Benerìa 1979).Footnote 2 For another, work in factories and industrial farms is generally considered ‘dirty’. A married woman engaged in paid manual labour outside the home brings ‘stigma’ to the family, lowering the household’s utility (Boserup 1970). ‘This stigma is a simple message: only a husband who is lazy, indolent, and entirely negligent of his family would allow his wife to do such labour’ (Goldin 1995: 71). As per-capita income rises further, however, paid jobs that do not mark women with a stigma, such as service-sector jobs, become more widely available, and women re-enter the labour force.

Ngai and Olivetti (2015) formalise this argument, showing that a model of structural transformation with home/market production and gender-specific comparative advantages implies a U-shaped female labour supply. Ngai and Petrongolo (2017) and Rendall (2017) examine alternative models, putting greater emphasis on the rise of service-sector jobs in generating an increase in female participation.

In these models, the initial fall in FLFP comes from the emergence of occupations to which stigma is attached, while the subsequent rise is due to an increased availability of non-stigma jobs. Thus, the U-pattern is implicitly sustained by the cultural norms that prescribe the appropriate role of women in the economy.Footnote 3 Indeed, Goldin’s (1995) original toy model came in two variants—with and without social stigma. In the ‘non-stigma case’, Goldin shows that female labour supply is not necessarily U-shaped.Footnote 4 Here, we link the cultural norms that prevail in the early stages of growth of per-capita income to the legacy of ancestral plough use.

2.2 Fertility dynamics

Other models reproduce a U-shaped path of FLFP by focusing on the role played by fertility dynamics in the demographic transition. In a seminal paper building on Becker (1960), Galor and Weil (1996) consider a growth model with gender heterogeneity and endogenous fertility. The relative wages of men and women have income and substitution effects on fertility and labour supply decisions. Both genders are equally endowed with mental human capital (‘brains’). Men, however, have more ‘brawn’ and hence a comparative advantage in labour-intensive tasks. In low-income economies with a labour-intensive technology, economic growth raises the male relative wage. The resulting income effect increases the demand for children, reducing FLFP. At a higher level of income, the technology is less brawn-intensive, and economic growth has a positive effect on the female wage, closing the gender wage gap. At this point, women substitute out of childrearing and into market work. As Galor and Weil acknowledge (1996), simple extensions of their model can generate a non-monotonic U-shaped relationship between per-capita income and FLFP (see also Lagerlof 2003; Bloom et al., 2009; Kimura and Yasui 2010).

Implicitly, these models are premised on a specific set of gender norms. Although Galor and Weil ‘do not assume that women are better at raising children than are men’ (1996: 378), they nonetheless assume that as a matter of fact ‘all childrearing is done by women’ (1996: 375). Accordingly, the time (opportunity) cost of children is an increasing function of the female wage only, implying a pure income effect from a rise in the male wage. In some societies, however, cultural norms are more gender-equal, and men and women contribute more equally to childrearing. Here, the opportunity cost of having children is also a function of the male wage. A rising demand for children is partly offset by the opportunity cost of children, and the dynamic path of FLFP is less strongly U-shaped. Again, here we link the cultural norms affecting the gender allocation of care labour to the legacy of the plough.

2.3 The gender education gap

A final argument, first proposed by Boserup (1970), links the Feminisation U to men’s privileged access to education and technological knowledge. Although there is no biological constraint on women and men attaining equal quantities of ‘brains’, gender-biased cultural norms favour the education of boys. This mechanism is examined formally by Hiller (2014), who considers a two-sex, overlapping generations model of the gender education gap in the course of development.Footnote 5 The household’s utility is derived from both wages (returns to education) and status. The education of daughters negatively impacts status if cultural norms favour women’s specialisation in housework. Parents maximise the household’s utility by choosing how much to invest in their children’s education. As in Fernandez (2013), cultural norms (and hence FLFP) are determined endogenously.

In this model, FLFP follows a U-shaped path. Yet, the model’s dynamics depend critically on the initial difference in productivity between uneducated men and women. This difference leads to a cultural bias in the allocation of the household’s education budget towards boys. When male and female labour are perfect substitutes, the allocation of the household’s education budget is gender-neutral. As soon as the gender gap in labour productivity is even slightly larger than zero, however, ‘there is room for an inegalitarian equilibrium […]. Other things being equal, a higher [initial gender productivity differential] indicates a greater likelihood that an economy will fall into a gender-inequality trap’, producing a more markedly U-shaped path of FLFP (Hiller 2014: 474).

In the model, the two genders are characterised by biological differences in physical strength. Whether these differences are economically salient, giving rise to a productivity differential depends on the extent to which the pre-industrial production technology relies on physically demanding labour (Hiller 2014). The more this technology is brawn-intensive—as is the case in plough-based agricultural systems—the greater the initial male advantage in productivity. This sets the stage for an education bias and a more distinctly U-shaped path of FLFP.Footnote 6

3 The empirical literature on the Feminisation U

Despite its strong theoretical rationale, the Feminisation U is supported by a mixed body of evidence. Early findings based on cross-country regressions were suggestive of a U-shaped relationship (Sinha 1967; Pampel and Tanaka 1986). Goldin’s (1995) cross-sectional evidence from 100 countries in 1985 is also consistent with a U pattern in which female labour supply reaches its lowest point in countries with a per-capita GDP of around 3,000 US$ (1985). Based on a pooled cross-section of 193 countries in 1980 and 1990, however, Çağatay and Özler (1995) find statistically significant evidence of an inverted U-shaped relationship.Footnote 7

Early studies also reported time-series evidence based on individual countries. Focusing on the USA, Goldin (1995) argues that FLFP probably traced out a U-shaped pattern over time, reaching a bottom in the 1920s, when per-capita GDP in the USA was around 5,000 US$ (PPP). Similar findings for England and France are documented by Tilly and Scott (1987).

As already noted by Durand (1975), the cross-sectional relationship may be biased, as the omission of unobserved country-level heterogeneities (e.g. cultural differences) may give rise to a Kuznets-type fallacy (Tam 2011). Meanwhile, results based on time series from individual countries may have limited external validity. For these reasons, the most recent work on the Feminisation U has turned to panel-data methods. Based on data from 90 countries during 1970–1985, Mammen and Paxson (2000) are the first to exploit variation within countries over time to identify the shape of the Feminisation U. Their fixed-effects panel estimates reveal a more muted U-shape than the corresponding pooled model, with a much lower (but statistically significant) turning point at 1,600 US$ per capita, as compared to 2,550 US$ in their pooled cross-section.

Luci (2009) and Tam (2011) estimate both static (OLS) and dynamic (GMM) panel-data specifications, confirming more formally that the Feminisation U shows up as an intertemporal relationship. Their models, however, do not control for potential time-varying confounders, nor do they address the potential endogeneity of income levels to female labour supply (although the GMM estimators they employ would allow them to instrument for per-capita GDP).

In a noteworthy contribution, Gaddis and Klasen (2014) employ more recent and comprehensive labour market data than either Luci (2009) or Tam (2011). They also use dynamic panel-data estimation techniques (GMM) and internal instruments to correct for the endogeneity of GDP per capita in ‘GMM style’. In a break from most of the previous literature, they conclude that ‘there is no clear evidence for the feminization U hypothesis from […] dynamic estimations’ (Gaddis and Klasen, 2014: 660).

Gaddis and Klasen’s (2014) findings are in line with recent studies that exploit variation in the level of development across sub-national units within countries. Using Indian state-level data spanning the period 1983–2012, Lahoti and Swaminathan (2016) estimate dynamic GMM models and find no systematic evidence of a U-shaped relationship between state-level income and FLFP. Similarly, Roncolato’s (2016) study of South Africa employs micro data to investigate the effect of municipality-level income on a woman’s probability of being in the labour force, concluding that in South Africa, the U-shaped relationship is more ambiguous than implied by theory.

Overall, these recent findings are mixed, and they strengthen the conclusion reached by Humphries and Sarasua (2012), based on their review of historical evidence from a number of now developed economies, that the Feminisation U cannot be assumed to hold universally.

A key limitation of all existing empirical studies is that they do not systematically investigate the potential heterogeneity of female labour supply dynamics across different contexts. Such an investigation could help reconcile some of the evidence summarised here, but also shed new light into the mechanisms driving the Feminisation U. In particular, we are not aware of any study examining how the cultural norms shaped by historical events from the distant past (e.g. the adoption of the plough) exert a modifying influence on the dynamic path of FLFP. The analysis that follows addresses this missing link in the empirical literature.

4 Empirical strategy

4.1 Data

To measure labour force participation (FLFP), we use the (most recent) sixth revision of the International Labour Organisation (ILO)’s Estimates and Projections of the Economically Active Population (EAPEP) database. This data, which was downloaded from the World Bank’s website, is based on ILO staff estimates. FLFP is defined as the number of women in the labour force as a share of the total working-age (15–64) female population. In the System of National Accounts (SNA), the labour force is composed of all individuals who are working or seeking work. Any kind of employment for pay is included, as is self-employment and unpaid work that is performed to produce a good (as opposed to a service) auto-consumption (Klasen 2018).

On this definition, subsistence farming counts as labour force participation. By contrast, housework and care work (e.g. childrearing and the care of the elderly) do not count as market work. This definition is consistent with our theoretical framework: women employed on the family farm or in household industries are classified as supplying their labour to market, while women engaged in childrearing or other forms of reproductive labour are considered to be economically inactive.

The EAPEP’s sixth edition covers the period 1990–2019.Footnote 8 In 2013, however, the 19th Conference of Labour Statisticians, convened by the ILO, modified the definitions of work and labour force, leading to a break in the ILO time series on labour force participation (Klasen 2018). Thus, to obtain estimates based on consistent definitions, we restrict our analysis to the period 1990–2013. In the sample available for estimation, which covers 169 countries, \(FLFP\) ranges between 10.1 (Jordan in 1991) and 91.5 percent (Burundi in 1991), with a mean (median) value of 56.3 (58.7) percent and a standard deviation of 17.5.

To measure the historical adoption of the plough (and the cultural norms it brought about), we use Alesina et al.’s (2013) original variable—\(Plough\). The authors use comparable ethnographic information to construct ‘estimates of the fraction of the population currently living in a […] country with ancestors that traditionally engaged in plough agriculture’ (2013: 486).Footnote 9\(Plough\) is a continuous variable ranging between 0 and 1, with a mean (median) value of 0.52 (0.73) and a standard deviation of 0.47. In 55.5 percent of the countries entering the estimation sample, either all (26.5 percent) or none (29.5 percent) of their present-day inhabitants had plough-using ancestors. Only 5.4 percent of countries in the sample have a value of \(Plough\) between 0.2 and 0.8.

An important limitation of Alesina et al. (2013)’s plough variable, which the authors acknowledge, is that it does not provide information on the exact timing of plough adoption. Yet, the introduction and diffusion of the plough certainly occurred well before the transition to a post-Malthusian growth regime—that is, when per-capita income starts to rise above subsistence levels and the economy begins to travel down the U-curve.Footnote 10 For this reason, Alesina et al.’s (2013) variable can be taken as an appropriate measure of initial conditions.

The data on income per capita, expressed in constant 2011 international dollars, are taken from the World Bank’s World Development Indicators and are PPP-adjusted. The mean (median) value of \(\mathrm{ln}GDPpc\) is 8.9 (9); and the standard deviation is 1.3, which provides us with sufficient variation to identify changes in FLFP across levels of income.

4.2 Model specification

We explore the heterogeneity in the dynamic path of FLFP in a panel-data framework. We model the relationship between FLFP and economic development as a quadratic function of (the log of) per-capita GDP—the standard approach in the literature. In contrast to previous studies, however, we allow the coefficients of the quadratic function to depend linearly on Alesina et al.’s (2013) \(Plough\) variable, leading to a specification with interaction terms:

$${FLFP}_{it}={\rho FLPF}_{it-1}+{\beta }_{1}{\mathrm{ln}GDPpc}_{it}+{\beta }_{2}{\mathrm{ln}GDPpc}_{it}^{2}+{\gamma }_{1}{Plough}_{i}\times {\mathrm{ln}GDPpc}_{it}+{\gamma }_{2}{Plough}_{i}\times {\mathrm{ln}GDPpc}_{it}^{2}+\varphi {{\varvec{X}}}_{it}+{\sigma }_{i}+{\tau }_{t}+{\varepsilon }_{it}$$
(1)

where \(i\) indexes countries and \(t\) time.Footnote 11

In line with previous contributions (Tam 2011; Gaddis and Klasen 2014), Eq. (1) is a dynamic specification in which the outcome variable is allowed to depend on its own previous period realisation. Substantively, the first lag of \(FLFP\) models the persistence of the cultural norms that shape and constrain women’s labour supply decisions. A dynamic specification is motivated by the fact that cultural attitudes regarding married women’s work are passed down from one generation to the next (Fernandez et al. 2004; Fernandez and Fogli 2009). As such, they have a strong tendency to persist unchanged over time (Grosjean and Khattar 2019). Statistically, the inclusion of a lagged dependent variable removes any serial correlation from the error term, leading to consistent panel-data estimates.

Our specification includes country-fixed effects (\({\sigma }_{i}\)) which flexibly absorb all time-invariant country-level factors, including the main effect of historical plough use. Thus, Eq. (1) allows us to estimate the (heterogeneous) relationship between FLFP and per-capita income based on within-country variation only, comparing the average time path of FLFP in ‘plough’ vs. ‘non-plough settings’.

\({\tau }_{t}\) denotes a full set of time-period dummies that capture labour market shocks affecting all countries simultaneously—e.g. global economic crises and business cycle fluctuations—as well as any global trend in gender equality.Footnote 12 In addition, the inclusion of \({\tau }_{t}\) prevents the idiosyncratic disturbances (\({\varepsilon }_{it}\)) from being contemporaneously correlated across individuals, which would bias the variance estimator (Roodman 2009). In additional models, we also consider richer specifications of the time-trend component.

Furthermore, we also present specifications that condition the estimated relationship between FLFP and per-capita GDP on a set of time-varying observables. We choose these control variables based on a review of the literature on the determinants of female labour supply. In particular, we include a measure of armed conflict, which may create labour shortages and new labour-market opportunities for women (Goldin and Olivetti 2013): a country’s dependence on oil exports, which have been shown to crowd out female-intensive tradable sectors (Ross 2008); a measure of aid dependence, on the assumption that aid agencies may favour male-biased technology transfer, feeding negative gender stereotypes (Boserup 1970; Jaquette and Summerfield 2006); an index capturing the extent to which a country’s society and culture are globalised and hence exposed to the diffusion of new values and attitudes, some of which may concern gender roles (Potrafke and Ursprung 2012); and a measure of the quality of democracy (Beer 2009). A detailed description of these variables and of the associated data sources is provided in Appendix Table 7.

The parameters of interest in Eq. (1) are \({\beta }_{1}\), \({\beta }_{2}\),\({\gamma }_{1}\), and \({\gamma }_{2}\). The \(\beta\)’s define the average curvature of the Feminisation U in countries in which \(Plough=0\), while the \(\gamma\)’s measure the difference between the curvatures observed in countries with and without historical plough use. In post-estimation, we also compute the parameters of the Feminisation U in countries in which \(Plough=1\)—that is, \({\beta }_{1}+{\gamma }_{1}\) and \({\beta }_{2}+{\gamma }_{2}\).

5 Main results

To explore the patterns in the data, we first present OLS estimates of Eq. (1). The results are summarised in Table 1. Column 1 reports the estimated coefficients of a pooled model. The specification in column 2 adds country (\({\sigma }_{i}\)) and year (\({\tau }_{t}\)) fixed effects. This and all the following specifications use only time variation within countries to estimate the relationship between FLFP and economic development. In column 3, we enrich the specification by replacing the year fixed effects (which enter as jointly insignificant) with a full set of year × continent fixed effects, allowing for fully flexible trends in FLFP at the continent level. Column 4 presents an alternative specification that conditions the estimates on country-specific linear trends (\({\sigma }_{i}t\)). Here, the functional form of the time trend is more restrictive, but each country is allowed to have a different trend in FLFP. In both models 3 and 4, the time trends are highly statistically significant.Footnote 13 Next, the specification in column 5 includes the set of time-varying observables discussed in Sect. 4.2. Lastly, column 6 reports a benchmark specification without \({Plough}_{i}\times {\mathrm{ln}GDPpc}_{it}\) and \({Plough}_{i}\times {\mathrm{ln}GDPpc}_{it}^{2}\), as in previous empirical studies of the Feminisation U. Due to data limitations, the sample available to estimate models 5 and 6 is substantially smaller.

Table 1 Determinants of female labour force participation (1991–2013), OLS

Throughout models 1–5, the coefficient on the lagged dependent variable is close to (but always statistically distinguishable from) one, suggesting a high degree of persistence in the gender norms that shape and constrain women’s decision to supply their labour to market. The estimates of \({\beta }_{1}\) and \({\beta }_{2}\) are always statistically insignificant, both individually and jointly, implying that when \(Plough=0\), the relationship between per-capita GDP and FLFP is statistically indistinguishable from a flat line.

The estimates of \({\gamma }_{1}\) and \({\gamma }_{2}\), by contrast, are always statistically significant (individually and jointly) at conventional levels, including in the more ‘demanding’ specifications that net out the influence of time trends in FLFP (columns 3–5). This finding indicates that in ‘plough’ countries, the expected relationship between GDP per capita and FLFP is significantly different from the expected relationship in ‘non-plough’ settings. In some specifications (for instance, model 3), the difference is only significant at the 10% level. This is arguably a consequence of multicollinearity. Indeed, in more parsimonious specifications that omit either \({\mathrm{ln}GDPpc}_{it}\) or \({\mathrm{ln}GDPpc}_{it}^{2}\), \({\widehat{\gamma }}_{1}\) and \({\widehat{\gamma }}_{2}\) are similar in magnitude to the estimates presented in Table 1 (column 3) but much more precisely estimated (see Appendix Table 8 columns 1 and 2, for the full results).

In panel B, we also present estimates of the U-curve’s parameters in ‘plough’ countries (\(Plough=1\)). Throughout models 1–5, the relationship between GDP per capita and FLFP is consistent with a U-shaped pattern, with a positive and significant coefficient on the quadratic term of log GDP per capita and a negative and significant coefficient on the linear term. As noted by Lind and Mehlum (2010, 111), ‘the requirement for a U shape is that the slope of the curve is [significantly] negative at the start and [significantly] positive at the end of a reasonably chosen interval of x-[\(\mathrm{ln}GDPpc\)] values’ – say, [6, 11]. In models 1–5, this is always the case if \({Plough}_{i}=1\), unequivocally indicating a non-monotonic relationship but never the case if \({Plough}_{i}=0\).Footnote 14 In addition, we find that, in ‘plough’ countries, the 98 percent confidence interval for the estimated turning point of the U (in logs) is always comfortably inside the data range (see panel C). In ‘non-plough’ countries, by contrast, the same confidence interval always exceeds the data range on either or both sides, leading to a rejection of a U-shaped relationship at the 1% level (Lind and Mehlum 2010, 113).

We also note some relevant differences across models with and without country-fixed effects. In ‘plough’ countries, the estimated parameters of the U-curve are larger in specifications that allow for country-specific effects (columns 2–5), implying a more pronounced U-shaped pattern (see panel B). Furthermore, the within-country estimates indicate that in ‘plough’ countries, women’s labour-force participation reaches a minimum at a per-capita income of around 5,000–6,000 PPP dollars (around 8.6 on a log scale).Footnote 15 This estimated turning point is around twice as high as the turning point implied by a pooled model without country-fixed effects (column 1).

Lastly, in column 6, we present the estimates of a ‘traditional’ specification without interaction terms. The results imply that, on average, the time path of FLFP is U-shaped in GDP per capita, in line with previous findings in the literature on the Feminisation U. Yet, the shape of the ‘average’ U-curve (as captured by the coefficient on the squared income term, i.e. 0.292, p-value = 0.011) is more pronounced than in non- ‘plough’ countries (− 0.104, p-value = 0.513) but less pronounced than in ‘plough’ countries (0.686, p-value = 0.000).

To further examine the heterogeneity in the relationship between FLFP and economic development, we also run OLS models on split samples, as an alternative to our preferred specification with interaction terms. In particular, we estimate a version of specification 5 in Table 1 on subsamples of countries where \(Plough\) is greater than 0.8 (‘plough’ countries) and lower than 0.2 (‘non-plough’ countries).Footnote 16 To facilitate a visual interpretation of the results, we plot the expected relationship between FLFP and per-capita GDP based on these two models (reported more extensively in Appendix Table 8, columns 6–7). The results, shown in Fig. 2, confirm our previous findings, although in this specification the estimated relationship in ‘non-plough’ countries (right-hand side panel) is negatively sloped and marginally distinguishable from a flat line at the 10% level (see Appendix Table 8, column 8).

Fig. 2
figure 2

FLFP and income per capita (1991–2013). Note: the diagrams plot the conditional mean (predictive margins) of FLFP in ‘plough’ (Plough > 0.8) and ‘non-plough’ (Plough < 0.2) countries, at different levels of per-capita GDP (in logs), averaging over the remaining co-variates. The 10% confidence interval is also shown. The plots are based on two regressions estimated on split samples. The model specification is similar to that of model 6, Table 1. The regressions control for country FE, country-specific linear trends, and five time-varying controls, including armed conflict, oil and aid dependence, globalization, and democracy

Since almost all countries in Europe (Africa) have a history (no history) of ancestral plough use, we also run additional regressions (based on model 5 in Table 1) that exclude European (African) countries from the sample. In both cases, the results (reported in Appendix Table 8, columns 3–4) are qualitatively similar, although the coefficients of interest are much less precisely estimated. In additional tests, we also find that excluding the so-called neo-Europes (USA, Canada, Australia, New Zealand) from the sample leads to near-identical results (see Appendix Table 8 column 5). Lastly, we inspect a diagram plotting the leverage score for each observation against its normalised residual squared, finding no evidence that the results are driven by influential outliers.

Taken together, these findings suggest a stylised fact that is consistent with the theoretical mechanisms discussed in Sect. 2. The time path of FLFP is only significantly U-shaped in countries whose ancestors employed a plough-based agricultural technology. The shape of this Feminisation U becomes progressively more muted the lower the share of a country’s ancestors that practiced plough agriculture. In countries with little or no legacy of historical plough use, the time path of FLFP is effectively flat in most specifications.

6 Robustness and extensions

In this section, we examine the robustness of the conditional relationships presented in Sect. 5 in a variety of ways.

6.1 Dynamic panel bias

The first issue concerns the dynamic nature of Eq. (1). In dynamic panel-data models, the lagged dependent variable is typically correlated with the disturbance, and the OLS (or the least-squares dummy variable or LSDV) estimator is subject to a finite-sample bias of order \(1/T\) (Nickell 1981). To address this problem, the most recent studies of the Feminisation U (Tam 2011; Gaddis and Klasen 2014) employ Arellano and Bond’s (1991) ‘difference’ GMM estimator. The ‘difference’ GMM estimator first-differences the estimating equation to expunge \({\sigma }_{i}\), while using lagged levels of \({FLPF}_{it-1}\) as internal (‘GMM-style’) instruments for \({FLPF}_{it-1}\). The problem is that when the outcome variable is highly persistent, as is the case with \(FLFP\), ‘difference’ GMM suffers from a version of the weak-instrument problem, leading to severe small-sample bias (Blundell and Bond 1998; Soto 2009).

To address this problem, we take a two-fold approach. First, we turn to Kiviet’s (1999) Nickell bias-corrected least-squares dummy variable estimator (or LSDVC), as extended and implemented by Bruno (2005). Kiviet (1999) derives expressions for the small-sample bias of the LSDV estimator, thus offering a method to correct the LSDV estimates.Footnote 17 Second, we divide the panel into 5-year intervals (1995, 2000, 2005, 2010). Using a panel dataset with gaps reduces the autocorrelation of FLFP, allowing us to apply ‘difference’ GMM consistently. Furthermore, the specification with 5-year intervals has the advantage of smoothing out the influence of short-term cyclical fluctuations in FLFP, e.g. those potentially induced by economic recessions (Albanesi and Sahin 2018).

The results are presented in Table 2. Models 1 and 2, based on the full dataset with panels pooled over consecutive years, are estimated using Kiviet’s (1999) LSDVC, with and without time-varying controls. Model 1 (Table 2) can be compared with the corresponding pooled (model 1, Table 1) and LSDV specifications (model 2, Table 1). While the coefficient on the lagged dependent variable is known to be biased upwards in the former, it is biased downwards in the latter (Roodman 2009). Reassuringly, the LSDVC estimate of \(\rho\) (0.965) falls well within the pooled-LSDVC bracketing range (0.992–0.935), confirming the plausibility of our LSDVC results. In both models 1 and 2 (Table 2), the pattern of results on the coefficients of interest is qualitatively similar to our previous findings, with no U-shape relationship between FLFP and per-capita GDP when \(Plough=0\) and a (statistically significantly different) U-shaped pattern when \(Plough=1\).

Table 2 Correcting for dynamic panel bias

In models 3–5, we consider panels with 5-year intervals. Column 3 reports a benchmark LSDVC model, while columns 4 and 5 report ‘difference’ GMM estimates, with and without time-varying controls. The LSDVC results in column 3 are directly comparable (and, indeed, fairly similar) to the GMM estimates reported in column 4. Both GMM models pass the standard diagnostic tests.Footnote 18 Again, across models 3–5, the main pattern of results is qualitatively unaltered. We conclude that the stylised facts presented in Table 1 are not an artefact of dynamic panel bias.

6.2 Main effect of historical plough legacies

In all the models presented so far (except for model 1 in Table 1), the main effect of historical plough use was either absorbed by the country-fixed effects (LSDV and LSDVC) or removed by the first-difference transformation (Δ-GMM). In other words, while these models allow us to estimate the different curvatures of the time path of FLFP in ‘plough’ vs. ‘non-plough’ settings, they cannot identify the vertical distance between these time paths (e.g. the intercepts of the two U-curves). Thus, we also report estimates obtained using Blundell and Bond’s (1998) ‘system’ GMM estimator, which permits the identification of coefficients on time-invariant regressors. While ‘difference’ GMM transforms the equation to eliminate \({\sigma }_{i}\), ‘system’ GMM estimates the equation in levels, first-differencing the instruments to make them exogenous to \({\sigma }_{i}\) (Roodman 2009).Footnote 19

The regression results (reported in full in Appendix Table 9) are qualitatively consistent with our previous findings. In Table 3, we report the marginal effects of historical plough legacies—that is, the vertical distance between the time path of FLFP in ‘plough’ vs. ‘non-plough’ countries. We report these effects both on average and at varying levels of development, based on models with (column 1) and without (column 2) continent dummies. The point estimates are similar across models 1 and 2, although they are less precisely estimated in model 2. The average marginal effect of \(Plough\) is negative, with a coefficient implying a long-run effect of − 10.40 (\(=-1.716/(1-\widehat{\rho })\) where \(\widehat{\rho }=0.835\)). This is close to Alesina et al.’s (2013) estimate of − 12.40 based on cross-section data.Footnote 20 This mean, however, conceals a substantially heterogeneous effect. In particular, the effect of historical plough use (the distance between the two curves) is only significantly negative and almost twice as large in magnitude as the average, at middling levels of income (8,100 PPP$).Footnote 21 At very low (245 PPP$) and very high levels of per-capita GDP (60,000 PPP$), by contrast, the effect is statistically insignificant (and indeed positively signed).

Table 3 Marginal effects of historical plough use (Plough), ‘system’ GMM

Consistent with our previous findings, these results suggest that the adverse legacy effects of the plough only appear when a country has travelled down a sufficient portion of the U-curve’s downward-sloping branch. In addition, these negative effects disappear as a plough country develops further and women re-enter the labour force.

These findings are also consistent with anecdotal evidence suggesting that in pre-industrial plough societies, women did not immediately abandon productive labour to specialise in childrearing and other reproductive tasks (Boserup 1970). Rather, the initial consequence of plough adoption was a gender-based segmentation of the labour market: men were responsible for operating heavy agricultural machinery, while women specialised in food processing, drying, and storing tasks, as well as weaving and the production of other essentials (ibid.).

6.3 Omitted time-varying confounders

Our estimates of the relationship between FLFP and per-capita GDP may be confounded by the omission of unobserved time-varying factors. In the models reported in Table 1, this concern was addressed by presenting specifications that control for unobserved continent- or country-level trends in FLFP that may be correlated with GDP per capita and/or for a set of five time-varying observables that may exert a joint influence on FLFP and economic development.

Here, to further allay concerns of omitted variable bias, we present estimates that instrument for per-capita GDP and its square in GMM style (as in Gaddis and Klasen 2014), after splitting the sample in two groups as in the regressions underlying Fig. 2. Table 4 compares ‘difference’ GMM models that treat income and its square as strictly exogenous (columns 1 and 2) with corresponding specifications that treat them as endogenous (columns 3 and 4). In ‘plough’ settings, the U-shaped relationship between FLFP and per-capita income is still statistically significant, although slightly shallower and less precisely estimated, after instrumenting for per-capita income (compare models 1 and 3). In ‘non-plough’ settings, by contrast, the time path of FLFP is statistically indistinguishable from a flat line, again, whether or not the income terms are treated as endogenous (compare models 2 and 4). These findings reassure us that the conditional relationships we present are unlikely to result from omitted confounders.

Table 4 Endogenous development: instrumenting for per-capita income, Δ-GMM

In addition, we note that, in the models by Galor and Weill (1996) and Hiller (2014), female labour enters the aggregate production function as a separate factor input, so that per-capita income is endogenous to FLFP. Thus, a disadvantage of the models with ‘endogenous income’ (Table 4) is that they remove from the estimates any reverse causal effect running from FLFP to income, which is part of the theoretical U-curve. This might explain why the specifications treating income as endogenous (columns 3 and 4) reveal a more muted U-curve.

6.4 Alternative measures of historical plough legacies

Next, we present reduced-form estimates that replace \(Plough\) with the instrumental variables used by Alesina et al. (2013). These variables capture the suitability of ancestral environments to the cultivation of crops that are technically compatible with the plough. More specifically, they measure ‘the proportion of the [present-day country’s] population with ancestors that lived in climates that could grow plough-positive cereals (wheat, barley and rye), and the proportion that lived in climates that could grow plough-negative cereals (sorghum, foxtail millet, and pearl millet)’ (Alesina et al., 2013, 516). In Table 5, we present the estimated coefficients on the income terms (obtained in post-estimation) when the interactant is the measure of plough-positive crop suitability (column 1) or the measure of plough-negative crop suitability (column 2) instead of \(Plough\).

Table 5 Alternative measures of historical plough use, Δ-GMM

Confirming the pattern of results presented so far, we find in column 1 that countries whose ancestors lived in climates that were suitable to the cultivation of crops benefitting from plough use (plough-positive crops = 1) experience a significantly U-shaped path of FLFP. By contrast, countries whose ancestors lived in climates that were not suitable to the cultivation of crops displaying plough complementarities (plough-positive crops = 0) experience a path of FLFP that is, statistically speaking, flat. Similarly, in column 2, we find that countries whose ancestors lived in climates that were suitable to the cultivation of crops that did not benefit from plough use (plough-negative crops = 1) experience a flat time path of FLFP. By contrast, in countries whose ancestors lived in climates that were not suitable to the cultivation of crops requiring other tools than the plough (plough-negative crops = 0), the time path of FLFP is (marginally) U-shaped at the 10% level.

6.5 Other effect modifiers

Lastly, we investigate the potential effect-modifying role of other variables than \(Plough\). Conceivably, the U-shaped path of FLFP may be modified also by other historical events, other features of pre-industrial societies, or other correlates of historical gender norms. If so, the heterogeneous shape of the Feminisation U may result partly from other factors than historical plough legacies. Furthermore, if these alternative factors are correlated with historical plough use, the coefficients on the interaction terms (\({Plough}_{i}\times {\mathrm{ln}GDPpc}_{it}\) and \({Plough}_{i}\times {\mathrm{ln}GDPpc}_{it}^{2}\)) could be picking up their unobserved influence, spuriously attributing their effect-modifying role to the plough. To address these two related concerns, we present models that control for two additional interaction terms—\({Z}_{i}\times {\mathrm{ln}GDPpc}_{it}\) and \({Z}_{i}\times {\mathrm{ln}GDPpc}_{it}^{2}\), where \({Z}_{i}\) is a time-invariant country characteristic that may exert a moderating influence on the dynamic path of FLFP.

As \({Z}_{i}\), we consider several potential factors. The first one is the pre-historical transition from a hunter-gatherer to an agricultural society. Based on cross-sectional regressions and archaeological evidence, Hansen et al. (2015) argue that historical gender inequalities pre-date the introduction of the plough. Indeed, these authors emphasise the transition to sedentary agriculture more generally as a critical historical juncture that (a) accelerated a rise in fertility and (b) assigned a premium on male brawn, promoting the emergence of a gendered division of labour and, over time, a system of norms and beliefs that downgrade the status of women in the productive sphere.Footnote 22 If so, the timing of the Neolithic revolution may exert a moderating influence on the subsequent evolution of FLFP. Since the shift to sedentary agriculture and the introduction of the plough are correlated, part of this influence may be spuriously picked up by our interaction terms. To capture the timing of the Neolithic revolution, we use a measure of the number of years of settled agriculture in 1500 AD compiled by Putterman and Trainor (2006).

Second, we examine the potential modifying influence of the cultural norms that are ingrained in languages. Linguists have shown that the rules governing grammatical gender, which differ greatly across languages, ‘arose from evolutionary pressures concerning specialisation, reproduction and the division of labor’ (Shoham and Lee 2018, 1217).Footnote 23 Indeed, the degree to which a language grammatically emphasises gender ‘acts as a [stable] cultural marker for historical gender norms’ (ibid., see Shoham and Lee 2018 for a review of the literature).Footnote 24 Thus, we consider the possibility that the cultural norms crystallised in languages may exert an additional moderating influence on the path of FLFP and that part of this influence may be spuriously picked up by the estimated coefficients on \({Plough}_{i}\times {\mathrm{ln}GDPpc}_{it}\) and \({Plough}_{i}\times {\mathrm{ln}GDPpc}_{it}^{2}\). To measure the degree of grammatical gender marking, we use the Gender Intensity Index (GII) developed by Gay et al. (2013).

Third, we investigate the potential role of a country’s exposure to the transatlantic slave trade. This variable is non-zero in a number of African countries, many of which have little or no tradition of historical plough use. Recent contributions suggest that the slave trade gave rise to female-biased sex ratios, allowing women to take up occupations that were traditionally the preserve of men (Teso 2019). If this occupational reallocation was responsible for the emergence of more equal gender-role attitudes, the legacy of the slave trade may exert an additional moderating influence on the path of FLFP, potentially confounding the effect-modifying role of the plough. We use a measure of a country’s total slave exports normalised by land area (Nunn 2008).

Fourth, we consider the potential effect-modifying impact of state antiquity—that is, the number of years each country was ruled by state institutions above tribal level (Borcan et al. 2018). In the literature, early statehood has been associated with the emergence and consolidation of patriarchal norms (Lerner 1986), as well as with the introduction of settled agriculture.

Lastly, we interact income with some of the anthropological variables used by Alesina et al. (2013) as controls. These are measures of the share of a present-day country’s inhabitants whose ancestors live in societies with an extended family structure, private property (land inheritance) rules, an economy that relied on large domesticated animals, and an economy that relied on hunting. These characteristics of ancestral societies may have facilitated the emergence of slow-moving gender norms that in turn made it more likely for a pre-historical society to invent or adopt the plough. A detailed description of all the variables used as interactants, and the associated data sources, is provided in Appendix Table 7.

The results are presented in Table 6. All models control for country and time-period (5-year) fixed effects, as well as our set of time-varying controls. Across specifications 1–8, the estimated coefficients on \({Z}_{i}\times {\mathrm{ln}GDPpc}_{it}\) and \({Z}_{i}\times {\mathrm{ln}GDPpc}_{it}^{2}\) are always statistically insignificant, after controlling for \({Plough}_{i}\times {\mathrm{ln}GDPpc}_{it}\) and \({Plough}_{i}\times {\mathrm{ln}GDPpc}_{it}^{2}\). We find no evidence that any of the \({Z}_{i}\)’s we considered exerts an additional moderating influence on the relationship between FLFP and per-capita GDP over and above the effect-modifying role of the plough. At the same time, including \({Z}_{i}\times {\mathrm{ln}GDPpc}_{it}\) and \({Z}_{i}\times {\mathrm{ln}GDPpc}_{it}^{2}\) in the regression leaves our results concerning the moderating influence of historical plough use qualitatively unaltered. Throughout models 1–7, the estimated coefficients on \({Plough}_{i}\times {\mathrm{ln}GDPpc}_{it}\) and \({Plough}_{i}\times {\mathrm{ln}GDPpc}_{it}^{2}\) are always statistically significant, while in model 8, the estimates are marginally insignificant at conventional levels (p-values = 0.119). We conclude that the estimated effect-modifying influence of the plough is unlikely to be spuriously picking up the impact of other historical correlates of the plough. Meanwhile, the observed heterogeneity in the path of FLFP across countries, while consistently related to the legacy of the plough, appears to be unrelated to these other historical factors.

Table 6 Ruling out alternative effect modifiers, Δ-GMM

7 Conclusion

In the literature, the evidence in support of the Feminisation U is mixed. In this paper, we investigated empirically the observed heterogeneity in the dynamic path of FLFP in the course of development. Building on the theoretical literature, we argued that initial conditions are critical in shaping the subsequent path of FLFP in the course of development. In particular, the cultural norms that prevail when per-capita income begins to grow above subsistence levels determine whether the mechanisms that are known to generate the Feminisation U (structural change, fertility dynamics, and the gender education gap) are operative or not.

Building on the previous literature on the historical origins of gender roles, we presented stylised facts that are consistent with initial gender-role cultural norms exerting a moderating influence on the dynamics of FLFP in the course of development. The conditional relationship between FLFP and per-capita GDP is only significantly U-shaped in countries with a history of plough agriculture, where historical gender norms were initially shaped by brawn-induced productivity differences. In countries without such history, by contrast, the time path of FLFP is effectively flat. These conditional relationships are robust to controlling for time-varying observables and unobservables, to using estimators that correct for dynamic panel bias, to instrumenting for per-capita GDP, to using alternative measures of historical plough use, and to controlling for other plausible historical effect modifiers.

Our findings have important implications. First, they integrate two literatures—on the Feminisation U and on the historical origins of gender roles—that have so far developed on parallel tracks. Second, they suggest that empirical studies may fail to estimate a significant U-shaped path of FLFP if they focus on countries or regions without a history of traditional plough use. For instance, a possible reason why Roncolato’s (2016) study of South Africa does not reveal a U-shaped relationship between FLFP and per-capita income is that South Africa is not a plough country (\(Plough=0.54\)) Third, our results point to an important nuancing of Alesina et al.’s (2013) version of Boserup’s (1970) ‘plough argument’. The introduction and diffusion of the plough leaves a legacy that persists up to the present day, as argued by Alesina et al. (2013). Yet, rather than leading to a permanent shift in gender norms and FLFP, the evidence presented here suggests that the plough shock had the effect of modifying the evolution of female labour supply (and arguably, the evolution of attitudes and beliefs about working women) in the post-Malthusian era. The labour market effects of plough adoption are neither immediate nor permanent, as they only appear as an economy reaches a middling level of income and tend to vanish completely as a high level of income is attained.

Overall, our research raises important questions about the interaction of culture and structure in shaping development outcomes. It also points to further avenues of research. While we show that historical legacies can modify the effects of post-Malthusian development on labour market outcomes for women, we have not presented direct evidence of a cultural transmission channel. While it seems difficult to conceive of a variable measuring gender-role norms and attitudes coherently over time and space to directly demonstrate cultural transmission, there is well-documented indirect evidence that supports our interpretation of the stylised facts presented here (e.g. Fernandez and Fogli 2009; Fernandez 2013).

Future research should also look for direct evidence of the mechanisms that, according to theory, are responsible for generating a U-shaped time path of FLFP. In particular, evidence should be sought that these mechanisms are triggered and sustained by the historical adoption of the plough. Answering these questions will also shed light on the time allocation of women as they leave and re-enter the labour force in the course of development. Is women’s time used to raise more children? Is girls’ education at its lowest level near the U-curve’s turning point? Is there a stronger son preference in middle-income ‘plough’ countries? Future research will also have to pay more attention to the timing of plough adoption and to how subsequent technological innovations (e.g. the invention of the heavy plough in mediaeval Northern Europe) influenced labour market outcomes for women.