Introduction

Gender dysphoria (GD) describes a distressing mismatch between one’s gender identity and natal sex (American Psychiatric Association, 2013). Not much is known about the origin of GD (Meyer-Bahlburg, 2013), but twin studies show that genetic factors play a substantial role (for a review, see Polderman et al., 2018). Also, prenatal testosterone (T) might have an effect on GD.

A Role for Organizational Testosterone Effects

A vast body of experimental research suggests that androgen action during critical periods in early development plays a crucial role in the creation of somatic and behavioral sex differences across mammalian species (for reviews, see McCarthy & Arnold, 2011; Motta-Mena & Puts, 2017). Behavioral effects on copulation, aggression, and rough-and-tumble play result from organizational (i.e., permanent) effects of androgens or their metabolites on the brain. A number of key principles have been distilled from animal experiments: (1) behaviors that are affected by organizational androgen effects show sex differences; (2) critical periods for organizational androgen effects are marked by higher T production in males than females; (3) organizational androgen effects contribute to between- as well as within-sex differences; (4) finally, within the physiological range, organizational androgen effects are roughly linear (Hines, Constantinescu, & Spencer, 2015).

In early human development, T production is higher in males than in females, particularly from about weeks 12 to 16 during gestation (Reyes, Winter, & Faiman, 1973) and from 1 to 5 months after birth (Lamminmäki et al., 2012), which suggests that these perinatal periods are critical for organizational T effects. Later in life, puberty might be a final period for organizational effects (see Schulz, Molenda-Figueira, & Sisk, 2009). In humans, experiments into such effects are not feasible, but observations on individuals who experienced atypical hormonal effects during development (e.g., due to medical conditions) suggest prenatal T effects on a suite of psychological variables (Berenbaum & Beltz, 2016). Among them are a moderate effect of prenatal T on the gender of desired romantic partners (e.g., Khorashad et al., 2016, 2017; Zucker et al., 1996) and a strong effect on human play behavior (Hines, 2010), which in itself shows very large sex differences regarding preferred play activities and gender of playmates (e.g., Golombok & Rust, 1993; Hines et al., 2002; Hönekopp & Thierfelder, 2009).

Organizational T effects might play a similar role in GD. Transmen might have experienced stronger masculinization than is typical for their natal sex, and the opposite might be true for transwomen. We consider two major lines of research supporting this hypothesis.

Brain Differences Between People With and Without a History of Gender Dysphoria

The first line of evidence draws on potential differences in the brain structures between transpeople and controls without a history of GD. Reviews (Guillamon, Junque, & Gómez-Gil, 2016; Kreukels & Guillamon, 2016) concluded that transpeople’s brains show some atypical changes away from their natal sex and toward their experienced gender identity. However, independent replications of these findings have not yet been conducted, and therefore some caution is required (e.g., LeBel & Peters, 2011; Open Science Collaboration, 2015).

Gender Dysphoria in People with Disorders of Sex Development

A second major line of evidence regarding organizational T effects on GD stems from disorders of sex development (DSD). Across a range of DSDs, gender of rearing and presumed masculinization of the brain can be aligned, partly misaligned, or at odds, irrespective of sex chromosomes (Hughes et al., 2006).

Females (46, XX) with congenital adrenal hyperplasia (CAH) produce excessive amounts of androgens as fetuses, which often leads to genital masculinization. In this group, relatively high rates of gender change or GD were observed: 5% among those raised as girls, and 12% among those raised as boys (Callens et al., 2016; Dessens, Slijper, & Drop, 2005; Mattila, Fagerholm, Santtila, Miettinen, & Taskinen, 2012). In individuals raised as girls, the cause might be the comparatively high prenatal T and its effects on brain. In individuals raised as boys, early medical treatment of their condition implies that they did not experience the boy-typical T-surge after birth; consequently, comparatively low perinatal brain masculinization might contribute to the development of a female gender identity.

In 46, XY individuals with partial or complete androgen insensitivity syndrome (AIS/CAIS), testicular androgen production is normal but has little or no effect (Hughes et al., 2012). In CAIS, which typically is only detected in adolescence, female genitalia develop and no perinatal brain masculinization occurs. Thus, the lack of perinatal brain masculinization and gender of rearing are aligned, and no gender change was observed in this group. In contrast, AIS leads to ambiguous genital development, and incomplete perinatal brain masculinization appears plausible. In this group, frequent gender change was observed, regardless of gender of rearing (Callens et al., 2016; Mazur, 2005). This possibly reflects that incomplete perinatal brain masculinization can be at odds with either male or female upbringing and therefore constitutes a risk factor for GD.

In other conditions in which 46, XY individuals undergo normal perinatal brain masculinization but show atypical genital development (penis loss through accidents; micropenis; aphallia; cloacal and classical exstrophy of the bladder), cases of gender change were not observed among individuals raised as boys but occurred frequently among individuals raised as girls (Mazur, 2005; Meyer-Bahlburg, 2005). This supports again the idea that T affects gender identity.

Across DSDs, levels of gender change or GD are consistently high when (inferred) brain androgenization mismatches gender of rearing but consistently low when brain androgenization and gender of rearing match. This supports the idea that organizational T effects impact on GD. However, confounding variables pose potential problems (for contrasting views, see Cohen-Bendahan, van de Beek, & Berenbaum, 2005; Jordan-Young, 2012). Further, GD typically develops in people without DSD, and generalization from the latter to the former might be problematic. For example, gender change in individuals with a DSD does not necessarily reflect GD (Cadet, 2011), and only DSD individuals have a lifelong history in which medical service providers and parents problematize their gender (Meyer-Bahlburg, 2009). Consequently, converging evidence for a role of organizational T effects on gender identity is desirable.

2D:4D Digit Ratio

In a classic paper, Manning, Scutt, Wilson, and Lewis-Jones (1998) suggested that low values for digit ratio 2D:4D (i.e., the length of the 2nd digit divided by the length of the 4th digit) reflect high prenatal T activity in humans. A suite of observations suggests that 2D:4D might be a useful measure of prenatal T effects: 2D:4D shows a moderate sex difference across numerous countries studied (Grimbos, Dawood, Burriss, Zucker, & Puts, 2010; Hönekopp & Watson, 2010). This sex difference is established in utero, probably at the end of the first trimester (Galis, Ten Broek, Van Dongen, & Wijnaendts, 2010; Malas, Dogan, Evcil, & Desdicioglu, 2006), but the sex difference appears largely unaffected by puberty (Králík, Ingrová, Kozieł, Hupková, & Klíma, 2017; McIntyre, Ellison, Lieberman, Demerath, & Towne, 2005; Trivers, Manning, & Jacobson, 2006). 2D:4D shows longitudinal stability (Králík et al., 2017; McIntyre, Cohn, & Ellison, 2006; McIntyre, Ellison, Lieberman, Demerath, & Towne, 2005; Wong & Hines, 2016), but see Knickmeyer, Woolson, Hamer, Konneker, and Gilmore (2011) for an exception. Probably the best evidence that 2D:4D reflects prenatal T effects stems from individuals who were exposed to atypical T effects during early development: Females with high prenatal T levels due to CAH have strongly masculinized 2D:4D (Brown, Hines, Fane, & Breedlove, 2002; Hönekopp & Watson, 2010; Kocaman et al., 2017; Rivas et al., 2014); similarly, men affected by Klinefelter’s syndrome, which causes low T levels throughout development, show strongly feminized 2D:4D (Manning, Kilduff, & Trivers, 2013); finally, genetic males affected by CAIS show moderately feminized 2D:4D (Berenbaum, Bryk, Nowak, Quigley, & Moffat, 2009; van Hemmen, Cohen-Kettenis, Steensma, Veltman, & Bakker, 2017).

Aims

As discussed above, observations from DSDs suggest that the incidence of GD in these conditions is increased when the sex of rearing mismatches early brain masculinization. This observation points to a potential role of prenatal T in the development of GD in individuals unaffected by DSD: strong prenatal T effects in females and weak prenatal T effects in males might increase the risk for GD. To address this hypothesis, we first report the largest sample for expert-measured 2D:4D in transpeople so far. Based on their statistical significance, findings from earlier studies have been interpreted as inconsistent (Manning, 2017). However, comparing studies on their statistical significance is an ill-suited criterion to judge the agreement of results (e.g., Cumming, 2014; Hunter, 1997). Therefore, we then present a meta-analysis of the pertinent evidence here. We hypothesize that transmen show, on average, masculinized (i.e., lower than female-typical) 2D:4D, whereas transwomen show feminized (i.e., higher than male-typical) 2D:4D.

The Mashhad Study

Method

Participants

Between January 2015 and December 2016, 203 individuals with GD were consecutively referred to the Transgender Studies Centre, at Mashhad University of Medical Sciences, in Mashhad, Iran. The diagnostic process of GD was largely based on the Standards of Care, version 7 of the World Professional Association of Transgender Health (WPATH) (Coleman et al., 2012). All individuals had been interviewed by at least two experienced psychiatrists according to DSM-5 criteria, and the GD diagnosis was confirmed for all participants in this group. The interviews also assessed participants’ sexual orientation, their history of GD, and their childhood play behavior. Those who reported identifying as a member of the other sex and/or sex incongruent behaviors before puberty were classified as “early onset”; the other participants were classified as “late onset.” An endocrinologist examined all participants with GD for DSDs, which was ruled out in all cases. Also, none of these participants had been treated with sex hormones.

To limit the effect of potential confounders, potential transsexual participants were screened with regard to a number of exclusion criteria (psychiatric comorbidity, medical disorders including hormonal and chromosomal abnormalities, and a history of finger fracture or dislocation). Three potential participants were excluded for a possible diagnosis of schizophrenia; six with bipolar disorder; and one with thalassemia. A total of 117 medical students and staff at Ibne-sina Hospital and Imam-reza Hospital volunteered as cisgender participants. Absence of the same exclusion criteria that were applied in the transsexual group was confirmed. 2D:4D ratios could not be obtained for one transwoman and nine controls because of obscured or ambiguous landmarks for finger lengths. We performed analyses on the remaining 104 transmen (natal females; M age 25.3 years, SD = 6.2; 103 gynephilic; 92 early onset), 89 transwomen (natal males; M = 26.0 years, SD = 6.6; all androphilic; 81 early onset), 53 control females (M = 24.9 years, SD = 4.5), and 56 control males (M = 28.9 years, SD = 9.5).

The study was approved by the ethics committee of Mashhad University of Medical Sciences. We explained the purpose of the study to the participants, and an informed written consent was obtained.

Measures

The palmar surface of the right and left hand of all participants was photocopied. Participants stretched their fingers and applied minimal pressure to the glass plate. All 2D and 4D lengths were then measured with digital vernier calipers from the digit tip to the middle of its most proximal crease. To check the reliability of the resulting 2D:4D, the photocopies of 37 hands were digitized and their 2D and 4D lengths were independently re-measured. The resulting two series of 2D:4D correlated satisfactorily, r = 86. All measurements were performed blind to participants’ gender identity.

Open Materials

The publication of data is widely regarded as an important step for safeguarding the validity of published research in psychology (e.g., Cumming, 2014; Munafò et al., 2017; Shrout & Rodgers, 2018). Therefore, all data are available at https://osf.io/jtyf4/.

Results and Discussion

Table 1 provides descriptive statistics for 2D:4D. A preliminary 2 (Sex: Female vs. Male) × 2 (Hand: Left vs. Right) mixed ANOVA on 2D:4D that included only control participants revealed the expected normative effect of gender on 2D:4D, F(1, 107) = 12.64, p = .001. In order to compare 2D:4D between transsexuals and controls of the same natal sex, we ran a 2 (Group: Transsexual vs. Cisgender) × 2 (Hand: Right vs. Left) mixed ANOVA separately on natal males and natal females. Group proved statistically significant for natal males, F(1, 142) = 4.33, p = .001, indicating lower (less masculine) 2D:4D in transwomen compared to control men. Group did not prove statistically significant for natal females, F(1, 155) = 0.07, p = .789. Figure 1 provides separate comparisons for each hand with standardized effect sizes.

Table 1 Means (and SD) for 2D:4D in the left and right hand for transmen, transwomen, control women, and control men
Fig. 1
figure 1

Digit ratio 2D:4D in transgender versus cisgender participants. Note: Individual 2D:4D scores (open circles against white background) are read against the left-hand ordinates; horizontal markers indicate group means. Standardized mean differences (filled square against gray background) are read against the right-hand ordinates; positive values indicate that mean 2D:4D was higher in transpeople than in controls, and error bars indicate 95% CIs

Sexual orientation and age of onset have been suggested to differentiate subtypes of transpeople, and these variables might indicate potential differences in GD etiology (Lawrence, 2010, 2017). Sexual orientation has also been linked to 2D:4D (e.g., Breedlove, 2017, 2019; Grimbos et al., 2010; McFadden et al., 2005; Williams et al., 2000). Both of our transpeople samples lacked variation in sexual orientation, thus precluding analyses of this factor. However, we could investigate the effect of GD onset, for which Table 2 shows descriptive statistics. We ran separate 2 (GD onset: Early vs. Late) × 2 (Hand: Right vs. Left) mixed ANOVAs on transwomen and transmen. In transwomen, neither GD onset nor its interaction with hand proved statistically significant (p ≥ .40). But in transmen, GD onset proved statistically significant, F(1, 102) = 8.55, p = .004); transmen with early GD onset (N = 92) had lower (more masculine) 2D:4D than those with late onset (N = 12). This difference proved large in the right hand, d = 1.04, 95%CI [0.42, 1.66] and medium-sized in the left hand, d = 0.62 [0.02, 1.23]. Given the small number of transmen with late onset and the exploratory nature of this analysis, some caution seems appropriate. Nonetheless, these results lend some support to the idea that different GD subtypes can be differentiated (Lawrence, 2010, 2017).

Table 2 Means (and SD) for 2D:4D in transsexuals’ left and right hand as a function of early or late onset of gender dysphoria

Meta-Analysis of 2D:4D in Transsexuals

Method

Study Retrieval and Inclusion Criteria

Childhood GD (unlike transsexualism) is often transient (e.g., Drummond, Bradley, Peterson-Badali, & Zucker, 2008). We therefore focused exclusively on adult samples. To locate relevant previous studies, we searched the topic field in Web of Science for 2D:4D or digit ratio in conjunction with either transsex* or transgender or gender dysphoria or gender identity disorder or sex reassignment. This led to 19 hits. Closer scrutiny identified eight relevant studies that compared 2D:4D between transpeople and controls with the same natal sex. Examination of their reference sections did not uncover any additional studies.

In order to be suitable for our meta-analysis, information for the calculation of a standardized mean difference (d) needed to be available for at least one natal sex (e.g., descriptive statistics, suitable test statistics or p values, figures with error bars). For five reports, relevant information was not (fully) available. We emailed corresponding authors for additional information and repeated this once when we received no response to our initial message. In this way, we could include six previous studies in our meta-analysis (Hisasue, Sasaki, Tsukamoto, & Horie, 2012; Kraemer et al., 2009; Leinung & Wu, 2017; Mas et al., 2009; Schneider, Pickel, & Stalla, 2006; Wallien, Zucker, Steensma, & Cohen-Kettenis, 2008). Two studies had to be excluded due to incomplete or contradictory information.

Analyses

We used d as our effect size measure. As in our Mashhad study, positive effect sizes indicated higher average 2D:4D in transpeople than in controls of the same natal sex (expected in transwomen), whereas negative effect sizes indicate lower average 2D:4D in transpeople (expected in transmen).

For each of the four combinations of natal sex by hand, we ran a separate random-effects meta-analysis. Two central outcome measures are of particular interest (Schmidt, Oh, & Hayes, 2009): The first, d, is simply the pooled estimate for the unknown population effect size δ (with values of 0.2/0.5/0.8 typically considered small/medium/large in psychology). The second outcome measure, T, reflects heterogeneity in the results and is important for the appropriate interpretation of d. Even when a study is closely replicated with different random samples from the same population, the resulting study effect sizes will vary, thus reflecting the randomness inherent in sampling. Heterogeneity emerges when the observed variability in study effect sizes exceeds the level expected from sampling alone. Typically, heterogeneity indicates that the question “What is the true population effect size?” cannot be answered with a single figure (e.g., “we estimate δ = 0.5”). The effectiveness of a drug, for example, might depend on the selected dose, on patients’ sex, age, diet, and genetic make-up, on their concurrent medical conditions, etc. Consequently, a set of efficacy studies using different regimes and patient groups should demonstrate heterogeneity. Therefore, the answer to “how effective is the drug?” should be “it depends…” The true effectiveness of the drug cannot be sensibly described with a single effect size, but only as a distribution of effect sizes. d estimates the mean of this distribution and T (our measure of heterogeneity) estimates its standard deviation. For example, d = 0.5 and T = 0.1 would indicate that the drug’s true effectiveness varies under most circumstances within a moderate band (from δ = 0.3 to δ = 0.7 if we use M ± 2SD); but for d = 0.5 and T = 0.4 we would infer that effectiveness varies so widely that it includes systematic harm (from δ = − 0.3 to δ = 1.3). Similarly, in our meta-analyses T > 0 suggests that the respective primary studies tapped into populations that differ in true effect size. For both d and T, significance tests can indicate if they deviate more strongly from zero than expected by chance alone. We performed all analyses with the R package metafor (Viechtbauer, 2010). Again, all materials and data are available at https://osf.io/jtyf4/.

Results and Discussion

For left hands, our meta-analyses were based on 252 transmen with 440 female controls and on 353 transwomen with 488 male controls, with k = 5 samples in each case. For right hands, our meta-analyses were based on 301 transmen with 456 female controls and on 420 transwomen with 505 male controls, with k = 6 samples in each case.

Primary study effect sizes and meta-analytic results are shown in Fig. 2. As expected, the two meta-analyses for transmen showed negative effects (masculinized 2D:4D) (see top panels in Fig. 2). However, the effect sizes (d = − .20 for left hands and d = − .36 for right hands) were not statistically significant (left hand: p = .195; right hand: p = .123).

Fig. 2
figure 2

Four meta-analyses comparing 2D:4D in transgender and cisgender participants of the same natal sex. Note: The top part in each panel shows individual study effect sizes and their 95% CIs; the bottom part shows, for each random-effects meta-analysis, the overall effect size with its 95% CI. Negative effect sizes (masculinized 2D:4D in transmen) were hypothesized for the top panels; positive effect sizes (feminized 2D:4D in transwomen) were hypothesized for the bottom panels

Strong heterogeneity was observed for both the left hand (T = 0.27, p = .028) and the right hand (T = 0.52, p = .0002). Manning (2017) suggested that clearer results emerge when digit lengths are not measured from photocopies or scans but directly on the fingers. We could test this idea for right hands. Type of measurement did not emerge as a statistically significant moderator, Q(1) = 0.72, p = .398. However, given that only one study (Leinung & Wu, 2017) used direct measurements, statistical power was presumably low. When both meta-analyses were repeated without the largest effect (Hisasue et al., 2012), heterogeneity remained large and statistically significant for both hands. We are not aware of any factors that would explain the large observed heterogeneity.

As expected, the two meta-analyses for transwomen showed positive effects (feminized 2D:4D) (see bottom panels in Fig. 2). Both effects were small, but statistically significant (left hand: d = 0.19, p = .010; right hand: d = 0.29, p = .0009). For left hands, no heterogeneity was observed (T = 0). For right hands, some heterogeneity was observed (T = 0.11). However, this was not statistically significant, Q(5) = 6.8, p = .237; thus, chance on its own would be fairly likely to generate the moderate level of heterogeneity observed. For right hands, we could again test Manning’s (2017) idea that type of finger measurement moderates results. Again, no such evidence emerged, Q(1) = 0.81, p = .370. But again, statistical power must be presumed to be low because only Leinung and Wu (2017) used direct measurements.

A preference for the publication of statistically significant studies can cause a systematic bias in the published literature (e.g., Sterling, 1959), which is a natural concern for meta-analysis (and any other type of literature review). Tests for this kind of publication bias have low statistical power when the number of primary studies is small (Ioannidis, 2008), as was the case here. We therefore did not perform any formal tests. However, as Fig. 2 shows, the results of most primary studies were not statistically significant on their own. (This is indicated by 95% CIs containing zero.) If a powerful publication bias had shaped the literature we reviewed here, this pattern of results would not be expected (e.g., Francis, Tanzman, & Matthews, 2014). Consequently, we believe that publication bias is not a concern in the body of literature reviewed here.

Although transsexualism and sexual orientation are distinct, they are not independent: Sexual attraction to women is much more common among transmen than among non-transsexual natal females; similarly, sexual attraction to men is much more common among transwomen than among non-transsexual natal males (e.g., Wallien et al., 2008). It is therefore interesting to note parallels and differences regarding their relationship with 2D:4D. Whereas our meta-analysis established a clear link between 2D:4D and transsexualism for natal males only, the opposite holds for sexual orientation. A meta-analysis found a link with 2D:4D for females only (Grimbos et al., 2010). As expected, lesbians showed lower (masculinized) 2D:4D than heterosexual women, with a small-to-medium effect size similar to the one we observed for natal males.

One limitation of our meta-analyses is that, due to insufficient information in the primary studies, we could not include data on 495 transwomen and 160 transmen. Given the observed homogeneity for results in transwomen, it seems plausible that these additional data would not change the overall picture in this group. More importantly, the discussions of the relevant reports suggest that the observed effects were in line with expectations (Veale, Clarke, & Lomax, 2010; Vujović et al., 2014). We therefore think it is fair to conclude that inclusion of these effects would not have substantially changed our findings.

Regrettably, a lack of detailed information in the primary studies precluded more fine-grained analyses of sexual orientation and age of GD onset, which might characterize distinct subtypes of transwomen and transmen (Lawrence, 2010, 2017). We hope that future studies will address these variables in greater detail.

General Discussion

As we discussed in the introduction, the pattern of GD rates across DSDs suggests a role for prenatal T in GD development. More specifically, a mismatch between perinatal brain masculinization and sex of rearing appears to increase GD risk. Similarly, a relatively high degree of masculinization of the brain in females and a relatively low degree of masculinization of the brain in males might increase GD risk in individuals unaffected by DSDs. Studies in people with DSDs cannot address the latter question, and atypical perinatal T effects in DSDs are confounded with a lifelong history in which medical service providers and parents problematize gender. For these reasons, convergent evidence beyond observations in DSDs is desirable to corroborate any role of perinatal T in GD. Digit ratio 2D:4D, an index of prenatal T effects, strikes us as a suitable tool.

Here, we presented the largest sample for expert-measured 2D:4D in transpeople and a meta-analysis of pertinent 2D:4D studies. As a mix of statistically significant and non-significant findings in individual studies might easily mask a clear pattern of results (e.g., Hunter, 1997), we focus our discussion on the meta-analysis. For transwomen, a clear pattern emerged. In line with our hypothesis, transwomen showed feminized (higher) 2D:4D. The effect was small, but consistent across studies and across both hands. For transmen, the evidence was more tentative: In both hands, transmen showed masculinized (lower) 2D:4D, and the average effects were slightly stronger than for transwomen. However, large heterogeneity between studies was observed and rendered the observed mean effects not statistically significant (Schmidt et al., 2009). Additional studies might increase statistical power and thus turn the overall effect statistically significant; however, it seems unlikely that heterogeneity would disappear. The studies analyzed here stem from markedly different cultures, among them Iran, Japan, and Switzerland. The observed heterogeneity in effect sizes might reflect that the role of prenatal androgenization for the development of GD in natal females differs across cultures.

Our results suggest that weak prenatal T effects in natal males contribute to GD risk. The meta-analyses tentatively suggest that strong prenatal T effects in natal females increase GD risk under circumstances yet to be identified.

Convergent Results from CAH and Digit Ratio Beyond Gender Dysphoria Underpin the Validity of 2D:4D

In the introduction, we discussed various strands of evidence that 2D:4D tracks prenatal T effects. Perhaps the strongest comes from observations in DSDs and other conditions in which atypical prenatal T effects are reliably accompanied by analogous changes in 2D:4D (Berenbaum et al., 2009; Brown, Finn, Cooke, & Breedlove, 2002; Kocaman et al., 2017; Manning et al., 2013; Rivas et al., 2014; van Hemmen et al., 2017). Some researchers, though, called the validity of 2D:4D into question and argued that it is unsuitable for studying perinatal T effects (Berenbaum & Beltz, 2016; Hines et al., 2015). We address this criticism in two ways. In this section, we review to what extent 2D:4D studies do or do not converge with CAH studies (which critics of 2D:4D regard as particularly useful for studying prenatal T effects) beyond the domain of GD. After that, we look in greater detail at the arguments levelled against 2D:4D and use them to discuss the strengths and limitations of our current findings.

Observations from CAH-affected females offer particularly strong evidence that gendered play behavior is affected by prenatal T (Berenbaum & Beltz, 2016; Hines et al., 2015). A number of studies had parents describe their child’s play behavior with the Pre-School Activities Inventory (Golombok & Rust, 1993), which asks for the popularity of various play activities, in order to investigate the relationships with 2D:4D (Hönekopp & Thierfelder, 2009; Körner, Pause, & Heil, 2017; Mitsui et al., 2016; Wong & Hines, 2016). Analyses were performed separately for girls and boys. Almost all correlations were in the expected direction (indicating that more feminine 2D:4D tended to go with more female-typical play), typically showing small-to-medium effects. Six out of 16 effects proved statistically significant. 2D:4D studies therefore offer considerable support for an effect of prenatal T on gendered play behavior in children. Thus, evidence from CAH studies and 2D:4D studies converges, although CAH studies found much larger effect sizes (Berenbaum & Beltz, 2016).

Another area of interest is autism spectrum disorder, for which prenatal T is believed to be a risk factor (Baron-Cohen, 2002). In line with this idea, a meta-analysis found clear evidence for lower (masculinized) 2D:4D in individuals affected by autism spectrum disorder (Hönekopp, 2012), with a medium effect size and no clear sign for heterogeneity in results (see also Al-Zaid, Alhader, & Al-Ayadhi, 2015, for a later study with similar results). Investigations whether autism risk is increased in CAH-affected individuals are hardly feasible because both conditions are rare. However, three studies investigated if autism-typical symptoms tend to be elevated in CAH-affected individuals (Knickmeyer et al., 2006; Kocaman et al., 2017; Kung et al., 2016). Across three female and two male samples, four effects pointed in this direction (d = 0.30 to d = 0.54), with two of them being statistically significant; one male sample found a small, statistically non-significant effect in the opposite direction (d = − 0.25, Kung et al., 2016). Given their different outcome measures (diagnosis of autism spectrum disorder versus degree of autism-typical symptoms), 2D:4D studies and CAH studies are difficult to compare directly; nonetheless, both garnered evidence that elevated prenatal T increases autism risk.

Another area of interest is sexual orientation. A review found that all eight pertinent studies reported higher bisexual/homosexual orientation in CAH-affected women than in unaffected controls (Meyer-Bahlburg, Dolezal, Baker, & New, 2008). This suggests that strong prenatal T effects shift female sexual behavior and fantasies in a male-typical direction. In line with this, a meta-analysis of studies comparing 2D:4D in homosexual and heterosexual individuals found lower (masculinized) 2D:4D in lesbians (Grimbos et al., 2010); no effect was observed for gay versus heterosexual men. Whereas the effect size for masculinized 2D:4D in lesbians was small, rates of bisexual/homosexual orientation in women with CAH were often substantially increased.

A final area of interest is aggression. Findings on aggression in CAH-affected females appear somewhat inconsistent. In comparison with controls, females with CAH have been reported to show statistically significantly lower levels of aggression (Helleday, Edman, Ritzén, & Siwers, 1993), about the same levels of aggression (e.g., Berenbaum & Resnick, 1997; Money & Schwartz, 1976), and statistically significantly higher levels of aggression (Berenbaum & Resnick, 1997; Pasterski et al., 2007). Overall, however, results appear to lean toward greater aggression in CAH-affected females. Two pertinent meta-analyses on 2D:4D studies found no analogous link with aggression in females (and at best tentative evidence for a relationship in males, cf. Hönekopp & Watson, 2011; Turanovic, Pratt, & Piquero, 2017). Therefore, results from CAH studies and from 2D:4D do not converge when it comes to aggression. Two points are of note though. First, the evidence from CAH studies appears less conclusive than is the case for GD, play behavior, autism-typical symptoms, and sexual orientation. Second, CAH studies into aggression typically rely on healthy female controls although chronically ill controls might be more appropriate, because the burdens of illness might increase aggression; we are aware of only one study with such a control group, and it presented no evidence for greater aggression in CAH-affected females (Slijper, 1984).

In sum, there is broad convergence in the results of CAH studies and 2D:4D studies: Both provide evidence for prenatal T effects on GD, gendered play behavior, autism, and sexual orientation; however, aggression is a potential exception. It should be noted though that effect sizes tend to be considerably larger in CAH studies than in 2D:4D studies whenever the use of similar outcome measures makes such comparisons meaningful. A broader review that goes beyond CAH studies to investigate convergence between 2D:4D studies and studies using other means to investigate prenatal T effects would be desirable, but is beyond the scope of our paper.

Strengths and Limitations of 2D:4D

Some researchers have argued against the usefulness of 2D:4D for studying prenatal T effects in humans, primarily because 2D:4D appears to be a noisy measure (e.g., Berenbaum & Beltz, 2016; Hines et al., 2015). For example, a large difference in prenatal T effects between typically developing males and 46, XY individuals with CAIS translates into 2D:4D distributions that show strong overlap between both groups (Berenbaum et al., 2009). The general pattern discussed earlier—2D:4D studies tend to find smaller effect sizes than comparable CAH studies—points in the same direction. However, the value of a noisy measure depends on the relative merits and limitations of available alternatives. For example, the easy availability of 2D:4D means that it can be used to study small populations of interest that are difficult to address via CAH, amniocentesis, or other means that address prenatal T directly (e.g., Coates, Gurnell, & Rustichini, 2009; Manning, Baron-Cohen, Wheelwright, & Sanders, 2001; Voracek, Reimer, Ertl, & Dressler, 2006). Possibly owing to ease of measurement, 2D:4D studies have also looked at a much broader range of outcome measures than has been achieved with alternative means for studying prenatal T effects. For example, negative correlations between 2D:4D and physical performance are well established (e.g., Hönekopp, Manning, & Müller, 2006; Manning, 2002). These relationships are strong when performance hinges upon endurance but absent or weak when performance hinges on strength (Hönekopp & Schuster, 2010), and a detailed understanding of these relationships at a physiological level might be within reach (Holzapfel, Chomentowski, Summers, & Sabin, 2016). Naturally, convergent evidence for prenatal T effects on endurance from future studies relying on different methods would be desirable. In general, our meta-analysis on transpeople and our review of other domains demonstrates that 2D:4D and other avenues for studying prenatal T can sensibly complement each other.

Although sex of rearing seems to powerfully shape gender identity, rates of GD are strongly elevated in a number of DSDs (Callens et al., 2016; Dessens et al., 2005; Mazur, 2005). Among other predictors for a positive outcome, degree of prenatal T effects can play a role in gender assignment after birth (Hughes et al., 2006). Nonetheless, it seems unlikely that 2D:4D can meaningfully contribute to treatment planning: Given that it is probably a noisy measure of individual prenatal T effects (Breedlove, 2010), 2D:4D can only be a weak predictor of outcome quality, which encompasses physical health, fertility, well-being, and other facets. And adding a weak predictor to one or more existing stronger predictors usually does not increase prediction accuracy (Cohen, 1990; Dawes, 1979; Gigerenzer & Goldstein, 1999).

However, 2D:4D advances our understanding of GD. The higher 2D:4D observed in our meta-analyses for transwomen suggests that low prenatal T levels increases GD risk in natal males. It also found some evidence that high prenatal T levels increase GD risk in natal females. Overall, prenatal T effects appear to play a small role in GD development. Future studies might show if early postnatal T has similar effects, and what other factors increase GD risk.