1 Introduction

Testosterone has been hypothesised to be associated with a wide range of economic decision making. One aspect of this hypothesis is the theory that prenatal testosterone exposure impacts brain development and therefore can explain some of the heterogeneity in behaviour between individuals. A putative proxy for the level of prenatal testosterone exposure is the ratio of the length of the second digit to the length of the fourth digit (2D:4D) on each hand, as suggested by Manning et al. (1998). Subsequently, many studies have reported associations between 2D:4D and a variety of traits, such as sexual orientation, spatial ability and personality traits, although the results are often conflicting [and with some possibility of publication bias, see, e.g., Puts et al. (2008), Voracek and Loibl (2009), Grimbos et al. (2010), Voracek et al. (2011), but see Hönekopp and Schuster (2010) and Hönekopp and Watson (2011), who do not find evidence for publication bias]. Furthermore, a sizeable literature uses 2D:4D to explore the effect of prenatal testosterone exposure on economic decisions, also with mixed results.

This paper aims to test hypotheses in previous papers in relation to the association between 2D:4D and risk taking, dictator game giving, and the willingness to compete. These preferences are relevant for explaining variation in many economic outcomes. We use a sample of 330 women—which is large given most sample sizes that have previously been reported—in an experiment to measure 2D:4D and economic preferences.

Whilst the 2D:4D measure has been used in many studies, the link between prenatal testosterone and 2D:4D is not strongly established (McIntyre 2006). The oft-cited study by Lutchmaya et al. (2004), which indirectly investigates the link between 2D:4D and prenatal testosterone exposure, finds a statistically significant negative correlation in a sample of 29 children between the testosterone-to-estradiol ratio in amniotic fluid and right hand 2D:4D only, even after controlling for gender (the left hand is reported insignificant). An additional method of investigation is to compare same sex and opposite sex twins, based on the theory of sex-hormone transfer in utero (Miller 1994). van Anders et al. (2006) find that females with a male rather than female co-twin have lower left hand 2D:4D, which the authors argue is due to hormone transfer from male to female foetuses, however, they find no statistically significant results for the right hand. Whilst Voracek and Dressler (2007) in a similar study report a statistically significant result for mean 2D:4D, among studies with much larger sample sizes there is a failure to find statistically significant differences (Hiraishi et al. 2012; Cohen-Bendahan 2005; Medland et al. 2008).Footnote 1 In a study looking at umbilical cord androgen and estrogen concentrations and 2D:4D measured as young adults, Hollier et al. (2015) find no statistically significant association for either hand, using a mixed gender sample of 341 participants. Lastly, other methods of establishing a link between 2D:4D and androgen exposure both post- and peri-natally include using congenital adrenal hyperplasia (CAH) and the CAG repeat polymorphism (McIntyre 2006; Brown et al. 2002), and here also there is a mix of positive and null results.

Even though the link between 2D:4D and prenatal testosterone is not well established, there are many papers investigating the association of 2D:4D with economic decision making. Whilst 2D:4D is an easy-to-measure way to proxy for prenatal testosterone exposure, many of these papers use multiple tests and have relatively small sample sizes. As far as we are aware, none of the previous studies pre-register their analyses. There are often multiple hypotheses involving different ways of measuring the explanatory variable (left hand, right hand, average of both hands or even squared 2D:4D), as well as which controls to include (such as gender, age or sexual orientation) and which subsamples to analyse (such as ethnicity and gender), giving rise to many ‘forking paths’ (Gelman and Loken 2013) and researcher degrees of freedom (Simmons et al. 2011). As discussed in Simmons et al. (2011), researchers have many options available in choosing among outcome variables, controls and subsample selection, creating ambiguity in the research process and potentially generating higher rates of false positives than 5%, even if researchers do not intend to do so. In our review of the literature in the following subsections, we consider statistically significant results to be cases where the p value is less than 0.05 and report anything above that threshold as insignificant, as is typically used. We present tables to summarise the results of studies that use comparable measures of economic preferences to our experiments.Footnote 2 However, in our own results in this paper, we instead consider a p value less than 0.05 to indicate suggestive evidence, whilst statistical significance requires a p value less than 0.005, following Benjamin et al. (2018).

Benjamin et al. (2018) suggest a change in the p value defining statistically significant new discoveries from 0.05 to 0.005, to improve the reproducibility of scientific studies (in terms of reducing rates of false positives). The authors propose that where p values are below 0.05 but above 0.005, this should be interpreted as suggestive evidence. Whilst our study aims to be a replication of past studies, the results of past studies are mixed and therefore we think it is appropriate to use the more conservative 0.005 threshold for statistical significance. An additional motivation for a more conservative threshold than 0.05 is that we, following the existing literature, run several tests for each outcome measure.

1.1 Dictator game giving

Several papers have looked at the relationship between 2D:4D and giving in the dictator game.Footnote 3 The dictator game removes any repercussions of failure to reciprocate (unlike the ultimatum game), and in all the below studies the participants were told that the recipient in the game is another participant whose identity is unknown.Footnote 4 The hypothesised relationship between 2D:4D and dictator game giving is positive, with higher exposure to testosterone (low 2D:4D) being associated with lower levels of dictator game giving. The results from studies using the dictator game are summarised in Table 1, showing that insignificant findings are common. When statistically significant, regressions using squared 2D:4D measures find an inverse U-shaped relationship between 2D:4D and dictator game giving (low dictator game giving is associated with both low and high testosterone). From the five previous papers summarised in Table 1, 2 out of the 43 total tests find statistically significant positive results, 1 out of 43 finds statistically significant negative results, 8 out of 43 find an inverse U-shaped relationship and 32 out of 43 find no statistically significant results (where here significance is \(p<0.05\)).

Table 1 Dictator game giving studies

1.2 Risk taking

While several review papers find that women are on average more risk averse than men, [see, e.g., Eckel and Grossman (2008), Croson and Gneezy (2009), Charness and Gneezy (2012)], there is also evidence from a meta-analysis by Nelson (2015) suggesting that the difference (in terms of effect size) is not very large. Nevertheless, there is a substantial literature looking into a biological explanation for this gender difference through prenatal testosterone exposure and the 2D:4D ratio. As far as we are aware, only one study finds an association between 2D:4D and risk tasking in men and not in women (Stenstrom et al. 2011). The hypothesis is that risk taking is negatively related to 2D:4D—higher testosterone exposure is associated with higher risk taking (and lower risk aversion). The results from studies using risk taking tasks are summarised in Table 2. We limit our analysis of the previous literature to the areas of financial or general risk taking. There are numerous ways to measure risk-taking in experimental tasks, as well as the digit ratio (such as by scanner, or calliper etc.), which can add measurement error. From the 18 previous papers summarised in Table 2, 1 out of the 109 total tests finds positive statistically significant results, 15 out of 109 find negative statistically significant results, and 93 out of 109 find no statistically significant results (significance here is \(p<0.05\)).

Table 2 Risk-taking studies

1.3 Competitiveness

Whilst there is evidence for gender differences in self-selection into competition (Niederle and Vesterlund 2007; Dariel et al. 2017),Footnote 5 there exists substantially less literature looking at the relation between prenatal testosterone exposure and willingness to compete, relative to the other economic preferences discussed. Given the gender differences observed in this scenario, the hypothesis tested in the existing literature is that higher testosterone is associated with higher competitiveness, leading to a negative relationship between 2D:4D and the willingness to compete. Table 3 summarises the results from previous studies. Out of the 10 total tests reported across previous studies, 2 find statistically significant negative results and 8 find no statistically significant results (here significance is \(p<0.05\)).

Table 3 Competitiveness studies

2 Method

2.1 Experimental procedures and design

The data on 2D:4D were collected in conjunction with a study on the influence of the oral contraceptive pill (Ranehill et al. 2017). The pre-analysis plan specifying the analysis prior to completion of data collection for this study was posted on the Open Science Framework website on the 21st of August 2015 (available at http://osf.io/he8nb/). However, the 2D:4D measure was not part of the main planned analyses in this double-blind randomised study. The exact analyses for the 2D:4D measure were therefore not specified in the pre-analysis plan. Instead it was stated in the pre-analysis plan that the 2D:4D data would be used to carry out tests of previous 2D:4D results reported as statistically significant in the literature (i.e., the data were collected to be able to replicate previous findings). The previously reported results in the literature are therefore the starting point for our analyses, but ideally our tests should have been exactly specified in the pre-analysis plan.

The participants in the study were 340 healthy women aged 18–35 years recruited following the criteria used in the oral contraceptive study.Footnote 6 Participants in this study thus had agreed to participate in a randomized controlled trial on the effects of the contraceptive pill. Participants participated in two sessions for the overall study: once at baseline, and once during the follow-up (the end of the study medication treatment period). Both sessions took place at the Karolinska University Hospital. The economic experiment was performed during the second session. During both sessions, we first collected blood samples for the participants before they filled out surveys on sexual function, general well-being and depressive symptoms. Participants then filled out a survey on facial preferences. In the second session, participants participated in the economic experiment after the survey of facial preferences. The economic experiment was computerized.Footnote 7 The economic part took about 30 minutes, while the other parts took about 20 minutes. Participants were not informed about their earnings for any task during the experiment but were paid at a later date (within 2 months after having participated in the experiment).

For details on how participants were recruited, the criteria for inclusion and exclusion, and further sample characteristics see Ranehill et al. (2017). Approximately 60% of participants reported an education level of university studies (ongoing) or a university degree. Unfortunately, we do not have ethnicity data for our sample of participants. While the majority of the participants were Caucasian, we cannot rule out that controlling for ethnicity would affect our results. The statistical analysis is based on 330 participants as 10 participants did not complete the data collection (7 discontinued treatment and thus did not complete the data collection, and 3 had missing hand measurements).

The economic experiments on decision making were also reported and analysed in Ranehill et al. (2017). The tests measured dictator game giving, financial risk taking, and willingness to compete. The order of the experimental tasks was kept constant across all participants, starting with the dictator game, the risk task, and thereafter the three stages of the competitiveness task.Footnote 8 Participants were not informed about their earnings for any task during the experiment but were paid at a later date (within 2 months after having participated in the experiment). The economic experiment was computerized and took about 30 minutes.

The dictator game giving measure was elicited in a modified dictator game where the participant was asked to allocate SEK 100Footnote 9 between herself and a charitable organization, repeated five times with a different charity organisation in each repetition. The average donation across the five decisions is used as our measure of dictator game giving. We include five dictator game decisions to reduce measurement error.

We measure risk taking with repeated lottery choices, involving 18 decisions between a certain payoff, and a 50:50 gamble to win either a larger amount of money than the safe option or SEK 0. The certain payoff amounts varied from SEK 40 to 280, and the gamble amounts were either SEK 200, 300 or 400. The percentage of choices of the gamble (i.e., the number of times the gamble was chosen over the certain payoff) is used as our measure of risk taking.

Measuring willingness to compete consisted of asking participants to solve simple tasks of adding numbers for 3 minutes, first under a non-competitive piece-rate payment scheme of SEK 5 for each correct answer, and then under a competitive tournament payment scheme of SEK 10 for each correct answer only if more tasks were solved than a random competitor (a participant selected from a previous session), otherwise the pay was zero (with SEK 5 for each person in the case of a tie). Then, in the last part, the participant could select to be paid either under the non-competitive piece rate scheme or the competitive tournament scheme. For our willingness to compete measure, we used the choice of competitive tournament scheme in this part (dummy variable where 1 is choice of competitive tournament scheme).

2D:4D results in the literature are sometimes presented for the left hand, sometimes for the right hand, and sometimes for the average of both hands. Following the existing literature, we therefore present results for all these three 2D:4D measures. In the literature results are sometimes presented for a linear model and sometimes a squared term is added to allow for a non-linear relationship. Following the existing literature, we therefore present results both without (the linear model) and with a quadratic term. In total we therefore estimate 18 regression models; 6 models for each outcome measure. In the models with a squared term we evaluate the significance of 2D:4D as the significance of the regression coefficient for the squared 2D:4D, but we also report the significance of an F-test for the joint significance of 2D:4D and the squared 2D:4D.

2.2 Power calculations

We first estimate our power to detect previous statistically significant results, based on all statistically significant findings in the literature (for models without the squared term and where the necessary information was available) and we have the following ranges of power calculations. For dictator game giving, the range of power is 0.896 to 0.999 with a mean of 0.941 at the 5% level and 0.656 to 0.994 with a mean of 0.791 at the 0.5% level. For risk taking, the range of power is 0.423 to 0.999 with a mean of 0.748 at the 5% level and 0.148 to 0.999 with a mean of 0.535 at the 0.5% level. For the willingness to compete, the range is 0.441 to 0.468 with a mean of 0.454 at the 5% level and 0.159 to 0.176 with a mean of 0.167 at the 0.5% level. However, we note that there are drawbacks to doing such power calculations, since it is very likely that original results are biased in terms of being exaggerated even if they are true positives [see, e.g., Gelman and Carlin (2014)]. Lastly, with our sample size of 330, we have 90% power to find a small effect size of \(r = 0.17\) with \(\alpha = 0.05\), and \(r = 0.22\) with \(\alpha = 0.005\).

2.3 Measuring 2D:4D

Digit measurement expressed in millimetres (mm) was performed for digit two (2D) and digit four (4D), using a Vernier digital calliper 0–150 mm (USA, Cocraft) with a precision of 0.01mm. Digit length was directly measured by two raters from the mid-point of the proximal crease of the proximal phalanx to the distal tip of the distal phalanx for 2D and 4D on both left and right hand. The reliability of direct measurement of digits was tested, demonstrating a high repeatability and differences between subjects greater than measurement errors (Savic et al. 2017). The mean value of two measurements of the 2D and 4D length was calculated and then divided to create the 2D:4D ratio, which was used for further statistical analysis.

3 Results

Overall we report results for 18 regression variations, with 6 different specifications for the explanatory variables run separately using OLS for the 3 dependent variables, representing our outcome measures of dictator game giving, risk taking, and the willingness to compete. We note that the Pearson correlation between left and right hand 2D:4D in our sample is 0.63.Footnote 10 Table 4 shows the means and standard deviations for the 2D:4D measures and the outcome variables.

Table 4 Summary statistics

We report the regression results in the following three tables, grouped by outcome measure. Table 5 shows the results for the dictator game giving measure, whilst Table 6 shows risk taking and Table 7 shows the willingness to compete as the dependent variable.

Table 5 Dictator game giving results
Table 6 Risk-taking results
Table 7 Willingness to compete results

We find no evidence of a statistically significant relation between 2D:4D and either dictator game giving or risk taking (\(p>0.05\)). For competitiveness we find no evidence in the linear models either (\(p>0.05\)). When we add a squared term we find statistically significant evidence (\(p<0.005\) for the squared 2D:4D coefficient) in both the regression for left hand 2D:4D and competitiveness, and the regression for the average 2D:4D of the two hands and competitiveness.Footnote 11 However, these regression specifications are not among those that have previously been reported in the literature for the willingness to compete. We plot the predicted relationships from these statistically significant specifications to illustrate the interpretation of the predicted relationships, using the range of 2D:4D that we see in our data (Fig. 1).Footnote 12

Fig. 1
figure 1

Plot of the predicted relationships between 2D:4D and willingness to compete, for the regression with left hand 2D:4D and left hand 2D:4D squared, and also for the regression with average 2D:4D and average 2D:4D squared

The willingness to compete outcomes predicted by our regression equations show an inverse U-shaped relationship where, across a range of 2D:4D values from 0.85 to 1.1, low 2D:4D (synonymous with high prenatal testosterone exposure) predicts low competitiveness, which does not fit with the pre-existing hypothesis that high testosterone correlates with high competitiveness.Footnote 13 The highest willingness to compete is instead associated with mid-range 2D:4D for this predicted relationship. If the hypothesis tested in the existing literature was to hold here, we would see a decreasing relationship. As most 2D:4D measurements are below 1, we see that most of the distribution of observations would lie to the left of the peak, in the region of an increasing relationship, which is the opposite to the hypothesised relationship. The estimated inverse U-shaped relationship is thus unlikely to represent a real effect.

4 Discussion

In this study we find little evidence of 2D:4D correlating with economic preferences in a sample of 330 women. The only two statistically significant regression specifications (\(p<0.005\)) are not in the hypothesised direction and are not consistent with any previous findings, and are thus likely to be a false positive. The study by Ranehill et al. (2017) that was run in conjunction, but looking at the effect of the oral contraceptive pill, also did not find any impact of the pill on economic preferences.

Our null results could be due to several reasons. First, 2D:4D may be a reliable proxy of prenatal testosterone exposure but prenatal testosterone exposure may not correlate with economic preferences and previous results are false positive results. Second, 2D:4D may be a reliable proxy of prenatal testosterone but the relation between prenatal testosterone exposure and economic preferences is so weak that with 330 women we do not have sufficient statistical power to detect true positive results. Third, 2D:4D may be a weak or noisy proxy of prenatal testosterone but the relation between prenatal testosterone exposure and economic preferences is actually strong; but again we could then be underpowered to detect true positive results. Fourth, 2D:4D may be a weak or noisy proxy of prenatal testosterone and there is also a weak relation between prenatal testosterone exposure and economic preferences; again we could then be underpowered to detect true positive results. Fifth, 2D:4D may not correlate with economic preferences among women, thus our study would be set up to not find anything since we have only women in our sample. Given previous literature it is not clear to us why this should make a difference but additional high-powered studies, with pre-analysis plans, on men or mixed gender would be useful.

Sixth, perhaps there is something special about our sample that makes us not find a true correlation between 2D:4D and economic preferences that exist in more general samples. The editor pointed out that the selection of women who are non-smokers and who are willing to use oral contraceptives might generate a sample that is more risk-averse than the general population, or have a higher 2D:4D ratio. With respect to risk taking, the closest comparison of our sample to the general population is Boschini et al. (2018) who explore risk preferences in a random sample of 487 Swedish women in a similar risk preference elicitation task of choices over lotteries versus safe options. In these samples, the average switching point is very similar—just below the risk neutral point. With respect to 2D:4D, our sample is within a similar range to previous studies.Footnote 14 In sum, more work is needed to disentangle these six possible explanations for our null results.

In a related vein, the evidence linking sex hormone administration to economic preferences is also inconclusive with most studies failing to reject the null hypothesis of no effect. The few statistically significant findings (as well as the null results) need, however, to be interpreted with caution because of low statistical power and the many researcher degrees of freedom [see the recent review by Dreber and Johannesson (2018) for more information].

In sum, more work is needed with larger sample sizes and pre-registered hypotheses to have enough statistical power to find small effects of 2D:4D on economic preferences. Additionally, studies using improved indicators of prenatal testosterone exposure may be warranted.