1 Introduction

In 2000, the field of psychology concluded the nature-nurture debate to be “over” by posing that all human behavioral traits are heritable (Turkheimer 2000). This “first law” of behavior genetics is backed by a vast body of literature comprising thousands of heritability studies (Polderman et al. 2015; Turkheimer 2000). Since 2008, several studies have shown that this law also holds for entrepreneurship (Nicolaou et al. 2008a, b, 2010; Shane and Nicolaou 2015; Van der Loos et al. 2013; Zhang et al. 2009). Inspired by these findings and advances in genetics research, Koellinger et al. (2010) provided a sketchy forecast in this journal of the expected identification of relationships between genetic variants and entrepreneurship. Nevertheless, despite several attempts in the past decade (Nicolaou et al. 2011; Quaye et al. 2012; Van der Loos et al. 2011, 2013; Wernerfelt et al. 2012), no single robust association between a genetic variant and entrepreneurship has been found. Therefore, the first question we address in the present study is “Why has the identification of robust associations between genetic variants and entrepreneurship been unsuccessful in the last decade?” We answer this question from a methodological point of view. In doing so, we also provide a review of the literature in this field of research.

The second question we address is “Would the identification of associations between genetic variants and entrepreneurship help to advance the field of entrepreneurship research?” Despite the unsuccessful attempts so far, we provide methodological and empirical reasons for why we may expect the identification of the first robust associations between genetic variants and entrepreneurship in the not too distant future. Entrepreneurship scholars have argued that the prediction of entrepreneurial behavior using genetic data could have practical applications in business and for individual decision-making (Nicolaou et al. 2008a; Nicolaou and Shane 2010; Shane 2010). Moreover, several private companies already offer genetic tests to predict someone’s leadership and managerial qualities.Footnote 1 We explain how summary indices of genetic variants (so-called polygenic risk scores) can be used for such prediction analyses, but by drawing on the broader behavior genetics literature, we stress the caveats associated with applying population-level results to the individual level. By relating the promises of “genoeconomics” as outlined by Benjamin et al. (2012a) to entrepreneurship research, we then sketch how we think the use of genetic information may advance the field of entrepreneurship research.

To illustrate the answers to our two research questions, we include an empirical analysis of data from the US Health and Retirement Study. The inclusion of the empirical analyses in this study serves three purposes. First, the results of the analyses show how polygenic risk scores constructed for a range of traits (and not just entrepreneurship) can help to identify regions in the human genome particularly important for entrepreneurial behavior. Second, these analyses illustrate how polygenic risk scores can significantly predict entrepreneurship (even when proxied by the relatively episodic activity of self-employment). Third, we use these analyses to illustrate that the estimated relationships between polygenic risk scores and entrepreneurship at the population level only marginally improve the prediction of entrepreneurial behavior at the individual level.

In the following section, we review the studies providing evidence for the heritability of entrepreneurship. By exploiting family-based relationships rather than molecular genetic information, these studies show that approximately 40% of the differences in entrepreneurial behavior can be explained by genes. In Section 3, we review the molecular genetic analyses of entrepreneurship. We provide a comprehensive overview and discussion of the methodological approaches taken to identify relationships between genetic variants and entrepreneurship. Our empirical analyses are introduced and presented in Section 4. Finally, Section 5 concludes by discussing the added value of genetics for entrepreneurship research.

2 The heritability of entrepreneurship

Nofal et al. (2018) provide a review of the literature about “biology and management.” Studies analyzing entrepreneurship are also included in this overview. All studies related to entrepreneurship in their category “Quantitative genetics” are discussed in this section (besides other studies). All entrepreneurship studies in their category “Molecular genetics” are discussed in Section 3 (again, besides other studies).

Heritability is a technical term denoting the proportion of observed differences in a trait among individuals from a certain population that is due to the genetic differences among these individuals (Visscher et al. 2008). The main challenge in the estimation of heritability is the statistical separation of the effect of genes from the effect of the family environment on the trait of interest. One way to address this challenge is to compare adoptees with biological children. Using this approach, Lindquist et al. (2015) find that parental entrepreneurship increases the likelihood of children’s entrepreneurship by 60%. In their Swedish sample, they show that post-birth factors (i.e., adoptive parents) are two times more important than pre-birth factors (i.e., biological parents) for explaining entrepreneurial involvement.

Another, more common approach to separating the effect of genes from the effect of the family environment is the comparison of monozygotic and dizygotic twins reared together because the number of available twin samples is much larger than the available samples of adoptees (Knopik et al. 2016). Monozygotic twins are genetically identical; however, dizygotic twins are as genetically similar to each other as regular siblings. Under the assumption that monozygotic and dizygotic twins are influenced by their family environment to the same extent, it is possible to decompose the variance in a trait into three components: the additive genetic effect, the common environment (family specific) effect, and the unique (individual specific) environment effect. Nicolaou et al. (2008a, b, 2010), Shane and Nicolaou (2015), Van der Loos et al. (2013), and Zhang et al. (2009) use the classical twin study methodology to estimate the heritability of entrepreneurship in American, British, and Swedish samples. These studies draw on a broad range of empirical measures for entrepreneurship, such as self-employment and the number of start-up efforts, and provide general support for the heritability of entrepreneurship. Overall, the heritability estimates are in the neighborhood of 40%, indicating that almost one-half of the differences in entrepreneurship in these countries can be attributed to genetic differences across population members.Footnote 3

Although adoptee and twin studies can establish that genetic factors account for variation in a trait, they do not identify specific genes or the biological pathways through which genes function, because the genetic component is inferred from family relationships rather than observed in these studies. The completion of the sequencing of the human genome at the beginning of the present century (Venter et al. 2001) enabled the identification and measurement of locations in the human genome that differ among population members and hence led to the search for the specific genes underlying the heritable variation in entrepreneurship.

3 The molecular genetic analysis of entrepreneurship

3.1 The human genome

A complete human genome consists of 23 pairs of chromosomes, from which the 23rd pair determines the biological sex of an individual. One of each pair of chromosomes is inherited from the mother, and the other is inherited from the father. A chromosome is composed of two intertwined strands of deoxyribonucleic acid (DNA), each made up of a sequence of nucleotide molecules. There are four different nucleotide molecules in the DNA: adenine, cytosine, thymine, and guanine. Adenine on one strand is always paired with thymine on the other strand, and cytosine is always paired with guanine. These combinations are called base pairs. Every human genome consists of approximately 3 billion base pairs. The stretches of base pairs in the DNA coding of a protein are called genes. There are approximately 20,000 genes in the human genome with varying lengths.

A random pair of individuals shares approximately 99.9% of their DNA (National Human Genome Research Institute 2018b), and most genetic differences across population members can be attributed to single nucleotide polymorphisms (SNPs, pronounced “snips”). Therefore, behavioral genetics researchers focus primarily on SNPs when analyzing heritable genetic variation. A SNP is defined as a location in the DNA strand at which two different nucleotides are present in the population. Each of the two possible nucleotides is called an allele for that SNP. The allele that is least common in the population is called the minor allele; the other allele is called the major allele. For each SNP, an individual’s genotype is coded as 0, 1, or 2, depending on the number of minor alleles present. Individuals who inherited the same allele from each parent are called homozygous for that SNP (and have genotype 0 or 2), while individuals who inherited different alleles are called heterozygous (and have genotype 1). SNPs can be found in every part of the genome, within genes or in regions in between genes, and may influence the production of proteins.

In the human genome, there are approximately 85 million SNPs with a minor allele prevalence of at least 1% (The 1000 Genomes Project Consortium 2015). When relating so many SNPs xij (coded as 0, 1, or 2) to a specific outcome yi in a regression framework such as

$$ {y}_i=\mu +{\sum}_{j=1}^J{\beta}_j{x}_{ij}+{\varepsilon}_i, $$

with intercept μ, SNP effects βj, and residual term εi, it is evident that we have to deal with an overidentified model with fewer individuals I than SNPs J (Benjamin et al. 2012a).Footnote 4 For this purpose, two basic approaches have been developed to deal with the overidentification problem. Hypothesis-driven methods such as the candidate gene approach do not consider all J SNPs, and hypothesis-free methods such as the genome-wide association study (GWAS) consider all J SNPs but not in one model. We continue by discussing these two basic approaches from a methodological point of view, and we review how they have been used for unravelling the genetic architecture of entrepreneurship.

3.2 Hypothesis-driven approaches

The candidate gene approach consists of testing a subset of genetic variants for association with the outcome of interest. These genetic variants are selected based on what is known or believed about their biological function (Benjamin et al. 2012a, b; Ebstein et al. 2010; Nicolaou and Shane 2009). This approach resembles the classic way of justifying and then testing a hypothesis. A clear advantage of this approach is that the interpretation of revealed significant relationships is relatively straightforward. Adopting this approach, Nicolaou et al. (2011) were the first to report an association between a SNP in the DRD3 gene (a dopamine receptor gene) and entrepreneurial behavior in a British sample. Their selection of candidate SNPs was based on the observation that dopamine receptor genes have been associated with novelty seeking/sensation seeking and attention deficit hyperactivity disorder (ADHD). These traits were reported to be particularly prevalent among entrepreneurs (Nicolaou et al. 2008b; Antshel 2017). Unfortunately, Van der Loos et al. (2011) failed to replicate this association in a Dutch sample seven times larger than the sample Nicolaou et al. (2011) drew upon.

This non-replication is exemplary for candidate gene studies (Benjamin et al. 2012a, b; Ioannidis 2005; Rietveld et al. 2014a). In principle, a theoretical framework guides empirical research in reducing the number of hypotheses being tested. However, the analytical rigor that a theory-guided approach provides is not helpful in the context of behavioral genetics because it is difficult to reduce the number of plausible hypotheses purely on theoretical grounds. For instance, 70% of all genes (thus approximately 14,000) are expressed in the brain (Ramsköld et al. 2009), and for many of these genes (and hence the SNPs within these genes), a seemingly plausible relation between genes and behavior—including entrepreneurship—could be hypothesized ex ante. As a matter of fact, in 2012, the editor of the leading field journal Behavior Genetics issued an editorial policy on candidate gene studies of behavioral traits that reads “The literature on candidate gene associations is full of reports that have not stood up to rigorous replication” and went on to say “…it now seems likely that many of the published findings of the last decade are wrong or misleading and have not contributed to real advances in knowledge” (Hewitt 2012). This editorial policy outlines strict quality criteria that candidate gene studies must meet to be considered for publication. Most importantly, the editors stressed the importance of sufficient statistical power in genetic discovery studies (Hewitt 2012).

Statistical power refers to the probability of rejecting the null hypothesis when it is not true. Statistical power of 80% or higher is generally considered to be adequate (Ellis 2010). Low statistical power results in a high chance of false negatives, i.e., non-rejections of the null hypothesis when the alternative hypothesis is true. Even more problematic, because of the winner’s curse, low statistical power also results in the overestimation of effect sizes for significant findings (Benjamin et al. 2018; Button et al. 2013; Wacholder et al. 2004). Statistical power is (among other things) a function of the effect size (of the SNP), the size of the analysis sample, and the significance level adopted. Nicolaou et al. (2011) report that their identified SNP explained 0.5% of the likelihood of being an entrepreneur. With their sample of 1335 individuals, they had only 6% power to detect such an effect at p < 0.05.Footnote 5 Hence, it is not surprising that this finding could not be further replicated (Van der Loos et al. 2013).Footnote 6

3.3 Hypothesis-free approaches

3.3.1 Genome-wide association studies

GWAS is a hypothesis-free approach to genetic discovery because no prior selection is made on the set of SNPs used in the analysis. To deal with the overidentification problem, a GWAS runs a single regression for every SNP. Hence, millions of regressions are performed in a GWAS. An advantage of the hypothesis-free study design of GWAS is that it makes the need to correct for multiple testing transparent. If the null hypothesis of no association is true for all these millions of SNPs, one still finds a p value < 0.05 for 5% of the SNPs. Therefore, in a GWAS, the significance threshold is set to 0.05/1,000,000 = 5 × 10−8 (“genome-wide significance”) because of the approximately 1 million independent SNPs in the human genome (adjacent SNPs in the genome are often inherited together). A clear disadvantage of this approach is that GWASs may prioritize SNPs for which the biological function is yet unknown or unclear.Footnote 7 Hence, GWAS usually identifies SNPs that need to be subjected to further analyses to understand the pathways between the SNPs and the outcome. Close collaboration with geneticists and biologists in consortia, such as the Gentrepreneur Consortium (Van der Loos et al. 2010) and the Social Science Genetic Association Consortium,Footnote 8 is therefore a prerequisite for the success of GWAS analysis.

The combination of a very stringent significance level and the small effect sizes of individual SNPs implies that large samples are needed to be adequately powered for gene discovery. The typical dataset has only several thousands of observations, and therefore, datasets need to be combined into mega-analyses or meta-analyses. In a mega-analysis, individual-level genetic data are merged and jointly analyzed. However, legal and privacy issues generally make it impossible to pursue this strategy. In a meta-analysis, the summary results of specific analyses are combined. The GWAS meta-analysis approach has enabled an unprecedented surge in genetic discoveries that are consistently replicated (Hindorff et al. 2009; Visscher et al. 2017), including the discovery of genetic associations with behavioral outcomes such as educational attainment (Lee et al. 2018; Okbay et al. 2016b; Rietveld et al. 2013), subjective well-being (Okbay et al. 2016a), and more recently preferences such as attitudes toward risk-taking (Linnér et al. 2019). The large sample sizes in these studies (N > 1,000,000 in some of them) could be obtained due to the dramatic decline in the cost of genotyping in the last decade (National Human Genome Research Institute 2018a).

In 2010, Koellinger et al. (2010) calculated that at least 30,000 observations were needed to find a relationship between an individual genetic variant and entrepreneurship at the genome-wide significance level. Quaye et al. (2012) used the GWAS approach in a sample of 3933 British females to assess whether there are associations between specific SNPs and entrepreneurship. Not surprisingly, because of the small sample size, they did not find SNPs that are significant at the genome-wide significance level. Van der Loos et al. (2013) conducted a large-scale GWAS meta-analysis on entrepreneurship in a combined sample of 53,898 individuals from Europe and the USA. Despite the sample size, this study did not find any genome-wide significant SNPs. Moreover, this study found no evidence that any of the genes that were previously suggested in the literature to influence entrepreneurship (Shane 2010) show significant associations with entrepreneurship. From a statistical point of view, this null result could have been driven by the attenuation of the effect sizes through the meta-analysis of samples from different countries and with different birth year profiles. However, GWASs from the past few years on other behavioral outcomes indicate that the effect sizes used in the power calculations by Koellinger et al. (2010) were too high.

The past years of research in behavioral genetics showed that individual SNPs typically explain less than 0.02% of the variance in a behavioral outcome (Chabris et al. 2015; Rietveld et al. 2014a). These findings imply that a sample of at least 197,984 individuals is needed to identify a SNP at the genome-wide significance level with 80% power. Hence, by now, we know that the GWAS meta-analysis of Van der Loos et al. (2013) was underpowered. Although the availability of genetic data is rapidly increasing, genetic data are collected primarily for medical purposes, and measures for entrepreneurship are not always available in medical datasets. There is progress in the collection of genetic data in surveys with an economic focus (such as the US Health and Retirement Study and the English Longitudinal Study of Ageing), but at this moment, a sufficiently large analysis sample for a GWAS on entrepreneurship is not available.

Nevertheless, the heritability estimates for entrepreneurship and the successful discovery of SNPs related to other behavioral outcomes indicate that we can be confident about the eventual success of a GWAS on entrepreneurship. Visscher et al. (2017) showed that the number of identified genetic associations in a GWAS is positively related to the size of the (meta-) analysis sample. For example, whereas the first GWAS meta-analysis on educational attainment (N ≈ 100,000) found only three genome-wide significant SNPs (Rietveld et al. 2013), the second one (Okbay et al. 2016b) identified 74 SNPs (N ≈ 300,000), and the third one (Lee et al. 2018) identified 1271 SNPs (N ≈ 1,100,000). Hence, a GWAS with a sufficiently large sample size—at least four times larger than the sample of ~ 50,000 individuals used by Van der Loos et al. (2013)—will also reveal the SNPs that are associated with entrepreneurship.

3.3.2 Genetic discovery using proxy traits

A novel way to boost statistical power in GWASs is the identification of genetic associations using a two-step procedure in the so-called proxy-phenotype method. Rietveld et al. (2014b) introduced this approach to identify genetic associations with cognitive performance. Similar to entrepreneurship, cognitive performance is not often measured in genotyped samples. Therefore, the first step in this method is conducting a large-scale GWAS on a genetically related trait. In the second step, the genetic variants associated with this proxy trait are tested for association with the main trait of interest. In this spirit, Rietveld et al. (2014b) used the results of a GWAS on educational attainment to select 69 independent SNPs, which were then tested for association with cognitive performance. The significance threshold adopted in the second step equals α = 0.05/69 rather than the genome-wide significance threshold of α = 5 × 10−8.

Linnér et al. (2019) used this approach in their GWAS on risk tolerance to study the genetic architecture of related traits, such as self-employment. Based on their main GWAS on risk tolerance, 99 SNPs were selected for further analysis regarding their association with entrepreneurship. In the second stage, the discovery GWAS (N = 50,627) results of Van der Loos et al. (2013) were used. Using a more lenient threshold for significance, Linnér et al. (2019) found one SNP that was significantly associated with entrepreneurship. The sign of the effect was in the expected direction, meaning that the SNP was related to higher risk tolerance and a higher likelihood of being an entrepreneur. Linnér et al. (2019) claimed in their supplementary materials that “if the association with rs7387531 is robust, this would be the first genetic variant to be found to be significantly associated with self-employment.” However, in the replication sample (N = 3271) of Van der Loos et al. (2013), the effect of the SNP (rs7387531) was in the opposite direction with p > 0.05, so it seems that the first robust association between a SNP and entrepreneurship is yet to be identified. Nevertheless, this approach illustrates that the genetic analysis of related traits may help to find genetic variants associated with entrepreneurship.

3.4 Polygenic risk scores

Individual SNPs typically explain less than 0.02% of the variance in a behavioral outcome (Chabris et al. 2015), and the GWAS on self-employment by Van der Loos et al. (2013) has shown that the effects of individual SNPs on entrepreneurship are also small (otherwise they would have been found). Hence, individually, genetic variants are practically useless for use in empirical studies. However, the tiny explanatory power of individual genetic variants has encouraged researchers to develop methods that combine individual genetic variants into so-called polygenic risk scores with larger explanatory power. A polygenic risk score is a weighted sum of SNPs and is constructed as followsFootnote 9:

$$ PG{S}_i=\sum \limits_{j=1}^J{\beta}_j{x}_{ij}, $$

where PGSi is the value for the polygenic risk score for individual i, βj is the regression coefficient of SNP j from the GWAS, and xij is the genotype of individual i for SNP j (coded as 0, 1, or 2). This simple approach has been proven to be effective in the out-of-sample prediction of behavioral outcomes. For example, Rietveld et al. (2013) found only three SNPs significantly associated with educational attainment at the genome-wide significance level. Each SNP explained approximately 0.02% of the variance in educational attainment. However, the polygenic risk score based on all SNPs (including the non-significant ones) explained approximately 2.5% of the variance. This percentage increased with the sample size of the GWAS. For example, the most recent polygenic risk score for educational attainment now explains 9.4% (Lee et al. 2018). The prediction attempt of Van der Loos et al. (2013) was unsuccessful in the sense that their polygenic risk score for entrepreneurship captured less than an insignificant 0.2% of the variance. Nevertheless, this percentage will increase if the GWAS for entrepreneurship increases in terms of sample size (Dudbridge 2013).

The weights (βj) used in the calculation of the polygenic risk score capture almost the full relationship between the SNP and entrepreneurship: the only control variables used in the GWAS on self-employment by Van der Loos et al. (2013) are sex, age, and variables to account for genetic relatedness between individuals. The relationship between someone’s genetic makeup and behavior is assumed to be extremely complex and to run through many (possibly also multiplicative) pathways. Therefore, a “direct” relationship between a SNP and entrepreneurship is unlikely to exist. Many pathways, possibly comprising gene-gene and gene-environment interactions, are likely to explain the relationship between a SNP and behavior. Nevertheless, in a GWAS, these pathways are all included in βj and therefore also in the polygenic risk score. In the spirit of the proxy-phenotype approach used in GWAS (see Section 3.3.2), we can therefore use the polygenic risk scores of traits that we think are in the pathway between some SNPs and entrepreneurship to foster our understanding about the genetic architecture of entrepreneurship.

One obvious example of such a pathway is risk tolerance. The recent GWAS by Linnér et al. (2019) on risk tolerance shows how the polygenic risk score for risk tolerance does indeed predict entrepreneurship out of sample. Although the explanatory power of this polygenic risk score is relatively small, between 0.57 and 1.36 in terms of (pseudo-)R2 for different proxies of entrepreneurship, it contributes significantly to the fit of the model. Moreover, the variance explained is already larger than we may expect it to be for individual SNPs. Risk tolerance may be an obvious trait to investigate when analyzing the pathway between SNPs and entrepreneurship. However, other less obvious traits may also be investigated. For example, earlier research shows that body height is associated with entrepreneurship (Rietveld et al. 2015). The newest polygenic risk score for height explains approximately 34.7% of the variance (Yengo et al. 2018). If the effect of the SNPs explaining entrepreneurship runs through height, we will be able to find an association between the polygenic risk score for body height and entrepreneurship.

Hence, polygenic risk scores constructed for traits other than entrepreneurship may help to identify regions in the human genome that are related to entrepreneurship. Moreover, these genetic summary indices may facilitate the gene-based prediction of entrepreneurship. In the next section, we present empirical analyses that illustrate these two conclusions.

4 Empirical illustration

For our empirical illustration, we draw on data from the US Health and Retirement Study. The HRS is a representative panel of Americans over 50 years old and their spouses. The HRS focuses on a variety of labor markets and health and retirement outcomes. Genetic data were collected from consenting HRS participants between 2006 and 2012 (Health and Retirement Study 2012). We use the RAND HRS Longitudinal File 2014 (V2) for the data on self-employment (Health and Retirement Study 2018a). This longitudinal data file includes the harmonized biennial data of the HRS (1992–2014). Our dependent variable indicating whether an individual is self-employed or not is derived from the question: “Do you work for someone else, are you self-employed, or what?”. The respondents could answer “for someone else” or “self-employed.” If respondents said they were self-employed, they were coded as 1, and if they replied that they worked for some else, they were coded as 0. Self-employment is the most commonly used measure for entrepreneurship studies drawing on survey data (Parker 2018), although engagement in self-employment can be episodic. We restrict our analyses to those aged between 50 and 65 to exclude individuals active in the labor market after retirement age. Moreover, following the recommendations of the genotyping center, we restrict the analysis to individuals of recent European descent to preempt bias from unobserved relationships between genetic and environmental factors (Health and Retirement Study 2012).

For the polygenic risk scores, which are the main independent variables in our regressions, we use the HRS Polygenic Scores 2006-2012 Genetic Data - Release 3 (Health and Retirement Study 2018b). In the present illustrative analyses, we use all available polygenic risk scores in this file that relate to mental health.Footnote 10 We choose to limit ourselves to the polygenic risk scores of only these traits, as the recent entrepreneurship literature suggests an important link between entrepreneurship and mental health in terms of person-job fit (Benz and Frey 2008; Stephan 2018). In total, we analyze 16 different polygenic risk scores. In our analyses, we control for sex, birth year (dummies for each birth year), and survey waves (dummies for each survey wave). We also control for the first ten principal components of the genetic relationship matrix, as is common in genetic association studies. The latter ten variables control for the genetic aspects of common ancestry that could be spuriously correlated with the polygenic risk scores and the outcome of interest, such as cultural or environmental factors (Rietveld et al. 2014a). To estimate the relationships between self-employment and the polygenic risk scores, we use a linear probability model with random effects (to deal with the time-invariant nature of the polygenic risk scores as well as the longitudinal nature of our data)Footnote 11:

$$ {SE}_{it}=\sum \limits_{k=1}^K{\gamma}_k PG{S}_{ik}+\boldsymbol{\delta}\ {\boldsymbol{Z}}_{it}+{\alpha}_i+{\varepsilon}_{it}, $$

where SEit is the binary variable indicating the self-employment status of individual i at time t, γk is the effect of the polygenic risk score PGSik for trait k, δ is a vector of coefficients for the vector of control variables Zit, αi is an unobserved random variable for individual i, and εit is the residual for individual i at time t.Footnote 12

Overall, 31,927 (person-year) observations are available from 7948 different individuals. In this sample, 47% of the individuals are male, the average age is 57.4 years (with standard deviation 4.1), and 19.9% of the person-year observations report self-employment. Table 1 displays the estimates of the associations between the different polygenic risk scores and self-employment. We observe that there are six (out of 16) significant associations at the 5% level: the polygenic risk scores for ADHD, autism, bipolar disorder, educational attainment, general cognition, and well-being.Footnote 13 For these traits, an increase of one standard deviation leads to an increase or decrease in the likelihood of being self-employed of approximately 1%. These results indicate that polygenic risk scores can significantly predict entrepreneurship (even when proxied by the relatively episodic activity of self-employment) and that genes influencing entrepreneurship are likely to be found in regions in the human genome associated with these six traits.Footnote 14

Table 1 The association between the polygenic risk scores for traits in the mental health domain and self-employment (random effects regression, Nindividual-year = 31,927, Nindividual = 7948)

At the same time, these results illustrate that the predictive power of these polygenic risk scores is small (although larger than the predictive power of individual SNPs). Compared to that of a model without the polygenic risk scores, the explained variance of this model increased by only 0.42%.Footnote 15 Table 2 shows that, from a prediction point of view (by taking the percentage of person-year observations in our sample in self-employment—19.9%—as the classification threshold), the correct individual-level prediction of self-employment status increases only marginally with the current model (0.14% point increase).

Table 2 In-sample prediction results for self-employment (versus wage work) for the models with and without polygenic risk scores; observations in the top 19.9% (percentage of person-year observations reporting self-employment in the sample) of the predicted values in each model are classified as self-employed

5 Conclusion: a second decade?

The “quest for the entrepreneurial gene” (Thurik 2015; Van der Loos et al. 2011) is largely motivated by the struggle of scholars to have a better understanding of entrepreneurs and entrepreneurship: what makes entrepreneurs decide to start a business, what motivates them, what makes them successful or fail, and what makes them different from other people? Various research approaches, as well as tools and theories from economics, psychology, and sociology, have been proposed and applied to these questions. However, the answers to “what makes an entrepreneur” remain uncertain and incomplete (Shane and Venkatamaran 2000; Parker 2018). Empirical evidence that genes may be part of the answer (Nicolaou et al. 2008a, b, 2009, 2011; Shane and Nicolaou 2013; Van der Loos et al. 2011, 2013; Zhang et al. 2009) has been received by scholars and the media with both hopes and enthusiasm, as well as with skepticism and criticism.

Despite several attempts in the past decade, until now, no robust association between genetic variants and entrepreneurship has been discovered. Our overview and discussion of these works gives a clear answer to our first research question, “Why has the identification of robust associations between genetic variants and entrepreneurship been unsuccessful in the last decade?” Irrespective of whether a hypothesis-driven or hypothesis-free approach was used, genetic discovery studies on entrepreneurship have until now been underpowered. Nevertheless, based on the results of large-scale genetic discovery studies on other behavioral traits (such as educational attainment), we may expect that robust associations between genetic variants and entrepreneurship will be identified if a sufficiently large sample can be gathered. Datasets that contain both genetic data and entrepreneurship information are relatively scarce (Van der Loos et al. 2013), but the advent of large genotyped biobanks such as the UK Biobank (Bycroft et al. 2018) and the Estonian Biobank (Leitsalu et al. 2015) is currently changing the landscape. Hence, a sufficiently powered GWAS on entrepreneurship may soon become feasible.

Because of data constraints, the latest and largest GWAS on entrepreneurship used self-employment as a proxy for entrepreneurship (Van der Loos et al. 2013). With more data becoming available, future GWASs of entrepreneurship may benefit from the analysis of an entrepreneurship measure less episodic in nature, such as serial or high-performance entrepreneurship. With more precise classification of individuals into occupational groups, the GWAS becomes more powerful and hence the chance to detect associations between individual genetic variants and entrepreneurship becomes larger. Nevertheless, in combination with other GWAS results, the analysis of the relatively heterogeneous self-employment measure may help identify specific underlying types of self-employment. For example, by drawing on GWAS results for schizophrenia and educational attainment, Bansal et al. (2018) reveal that the binary schizophrenia diagnosis aggregates over at least two different subtypes. The first type is associated with high intelligence and bipolar disorder, while the second type is a cognitive disorder that is independent of bipolar disorder. With GWAS results for many publicly available traits,Footnote 16 similar analyses may also be interesting to conduct on self-employment to possibly identify unexpected subtypes.

However, rather than directly analyzing entrepreneurship, it is possible to shift attention (at least for the time being) to variables mediating the relationship between genes and entrepreneurship. Examples of such variables that can be measured in large samples include traits such as preferences for risk and uncertainty, confidence, and optimism. In addition to these well-known measures in the world of entrepreneurship research, one may also consider characteristics such as body height, body mass index, and mental disorders (possibly in a hypothesis-free setting). One advantage of this approach is that genetic effects on more proximate outcomes are likely to be stronger and hence easier to detect, for a given sample size, than the genetic effects on distal outcomes, such as entrepreneurship (Rietveld et al. 2014b). By using the proxy-phenotype approach, as discussed in Section 3.3.2, it will be possible to identify associations with entrepreneurship, for example, by using the (publicly available) GWAS results of Van der Loos et al. (2013) in the second step of the analysis.Footnote 17 This approach circumvents to some extent the problem of the currently insufficient sample size needed for a well-powered GWAS on entrepreneurship.

Although a regular GWAS looks only at the linear association between a genetic variant and entrepreneurship, the genetic architecture of entrepreneurship may comprise interactions between two or more genetic variants. Theoretically, it is possible to include cross-products of SNPs as explanatory variables in a GWAS to advance our understanding of the possibly complex biological mechanisms that are associated with entrepreneurship. However, in a hypothesis-free setting, such an approach would also require an even more stringent correction of the significance level (as the number of statistical tests increases exponentially with the number of interacting SNPs). Hence, if we assume the size of the interaction effects is not larger than the effects of individual SNPs, this approach is unlikely to be productive in the distant future because of data limitations. The interaction effect may also be identified with (nonlinear) machine learning techniques. Relatively simple machine learning techniques have been proven to have relatively high predictive power for traits such as human height (Pare et al. 2017; Lello et al. 2018). Despite the massive computational burden of these methods, it is promising to analyze to what extent these techniques are also useful for predicting entrepreneurship. Nevertheless, the biological interpretation of the results obtained with machine learning techniques is arguably even more difficult than that of results obtained with a regular GWAS.

To answer our second research question, “Would the identification of associations between genetic variants and entrepreneurship help to advance the field of entrepreneurship research?,” we relate the promises of “genoeconomics,” as outlined by Benjamin et al. (2012a), to entrepreneurship research in light of the recent development in behavioral genetics. Benjamin et al. (2012a) outlined four main reasons why the genetic analysis of behavioral traits is important and relevant. First, studies using directly observed genes may reveal the genetic pathways and mechanisms underlying behavior and may lead to a more complete understanding of entrepreneurial behavior. For example, as already discussed above in light of the findings of Bansal et al. (2018), it may be possible to identify to what extent different mechanisms and cognitive processes are involved in the identification and exploitation of business opportunities. Second, these studies have the potential to provide measures for constructs that are difficult to measure empirically. Benjamin et al. (2012a) use the example that specific genetic variants can be used as a proxy for the taste for fatty foods. In this spirit, rather than using self-reported measures for entrepreneurial intention, one could draw on the genes related to entrepreneurship. Third, based on someone’s genetic profile, interventions may be channeled. In this vein, entrepreneurship scholars argue that the prediction of entrepreneurial behavior using genetic data could have practical applications in business and for individual decision-making (Nicolaou et al. 2008a; Nicolaou and Shane 2010; Shane 2010). Fourth, genes can be used to enrich otherwise nongenetic models. For example, the inclusion of control variables for genetic endowments may absorb the residual variance in regression models or experimental settings and allow for stronger statistical inference (DiPrete et al. 2018; Rietveld and Webbink 2016). In some instances, it will also be possible to infer causal relationships in observational data by using genes as instrumental variables (Van Kippersluis and Rietveld 2018; Von Hinke et al. 2016). Hence, the use of genes may be instrumental for better understanding the effects of environmental factors.

Regarding the first two promises, we have seen that for behavioral outcomes (such as entrepreneurship), one should not expect values of R2 in excess of 0.02% for individual SNPs. Hence, it is unlikely that such a SNP will provide much information about the mechanisms underlying entrepreneurship behavior. In contrast to focusing on individual genetic variants, there are good arguments for shifting our attention to polygenic risk scores that summarize the contribution of several genetic variants to a trait. A clear advantage of this approach is that polygenic risk scores can be used as regular variables in empirical research, and expertise for working with raw genetic data is not necessary, as some polygenic risk scores are already publicly available (such as in the HRS).Footnote 18 In the present absence of a polygenic risk score for entrepreneurship with significant explanatory power, we have to shift our focus to the analysis of polygenic risk scores for entrepreneurship-related traits. By doing so, we also come closer to the common practice in entrepreneurship research of testing particular hypotheses (i.e., particular pathways through which genes influence entrepreneurship). For example, we may hypothesize and test whether the genetic variants contributing to the development of ADHD are also related to entrepreneurship. In this spirit, a polygenic risk score can also serve as a proxy for a trait. For example, Patel et al. (2019) use the polygenic risk score for ADHD to study the influence of ADHD on entrepreneurship and entrepreneurial performance in a sample of individuals for which the diagnosis of ADHD was not available.

Regarding the third and fourth promise (the use of genetic information to predict individual behavior and to enrich otherwise nongenetic models), the current state of the behavioral genetics literature as well as the analyses presented in the present study makes clear that the added value of genetics for entrepreneurship scholars should be thought of in terms of enriching population-level models rather than improving individual-level prediction (Morris et al. 2019). Van der Loos et al. (2013) show that all SNPs together may explain up to 25% of differences in entrepreneurial behavior between individuals. Even if we are able to realize this prediction R2, the likelihood of misclassification of individual into occupational groups remains great. Hence, early speculations about the use of molecular genetic data for understanding and predicting entrepreneurship (Shane 2010) remain premature, at a minimum. Even though it may be useful to capture some of the (otherwise residual) variance in polygenic risk scores, the gene-based prediction of individual entrepreneurial behavior will remain of limited value for individuals and entities such as governments and banks.Footnote 19

Nevertheless, capturing residual variance in polygenic scores may improve the understanding of the effects of environmental factors. In so-called gene-by-environment (“GxE”) studies (Keller 2014; Thompson 2014), polygenic risk scores could be used to investigate how entrepreneurship results from the interplay between genetic endowments and environmental factors. For example, a recent study argues that cultural factors (as proxied by the taste for alcoholic drinks) may influence how genes shape different types of entrepreneurship (Acs and Lappi 2019). In general, a good fit between individuals and their occupations has been shown to be important for high levels of productivity (Kristof-Brown et al. 2005Importantly, the identifiable occurrence of matches and mismatches between an individual and his or her career choices and the possible impact on stress and health was a crucial argument for the medical profession to cooperate with behavioral researchers in the search for the genes associated with entrepreneurship (Koellinger et al. 2010; Van der Loos et al. 2010). Because of the large-scale collections of genetic data and expertise on the biological functioning of genes in the medicine and biology fields, the involvement of researchers in these fields will remain crucial to find associations between genetic variants and entrepreneurship.

In sum, although the attempts to identify specific genetic variants underlying the heritable variation in entrepreneurship have until now been unsuccessful, there is reason to be confident about the eventual success of the “quest for the entrepreneurial gene” (Van der Loos et al. 2011). The benefits of using individual genetic variants for empirical research in the entrepreneurship domain are likely to be small. However, the use of polygenic risk scores may promote the realization of the promises of genoeconomics for entrepreneurship research. Although the gene-based prediction of individual entrepreneurial behavior will be of limited value, the use of polygenic risk scores in models may help to increase our understanding of which regions in the genome and which combinations of genetic endowments and environmental circumstances drive entrepreneurship and person-job fit at the population level.