The Garden of Forking Paths: Reinterpreting Haseman-Elston Regression for a Genotype-by-Environment Model

Haseman-Elston regression (HE-reg) has been known as a classic tool for detecting an additive genetic variance component. However, in this study we find that HE-reg can capture GxE under certain conditions, so we derive and reinterpret the analytical solution of HE-reg. In the presence of GxE, it leads to a natural discrepancy between linkage and association results, the latter of which is not able to capture GxE if the environment is unknown. Considering linkage and association as symmetric designs, we investigate how the symmetry can and cannot hold in the absence and presence of GxE, and consequently we propose a pair of statistical tests, Symmetry Test I and Symmetry Test II, both of which can be tested using summary statistics. Test statistics, and their statistical power issues are also investigated for Symmetry Tests I and II. Increasing the number of sib pairs is important to improve statistical power for detecting GxE.


Introduction
To investigate linkage between a quantitative trait and a marker locus, Haseman and Elston proposed the seminal "Haseman-Elston regression" (HE-reg) method (Haseman and Elston 1972).Since its inception, HE-reg has been a rich garden, which fertilizes many other methods and leads to various improvements (Elston et al. 2000).It has been adapted as a major tool for linkage scans (Cardon and Fulker 1994;Fulker and Cardon 1994;Xu et al. 2000;Sham and Purcell 2001), or for estimating heritability at a genomewide scale using sib pairs (Visscher et al. 2006).Since the rise of genome-wide association studies, HE-reg has been reinvigorated for addressing the problem of "missing heritability" (Yang et al. 2010).Through the linkage kernel (identical-by-descent, IBD) of HE-reg being replaced by IBS (identity-in-state, IIS, also known as IBS), HE-reg can estimate SNP-heritability with good accuracy (Chen 2014; Golan et al. 2014;Bulik-Sullivan 2015;Wu and Sankararaman 2018).Recently, because of the possibility of accumulating new pedigree data (Kong et al. 2018;Kaplanis et al. 2018), HE-reg has been further modified to deal with linkage analysis at an unprecedented scale (Young et al. 2018;Zajac et al. 2023).
Due to technical difficulties, there has been some debate on the properties of HE-reg (DeFries 2010).Nevertheless, there is a consensus that HE-reg is a model-free framework to detect an additive genetic variance component.We here report a long overlooked ambiguity that HE-reg captures if there is any genotype-environment (GxE) variance component besides the additive genetic variance.Following the original approach of HE-reg, we derive the analytical result for the regression coefficient of HE-reg when GxE is present.Due to its historical importance, this reinterpretation for HE-reg will shed light on a forking path in the garden of HE-reg, in which both linkage and association studies have deep roots.
GxE plays a role in shaping complex traits, and, if ignored, it can affect the marginal effect estimation of a locus (Zhu et al. 2024).This study focuses on an unexplored GxE mechanism, which can yield different heritability in linkage and in association studies, respectively.This paper is organized around two kinds of heritability: h 2 , which is consistent with the conventional definition of heritability, say 2pq⊣ 2 , and h 2 , which additionally allows for GxE in the model.We briefly review HE-reg and introduce GxE for a labile trait that is subject to the norm of reaction.GxE is then found to be naturally integrated into HE-reg without the need to include any additional parameters, and this leads to a new interpretation of HE-reg.As HE-reg has been widely used for both linkage and association studies, either in a single-locus or whole-genome context (Sham and Purcell 2001;Visscher et al. 2006;Chen 2014), we here pursue symmetric and asymmetric forms between linkage and association studies under the new interpretation of HE-reg.For linkage and association, two pairs of symmetries are developed: Symmetry I for single-locus estimation of heritability, and Symmetry II for whole-genome estimation of heritability.These two symmetry groups are represented by their respective test statistics-t-tests here.On the basis on the good analytical properties of HE-reg, we envisage linkage and association in a complementary manner to each other and empower the investigation of genetic architecture.When there is GxE, these two symmetries should not hold, but a large sample size-especially if there are only sib pairs-is needed to detect possible asymmetry between linkage and association studies.

Methods
We adapt the original sib pair design of HE-reg, and have a linear model for the additive effect of a casual variant in which Y ij = y i − y j 2 for a pair of individuals i and j who are a sib pair, the mean of the model, l , the regression coefficient, ij the identity-by-descent (IBD) of the sib pair, and e ij the residual.We consider K pairs of sibs from K independent families, so Y ij is a K × 1 vector.Now we take a close look at its signature result, the interpretation of the regression coefficient l , which is written as in which c is the recombination fraction between the l th marker and a causal variant, and h 2 the variance explained by the causal variant.Equation 2 is the commonly accepted standard interpretation for HE-reg (Lynch and Walsh 1998).
Here we have not distinguished h 2 from 2 a (additive genetic variance of a locus), which can be reconciled by scaling. (1) In the context below, we choose the appropriate notation either h 2 or 2 a .

Norm of reaction for a labile trait
Sibling differences such as one drinks but the other not makes a form of personalized "environment", but we only focus on environmental factors that can apply to the whole family, such as highland habitants vs lowland habitants.
A typical non-removable GxE is present if both genotype effects "crossover" (Wang et al.When high-or low-land habitants are clearly defined, it is easy to include this as a covaria t e i n a c o nve n t i o n a l G x E GWA S m o d e l y = + b × SNP + c × Habitant + d × (SNP × Habitant) + e , and d captures the linear by linear proportion of GxE.How- ever, environmental factors can be, if not infinite, many, each of which can lead to norm of reaction for each locus; for the l th locus, the acting environmental factor could be habitation, but could be water quality for the (l + 1) th locus.In other words, if not completely impossible, it is both logistically and computationally impractical to explore such models exhaustively.Here, we propose an alternative route via HEreg, which does not need to articulate the environmental interaction, but is able to capture its effect.
We assume K = K 1 + K 2 sib pairs, and K 1 sib pairs are highland habitants and K 2 sib pairs lowland.Now, we have Table 1, which is a modification of the original HE paper (please refer to "Table I we drop off 2 d = 4p 2 q 2 d 2 because of ignorable dominance genetic variance, and Eq. 4 is identical to the Eq. 3, indicating that HE-reg does not distinguish GxE, and consequently does not distinguish GxE from a pure additive model.Of the l th locus, we arrive at a generalized expression for the HE-reg regression coefficient: ( in which E is the number of different environments.Equa- tion 5 actually slices the original HE-reg into much thinner layers, each of which covers a fraction f of sib pairs, who are sampled from environment f .For the l th locus, we assume a l follows a l ∼ N(a l , E 2 l ) , in which a l the inter-family additive effect (marginal effect) and E 2 l is the variation due to familial or environmental origin-an analogue of GxE.E a 2 l = a 2 l + E 2 l , and Eq. 5 can be written in an even more general form as 2 " to have GE shown as scheme I and scheme II.
the proportion of sib pairs having scheme I genetic effects, and Table 2 The test statistics for various Haseman-Elston regressions , see Appendix 2; E( ) = 0.5 here for sib pairs.We consider the marker itself Symmetry II: whole-genome model (collapsed genetic architecture) HE-reg Even though the model is identical to the original HEreg, Eq. 6 indicates that h 2 has an additive variance com- ponent, which can be detected via an additive effect model, together with a GxE component.As discovered by recent large-sample GWAS, causal variants are very saturated (Yengo 2022).We now omit the factor (1 − 2c) 2 in Eq. 6 in the text below.The complexity of the original HE-reg paper arises largely to deal with (1 − 2c) 2 .
In particular, for HE-reg we have its null and alternative hypotheses: H 0 ∶ = 0;H 1 ∶ ≠ 0 , so for Eq. 6 we construct a t-test t LS for single-locus linkage analysis: For sib pairs, = 0.5 in Eq. 7, and the approximation holds when h 2 is small.The detailed elements for con- structing Eq. 7 are given in Table 2.
Nevertheless, for a single-locus GWAS linear model y = + a l z l + e , in which z is the standardized genotype for the l th locus, the t-test , in which h 2 = 2p l q l a 2 l does not include a GxE component.This casts a discrepancy between linkage and association studies.
Replacing IBD with IBS, for n unrelated pairs HE-reg is y i − y j 2 = a + ⌊s ij + e ij .y i and y j are the standardized phe- notypes for unrelated individuals i and j , s ij the genetic related- ness score, e ij the residual and b is the regression coefficient.
Given an unrelated GWAS sample, the analytical solution for SNP heritability is in which Q is the number of causal variants, often unknown, and for HE-reg (Chen 2014).
There are various unexplored paths for possible variation in Eq. 8, but we here direct our attention to a collapsed form (6) k because the summation of the covariances of all pair of variables is zeroed out.A pair of assumptions are needed for Eq.9: I) the first approximation occurs under the general polygenic assumption that causal variants are randomly allocated along the genome, and II) the second approximation holds when the markers are saturated enough, indicating that nearly every SNP is possibly causal (Yengo 2022).So, the approximation should be retained for Eq. 8 in view of these two assumptions.Of note, albeit pathologically, counter-examples can be found in which these two assumptions do not hold: see discussion in other studies (Chen 2014).The corresponding t-test t AW for whole-genome association analysis is (Table 2) However, when there is only one marker 2 mm = 1 , and Eq. 10 leads to a single-locus test for association t-test, t AS , which is about (Table 2) If we replace the in Eq. 1 with aggregated genome-wide IBD π = 1 m ∑ m k=1 k , it leads to whole-genome estimation of heritability for a linkage design (Visscher et al. 2006): Using the same two assumptions for Eq. 8, the expecta- k , and its sampling variance is about Morgan is the genetic length of the 22 human autosomes.However, the number of markers in the linkage study determines the sampling variance of B .Consequently, the t-test t LW for whole-genome linkage analysis is (Table 2)

The statistics for the forking path of HE-reg
As derived above and summarized in non-centrality parameter (NCP) of the squared t-test statistic as shown below: Symmetry II ∶ whole genome For each paired test statistic, Symmetry I or Symmetry II, under the pure additive model, the test derived from linkage often has less power than that from association; otherwise GxE is available and captured by HE-reg.

Validation for HE-reg in the presence of GxE
We want to validate the accuracy of the t-test statistics in Eq. 7 ( ) in the presence of GxE.We simulated K nuclear families, each of which consisted of a pair of unrelated parents and a sib pair.The genotypic effects were set as 1, 0, and -1 for BB , Bb , and bb (Scheme I) for the first K 2 families, and as -1, 0, and 1 for bb , Bb , and BB (Scheme II) for the second K 2 families.The reference allele frequency was uniformly sampled from 0.01 ~ 0.5; h 2 = 0.1, 0.25 , and 0.5, respectively; K = 200 and 500 respectively.For each simulated sib pair, (1, 0.5, and 0, and var( ) = 1 8 ) was directly known for the simulation data, and the marker itself was causal so that (1 − 2c) 2 = 1 .We resampled each scenario 200 times.As a contrast, for K families, we also pooled their 2K unrelated parents together, which made a sample for GWAS.
As illustrated in Fig. 1, since Scheme I and Scheme II were similar scenarios-but opposite effect size-they had very consistent results within HE-reg.When K increased from 200 to 500, the consistency remained.As expected, HE-reg was not as powerful as GWAS in catching a single Scheme-I or Scheme-II locus.However, when the Scheme I and II samples were pooled together and an unremovable GxE was present, HE-reg was more powerful to detect GxE, not just because of the doubled sample size; in contrast, GWAS had no power at all to detect the GxE effect because the effects cancel out under a GWAS model.
One criticism might be: why not include GxE in a GWAS model?A GWAS model could be more powerful for detecting a GxE locus, if we knew what the environment was.However, the number of environments is infinite in practice.

Power calculation for GxE using single-locus HE-reg for a linkage design
It should be noticed that HE-reg has much lower power in linkage than in the comparable association analysis, so it is trivial to compare the statistical power for linkage and association.We considered the statistical power for HEreg to detect a single locus, which could be in the context of either h 2 or h 2 .In the simulations, we set the Type I Error rate = 0.05 1,000,000 after 1 million genotyped SNPs, and we set h 2 = 0.005 , 0.0075, 0.01, 0.015 and 0.02, respec- tively; the total number of sib pairs ranged from 500,000 to 10,000,000, with an increment of 500,000.Given type I error rate = 0.05 1,000,000 and if the statistical power 0.85 is considered acceptable in practice, for h 2 = 0.015 , n ≈ 3,000,000 sib pairs are required, and for h 2 = 0.01 , n ≈ 7,000,000 sib pairs are required (Fig. 2).

Statistical test for symmetry I
. We evaluated statistical power for Eq. 15 by setting a locus with GxE ( h 2 LS from 0.005 to 0.05) but no additive effect, h 2 AS = 0 , as illustrated in Fig. 3.The number of SNPs was 1,000,000, and the Type I error rate was = 0.05 1,000,000 .The sib pairs for n LS =50,000, 150,000, and 300,000, respectively, and n AS = 10,000 and 100,000, respectively.It was obvious that the statistical power benefited from a larger n LS (Fig. 3).Of note, Sym- metry I could be implemented purely on a pair of summary statistics from single-locus linkage and single-locus GWAS. (15

Statistical test for symmetry II
Analogously, we could test for Symmetry II, and its t-test We conducted simulation for Symmetry II, and key parameters were set as L = 32 Morgan and 2 mm = 1 100,000 .Two traits were considered, the first trait had h 2 = 0.25 and the second h 2 = 0.45 , which could be detected by whole-genome association HE-reg; and in contrast, with the inclusion of GxE (from 0 to 0.2, with increments of 0.02), its h 2 = h 2 + GxE , which could be detected by whole-genome linkage HE-reg.The sample size for association was n AW = for the transfomred pair of t-tests in Eq. 15.The y asix is statstical power given h 2 LS ranged from 0.005 to 0.05 (x-axis), evenly broken into ten values, and h 2 = 0 in association.The two horizontal lines were for statistical power of 0.25, 0.5, and 0.85, respectively.The sample sizes for linkage ( n LS ) and for association ( n AS ) are as shown each subtitle 10,000, and 100,000, respectively, and for linkage it was n LW = 50,000 , 150,000, and 300,000, respectively.The sam- ple size should be larger than 50,000 for linkage, otherwise hardly a sensible difference could be statistically detected.

Discussion
Both marginal environmental effect and GxE interactions can shape genetic architecture (Zhu et al. 2024).As revealed in this study, the original HE-reg actually captures not only an additive effect but also GxE.However, a GWAS model is very likely to miss such a GxE location if the environmental variable is unknown.So, given the intriguing behaviour as investigated between HE-reg and GWAS, it naturally leads to a discrepancy between linkage and association studies as long as genes harbor familial variation.The real advantage of our reinterpretation for HE is that a gene desert in GWAS may lead to rich results in single-locus HE-reg linkage, which can detect such a locus without specifying the environment (Fig. 1).If there are unexplored hotspots for GxE across the genome, it is likely to observe a GxE landscape for various traits and enrich our understanding of genetic architecture.According to our power calculation (Fig. 2) more sib pairs may enhance the statistical power of HE-reg linkage for detecting GxE.Symmetry I and II tests can be implemented on summary statistics (Figs. 3 and 4), so if we have a large sibling data resource, it is possible to construct summary statistic tests for GxE.So far, the publicly available sib pairs may be drawn from UK Biobank (Bycroft et al. 2018), but unfortunately there are far less, say about 20,000 sib pairs, than a sample size that leads to practical statistical power.As shown in the simulation results, a linkage study would have lower power to detect a signal than a GWAS, given the same sample size, which could be about 280,000 unrelated UKB samples, and consequently we anticipate finding only few signals in a linkage study.However, one technical difficulty may be how to distinguish between an independent signal and the synthetic signal of a LD block.One of the agenda items in addressing the "missing heritability" question is to narrow down the heritability gap between the family-based studies (linkage) and unrelated GWAS samples (association).In the absence of GxE, increasing the number of variants gives a possible way to solve this issue.The trait most studied is height, the heritability of which is estimated in European descendants using various markers.The benchmark source of familybased heritability estimate of ĥ2 FAM = 0.8 ± 0.1 was based on 950 quasi-independent full-sib pairs, and their wholegenome IBD was estimated using 791 autosome microsatellite markers (Visscher et al. 2006); the compared SNP-heritability was ĥ2 SNP = 0.45 ± 0.083 , which was estimated from 3,925 unrelated samples and 294,831 SNP markers (Yang et al. 2010).The ĥ2 SNP so far has approached 0.68 ± 0.10 using 25,465 unrelated European descend- ants on the whole-genome sequenced TopMed data with 33.7 × 10 6 variants (Wainschtein et al. 2022).Given such a small sample size for the compared family study, it was not likely to reach statistical significance.Symmetry II test indicates another way for interpreting the heritability gap.The gap is not necessarily about inclusion of more variants but rather about using a family-based design instead.

Appendix 1: Least squares estimation
To a linear model of n observations Its least squared estimation is in which MSE is
There are six unique complex terms involved, we need to treat them one by one.
As y 1 and y 2 are related, using Cholesky decomposition, we rewrite y 1 and y 2 as below in which the relatedness score ( = 0.5 for a pair of sibs), z 1 ∼ N(0, h 2 ) , e 1 ∼ N(0, 1 − h 2 ) , and z 2 ∼ N(0, h 2 ) , e 2 ∼ N(0, 1 − h 2 ) , and e the correlation between residuals., where d is the genetic distance measured in Morgan.For the maternal haploids, if the recombination fractions are different from that of paternal, the similar table should be made for maternally raised IBD.

Item var y
2010), as the norm of reaction defined below for a causal variant linked the l th locus: Scheme I (highland habitants): genotypes BB , Bb and bb with their corresponding effects G BB = a , G Bb = d and G bb = −a; Scheme II (lowland habitants): genotypes BB , Bb and bb with their corresponding effects G BB = −a , G Bb = d and G bb = a.

Fig. 1
Fig. 1 Validation for test statistics for HE-reg and GWAS.200 simulations for each scenario, and sib pairs were analysed by HE-reg (first row) and their parents were pooled together and analysed by GWAS (second row), such as y = a + bz + e .In each cluster, the three coloured bars were for scheme I, scheme II, and scheme I + II samples.Of each pair of the same colour bars, the first one was the expected t-test statistic calculated by Eq. 7 ( t HE = � − √ n−2 � 3+16 ( 1−h 2 ) h4

Fig. 3
Fig. 3 Statistical power for Symmetry I single-locus test for GE.The NCP of the test statistic is

Fig. 4
Fig.4Statistical power for Symmetry II whole-genome test for GE.Numerical evaluation for statistical power of Eq. 16, and its converted NCP is t 2 II ≈ − y 2 2 = var y 2 1 + var y 2 2 + 4var y 1 y 2 + 2cov y 2 1 , y 2 2 + 2cov y 2 1 , −2y 1 y 2 + 2cov(y 2 2 , −2y 1 y 2 ) p ) + 1 4 var( m ) , and cov( p , m ) = 0 because the inde- pendency between the paternal (with superscript p ) and maternal (with superscript m ) meiosis.For a chromosome, with k genotyped markers, E( consider the IBD transmitted from paternal origins.We know, t h e t e r m E( p i p j ) c a n b e c a l c u l a t e d b e l owvar y 1 − y 2 2 = 8(1 − ) 2 h 4 + 8 1 − e 2 1 − h 2 = 8(1 − ) 2 h 4 + 8(1 − h 2 ) j ), where = 1 if the alleles are IBD for the sib pair or 0 if not.u m i n g t h e H a l d a n e m a p p i n g f u n c t i o n r p = 0.5[1 − exp(−2d)]

2
Assume the difference between maternal and paternal recombination fractions is ,NowWhen k is very large, it can be expressed as an integral, and the analytical solution is:

Table 1
A rework of the original Table I for HE-reg when the target locus harbors genotype-environment interaction Notes: This table is adapted from Table I from the original HE-reg paper (Haseman and Elston 1972), but we expand the column