Genome-wide discovery for biomarkers using quantile regression at biobank scale

Wang, Chen; Wang, Tianying; Kiryluk, Krzysztof; Wei, Ying; Aschard, Hugues; Ionita-Laza, Iuliana

doi:10.1038/s41467-024-50726-x

Genome-wide discovery for biomarkers using quantile regression at biobank scale

Article
Open access
Published: 31 July 2024

Volume 15, article number 6460, (2024)
Cite this article

Download PDF

You have full access to this open access article

From

View current issue

Genome-wide discovery for biomarkers using quantile regression at biobank scale

Download PDF

815 Accesses
Explore all metrics

Abstract

Genome-wide association studies (GWAS) for biomarkers important for clinical phenotypes can lead to clinically relevant discoveries. Conventional GWAS for quantitative traits are based on simplified regression models modeling the conditional mean of a phenotype as a linear function of genotype. We draw attention here to an alternative, lesser known approach, namely quantile regression that naturally extends linear regression to the analysis of the entire conditional distribution of a phenotype of interest. Quantile regression can be applied efficiently at biobank scale, while having some unique advantages such as (1) identifying variants with heterogeneous effects across quantiles of the phenotype distribution; (2) accommodating a wide range of phenotype distributions including non-normal distributions, with invariance of results to trait transformations; and (3) providing more detailed information about genotype-phenotype associations even for those associations identified by conventional GWAS. We show in simulations that quantile regression is powerful across both homogeneous and various heterogeneous models. Applications to 39 quantitative traits in the UK Biobank demonstrate that quantile regression can be a helpful complement to linear regression in GWAS and can identify variants with larger effects on high-risk subgroups of individuals but with lower or no contribution overall.

Methods and Tools in Genome-wide Association Studies

Genome-wide association studies

Article 26 August 2021

A generalized linear mixed model association tool for biobank-scale data

Article 04 November 2021

Introduction

GWAS are commonly used to identify genetic associations with phenotypes of interest and have been successful in identifying a large number of associations with diverse traits and diseases. The GWAS models used to connect genotypes with phenotypes are based on linear regression (LR) models that can fail to capture more complex and dynamic relationships. Despite awareness of this complexity^1,2, the standard models remain highly simplified, largely because the very large sample sizes required for genetic discovery preclude incorporating known complexities into the analytic framework (e.g. relevant environmental measures may not be available at this scale).

Linear regression models allow one to test whether genetic variants are associated with the mean of a phenotype distribution. However, the effect of a genetic variant can go beyond the mean and can include the lower or upper parts of the phenotype distribution. Several known examples include an earlier study by Yang et al.³ that showed that an SNP in the FTO gene is associated with both the mean and the variance of BMI, as well as several variance quantitative trait loci (vQTL) studies^4,5,6,7,8. Genetic associations can be complex due to underlying heterogeneity in the population and disease model, and the effects of gene-gene and gene-environment interactions^9,10. For eQTLs, several recent studies have shown the dynamic nature of eQTLs depending on the context or environment, providing evidence of widespread gene–environment interactions that can play a role in disease development¹¹. LR, with its focus on the central part of the phenotype distribution, may not capture these more complex associations.

There are several existing methods for detecting genetic variants that control phenotypic variance^{4,12,13,14,15,16,17}. However, the detected variance effects in practice are very few, and reflect only to a limited extent the existing effect heterogeneity, as we show in simulations and applications below. An alternative, natural extension of LR in the context of quantitative traits is based on quantile regression (QR¹⁸). Different from testing moments, quantile regression-based tests, as discussed here, target distribution-wise differences and can lead to the detection of more general distributional differences. In particular, QR allows one to model any quantile of the phenotype distribution and can be used to measure how a genetic variant associates with various quantiles of a trait.

QR techniques have only sporadically been utilized in GWAS studies, most of them of modest sample sizes^{6,19,20,21,22}, despite the similarity to LR, the availability of efficient implementations for genome-wide analysis in standard statistical packages^6,23,24, and some notable advantages as follows. First, QR provides a better characterization of the underlying genotype-phenotype association, even at those associations identified by LR. Second, QR does not have assumptions on the error distribution, which means that it is robust to non-Gaussian errors and outliers. While in linear regression, traits are usually normalized using, for example, rank-based inverse normal transformation, and results (type-1 error and power) can depend on the particular transformation^25,26, with quantile regression results are invariant to monotonic transformations and therefore, no particular trait transformation is required. It is, of course, highly desirable that trait transformations do not affect our conclusions based on the regression results. Furthermore, this invariance property also helps when certain transformations are preferable to get a more natural interpretation of the association patterns.

Furthermore, QR can be helpful in downstream analyses, such as polygenic risk score (PRS) construction. Specifically, QR provides estimates of regression coefficients at specific quantile levels, with a similar interpretation as in LR. Namely, a QR coefficient represents how much a specific quantile of the phenotype distribution changes with a 1-unit shift in the genotype distribution (e.g. for the 0.5 quantile, this would correspond to the change in median). While the PRS derived from LR models in GWAS are based on predicting the conditional phenotypic mean (conditional on genotypes), QR allows PRS construction at different quantile levels, essentially allowing us to predict the entire conditional phenotypic distribution, not just the mean.

We illustrate below the performance of QR in simulations using several homogeneous and heterogeneous models, and compare it with LR and a recent variance-based test, QUAIL¹⁷. We also illustrate the practical benefits in the GWAS setting, by applying it to biomarkers from the UKBB²⁷.

Results

Overview of quantile regression techniques

We describe here briefly the quantile regression model as a complementary strategy to the commonly used linear regression model in GWAS. More details are in the Methods section. We assume that we have n independent samples from a population. We denote by ${\bf{Y}}=({Y}_{1},\cdots \,,{Y}_{n})^{\prime}$ the n × 1 sample phenotype vector, by X the n × p genotype matrix, and by C the n × q matrix for covariates. We denote the τth conditional quantile function of Y as Q_Y(τ∣X, C). Then we can write the conditional quantile regression model for jth variant and for a specific quantile level τ ∈ (0, 1) as

$${Q}_{{\rm {Y}}}(\tau | {{\bf{X}}}_{j},\,{\bf{C}})={{\bf{X}}}_{j}{{\boldsymbol{\beta }}}_{j}(\tau )+{\bf{C}}{\boldsymbol{\alpha }}(\tau ),$$

(1)

where β_j(τ) and α(τ) are quantile-specific coefficients and can be estimated by minimizing the pinball loss function¹⁸ as implemented in the R package quantreg. For testing H₀: β_j(τ) = 0, a commonly used tool in quantile regression is the rank score test^6,28, implemented in the R package QRank. The rank score test statistic for a fixed quantile level τ is defined as

$${{\bf{S}}}_{{\rm{QRank}},j,\tau }={n}^{-1/2}\mathop{\sum }\limits_{i=1}^{n}{{\bf{X}}}_{ij}^{*}{\phi }_{\tau }({{\bf{Y}}}_{i}-{{\bf{C}}}_{i}\hat{{\boldsymbol{\alpha }}}(\tau ))\,{\rm{and}}\,{{\bf{V}}}_{{\rm{QRank}},j}={n}^{-1}\tau (1-\tau ){{\bf{X}}}_{j}^{*^{\prime} }{{\bf{X}}}_{j}^{*}.$$

(2)

where X^* = P_CX, ${{\bf{P}}}_{{\rm {C}}}={\bf{I}}-{\bf{C}}{({\bf{C}}^{\prime} {\bf{C}})}^{-1}{\bf{C}}^{\prime}$, and ϕ_τ(u) = τ−I(u < 0) is the τ-pinball loss. Under the null hypothesis H₀: β_j(τ) = 0, ${{\bf{V}}}_{{\rm{QRank}},j}^{-1/2}{{\bf{S}}}_{{\rm{QRank}},j,\tau } \sim N(0,1)$. Note that the asymptotic distribution of the test statistics is independent of the particular distribution of the phenotype, and hence the rank score test can be applied to any phenotype without requiring a pre-transformation to achieve normality. Throughout this paper, we use nine quantiles spaced equally at 0.1 intervals and combine quantile-specific p-values via Cauchy combination²⁹.

QR is powerful across different homogeneous and heterogeneous models

We perform simulations based on UKBB genotype data on 20,000 white British unrelated individuals and focus on one locus (chr8:19,291,998–20,291,997; GRCh37/hg19) with 4148 variants. We assume only one causal variant selected at random among variants with minor allele frequency (MAF) > 1% and with varying effect size: 0.01 ≤ β ≤ 0.2. We simulate phenotype data using several models, including a homogeneous linear model (with effects being constant across quantiles) with the error distribution being normal or a heavy tail Cauchy distribution; a location-scale model; a local model where effects only exist at the upper quantiles; and a model with both additive and dominance effects. Trait is normalized using rank-based inverse normal transformation for LR only; raw trait values are used for QR and QUAIL. We assess the power of LR, QUAIL, and QR based on 100 Monte Carlo replicates. Note that the type-1 error is well controlled under both normal and Cauchy error distribution at individual quantile levels (Table S1). There is slight inflation at the 0.05 level when using the Cauchy combination method to combine p-values across quantile levels, due to the fact that the Cauchy combination method is known to be most accurate in the tail of the null distribution²⁹, and therefore smaller p-values are approximated better. Given that in GWAS, we are interested in identifying variants at stringent significance levels, the Cauchy combination method should be an appropriate strategy to combine p-values.

Homogeneous linear model with normal or Cauchy error distribution

we assume the phenotype Y follows the model:

$${\bf{Y}}=\beta {\bf{X}}+{\boldsymbol{\epsilon }},$$

(3)

where X is the genotype at the causal variant, ϵ ~ N(0, 1) or ϵ ~ Cauchy(0, 1). For the normal error, LR has a slight power increase over QR (Fig. 1a), while for the Cauchy error (heavy tail distribution), QR has higher power than LR (Fig. 1b). QUAIL has no power in these scenarios.

**Fig. 1: Power results in simulations.**

Location-scale model

we also simulate the phenotype Y from the following model:

$${\bf{Y}}=\beta {\bf{X}}+(1+0.5\beta {\bf{X}}){\boldsymbol{\epsilon }},$$

(4)

where ϵ ~ N(0, 1). In this model, X is associated with both the mean and the variance of Y. QR has the highest power in this scenario, followed by QUAIL, while LR has the lowest power, as expected (Fig. 1c).

Local model with effects only at upper quantiles

we additionally simulate data under a local model with effects only at the upper quantiles τ ∈ [0.7, 1]:

$${Q}_{{\rm {Y}}}(\tau | {\bf{X}})={\boldsymbol{\beta }}(\tau ){\bf{X}}+{{\boldsymbol{\alpha }}}_{0}(\tau ),$$

(5)

where β(τ) = 5β(τ−0.7)/(1−0.7) when τ > 0.7 and β(τ) = 0 otherwise. The error distribution (${{\boldsymbol{\alpha }}}_{0}^{-1}(\tau )$) is assumed to be normal. Such a model can correspond to a gene–environment interaction, where a genetic variant only shows an effect on a phenotype in a certain environment. Under this local model, QR can have significant increase in power over LR (Fig. 1c). QUAIL has similar power as QR.

Model with both additive and dominance effects

finally, we consider a model that includes both additive and dominance effects as follows:

$${\bf{Y}}={\beta }^{{\rm {a}}}{{\bf{X}}}^{{\rm {a}}}+{\beta }^{{\rm {d}}}{{\bf{X}}}^{{\rm {d}}}+{\boldsymbol{\epsilon }},$$

(6)

where X^a corresponds to the additive coding 0, 1 and 2, while X^d is coded as $-p/(1-p),\, 1,-\left.(1-p)/p\right)$ with p being the MAF, and ϵ ~ N(0, 1). When the dominance effect is strong enough, QR can indeed be more powerful (Figs. 1d, S1). QUAIL overall has lower power than both LR and QR.

Applications to UKBB GWAS

To illustrate QR in a modern GWAS setting and contrast the results with LR and QUAIL, we use the UKBB data (between 264,854 and 325,825 white British, unrelated/independent individuals and ~8.6 million SNPs), and perform GWAS for 39 quantitative traits (Fig. S2). We adjust for several covariates, including age, sex, batch, and 10 principal components (PC) of genetic variation. We identify significant SNPs using a genome-wide significance threshold of 5 × 10⁻⁸ for each trait (Fig. S4–S13). We define a locus as a 1Mb segment around the top p-value SNP in a region. LR detects a larger number of significant associations relative to QR (2,039,518 genome-wide significant SNPs and 9964 loci vs. 1,517,669 SNPs and 7297 loci). Most of the QR discoveries are shared with LR, but QR also identifies new loci that LR misses (Fig. 2a, S2, and Supplementary Data 1). In contrast, QUAIL detects a few significant associations across these 39 biomarkers, most of which are already detected by QR (Fig. S3). Specifically, on average, only 9% (ranging from 0% to 33%) of the QR associations were also detected by QUAIL.

**Fig. 2: Number of genome-wide significant loci identified by LR and QR for each of 39 traits, effect size heterogeneity, and functional score comparison.**

We replicate the genome-wide significant associations above in an independent dataset of European individuals from UKBB (n ≈ 34,000–39,000). Given the much smaller sample sizes for the replication datasets relative to the discovery datasets, we report replication at nominal significance level (p-value < 0.05). One challenge with QR is that the result is a composite of results across multiple quantile levels and the p-value is obtained by Cauchy combination of individual quantile-level p-values which makes it more difficult to take the sign of the associations into account when considering replication. To overcome this challenge, we used the sign of the estimated effect β(τ) for the quantile with the lowest QR p-value in the discovery dataset and compared it to the sign of the estimated effect for the same quantile level in the replication dataset. QR has replication rates at least as high as LR (68% vs. 62% on average across the 39 traits, Fig. S14, Table S2). For both LR and QR, sign consistency tends to be very high (≈96%). Note also that most of the lead QR variants at loci only identified by QR still have suggestive signals in LR, suggesting that many of them could be identified in larger GWAS (Fig. S15).

Since QR allows us to estimate quantile specific effects across a whole range of quantiles, we can quantify the degree of heterogeneity in effects across the quantiles for individual SNPs. Specifically, for a SNP we compute a heterogeneity index as the log transformed ratio between the standard deviation and the absolute mean of the estimated quantile coefficients β(τ)’s: $\log | {\rm{sd}}(\beta (\tau ))/{\rm{mean}}(\beta (\tau ))|$. The higher the value, the more heterogeneity in the quantile effects. As expected, SNPs identified by QR but missed by LR (QR p-value < 5e−08, LR p-value > 5e−08) are the most heterogeneous (Fig. 2b). Furthermore, they have significantly higher functional scores relative to those only identified by LR (Eigen-PC³⁰ p = 0.031, Fig. 2c). Together with the higher replication rates and the high sign consistency mentioned above, these results suggest that the new QR-LR loci are at least as reliable as the loci identified only by LR.

Next, we focus on two traits, BMI and LDL, to provide more specific results. Both traits have a right-skewed distribution. For BMI, QR identifies 188 loci, of which 172 are shared with LR, and 16 are unique to QR (Figs. 3a and S16, Table S3). Variants identified by QR but missed by LR have heterogeneous effects across quantile levels as expected, with larger effects at the upper quantiles (Fig. S17), likely due to the right skewness of the BMI distribution. For LDL, QR identifies 93 loci, with 87 of them being shared with LR, and 6 being unique to QR (Fig. 3b and S18, Table S3). As with BMI, variants identified by QR but missed by LR have larger effects at the upper quantiles (Fig. S17). Moreover, variants identified by QR but missed by LR have more heterogeneity (Fig. S19a) and higher functional scores (Fig. S19b) relative to those only identified by LR.

**Fig. 3: Manhattan sunset plot for BMI and LDL.**

Next, we illustrate the advantages of QR when there is effect heterogeneity by showing results at two loci detected by QR but missed by LR for two traits, BMI and LDL. For BMI, we show the locus corresponding to the lead QR SNP rs12037905 (Fig. 4a), which falls within LYPLAL1-antisense RNA1 (LYPLAL1-AS1), a long noncoding RNA gene, which has been found previously to be associated with BMI³¹ and lipid traits such as high-density lipoprotein (HDL) cholesterol and triglycerides (TG) levels³². Yang et al. described a mechanism by which long noncoding RNA regulates adipogenic differentiation of human mesenchymal stem cells and suggested that LYPLAL1-AS1 may serve as a novel therapeutic target for preventing and combating diseases related to abnormal adipogenesis, such as obesity³³.

**Fig. 4: Examples of QR–LR and QR& LR significant loci for BMI and LDL.**

For LDL, we show the lipoprotein lipase (LPL) locus with lead QR SNP rs326 (Fig. 4b), which is an intron variant within LPL. LPL, encoding lipoprotein lipase, plays an important role in the clearance of TG and the maturation of HDL. A higher concentration of serum LPL is associated with a lower/higher level of TG/HDL cholesterol and contributes to a reduced risk of coronary artery disease (CAD)^34,35. rs326 at LPL is a pleiotropic variant associated with CAD³⁶ and lipid metabolism³⁷. Large lipid genetic studies have shown that the rs326 G allele is associated with increased HDL cholesterol levels and decreased TG, suggesting a protective effect for CAD risk³⁷. In our QR analyses, the rs326 G allele is significantly associated with decreased LDL at upper quantiles (Fig. 4b), suggesting a protective effect against CAD risk among individuals with higher levels of LDL.

The advantage of QR is not restricted to identifying novel loci, missed by LR. QR can provide additional insights even for loci detected by LR. We show here two examples corresponding to well known associations: the FTO locus and BMI, and the PCSK9 locus and LDL. These loci have strong associations from the conventional LR approach, yet QR provides additional information on the heterogeneity of effects across different quantile levels beyond what one can get from LR (Fig. 4c, d). In particular, the rs11642015 T allele at the FTO locus (intron variant) has higher effects at the upper quantiles of the BMI distribution, while the rs11591147 T allele at the PCSK9 locus (missense variant) has stronger negative (protective) effect at the upper quantiles of the LDL distribution. Both variants have been reported before as likely to play a functional role^38,39. In particular, the FTO rs11642015 resides in an enhancer that regulates the expression of the neighboring IRX3/IRX5 cluster through long-range interactions with the IRX3/IRX5 gene promoters, and was shown to repress mitochondrial thermogenesis in adipocyte precursor cells⁴⁰.

Heterogeneous associations identified by QR could be the result of gene-environment interactions. Even though gene–environment interactions are believed to be common, detecting them is challenging due to several factors, including lower statistical power relative to detecting main effects, the low prior probability of specific gene–environment interactions, etc. Furthermore, the relevant environmental exposures can be missing from the study, and even when measured, accurate measurements of exposure may not be available. The significant SNPs identified by QR can be subsequently tested for interactions with particular environmental factors using conventional gene–environment tests. Note that this is a two-stage design: first, we identify significant SNPs from a QR scan (only for SNP effect, without any environmental factor in the model), and second, we test for interactions between these SNPs and particular environmental factors using well established gene–environment tests with Bonferroni method to adjust for the number of gene–environment tests in the second stage (note that the two stages are independent).

We selected five environmental factors that are available in the UKBB, including sex, age, physical activity (PA), sedentary behavior (SB) and smoking, and tested for interactions with top SNPs detected by QR. Overall, using lead QR SNPs at 7297 loci across 39 traits we identify 139 significant gene-environment interactions (significant after multiple testing adjustment for tests for each phenotype), 58 of which are significant at the study-wide level (p < 0.05/(7297 × 5) = 1.37e−06), after adjusting for all gene–environment tests across all phenotypes (Table S4 and Fig. 5a). To test the enrichment of gene–environment interaction effects among the lead QR variants, we repeated the gene–environment tests using the same number of randomly selected significant variants (LR or QR) across 39 traits. Over 1000 replicates, we observed, on average, 16 significant gene–environment interaction effects, significantly lower than the 58 observed interactions with leading QR variants (Fig. 5b). If we focus on QR–LR loci, only two study-wide significant interactions are detected among 259 × 5 tests: ARHGEF3 with age for platelet distribution width (p = 9.09e−16), and RSPO3 with sex for waist circumference (p = 7.49e−07). Note that this is still an 11-fold enrichment over the 2 study-wide significant interactions detected at the LR-QR loci among 2926 × 5 tests (p = 0.03). Similar results hold if we restrict to BMI and LDL (Fig. S20), for which most of the significant interactions are with age or sex. For example, for LDL we detected significant interactions between variants in PCSK9, APOB, CELSR2, LPA, and age. For BMI, we identified significant interactions between a variant in FTO and physical activity, and a variant in FAIM2 and sedentary behavior.

**Fig. 5: Significant gene–environment interaction effects of the lead QR variants for 39 traits.**

We attempted to replicate the study-wide significant gene-environment interactions above in an independent dataset of European individuals from UKBB (Table S4). There is strong correlation among estimated effect sizes from the discovery and replication (r = 0.77) with almost perfect sign consistency with 56 out of 58 interaction effects showing the same sign (p < 2.2e − 308). In terms of p-value, 25 out of 58 tests are nominally significant (p < 2.2e − 308), with 6 out of 58 being significant after adjusting for 58 tests, including variants in TCF7L2 and age for Glycated hemoglobin (HbA1c), variants in SMOX and sex for High light scatter reticulocyte count, Reticulocyte percentage, and Immature reticulocyte fraction, variants in CSP1 and sex for Mean platelet (thrombocyte) volume. Similar results are obtained when relaxing the list of discovered interactions to the ones that are significant for each trait individually Supplementary Data 2.

As mentioned above, one highly significant interaction was detected with rs1354034, a SNP with heterogeneous quantile effects that was completely missed by LR (Fig. 6). Interestingly, a strong non-additive association was previously observed for platelet distribution width at the same variant rs1354034, associated with increased expression of ARHGEF3 in platelets, the association being highly specific to platelets and not detected in other major blood cell types^41,42. Platelet distribution width describes how similar the platelets are in size, with high values meaning great variation is size and being associated with vascular disease or certain cancers. The interaction of rs1354034 with age may be indicative of increased platelet activation with age which leads to more variable platelet width⁴³. Specifically, at younger age there are low levels of platelet activation and the risk genotype may be silent. As platelet activation increases with age, the genotype expresses itself, which leads to stronger age-phenotype associations in the presence of the risk genotype. rs1354034 is a putative functional variant and has been studied before in connection with the platelet-related phenotypes. In particular, rs1354034 disrupts a conserved GATA motif and resides in a regulatory element that has previously been shown to affect the transcription of nearby ARHGEF3⁴⁴. Additionally, rs1354034 is associated in trans with the expression of von Willebrand factor (VWF) and other key platelet/megakaryocyte genes found at other loci⁴⁵.

**Fig. 6: Association of rs1354034 at the *ARHGEF3* locus with Platelet distribution width.**

Discussion

In this paper we demonstrate the usefulness of QR in the context of GWAS for quantitative phenotypes, and how it can complement conventional GWAS based on LR. Although QR has been employed before in the context of genetic studies, previous applications were limited to small GWAS studies of single phenotypes^19,20 or small simulation studies²². The current paper expands significantly on the existing work by providing applications of QR at biobank scale and for many phenotypes. Identifying heterogeneous associations as made possible by QR should be of wide interest to genetic association studies of medically relevant traits. In particular, it is often the case that high-risk subgroups (e.g. high risk for heart disease) have either high or low values for continuous risk factors, such as BMI or LDL cholesterol. Furthermore, QR is robust to the non-normality of traits (including heavy-tailed distributions) and outliers, a common practical problem when conducting quantitative trait analyses in large, heterogeneous GWAS studies. Although we have focused our discussion on GWAS with independent subjects, it is possible to extend QR to studies with related individuals by including random effects whose covariance structure depends on the genetic relatedness matrix⁴⁶. However, these existing methods are not computationally feasible at the biobank scale, and we plan to work on developing scalable methods in this setting in the future.

SNPs with heterogeneous effects that are detected by QR can arise due to various factors, including gene-gene or gene-environment interactions, time-dependent effects, or measurement errors. Similarly, genetic nurturing effects⁴⁷, where sequence variants in the parental genomes that are not transmitted to a child interact with the child’s genotype to affect the phenotype, can also lead to heterogeneous effects. Elucidating the precise source of heterogeneity is difficult in general but may be possible by, for example, post-hoc incorporation of explicit interactions (when available) in the model as we have done here, and can suggest particular contexts in which GWAS-discovered variants manifest their effects.

An appealing aspect of QR relative to other non-LR tests in the literature is its parallel to widely used LR techniques. Like LR, QR is a regression framework and inherits its advantages, such as the ability to do covariate adjustment and to estimate effect sizes of genetic variants. However, whereas LR only produces regression coefficients that quantify the genetic effect on the phenotypic mean, QR produces quantile-specific regression coefficients across a range of quantile levels which characterize the genetic effects on the conditional quantiles instead. Such rich information allows not only better characterization of the genotype-phenotype relationship but also allows one to generalize the phenotypic prediction based on genetic variants, from predicting the conditional mean to predicting the entire conditional distribution of the phenotype. Specifically, in the usual polygenic risk scores, we aim to predict the conditional mean of a phenotype, given the genotype profile of an individual: E(Y∣X)⁴⁸. QR provides the opportunity to generalize this phenotypic prediction to the entire conditional distribution: Q_Y(τ∣X) for any quantile level τ ∈ (0, 1). That allows us to, for example, compute for a given genotype profile, the probability that an individual with that profile has PRS above a given threshold, or the proportion of individuals with that profile expected to be in a high risk group. Similar to confidence intervals for the conventional PRS, we can construct quantile prediction intervals using the predicted conditional quantiles. It will be of interest to investigate this direction in detail in future work.

There are several potential limitations of QR. As discussed, QR is able to provide quantile-specific p-values, and type-1 error is well controlled at the individual quantile levels. There are different ways to combine correlated quantile-specific p-values into one single p-value. Here, we illustrated one such approach based on the Cauchy combination method, which provides an analytical approximation to the combined p-value. Note that this approximation works best for very small p-values (in the tails) as one can see also from the empirical results presented, and may show slight inflation at larger significance levels (e.g. 0.05). Given that we are interested in identifying SNPs significant at stringent significance levels, this approximation is a reasonable method to combine p-values across quantile levels. Another limitation of QR is that heterogeneous effects, as discussed above, can arise due to various factors, and hence care needs to be taken when interpreting the results. Careful investigation in post-hoc analyses (beyond QR) needs to be undertaken to understand the source of heterogeneity.

QR is computationally more expensive than LR due to iterative interior point methods used to optimize the loss function, but it can be scaled up to biobank-sized datasets, as illustrated here. To give a specific example, for BMI (with 325,306 individuals and 8,572,925 variants), LR with PLINK2⁴⁹ took 92 h of CPU time, and QR (nine quantile-specific tests) with the QRank package in R took 9534 h of CPU time. This runtime includes both the estimation of effect sizes at every SNP and the rank score testing at nine quantile levels. It is possible to speed up the time for QR analyses by focusing on genome-wide testing, while only estimating effects for regions of interest, such as genome-wide significant SNPs or loci. This way, the runtime can be reduced to a total of 1946 h of CPU time, i.e. an average of 216 h per quantile level. Moreover, both LR and QR can be accelerated by performing tests in parallel on small genomic regions. In our implementation, the genome-wide variant set was split into 1,000 segments of ~8.57K variants, with each segment taking about 0.1 h for LR and 9.5 hours for QR (1.9 hours if only performing testing) with a single CPU core. With this data scatter and parallel strategy, the elapsed wall-clock time with 1000 CPU cores became reasonable (~1.9 h for nine quantile levels). It is also possible to use more computationally efficient optimization techniques for the loss function in QR based on stochastic gradient descent⁵⁰ to reduce the computational requirements, and this will be the subject of future work.

In summary, we showed that QR is a useful yet underutilized complement to LR in GWAS of biomarkers and quantitative traits, and can lead to biologically meaningful discoveries. QR bears many similarities to LR and can be run efficiently using available R packages at biobank scale. We recommend adding QR to the toolbox when performing biomarker GWAS studies.

Methods

We first provide some basic details on the linear and quantile regression models as they relate to the GWAS setting.

Linear regression (LR)

We assume that we have n independent samples from a population, and for the ith subject, we denote by Y_i a continuous (quantitative) phenotype, by ${{\bf{X}}}_{i}=({X}_{i1},\cdots \,,{X}_{ip})^{\prime}$ the genotypes at p genetic variants, and by ${{\bf{C}}}_{i}=({C}_{i1},\cdots \,,{C}_{iq})^{\prime}$ covariates we want to adjust for, such as age, gender and principal components of genetic variation. Furthermore, we denote by ${\bf{Y}}=({Y}_{1},\cdots \,,{Y}_{n})^{\prime}$ the n × 1 sample phenotype vector, by X the n × p genotype matrix, and by C the n × q matrix for covariates. We assume that Y is normalized.

First, we consider fitting all p variants in a genetic region using a joint linear model:

$${\bf{Y}}={\bf{X}}{{\boldsymbol{\beta }}}_{{\rm{all}}}+{\bf{C}}{\boldsymbol{\alpha }}+{\boldsymbol{\epsilon }},$$

(7)

where α ∈ R^q and β_all ∈ R^p are coefficients, and ${\boldsymbol{\epsilon }} \sim N(0,{\sigma }_{Y}^{2})$. We define the error projection matrix ${{\bf{P}}}_{{\rm {C}}}={\bf{I}}-{\bf{C}}{({\bf{C}}^{\prime} {\bf{C}})}^{-1}{\bf{C}}^{\prime}$. We project Y and X onto the orthogonal complement of the column space of C, then the estimate for β_all and its variance can be obtained as

$${\hat{{\boldsymbol{\beta }}}}_{{\rm{all}}}={({\bf{X}}^{\prime} {{\bf{P}}}_{{\rm {C}}}{\bf{X}})}^{-1}{\bf{X}}^{\prime} {{\bf{P}}}_{\rm {{C}}}{\bf{Y}}\,{\rm{and}}\,{\rm {var}}({\hat{{\boldsymbol{\beta }}}}_{{\rm{all}}})={({\bf{X}}^{\prime} {{\bf{P}}}_{{\rm {C}}}{\bf{X}})}^{-1}{\sigma }_{Y}^{2}.$$

(8)

The z score test statistic is defined as

$${{\bf{Z}}}_{{\rm{all}}}=\frac{{\hat{{\boldsymbol{\beta }}}}_{{\rm{all}}}}{{\rm {se}}({\hat{{\boldsymbol{\beta }}}}_{{\rm{all}}})}={\sigma }_{Y}^{-1}{({\bf{X}}^{\prime} {{\bf{P}}}_{{\rm {C}}}{\bf{X}})}^{-1/2}{\bf{X}}^{\prime} {{\bf{P}}}_{{\rm {C}}}{\bf{Y}}.$$

(9)

In GWAS, the typical approach is to perform marginal (unconditional) testing for each variant j at a time. Therefore the model becomes:

$${\bf{Y}}={{\bf{X}}}_{j}{{\boldsymbol{\beta }}}_{j}+{\bf{C}}{\boldsymbol{\alpha }}+{\boldsymbol{\epsilon }}.$$

(10)

Similarly, as above, we have

$${\hat{{\boldsymbol{\beta }}}}_{{\rm{GWAS,j}}}={({\bf{X}}^{\prime} {{\bf{P}}}_{{\rm {C}}}{{\bf{X}}}_{j})}^{-1}{\bf{X}}^{\prime} {{\bf{P}}}_{{\rm {C}}}{\bf{Y}}\,{\rm{and}}\,{\rm {var}}({\hat{{\boldsymbol{\beta }}}}_{{\rm{GWAS,j}}})={({\bf{X}}^{\prime} {{\bf{P}}}_{{\rm {C}}}{{\bf{X}}}_{j})}^{-1}{\sigma }_{Y}^{2}.$$

(11)

We denote by ${\hat{{\boldsymbol{\beta }}}}_{{\rm{GWAS}}}=({\hat{{\boldsymbol{\beta }}}}_{{\rm{GWAS,1}}},\cdots \,,{\hat{{\boldsymbol{\beta }}}}_{{\rm{GWAS,p}}})^{\prime}$ and the vector of z scores as

$${{\bf{Z}}}_{{\rm{GWAS}}}=\left(\frac{{\hat{{\boldsymbol{\beta }}}}_{{\rm{GWAS,1}}}}{{\rm {se}}({\hat{{\boldsymbol{\beta }}}}_{{\rm{GWAS,1}}})},\cdots \,,\frac{{\hat{{\boldsymbol{\beta }}}}_{{\rm{GWAS,p}}}}{se({\hat{{\boldsymbol{\beta }}}}_{{\rm{GWAS,p}}})}\right)^{\prime} .$$

(12)

Then, the following is true:

$${{\bf{Z}}}_{{\rm{GWAS}}}={{\bf{D}}}^{*1/2}{{\bf{Z}}}_{{\rm{all}}},$$

(13)

where D^* is the correlation matrix of X^* ≔ P_CX. When X is independent of C, X^* = X and D^* is simply the correlation (LD) matrix among p SNPs. Under the null hypothesis H₀: β_all = 0, Z_all ~ N(0, I_p) whereas Z_GWAS ~ N(0, D^*).

Quantile regression (QR)

We denote the τth conditional quantile function of Y_i as ${Q}_{{Y}_{i}}(\tau | {{\bf{X}}}_{i},{{\bf{C}}}_{i})$. Then in linear quantile regression we can write the conditional quantile regression model for p joint variants and given a specific quantile level τ ∈ (0, 1) as

$${Q}_{{Y}_{i}}(\tau | {{\bf{X}}}_{i},\,{{\bf{C}}}_{i})={{\bf{X}}}_{i}{{\boldsymbol{\beta }}}_{{\rm{all}}}(\tau )+{{\bf{C}}}_{i}{\boldsymbol{\alpha }}(\tau ),$$

(14)

where β_all(τ), and α(τ) are quantile-specific coefficients and can be estimated by solving the following optimization problem¹⁸:

$$({\hat{{\boldsymbol{\beta }}}}_{{\rm{all}}}(\tau ),\, \hat{{\boldsymbol{\alpha }}}(\tau ))=\arg \min E{\rho }_{\tau }({{\bf{Y}}}_{i}-{{\bf{X}}}_{i}{{\boldsymbol{\beta }}}_{{\rm{all}}}(\tau )-{{\bf{C}}}_{i}{\boldsymbol{\alpha }}(\tau )),$$

(15)

where ρ_τ(u) = u(τ−I(u < 0)) is the pinball loss function and I(⋅) is the indicator function. When the error ϵ is independent and identically distributed (iid), we have the asymptotic normality for ${\hat{{\boldsymbol{\beta }}}}_{{\rm{all}}}(\tau )$²⁸ such that

$$\sqrt{n}\left\{{\hat{{\boldsymbol{\beta }}}}_{{\rm{all}}}(\tau )-{{\boldsymbol{\beta }}}_{{\rm{all}}}(\tau )\right\}\to N\left(0,\frac{\tau (1-\tau )}{{f}_{i}^{2}({\xi }_{i}(\tau ))}{{\bf{D}}}_{0}^{-1}\right),$$

(16)

where ${{\bf{D}}}_{0}: \!=\mathop{\lim }\limits_{n\to \infty }{n}^{-1}{\sum }_{i}{{\bf{X}}}_{i}^{\top }{{\bf{X}}}_{i}$, ${\xi }_{i}(\tau ): \!={F}_{{Y}_{i}}^{-1}(\tau | {{\bf{X}}}_{i},{{\bf{C}}}_{i})={Q}_{{Y}_{i}}(\tau | {{\bf{X}}}_{i},\, {{\bf{C}}}_{i})$, and f(⋅) represents the density function. Thus, the asymptotic distribution of ${\hat{{\boldsymbol{\beta }}}}_{{\rm{all}}}$ depends on the behavior of the unknown conditional density of the response Y in a neighborhood of the conditional quantile model.

For quantile regression, a commonly used hypothesis testing tool is the rank score test²⁸. The rank score test statistic for a fixed quantile level τ is defined as:

$${{\bf{S}}}_{{\rm{all,\tau }}}={n}^{-1/2}\mathop{\sum }\limits_{i=1}^{n}{{\bf{X}}}_{i}^{*^{\prime} }{\phi }_{\tau }({{\bf{Y}}}_{i}-{{\bf{C}}}_{i}\hat{{\boldsymbol{\alpha }}}(\tau )),$$

(17)

where ϕ_τ(u) = τ−I(u < 0). The covariance matrix of S_n,τ is

$${{\bf{V}}}_{{\rm{all}}}={n}^{-1}\tau (1-\tau ){{\bf{X}}}^{*\top }{{\bf{X}}}^{*},$$

(18)

where X^* = P_CX is the same as defined previously. If we fit a marginal model for the jth SNP (i.e., QRank⁶), we have

$${Q}_{{Y}_{i}}(\tau | {{\bf{X}}}_{ij},{{\bf{C}}}_{i})={{\bf{X}}}_{ij}{{\boldsymbol{\beta }}}_{{\rm{QRank}},j}(\tau )+{{\bf{C}}}_{i}{\boldsymbol{\alpha }}(\tau ),$$

(19)

and

$${{\bf{S}}}_{{\rm{QRank,j,\tau }}}={n}^{-1/2}\mathop{\sum }\limits_{i=1}^{n}{{\bf{X}}}_{ij}^{*}{\phi }_{\tau }({{\bf{Y}}}_{i}-{{\bf{C}}}_{i}\hat{{\boldsymbol{\alpha }}}(\tau ))\,{\rm{and}}\,{{\bf{V}}}_{{\rm{QRank,j}}}={n}^{-1}\tau (1-\tau ){{\bf{X}}}_{j}^{*^{\prime} }{{\bf{X}}}_{j}^{*}.$$

(20)

Under the null hypothesis H₀ : β_all,τ = 0, ${{\bf{V}}}_{{\rm{all}}}^{-1/2}{{\bf{S}}}_{{\rm{all,\tau }}} \sim N({\bf{0}},{{\bf{I}}}_{{\rm {p}}})$ and ${{\bf{V}}}_{{\rm{QRank}},j}^{-1/2}{{\bf{S}}}_{{\rm{QRank}},j,\tau } \sim N(0,1)$, where V_QRank,j is the jth diagonal element of V_all. We note that the asymptotic distribution of the test statistics is independent of the particular distribution of the phenotype. Hence it can be applied to any phenotype without requiring a pre-transformation to achieve normality.

Denote ${{\bf{Z}}}_{{\rm{QRank,\tau }}}=\left(\frac{{{\bf{S}}}_{{\rm{QRank,1,\tau }}}}{\sqrt{{{\bf{V}}}_{{\rm{QRank,1}}}}},\cdots \,,\frac{{{\bf{S}}}_{{\rm{QRank,p,\tau }}}}{\sqrt{{{\bf{V}}}_{{\rm{QRank,p}}}}}\right)^{\prime}$ and ${{\bf{Z}}}_{{\rm{all,\tau }}}={{\bf{V}}}_{{\rm{all}}}^{-1/2}{{\bf{S}}}_{{\rm{all,\tau }}}$, then we have

$${{\bf{Z}}}_{{\rm{QRank,\tau }}}={{{\bf{D}}}^{*}}^{1/2}{{\bf{Z}}}_{{\rm{all,\tau }}},$$

(21)

where D^* is the correlation matrix of X^*.

Hence, we conclude that the rank score test statistics from a marginal quantile regression model can be adjusted to the ones from a joint quantile regression model by the correlation matrix D^*, the same as the z scores in the linear regression models above. Finally, we combine quantile-specific p-values from the marginal tests via the Cauchy combination²⁹.

UK Biobank analyses

We used the imputed genotype data of 488,377 participants from the UK Biobank (UKBB). The UKBB samples were directly genotyped with UK BiLEVE and UK Biobank Axiom arrays. Genotype imputation was carried out by using the Haplotype Reference Consortium panel and the merge reference panel from UK10K and 1000 Genomes phase 3 with the IMPUTE4 software²⁷. We selected unrelated white British individuals by excluding third-degree or closer relatives (kinship coefficient > 0.0442 estimated with KING⁵¹) as described in refs. ^27,52 for discovery analysis and unrelated individuals with non-British European ancestry for replication. The numbers of post-QC samples and variants vary depending on the availability of phenotypic data (Fig. S2). We retained variants with minor allele frequency (MAF) > 1%, Hardy–Weinberg equilibrium (HWE) test p-value > 1 × 10⁻⁶, and imputation quality (INFO) score > 0.8. Positions of the genetic variants are given in the coordinates of the GRCh37 human reference genome. All analyses were done while adjusting for age, sex, and 10 principal components of genetic variation. For linear regression, traits were normalized using rank-based inverse normal transformation for LR only; raw trait values are used for quantile regression. We analyzed 39 quantitative phenotypes in the UK Biobank, including three body size measures, eight blood biochemistry markers (clinical markers related to disease risk factors, diagnostic measures, and renal and liver function), and 28 hematological markers (blood counts and related parameters commonly used in clinical practice). The body size measures were collected by the UK Biobank assessment centre, and the biomarkers were measured by the UK Biobank central laboratory. UKBB field IDs for the phenotypes analyzed in this study are available in Supplementary Data 1. Details on each phenotype can be accessed by searching the field ID in the UKBB online showcase (https://biobank.ndph.ox.ac.uk/ukb/).

Gene–environment interactions

We tested for interactions with top SNPs detected by QR using the following model:

$${\bf{Y}}={{\bf{X}}}_{{\rm {g}}}{\beta }_{{\rm {g}}}+{{\bf{X}}}_{{\rm {e}}}{\beta }_{{\rm {e}}}+{{\bf{X}}}_{{\rm {g}}}{{\bf{X}}}_{{\rm {e}}}{\beta }_{{\rm {ge}}}+{\boldsymbol{\epsilon }}.$$

(22)

We tested β_ge = 0 using linear regression in PLINK2 and adjusted for multiple testing using Bonferroni method. Note that we also performed the robust GxE interaction tests with the Huber-White sandwich variance estimator and the results were consistent with those from the naive variance estimator. (Fig. S21).

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Data availability

The individual-level genotype and phenotype data are available to approved researchers through the UK Biobank web portal at https://www.ukbiobank.ac.uk/. The LR and QR GWAS summary statistics for the biomarkers reported here are publicly available at https://doi.org/10.5281/zenodo.11095249. Significant QR-LR loci for an additional 22 quantitative traits (hence a total of 61 traits) are reported in Supplementary Data 3.

Code availability

Scripts for QR GWAS are available at https://github.com/Iuliana-Ionita-Laza/QRGWASand also deposited at https://doi.org/10.5281/zenodo.11642489. The quantile regression was performed using the R package quantreg (https://cran.r-project.org/package=quantreg), and the rank score test was performed using the R package QRank (https://cran.r-project.org/package=QRank). The quantile-specific p-values were integrated with the Cauchy combination using the R package ACAT. The linear regression model was performed using the PLINK 2.0 (https://www.cog-genomics.org/plink/2.0/).

References

Chun, S. et al. Limited statistical evidence for shared genetic effects of eQTLs and autoimmune-disease-associated loci in three major immune-cell types. Nat. Genet. 49, 00–605 (2017).
Article CAS Google Scholar
Umans, B. D., Battle, A. & Gilad, Y. Where are the disease-associated eQTLs? Trends Genet. 37, 09–124 (2021).
Article Google Scholar
Yang, J. et al. FTO genotype is associated with phenotypic variability of body mass index. Nature 490, 267–272 (2012).
Article ADS CAS PubMed PubMed Central Google Scholar
Brown, A. A. et al. Genetic interactions affecting human gene expression identified by variance association mapping. eLife 3, e01381 (2014).
Article PubMed PubMed Central Google Scholar
Pare, G., Cook, N. R., Ridker, P. M. & Chasman, D. I. On the use of variance per genotype as a tool to identify quantitative trait interaction effects: a report from the Women’s Genome Health Study. PLoS Genet. 6, e1000981 (2010).
Article PubMed PubMed Central Google Scholar
Song, X. et al. QRank: a novel quantile regression tool for eQTL discovery. Bioinformatics 33, 2123–2130 (2017).
Article CAS PubMed PubMed Central Google Scholar
Wang, H. et al. Genotype-by-environment interactions inferred from genetic effects on phenotypic variability in the UK Biobank. Sci. Adv. 5, aaw3538 (2019).
Article ADS Google Scholar
Westerman, K. E. et al. Variance-quantitative trait loci enable systematic discovery of gene–environment interactions for cardiometabolic serum biomarkers. Nat. Commun. 13, 3993 (2022).
Article ADS CAS PubMed PubMed Central Google Scholar
Manchia, M. et al. The impact of phenotypic and genetic heterogeneity on results of genome wide association studies of complex diseases. PLoS ONE 8, e76295 (2013).
Article ADS CAS PubMed PubMed Central Google Scholar
van Ijzendoorn, M. H. et al. Gene-by-environment experiments: a new approach to finding the missing heritability. Nat. Rev. Genet. 12, 881–881 (2011).
Article PubMed Google Scholar
Findley, A. S. et al. Functional dynamic genetic effects on gene regulation are specific to particular cell types and environmental conditions. Elife 10, e67077 (2021).
Article CAS PubMed PubMed Central Google Scholar
Levene, H. Robust tests for equality of variances. In: Contributions to probability and statistics: essays in honor of Harold Hotelling (ed. Olkin, I.) (Stanford University Press) 278–292 (1960).
Brown, M. B. & Forsythe, A. B. Robust tests for the equality of variances. J. Am. Stat. Assoc. 69, 364–367 (1974).
Article Google Scholar
Smyth, G. K. Generalized linear models with varying dispersion. J. R. Stat. Soc. B 51, 47–60 (1989).
Article MathSciNet Google Scholar
Rönnegård, L. & Valdar, W. Detecting major genetic loci controlling phenotypic variability in experimental crosses. Genetics 188, 435–447 (2011).
Article PubMed PubMed Central Google Scholar
Dumitrascu, B., Darnell, G., Ayroles, J. & Engelhardt, B. E. Statistical tests for detecting variance effects in quantitative trait studies. Bioinformatics 35, 200–210 (2019).
Article CAS PubMed Google Scholar
Miao, J. et al. A quantile integral linear model to quantify genetic effects on phenotypic variability. Proc. Natl Acad. Sci. USA 119, e2212959119 (2022).
Article CAS PubMed PubMed Central Google Scholar
Koenker, R. & Bassett, G. J. R. Regression quantiles. Econometrica 46, 33–50 (1978).
Article MathSciNet Google Scholar
Pozarickij, A., Williams, C., Hysi, P. G. & Guggenheim, J. A. UK Biobank Eye and Vision Consortium. Quantile regression analysis reveals widespread evidence for gene–environment or gene–gene interactions in myopia Development. Commun. Biol. 2, 167 (2019).
Article PubMed PubMed Central Google Scholar
Nascimento, M. et al. Quantile regression for genome-wide association study of flowering time-related traits in common bean. PLoS ONE 13, 0190303 (2018).
Article Google Scholar
Wang, T., Ionita-Laza, I. & Wei, Y. Integrated Quantile RAnk Test (iQRAT) for gene-level associations. Ann. Appl. Stat. 16, 423–44 (2022).
Article MathSciNet Google Scholar
Oliveira, G. F. et al. Population size in QTL detection using quantile regression in genome-wide association studies. Sci. Rep. 13, 585 (2023).
Article Google Scholar
Beyerlein, A. Quantile regression-opportunities and challenges from a user’s perspective. Am. J. Epidemiol. 180, 330–331 (2014).
Article PubMed Google Scholar
Wei, Y. & Terry, M. B. Re: “Quantile regression-opportunities and challenges from a user’s perspective". Am. J. Epidemiol. 181, 152–153 (2015).
Article PubMed Google Scholar
Beasley, T. M., Erickson, S. & Allison, D. B. Rank-based inverse normal transformations are increasingly used, but are they merited? Behav. Genet. 39, 580–95 (2009).
Article PubMed PubMed Central Google Scholar
McCaw, Z. R., Lane, J. M., Saxena, R., Redline, S. & Lin, X. Operating characteristics of the rank-based inverse normal transformation for quantitative trait analysis in genome-wide association studies. Biometrics 76, 262–72 (2020).
Article MathSciNet Google Scholar
Bycroft, C. et al. The UK Biobank resource with deep phenotyping and genomic data. Nature 562, 203–209 (2018).
Article ADS CAS PubMed PubMed Central Google Scholar
Koenker, R. Quantile Regression Vol. 38 (Cambridge University Press, 2005).
Liu, Y. & Xie, J. Cauchy combination test: a powerful test with analytic p-value calculation under arbitrary dependency structures. J. Am. Stat. Assoc. 115, 393–402 (2020).
Article MathSciNet CAS PubMed Google Scholar
Ionita-Laza, I., McCallum, K., Xu, B. & Buxbaum, J. D. A spectral approach integrating functional genomic annotations for coding and noncoding variants. Nat. Genet 48, 214–220 (2016).
Article CAS PubMed PubMed Central Google Scholar
Huang, J. et al. Genomics and phenomics of body mass index reveals a complex disease network. Nat. Commun. 13, 973 (2022).
Article ADS Google Scholar
Klarin, D. et al. Genetics of blood lipids among 300,000 multi-ethnic participants of the Million Veteran Program. Nat. Genet. 50, 1514–1523 (2018).
Article CAS PubMed PubMed Central Google Scholar
Yang, Y. et al. Long noncoding RNA LYPLAL1-AS1 regulates adipogenic differentiation of human mesenchymal stem cells by targeting desmoplakin and inhibiting the Wnt/β-catenin pathway. Cell Death Discov. 7, 05 (2021).
Article CAS Google Scholar
Myocardial Infarction Genetics and CARDIoGRAM Exome Consortia Investigators. Coding variation in ANGPTL4, LPL, and SVEP1 and the risk of coronary disease. N. Engl. J. Med. 374, 1134–44 (2016).
Article PubMed Central Google Scholar
Khera, A. V. et al. Association of rare and common variation in the lipoprotein lipase gene with coronary artery disease. JAMA 317, 37–946 (2017).
Article Google Scholar
Tcheandjieu, C. et al. Large-scale genome-wide association study of coronary artery disease in genetically diverse populations. Nat. Med 28, 679–1692 (2022).
Article Google Scholar
Wojcik, G. L. et al. Genetic analyses of diverse populations improves discovery for complex traits. Nature 570, 514–518 (2019).
Article CAS PubMed PubMed Central Google Scholar
Cohen, J. C., Boerwinkle, E., Mosley, T. H. Jr & Hobbs, H. H. Sequence variations in PCSK9, low LDL, and protection against coronary heart disease. N. Engl. J. Med 354, 1264–72 (2006).
Article CAS PubMed Google Scholar
Sobreira, D. R. et al. Extensive pleiotropism and allelic heterogeneity mediate metabolic effects of IRX3 and IRX5. Science 372, 1085–1091 (2021).
Article ADS CAS PubMed PubMed Central Google Scholar
Claussnitzer, M. et al. FTO Obesity Variant Circuitry and Adipocyte Browning in Humans. N. Engl. J. Med. 373, 895–907 (2015).
Article CAS PubMed PubMed Central Google Scholar
Palmer, D. S. et al. Analysis of genetic dominance in the UK Biobank. Science 379, 1341–1348 (2023).
Article ADS CAS PubMed PubMed Central Google Scholar
Kolberg, L., Kerimov, N., Peterson, H. & Alasoo, K. Co-expression analysis reveals interpretable gene modules controlled by trans-acting genetic variants. eLife 9, e58705 (2020).
Yun, S. H., Sim, E. H., Goh, R. Y., Park, J. I. & Han, J. Y. Platelet activation: the mechanisms and potential biomarkers. Biomed. Res. Int. 2016, 9060143 (2016).
Article PubMed PubMed Central Google Scholar
Guo, M. H. et al. Comprehensive population-based genome sequencing provides insight into hematopoietic regulatory mechanisms. Proc. Natl Acad. Sci. USA 114, E327–E336 (2017).
Article CAS PubMed Google Scholar
Zhang, X. et al. Genetic associations with expression for genes implicated in GWAS studies for atherosclerotic cardiovascular disease and blood phenotypes. Hum. Mol. Genet. 23, 782–795 (2014).
Article CAS PubMed Google Scholar
Geraci, M. & Bottai, M. Linear quantile mixed models. Stat. Comput. 24, 461–479 (2014).
Article MathSciNet Google Scholar
Kong, A. et al. The nature of nurture: effects of parental genotypes. Science 359, 424–428 (2018).
Article ADS CAS PubMed Google Scholar
Chatterjee, N., Shi, J. & García-Closas, M. Developing and evaluating polygenic risk prediction models for stratified disease prevention. Nat. Rev. Genet. 17, 392–406 (2016).
Article CAS PubMed PubMed Central Google Scholar
Chang, C. C. et al. Second-generation PLINK: rising to the challenge of larger and richer datasets. Gigascience 4, 7 (2015).
Article PubMed PubMed Central Google Scholar
Bottou, L., Curtis, F. & Nocedal, J. Optimization methods for large-scale machine learning. SIAM Rev. 60, 223–311 (2018).
Article MathSciNet Google Scholar
Manichaikul, A. et al. Robust relationship inference in genome-wide association studies. Bioinformatics 26, 2867–2873 (2010).
Article CAS PubMed PubMed Central Google Scholar
Sinnott-Armstrong, N. et al. Genetics of 35 blood and urine biomarkers in the UK Biobank. Nat. Genet. 53, 185–194 (2021).
Article CAS PubMed Google Scholar

Download references

Acknowledgements

This study was supported by the National Institutes of Health grants MH095797 (I.I.-L., C.W.) and AG072272 (I.I.-L., Y.W.). The funders had no role in study design, data collection and analysis, the decision to publish or the preparation of the manuscript. We thank Dan Belsky for useful discussions. This research has been conducted using the UK Biobank Resource under Application Number 41849.

Author information

Authors and Affiliations

Department of Biostatistics, Columbia University, New York, NY, USA
Chen Wang, Ying Wei & Iuliana Ionita-Laza
Division of Nephrology, Department of Medicine, Vagelos College of Physicians and Surgeons, Columbia University, New York, NY, USA
Chen Wang & Krzysztof Kiryluk
Colorado State University, Fort Collins, CO, USA
Tianying Wang
Department of Computational Biology, Institut Pasteur, Université Paris Cité, Paris, France
Hugues Aschard
Department of Statistics, Lund University, Lund, Sweden
Iuliana Ionita-Laza

Authors

Chen Wang
View author publications
You can also search for this author in PubMed Google Scholar
Tianying Wang
View author publications
You can also search for this author in PubMed Google Scholar
Krzysztof Kiryluk
View author publications
You can also search for this author in PubMed Google Scholar
Ying Wei
View author publications
You can also search for this author in PubMed Google Scholar
Hugues Aschard
View author publications
You can also search for this author in PubMed Google Scholar
Iuliana Ionita-Laza
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

I.I.-L. conceived the study. C.W. and I.I.-L. performed the analyses. T.W., Y.W., and I.I.-L. consulted on the statistics. K.K. and H.A. consulted on the GWAS and GxE analyses. C.W., T.W., and I.I.-L. drafted the manuscript. All authors reviewed and critically revised the article, and approved the final version of the manuscript.

Corresponding author

Correspondence to Iuliana Ionita-Laza.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Communications thanks Doug Speed and the other, anonymous, reviewers for their contribution to the peer review of this work. A peer review file is available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Peer Review File

Description of Additional Supplementary Files

Supplementary Data 1

Supplementary Data 2

Supplementary Data 3

Reporting Summary

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Wang, C., Wang, T., Kiryluk, K. et al. Genome-wide discovery for biomarkers using quantile regression at biobank scale. Nat Commun 15, 6460 (2024). https://doi.org/10.1038/s41467-024-50726-x

Download citation

Received: 06 July 2023
Accepted: 18 July 2024
Published: 31 July 2024
DOI: https://doi.org/10.1038/s41467-024-50726-x
Springer Nature Limited

Genome-wide discovery for biomarkers using quantile regression at biobank scale

Abstract

Similar content being viewed by others

Introduction

Results

Overview of quantile regression techniques

QR is powerful across different homogeneous and heterogeneous models

Homogeneous linear model with normal or Cauchy error distribution

Location-scale model

Local model with effects only at upper quantiles

Model with both additive and dominance effects

Applications to UKBB GWAS

Discussion

Methods

Linear regression (LR)

Quantile regression (QR)

UK Biobank analyses

Gene–environment interactions

Reporting summary

Data availability

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Peer review

Peer review information

Additional information

Supplementary information

Rights and permissions

About this article

Cite this article

Share this article

Search

Navigation