Pathway-based approach using hierarchical components of rare variants to analyze multiple phenotypes

Lee, Sungyoung; Kim, Yongkang; Choi, Sungkyoung; Hwang, Heungsun; Park, Taesung

doi:10.1186/s12859-018-2066-9

Pathway-based approach using hierarchical components of rare variants to analyze multiple phenotypes

Research
Open access
Published: 08 May 2018

Volume 19, article number 79, (2018)
Cite this article

Download PDF

You have full access to this open access article

BMC Bioinformatics Aims and scope Submit manuscript

Pathway-based approach using hierarchical components of rare variants to analyze multiple phenotypes

Download PDF

Sungyoung Lee¹,
Yongkang Kim²,
Sungkyoung Choi¹,
Heungsun Hwang³ &
…
Taesung Park^1,2

12k Accesses
7 Citations
Explore all metrics

Abstract

Background

As one possible solution to the “missing heritability” problem, many methods have been proposed that apply pathway-based analyses, using rare variants that are detected by next generation sequencing technology. However, while a number of methods for pathway-based rare-variant analysis of multiple phenotypes have been proposed, no method considers a unified model that incorporate multiple pathways.

Results

Simulation studies successfully demonstrated advantages of multivariate analysis, compared to univariate analysis, and comparison studies showed the proposed approach to outperform existing methods. Moreover, real data analysis of six type 2 diabetes-related traits, using large-scale whole exome sequencing data, identified significant pathways that were not found by univariate analysis. Furthermore, strong relationships between the identified pathways, and their associated metabolic disorder risk factors, were found via literature search, and one of the identified pathway, was successfully replicated by an analysis with an independent dataset.

Conclusions

Herein, we present a powerful, pathway-based approach to investigate associations between multiple pathways and multiple phenotypes. By reflecting the natural hierarchy of biological behavior, and considering correlation between pathways and phenotypes, the proposed method is capable of analyzing multiple phenotypes and multiple pathways simultaneously.

Pathway analysis of rare variants for the clustered phenotypes by using hierarchical structured components analysis

Article Open access 11 July 2019

Hierarchical structural component model for pathway analysis of common variants

Article Open access 24 February 2020

An evaluation of approaches for rare variant association analyses of binary traits in related samples

Article Open access 04 February 2021

Background

In the past decade, genome-wide association studies (GWAS) have played a key role in identifying genetic associations between Single Nucleotide Variants (SNVs) and many complex biological pathologies, including type 2 diabetes (T2D), heart disease, and schizophrenia [1,2,3]. However, large-scale genetic analyses continue to suffer from incomplete association, of single nucleotide variants (SNVs), with distinct phenotypes (“missing heritability”), and difficulties of biological interpretation [4].

Among many proposed solutions to solve the missing heritability problem, many researchers have focused on “rare variants”. Methods for rare variants analysis arose from extending individual variant-level approaches to those at the gene-level [5, 6], and extending those at the gene level, to multiple phenotypes [7,8,9].

As the number of publicly available biological resources is increasing, recent methods for analyzing rare variants utilize pathway knowledge as a priori information. Since most biological behaviors manifest from a complex interaction of biological pathways [10, 11], analyzing pathway information for identifying rare variants has several advantages. In contrast to variant-level analysis, the number of statistical tests is substantially smaller in pathway analysis, resulting in less strict multiple testing corrections. Moreover, since pathways explain curated biological behaviors with multiple genes, it is easier to interpret statistically significant pathways than variant- or gene-level analyses. In this respect, many pathway-based approaches have been proposed especially using the pathway databases, which resulted in improvement of the interpretation of discoveries [12, 13].

Another effort to enhance the power of rare variants is to develop multivariate analysis methods. In general, many complex diseases arise from multiply correlated traits. For example, according to American Diabetes Association guidelines, diabetic status is diagnosed based on four traits: fasting glucose, two hours after plasma glucose, random plasma glucose, and HbA1c [14]. In that regard, simultaneous analysis of those correlated traits offer two substantial advantages over univariate analysis. First, multivariate analysis can elevate statistical power to identify additional causal biomarkers, which are not discovered by single phenotype analysis. Second, by analyzing multiple traits at once, the required number of statistical tests can be reduced, compared to those of univariate analysis. Those advantages have been well documented in past studies of large-scale sequencing datasets [15, 16].

There have now been many applications of multivariate analysis to large-scale datasets. In particular, for variant- and gene-level analysis, many multivariate methods, for common and rare variants, have been proposed [8, 15]. Despite those efforts, only a number of pathway-based multivariate analyses have been deemed feasible. Recently, three multivariate approaches, for region-level analyses, were proposed: MARV, aSPU, and MURAT. MARV [17] uses a statistical approach, reverse regression, to investigate associations between genetic regions and multiple phenotypes, by treating phenotypes as independent variables, hence enabling rapid multivariate analysis of large-scale datasets. On the other hand, aSPU [18], extends an original concept, data-adaptive sum of powered score test, to multivariate analysis, using summary statistics from single SNVs. For multivariate extension of powerful gene-based tests, MURAT (Multivariate Rare-variant Association Test) extended the original SKAT (sequence kernel association test) method to multiple phenotypes [19]. However, it might not be adequate to apply SKAT-based methods to pathway-based analysis, as we have previously demonstrated [20]. Moreover, none of the above methods are available for multivariate pathway-based association tests for rare variants with multiple pathways. Since the established pathway databases have substantial overlap among their pathways, they may ignore significant correlations between pathways, leading to misleading biological interpretations [21, 22].

In this report, we introduce a new method, “PHARAOH-multi” (Pathway-based approach using HierArchical component of collapsed RAre variants Of High-throughput sequencing data), for analyzing multiple phenotypes. Previously, we proposed a component-based hierarchical model for analysis of multiple pathways with a single model [20]. Here, while keeping the advantages of our previous approach, we extend it to enable analysis of multiple traits using hierarchical components of genetic variants. In addition, the proposed model can identify associations between multiple phenotypes and multiple pathways, with a single model, in the presence of subsequent genes within pathways, as a hierarchy.

Methods

Exome sequencing dataset for discovery study

To demonstrate the validity of the proposed method for examining large-scale datasets with multiple phenotypes, in real (biological) data analysis, we analyzed whole-exome sequencing (WES) data from a Korean population study. In brief, the dataset consists of next generation sequencing of 1087 individuals’ genomes, using the Illumina HiSeq2000 platform (Illumina, Inc., San Diego, CA), selected by the Korean Association REsource (KARE) study [23], as a part of the T2D-GENES consortium. For pathway-gene mapping, we retrieved pathway information from MSigDB [24], and mapped the genes to 217, 186 and 674 pathways extracted from the Biocarta, KEGG [25] and Reactome [26], respectively.

Exome chip dataset for replication study

For replication of the identified pathways from the discover study, an independent cohort from Koreans, the Health Examinee shared control study (HEXA), was used. HEXA is a part of the KoGES population based cohort, initiated in 2001 [27]. In total, genotypes of 3445 individuals were acquired using the HumanExome BeadChip v1.1 (Illumina, Inc., San Diego, CA). With same quality control criteria, 24,474 rare variants were used in the analysis.

PHARAOH-multi method

Our ultimate goal was to find an association between Q phenotypes and K pathways, each of whose number of genes was T₁, …, T_K, under the presence of distinct parameters for ridge penalization. The proposed method is based on Generalized Structural Component Analysis (GSCA) [28], and an exemplary structure of the model is shown in Fig. 1.

Let Y = [y₁₁…y_1Q; …; y_N1…y_NQ] be the matrix of phenotypes for N samples, where y_iq is the observation of the i^th sample on the q^th phenotype, and let X be the matrix of gene-level collapsed variables generated by summing rare variants according to their gene variant-gene mapping. Let g_ij ∈ {0, 1, 2} be the number of minor alleles for the j^th genetic variant of the i^th sample. Regarding the elements of X, x_ikt is a gene-level summary of rare variants which is defined as weighted sum of the i^th sample’s rare variants in the t^th gene of the k^th pathway, denoted by $ {x}_{ikt}={\Sigma}_{j\in {M}_{kt}}{\omega}_j{g}_{ij} $, where M_kt is an index set that defines which rare variants are mapped onto the t^th gene in the k^th pathway. Several weighting parameters, ω_j, can be used, as previously described in [20]. By imposing two penalty parameters λ_G and λ_P on the genes-pathway and pathways-phenotype, we sought to address potential multicollinearity problems, in both genes and pathways, in the proposed method. Such problems may adversely affect the estimation of weights and coefficients. The proposed model assumes that the phenotype, y_iq, arises from the multivariate normal distribution with mean μ and covariance Σ (i = 1,…,Q and j = 1,…,N). Then we define the proposed PHARAOH-multi model as.

$$ {y}_{iq}={\beta}_{0q}+\sum \limits_{k=1}^K\left(\sum \limits_{t=1}^{T_k}{x}_{ik t}{w}_{tk}\right){\beta}_{kq}+{\tilde{\epsilon}}_i={\beta}_{0q}+\sum \limits_{k=1}^K{f}_{ik}{\beta}_{kq}+{\tilde{\epsilon}}_i={F}_i{\tilde{\beta}}_q+{\tilde{\epsilon}}_{i.} $$

(1)

Here, $ {f}_{ik}={\sum}_{t=1}^{T_k}{x}_{ik t}{w}_{tk} $ and F_i indicate the i^th observation’s score of the k^th pathway, and its matrix form across Q phenotypes, respectively. Moreover, $ {\overset{\sim }{\beta}}_q=\left[{\beta}_{0q}\ {\beta}_{1q}\cdots {\beta}_{Kq}\right] $ is a vector of coefficients for the q^th phenotype, and $ {\overset{\sim }{\epsilon}}_i=\left[{\epsilon}_{i1}\cdots {\epsilon}_{iQ}\right] $ is a vector of residuals for the i^th sample.

Parameter estimation

The proposed model seeks to associate pathways and phenotypes. The effect of the k^th pathway, on multiple phenotypes, can be determined by testing all coefficients of the pathways simultaneously (H₀ : β_k1 = … = β_kQ = 0).

Moreover, by its nature, the proposed method can further assess three more associations: (1) the effect of a gene on multiple phenotypes conditioned by a given pathway; (2) the effect of a gene on a phenotype conditioned by the pathway; and (3) the effect of a pathway on a phenotype. Detailed characteristics of the proposed model (PHARAOH-multi), including relationships and coefficients, are shown in Table 1.

Table 1 Parameters related to specific relationships for the proposed model

Full size table

Let B is a matrix of $ {\overset{\sim }{\beta}}_1,\cdots, {\overset{\sim }{\beta}}_Q $. From the above model, we seek to maximize the penalized log-likelihood function, to estimate the parameters w_tk and β_kq, subject to the conventional scaling constraint $ \sum \limits_{i=1}^N{f}_{ik}^2=N $ [29]. The penalized log-likelihood function is expressed to

$$ \mathrm{\ell}\left(B\kern0.5em ,\kern0.5em W,\sum \left|{Y}_i,X\right.\right)=-\frac{NQ}{2}\log\ \pi -\frac{N}{2}\log\;\det \sum -\frac{1}{2}\sum \limits_{i=1}^N{\left({Y}_i-{B}^{\hbox{'}}{F}_i\right)}^{\hbox{'}}{\sum}^{-1}\left({Y}_i-{B}^{\hbox{'}}{F}_i\right)-\frac{1}{2}{\lambda}_G\sum \limits_{k=1}^K\sum \limits_{t=1}^{T_k}\left\Vert {w}_{tk}\right.\left\Vert {}_2\right.-\frac{1}{2}{\lambda}_G\sum \limits_{q=1}^Q\sum \limits_{k=1}^k\left\Vert {\beta}_{kq}\left\Vert {}_2\right.\right. $$

(2)

where λ_G and λ_P are the penalty parameters for each specific gene and pathway, respectively, and ‖w_tk‖₂ and ‖β_kq‖₂ are the ridge penalty function.

We previously introduced an iteratively reweighted least square (IRLS) method to minimize an univariate version of (2) under the presence of ridge penalties [20], which is similar to the alternating regularized least-squares algorithm [30]. Here we extend the previous algorithm to multivariate analysis. Let R_i be a “column-trimmed” matrix of GSCA [30], defined by F_i ⊗ I_K, where ⊗ is Kronecker product, and I_K is K × K identity matrix. Maximization of (2) in respect of B and W is equivalent to minimizing the following least-square functions:

$$ {\displaystyle \begin{array}{l}{\phi}_B=\sum \limits_{i=1}^N{\left({Y}_i-{B}^{\hbox{'}}{F}_i\right)}^{\hbox{'}}{\sum}^{-1}\left({Y}_i-{B}^{\hbox{'}}{F}_i\right)+{\lambda}_p\sum \limits_{q=1}^Q\sum \limits_{k=1}^K\parallel {\beta}_{kq}{\parallel}_2\\ {}\kern1em =\sum \limits_{i=1}^N{\left({Y}_i-{R}_i vec(B)\right)}^{\hbox{'}}{\sum}^{-1}\left({Y}_i-{R}_i vec(B)\right)+{\lambda}_p vec{(B)}^{\hbox{'}} vec(B)\\ {}\kern1em ={\left( vec(Y)- Rvec(B)\right)}^{\hbox{'}}\left( vec(Y)- Rvec(B)\right)+{\lambda}_p vec{(B)}^{\hbox{'}} vec(B)\end{array}} $$

(3)

$$ {\displaystyle \begin{array}{l}{\phi}_w=\sum \limits_{i=1}^N{\left({Y}_i-{B}^{\hbox{'}}{X}_iW\right)}^{\hbox{'}}{\sum}^{-1}\left({Y}_i-{B}^{\hbox{'}}{X}_iW\right)+{\lambda}_G\sum \limits_{K=1}^K\sum \limits_{t=1}^{T_k}\parallel {w}_{tk}{\parallel}_2\\ {}\kern1em =\sum \limits_{i=1}^N{\left({Y}_i-{\Phi}_iW\right)}^{\hbox{'}}{\sum}^{-1}\left({Y}_i-{\Phi}_iW\right)+{\lambda}_G\sum \limits_{k=1}^Kw\underset{k}{\hbox{'}}{w}_k\\ {}\kern1em ={\left( vec(Y)-\Phi W\right)}^{\hbox{'}}\left( vec(Y)-\Phi W\right)+{\lambda}_G\sum \limits_{k=1}^Kw\underset{k}{\hbox{'}}{w}_k\end{array}} $$

(4)

These least-square functions are subject to diag(R^′R) = NI_NQ, where Φ_i is a column-trimmed matrix of B^′ ⊗ X_i [30], and vec(·) is a vectorization operator. Then, it can be easily shown that the covariance matrix Σ is not related to the above equations since the PHARAOH-multi model uses multivariate linear model. An estimation of Σ can be done after convergence of B and W, by minimizing the first derivate of (2) with respect to Σ, as:

$$ \widehat{\Sigma}=\frac{1}{N}{\left(Y- Rvec(B)\right)}^{\hbox{'}}\left(Y- Rvec(B)\right) $$

(5)

Similarly, B and W can be updated by equating (3) and (4) to zero. This then gives the estimating equation of B and W as:

$$ vec\left(\widehat{B}\right)={\left({R}^{\prime }R\right)}^{-1}{R}^{\prime } vec(Y) $$

(6)

$$ vec\left(\widehat{w}\right)={\left({\Phi}^{\prime}\Phi \right)}^{-1}{\Phi}^{\prime } vec(Y) $$

(7)

Taken together, the overall procedure of the proposed algorithm is as follows:

1.
Let t = 1.
2.
Assign random initial values to W, which are then represented by W₍₀₎.
3.
Calculate F_(t), using W_{(t − 1)}.
4.
Update B_(t), using F_(t).
5.
Update W_(t), using F_(t) and B_(t).
6.
Repeat until the sum of the differences |W_(t) − W_{(t − 1)}| + |B_(t) − B_{(t − 1)}| converges the threshold.

Finally, we determine the values of λ_G and λ_P, before applying the parameter estimation procedure. To that end, we can implement k-fold cross-validation (CV) to determine the values of λ_G and λ_P. First, we construct a two-dimensional grid of different λ_G and λ_P values. Then we compute the deviance of each model with the given λ_G and λ_P, for all CV fold values. Finally, λ_G and λ_P are selected by their average deviance, which is minimized.

Significance testing

To assess the significance of genes or pathways, resampling methods can be used to test the statistical significance of the estimated effects of all pathways on the phenotype. In the proposed method, we utilize a permutation test to obtain p-values. By permuting the given phenotype, our method first generates null distributions for both pathways and gene coefficients. By computing the quantile of estimated pathway and gene coefficients, from the non-permuted dataset in each empirical null distribution, we can obtain an empirical p-value for any specific pathway and gene.

The testing of joint effects, between multiple phenotypes, is crucial. As shown in Table 1, PHARAOH-multi provides the individual effects of a pathway on each phenotype through β_k1, ..., β_kQ. The global effect of a pathway, on all phenotypes, can be evaluated by jointly testing β_k1, ..., β_kQ. Here, we introduce two different schema for determining a joint p-value for the k^th pathway, from multiple phenotypes.

Our first approach was to combine the individual p-values (referred as “P_K”). Since there are considerations among the estimated coefficients β_k1, ..., β_kQ, these correlations should be accounted for combining multiple p-values. Let the p-values from the k^th pathway be denoted by P_k1, …, P_kQ. The simplest way to combine those p-values is to use Fisher’s method, which is denoted by $ {\Psi}_k=-2{\sum}_{i=1}^Q\log {P}_{ik} $ under the independence assumption. Then, the statistic, Ψ_k, follows the χ² distribution, with the degrees of freedom, 2Q, under the null hypothesis. An extended version of Fisher’s method, Brown’s method, can combine dependent p-values using a rescaled χ²distribution and covariance of p-values [31]. However, an analytical computation of the covariance is not feasible for large-scale datasets, due to their computational complexity. A solution for this problem [32] introduced an approximation using a third-order polynomial for the covariance, denoted by $ \operatorname{cov}\left(-2\log {P}_i,-2\log {P}_j\right)\approx 3.263{\rho}_{ij}+0.71{\rho}_{ij}^2+0.027{\rho}_{ij}^3 $. To that end, Kost’s approach has been shown to be one of the best working methods for combining p-values [33]. Here, we adopt Kost’s method by substituting ρ to the empirical correlation of estimated coefficients, β_k1, ..., β_kQ, and derive the statistic for joint effect between the k^th pathway and multiple phenotypes, as follows:

$$ {P}_{kost,k}=1-{\Phi}_{2{d}_k}\left({\Psi}_k/{c}_k\right), $$

(8)

where c_k, d_k and $ {\Phi}_{2{d}_k} $ are the scale parameter, the re-scaled degree of freedom, and the cumulative distribution function of χ², with the degree of freedom 2d_k for the k^th pathway, respectively [32].

Our second approach was to construct a single statistic that combines all Q coefficients (referred as “P_M”). Here, we define a Wald-type statistic, T, as below, and utilize T for the following permutation testing scheme:

$$ T=\tilde{\beta}\underset{k}{\hbox{'}}{\operatorname{cov}}^{-1}\left({\tilde{\beta}}_k\right){\tilde{\beta}}_k $$

(9)

Then, the estimated covariance $ \mathit{\operatorname{cov}}\left(\widehat{{\overset{\sim }{\beta}}_k}\right) $ can be directly estimated using (6) with equation $ \mathit{\operatorname{cov}}\left(\widehat{\overset{\sim }{\beta }}\right)={\left({F}^{\prime }F+{\lambda}_PI\right)}^{-1}{F}^{\prime }F{\left({F}^{\prime }F+{\lambda}_PI\right)}^{-1}\otimes \widehat{\Sigma} $ [34], or can be altered by calculating sample covariance of $ {\overset{\sim }{\beta}}_k $, from permutations.

Multiple testing correction

Since the number of pathways is far less than those of genes or genetic variants, the “multiple testing problem” remains. While Bonferroni correction can be a straightforward approach for adjusting for multiple testing, it may impose an adjustment that is too stringent, especially for correlated results [35]. To overcome this issue, we applied two types of multiple testing corrections.

First, PHARAOH-multi corrects p-values using the Westfall & Young permutation procedure [36], which can be easily adopted, since PHARAOH-multi already uses a permutation scheme. Let T₍₀₎ be a vector of the statistics calculated using observed, unpermuted phenotypes, and let T_(j) be those from the j^th permutation. First, we rank the values of T₍₀₎ in ascending order, and let the rank of the k^th pathway, and the k^th index, be r_k and r_(k), respectfully. Then, for each permutation j = 0, 1, ⋯, J, let $ {T}_{(j)}^{\prime } $ be $ {T}_{(j){r}_{(1)}},\cdots, {T}_{(j){r}_{(K)}} $, to define $ {T}_{(j)}^M $ as a cumulative maximum of $ {T}_{(j)}^{\prime } $. Let I_{j, k} be an indicator function that resolves to 1.0, if $ {T}_{(0){r}_k}^{\prime }<{T}_{(j){r}_k}^M $, or 0.0, if that condition does not hold. The adjusted p-value for the k^th pathway, by the Westfall & Young procedure, is then defined as:

$$ {P}_k^{adj}=\frac{1+{\sum}_{j=1}^J{I}_{j,k}}{1+J} $$

(10)

Second, PHARAOH-multi provides False Discovery Rate (FDR) adjustment, by calculating q-values [37]. Here, we first obtain K as the number of permutation p-values, and from those, we can derive q-values, using the Benjamini-Hochberg step-up procedure.

Simulation study

To evaluate the performance of the proposed method, we conducted simulation studies, under various scenarios. For generating rare variants, we first produced a pool of genetic variants, using SimRare [38], a rare variant simulator with well-established genetic assumptions. A pool was then generated, with default settings and gene lengths of 1Kbp. Next, we generated a simulation dataset of 10 pathways, with 1000 samples, for each replicate. All simulation scenarios were evaluated, using 1000 replicates. Based on the genotypes, the simulated phenotypes were generated by the following model, with an assumption that only the first pathway is causal to the phenotypes:

$$ {y}_{iq}={\beta}_{1q}{\tilde{f}}_{i1}+{\in}_{iq}={\beta}_{1q}\sum \limits_{t=1}^{H_1}{w}_{1t}{x}_{i1t}+{\in}_{iq}={\beta}_{1q}\sum \limits_{t=1}^{H_1}\left({w}_{1t}\sum \limits_{j=1}^{M_{1t}}{\gamma}_{1 tj}{g}_{i1 tj}\right)+{\in}_{iq} $$

(11)

This is then subject to diag(F^′F) = NI_K, where H₁ is the number of causal genes in the first pathway, and M_1t is the number of rare variants in the t^th gene of the first pathway (i.e., causal pathway).

In the above model (eq. 11), γ_1tj denotes the effect of the j^th genetic variant, of the t^th gene, set to |log₁₀MAF_tj|, such that ϵ_iq denotes the residual and follows MVN(0, Σ). In our simulation, the settings q = 1,2, and H₁ = 1, 2, 5, 10, were used. For each replicate, all rare variants were collapsed into genes.

Results

For the simulation and the analysis of the real dataset, a workstation system with two Intel Xeon E5–2620 CPUs, with a combined RAM of 128GiB, were used. Note that the aSPU and MARV analyses were performed using the default settings, except that aSPU was performed without “genetic variant pruning” capability, as we observed that aSPU raises “unrecoverable error” with that capability. For our proposed method and aSPU, the number of permutations was 5000, to prevent possible lower bound limitation.

Simulation study

For PHARAOH-multi, we selected the tuning parameters λ_G and λ_P, based on three-fold CV for each replicate, using two-dimensional grids of λ_G and λ_P, with six different starting points of ridge parameters, ranging from 10¹ to 10⁶ (on a logarithmic base 10 scale). In the simulation, we considered the following conditions: number of effective genes in the causal pathway (H₁), gene-level effect (w_t1), pathway-level effect (β_1q), and residual correlation (ρ). In our simulation, H₁, w_t1, β, and ρ were assumed to be 1, 2, and 5; 0.1 and 0.2; 0.1, 0.15, and 0.2; and 0, 0.25, 0.5, respectively, with evaluation of their exhaustive combinations. Other parameters, Q, K and T_K, were fixed to 2, 10, and 10, respectively.

We first compared the type 1 error rates of the proposed method vs. the traditional methods. Here, type 1 error rate was computed as a proportion of p-values for the pathways with no effect, and was less than the significance level, across 1000 replicates of permuted phenotypes. As shown in Fig. 2, we evaluated the type 1 errors using two significance levels, 0.01 and 0.05. As a result, type 1 errors were controlled well in the traditional methods, but PHARAOH-multi showed a moderately deflated type 1 error rate (P_M), while the inflated rate is P_K, when ρ = 0. In contrast, the quantile-quantile (Q-Q) plots in Fig. 3 show no inflation or deflation pattern, in all the methods, except for P_K, with no correlation between phenotypes.

It was also worthwhile to assess the gain of power in the multivariate analysis, as compared to univariate analysis. In this respect, our simulation study was conducted to compare the power gain from multivariate methods, and between multivariate and univariate analyses.

First, we checked whether PHARAOH-multi with multiple phenotypes boosts power compared to PHARAOH with a single phenotype, under the same scenarios of the power simulation. As a result, we observed that the power of PHAROH-multi was at least 2.52 times larger than PHARAOH, and this difference becomes even larger, as w and β increase (data not shown).

Second, we assessed the statistical power of PHARAOH-multi and the compared methods, defined as the proportion of the adjusted q-value of the simulated causal pathway (the first pathway) being less than the significance threshold, e.g., 0.05. Despite the proposed method supporting the Westfall-Young permutation procedure, it was not considered in the simulation study, due to the absence of corresponding adjustments in the compared methods.

Figures 4 and 5 show comparison results of statistical power simulation from 1000 replications. Each row in the grid of plots represents the same settings of w and β, with different numbers of causal genes in the causal pathway, and each column represents the same number of causal genes, with different effect settings.

In most scenario comparisons, the two proposed statistics obtained by PHARAOH-multi (P_M) and p-value aggregation (P_K) showed greater power than the other two approaches, aSPU and MARV. However, this did not hold when 50% of the genes were causal for a specific pathway, with effect sizes of w = 0.1 and β = 0.1. In order to investigate whether or not there are significant differences among powers, we performed paired t-tests between a pair of methods. In Fig. 4 for the case of w = 0.1, the p-values were 3 × 10^− 7 for comparing powers of P_M and aSPU, and 4.2 × 10^− 6 for comparing those of P_M and MARV. In Fig. 5 for the case of w = 0.2, the same pairwise comparison for the p-values were 7.3 × 10^− 7 and 4.2 × 10^− 6, respectively. In overall scenarios, powers of P_M were larger up to 18%p compared to aSPU in H₁ = 5, w = 0.2 and β = 0.2, and were larger up to 83%p compared to MARV. Generally, P_K exhibited smaller power than P_M, and showed comparable or slightly smaller power, than aSPU.

Here, we observed three interesting patterns in the results. First, the proposed P_M and aSPU methods showed lower power, when the proportion of causal genes increase, compared to MARV. Second, the power rapidly increased, as β increased, as shown in Fig. 4. Third, the contribution of w to the power was relatively moderate, compared to β, as shown in corresponding scenarios of Figs. 4 and 5. The reason for the occurrence of those two patterns is that the model(s) generate phenotypes for power simulation, and eq. (11) requires the constraint of the so-called “latent variable,” in GSCA (see Methods). While both PHARAOH-multi and aSPU construct hierarchies of genes and pathways, MARV essentially treats a pathway as a large set of SNVs, since the motivation of MARV is for region- vs. pathway-based tests. The simulation setting and its overall effect on phenotypes is summarized, first at the gene-level, and then by the expression of a linear combination of those genes. In this respect, the results of PHARAOH-multi and aSPU were more plausible than those of MARV, because those two methods more properly reflected the simulation settings.

To confirm the above hypothesis, we performed an additional comparison using the same dataset, except that the phenotypes were generated without the constraint. As shown in Fig. 6, PHARAOH-multi (“P_M”) showed larger power than in the previous simulation, while the power of MARV increased as the number of causal genes increased. However, in contrast to the previous results, the powers of PHARAOH-multi and aSPU also increased.

Finally, we investigated whether or not the statistical power changes by M_kt. For simplicity, we split 1000 simulation datasets into two groups: the first group where the number of variants is small and the second where it is large. Then, we compared the power of each method between two groups using t-test. As a result, the p-values were 0.097 for aSPU, 0.684 for MARV, and 0.825 for PHARAOH-multi. Thus, we concluded that M_kt is unlikely to affect the simulation result regardless of the methods.

Real data discovery from whole-exome sequencing dataset

To evaluate the practical performance of PHARAOH-multi, we conducted a discovery study using a large-scale sequencing dataset. Many studies suggest that the major underlying risk factors for metabolic disorders include high density lipoprotein (HDL), blood pressure (SBP, DBP), waist circumference (WAISTC), fasting glucose (FAST_GLU), and triglycerides (TG). In this regard, we conducted a multivariate analysis of metabolism-related traits, using a large-scale sequencing dataset, obtained from the Type 2 Diabetes Genetic Exploration by Next-generation sequencing in multi-Ethnic Samples (T2D-GENES) Consortium, comparing our proposed (PHARAOH-multi) and other common methods. In detail, we analyzed a dataset consisting of 1086 samples selected from the Korean Association REsource (KARE) study [23].

After removal of samples with any missing observations of the aforementioned six phenotypes, we included 1085 samples for analysis. The quality controls with genotype call rates were < 95%, or for Hardy-Weinberg Equilibrium (HWE) test P < 10^− 5, the minor allele frequency was < 5% and the minor allele count was > 2, resulted in 198,761 variants. The final dataset was then mapped to genes, using the human genome-19 (hg19) reference genome coordinates, with 10Kbp flanking regions. The gene range of hg19 reference, was extracted from RefSeq track of UCSC Table Browser, as of October 2014. Finally, the gene-level collapsed variable was generated using Workbench for Integrated Superfast Association study with Related Data (WISARD), with beta-transformation weighting, as suggested in [5], with the number of genes being 4388.

Next, we compared our multivariate and univariate analysis results, using PHARAOH-multi and PHARAOH. As shown in Figs. 7 and 8, Q-Q plots of the results showed no substantial inflation or deflation pattern for either the multivariate or univariate results. However, with regard to pathway discovery, the results did show significant differences.

These comparisons clearly support the one advantage of multivariate analysis that we discussed above: elevation of statistical power. As with multivariate analysis, we calculated q-values for each univariate result. Interestingly, no univariate analysis identified significant pathways, except for SBP with KEGG, which identified three pathways (drug metabolism cytochrome P450, glutathione metabolism and progesterone-mediated oocyte maturation). As shown in Table 2, only one pathway, glutathione metabolism, was identified in the univariate analysis, and the q-values of univariate analyses for pathways identified by multivariate analysis, were not significant.

Table 2 Significant pathways of PHARAOH-multi and MARV, and their q-values of multivariate and univariate analyses

Full size table

Second, we compared the result of multivariate analyses, using PHARAOH-multi, aSPU and MARV. As shown in Fig. 8, PHARAOH-multi exhibited generally acceptable p-value trends, despite the result from KEGG being modestly deflated, due to the optimization of lambda. The Q-Q plots of MARV look similar to PHARAOH-multi. In contrast, aSPU showed unacceptably inflated patterns of Q-Q plots, regardless of the pathway databases, which were not used in the simulation study. This could possibly be due to substantial overlap of existing pathway databases.

As shown in Table 2, the multivariate analysis successfully identified eight pathways from three pathway databases, with Benjamini-Hochberg q-value < 0.1. Interestingly, PHARAOH-multi identified glutathione-related pathways in both KEGG and Reactome pathway databases, which supports the result of PHRAOH-multi. As shown in Fig. 8, the quantile-quantile plots of aSPU for the real dataset are highly inflated (i.e., their p-values are very small). As a result, 57.7% (Reactome), 29.5% (Biocarta) and 71.5% (KEGG) of the tested pathways by aSPU were statistically significant (q-value < 0.1). Unfortunately, these pathways are highly false positives. In this respect, we included the results of significant pathways identified by either PHARAOH-multi or MARV.

The identified pathways suggested evident relationships with metabolic syndrome. Since the peroxisome pathway elucidates peroxisome biogenesis, which contributes to fatty acid oxidation and biosynthesis of ether lipids, many studies have discussed interrelationship between peroxisomes and metabolic processes [39, 40]. Likewise, identification of the GABA pathway can also be explained by the relationship between GABA and peroxidation, and putative relationship of obesity [41, 42]. Moreover, identification of glutathione metabolism, and its conjugation, explain that PHARAOH-multi successfully captured a key process of metabolic disorders [43]. Finally, another report suggested a putative role of adhesion molecules in metabolic diseases, as explained by “cell2cell” pathway [44]. For the two pathways identified by MARV, we found that Caspase pathway has been known to be related to metabolic stress or perturbation [45], but no evidence for D4GDI pathway was found.

Replication study using independent exome chip dataset

We conducted a replication study using exome chip dataset from an independent cohort, using the identified pathways in the discovery study. Despite the insufficiency of detected variants in the exome chip dataset, as a result, we successfully replicated two pathways with p-value < 0.1, the peroxisome pathway in KEGG (p = 0.059) and cell2cell pathway in Biocarta (p = 0.093). As shown in the literature search, the two pathways we replicated have strong relationships with metabolic disorders.

Discussion

Compared to univariate approaches, which analyze each phenotype individually, our real data analysis successfully demonstrated that the multivariate approach could identify pathways commonly associated with specific traits. It is important to construct a systematic analysis that considers the correlation between complex diseases and their underlying biological traits. In addition, our results from two well-established pathway databases were strongly supported by many existing publications, thus demonstrating the advantage of our proposed approach.

Compared to existing multivariate analysis methods, PHARAOH-multi features several advantages. Firstly, by constructing a hierarchical structure of genes-pathways-phenotypes, four types of associations (gene-single phenotype, gene-multiple phenotypes, pathway-single phenotype, and pathway-multiple phenotypes) can be estimated simultaneously. Compared to our proposed method, existing methods of multivariate analysis were limited to gene-level analysis, and hence, the combinatorial effect of multiple genes, via biological pathways, was impossible to estimate. In addition, the proposed method considers the correlation between genes, pathways, and phenotypes, by imposing penalty parameters on the estimation procedure.

Secondly, PHARAOH-multi provides multiple options for correcting for the multiple testing issue. Although Bonferroni correction is simple, and powerfully controls type 1 error, it is a well-known fact that the Bonferroni correction often results in controls that are too stringent, when the tests are correlated. Under such conditions, application of the Westfall-Young permutation procedure can be an appropriate alternative, since its asymptotic optimality under dependence is known [35]. In this respect, the proposed method has the advantage of identifying causal pathways, by considering correlation among pathways.

For analysis times of both simulation and real datasets, MARV was the fastest among all the methods, while PHARAOH-multi ran slightly faster than aSPU. For example, in the analysis of simulation dataset of 100 genes with 1000 samples, the running times of MARV, PHARAOH-multi, and SPU were 13, 67 and 235 s, respectively. The trends of execution time were consistent regardless of simulation parameters or datasets. However, PHARAOH-multi can be further accelerated with multithreading which is not supported by MARV and aSPU. With multithreading of 8 threads, the analysis time of PHARAOH-multi was reduced to 12 s.

At this point, there are a number of subjects we can consider for future research. Our current analysis is limited only to Korean population. In our future study, we apply our method to the whole data of 13,000 WES dataset of T2D-GENES consortium [46] which contains our KARE samples. It would be a challenging work to identify novel pathways across multiple populations. For the methodological aspect, our approach uses gene-level collapsing of multiple rare variants. Although the collapsing method has the advantage that the analysis of very rare variants is possible, it cancels out the effects of variants with opposite direction (e.g, gene upregulation vs. downregulation). Despite such limitations, our method showed great potential in identifying causal genetic structure in the real data analysis. However, further research, on a more sophisticated approach that can consider the effect direction of variants, is needed. Moreover, we plan to improve our proposed multivariate analysis by applying Generalized Estimating Equations (GEE) or Linear Mixed Model (LMM). Our method can be extended to prediction models, rather than association tests, using other types of penalization, such as LASSO or SCAD [47]. Lastly, our method can also be extended to pathway interaction analysis that has been commonly performed in gene expression data analysis [48].

Conclusion

In this study, we proposed a novel statistical approach for multivariate pathway-based analysis of rare variants, from large-scale sequencing datasets. Analyses of multiple phenotypes have been successful in analyzing various complex diseases, including type-2 diabetes (T2D) or hypertension. In general, curated guidelines suggest diagnosing T2D according to traits observed in the individual. Consequently, incorporating multiple correlated traits, to be investigated for association with specific diseases, via multivariate analysis, elevates the statistical power. In this respect, our simulation study reflects the relationship between diseases and their related traits. Throughout the simulation study, PHARAOH-multi outperformed existing multivariate methods. In addition, our proposed method successfully demonstrated several advantages of multivariate analysis, including significantly improving the detection power of causal pathways, as compared to univariate analysis, while also retaining detection power for the individual phenotype. Moreover, we successfully demonstrated that the proposed method is capable of identifying plausible pathways in the real dataset, by identifying eight pathways in the discovery study, and replicating two pathways in the replication study. We firmly believe that the proposed method will assist researchers in understanding the genetic structures that underlie many complex diseases.

References

Schizophrenia Working Group of the Psychiatric Genomics C. Biological insights from 108 schizophrenia-associated genetic loci. Nature. 2014;511(7510):421–7.
Article Google Scholar
Lettre G, Palmer CD, Young T, Ejebe KG, Allayee H, Benjamin EJ, Bennett F, Bowden DW, Chakravarti A, Dreisbach A, et al. Genome-wide association study of coronary heart disease and its risk factors in 8,090 African Americans: the NHLBI CARe project. PLoS Genet. 2011;7(2):e1001300.
Article CAS PubMed PubMed Central Google Scholar
McCarthy MI, Zeggini E. Genome-wide association studies in type 2 diabetes. Curr Diab Rep. 2009;9(2):164–71.
Article CAS PubMed PubMed Central Google Scholar
Manolio TA, Collins FS, Cox NJ, Goldstein DB, Hindorff LA, Hunter DJ, McCarthy MI, Ramos EM, Cardon LR, Chakravarti A, et al. Finding the missing heritability of complex diseases. Nature. 2009;461(7265):747–53.
Article CAS PubMed PubMed Central Google Scholar
Wu MC, Lee S, Cai T, Li Y, Boehnke M, Lin X. Rare-variant association testing for sequencing data with the sequence kernel association test. Am J Hum Genet. 2011;89(1):82–93.
Article CAS PubMed PubMed Central Google Scholar
Choi S, Lee S, Cichon S, Nothen MM, Lange C, Park T, Won S. FARVAT: a family-based rare variant association test. Bioinformatics. 2014;30(22):3197–205.
Article CAS PubMed Google Scholar
Wang L, Lee S, Gim J, Qiao D, Cho M, Elston RC, Silverman EK, Won S. Family-based rare variant association analysis: a fast and efficient method of multivariate phenotype association analysis. Genet Epidemiol. 2016;40(6):502–11.
Article PubMed PubMed Central Google Scholar
Zhou X, Stephens M. Efficient multivariate linear mixed model algorithms for genome-wide association studies. Nat Methods. 2014;11(4):407–9.
Article CAS PubMed PubMed Central Google Scholar
Lee S, Won S, Kim YJ, Kim Y, Consortium TD-G, Kim BJ, Park T. Rare variant association test with multiple phenotypes. Genet Epidemiol. 2016;
Costanzo M, Baryshnikova A, Bellay J, Kim Y, Spear ED, Sevier CS, Ding H, Koh JL, Toufighi K, Mostafavi S, et al. The genetic landscape of a cell. Science. 2010;327(5964):425–31.
Article CAS PubMed PubMed Central Google Scholar
Hirschhorn JN. Genomewide association studies--illuminating biologic pathways. N Engl J Med. 2009;360(17):1699–701.
Article CAS PubMed Google Scholar
Lee JH, Zhao XM, Yoon I, Lee JY, Kwon NH, Wang YY, Lee KM, Lee MJ, Kim J, Moon HG, et al. Integrative analysis of mutational and transcriptional profiles reveals driver mutations of metastatic breast cancers. Cell Discov. 2016;2:16025.
Article CAS PubMed PubMed Central Google Scholar
Zhao XM, Liu KQ, Zhu G, He F, Duval B, Richer JM, Huang DS, Jiang CJ, Hao JK, Chen L. Identifying cancer-related microRNAs based on gene expression data. Bioinformatics. 2015;31(8):1226–34.
Article PubMed Google Scholar
American Diabetes A. Diagnosis and classification of diabetes mellitus. Diabetes Care. 2014;37(Suppl 1):S81–90.
Article Google Scholar
O'Reilly PF, Hoggart CJ, Pomyen Y, Calboli FC, Elliott P, Jarvelin MR, Coin LJ. MultiPhen: joint model of multiple phenotypes can increase discovery in GWAS. PLoS One. 2012;7(5):e34861.
Article PubMed PubMed Central Google Scholar
Yang Q, Wang Y. Methods for analyzing multivariate phenotypes in genetic association studies. J Probab Stat. 2012;2012:652569.
Article PubMed PubMed Central Google Scholar
Kaakinen M, Magi R, Fischer K, Heikkinen J, Jarvelin MR, Morris AP, Prokopenko I. A rare-variant test for high-dimensional data. Eur J Hum Genet. 2017;
Kwak IY, Pan W. Adaptive gene- and pathway-trait association testing with GWAS summary statistics. Bioinformatics. 2016;32(8):1178–84.
Article CAS PubMed Google Scholar
Sun J, Oualkacha K, Forgetta V, Zheng HF, Brent Richards J, Ciampi A, Greenwood CM, Consortium UK. A method for analyzing multiple continuous phenotypes in rare variant association studies allowing for flexible correlations in variant effects. Eur J Hum Genet. 2016;24(9):1344–51.
Article PubMed PubMed Central Google Scholar
Lee S, Choi S, Kim YJ, Kim BJ, Consortium Td-G, Hwang H, Park T: Pathway-based approach using hierarchical components of collapsed rare variants. Bioinformatics 2016, 32(17):i586-i594.
Alexa A, Rahnenfuhrer J, Lengauer T. Improved scoring of functional groups from gene expression data by decorrelating GO graph structure. Bioinformatics. 2006;22(13):1600–7.
Article CAS PubMed Google Scholar
Skarman A, Shariati M, Jans L, Jiang L, Sorensen P. A Bayesian variable selection procedure to rank overlapping gene sets. BMC bioinformatics. 2012;13:73.
Article PubMed PubMed Central Google Scholar
Cho YS, Go MJ, Kim YJ, Heo JY, Oh JH, Ban HJ, Yoon D, Lee MH, Kim DJ, Park M, et al. A large-scale genome-wide association study of Asian populations uncovers genetic factors influencing eight quantitative traits. Nat Genet. 2009;41(5):527–34.
Article CAS PubMed Google Scholar
Liberzon A, Subramanian A, Pinchback R, Thorvaldsdottir H, Tamayo P, Mesirov JP. Molecular signatures database (MSigDB) 3.0. Bioinformatics. 2011;27(12):1739–40.
Article CAS PubMed PubMed Central Google Scholar
Kanehisa M, Goto S, Sato Y, Furumichi M, Tanabe M. KEGG for integration and interpretation of large-scale molecular data sets. Nucleic Acids Res. 2012;40(Database issue):D109–14.
Article CAS PubMed Google Scholar
Fabregat A, Sidiropoulos K, Garapati P, Gillespie M, Hausmann K, Haw R, Jassal B, Jupe S, Korninger F, McKay S, et al. The Reactome pathway knowledgebase. Nucleic Acids Res. 2016;44(D1):D481–7.
Article CAS PubMed Google Scholar
Kim YJ, Go MJ, Hu C, Hong CB, Kim YK, Lee JY, Hwang JY, Oh JH, Kim DJ, Kim NH, et al. Large-scale genome-wide association studies in east Asians identify new genetic loci influencing metabolic traits. Nat Genet. 2011;43(10):990–5.
Article CAS PubMed Google Scholar
Hwang H, Takane Y. Generalized structured component analysis. Psychometrika. 2004;69(1):81–99.
Article Google Scholar
Takane Y, Hwang H. An extended redundancy analysis and its applications to two practical examples. Computational Statistics & Data Analysis. 2005;49(3):785–808.
Article Google Scholar
Hwang H. Regularized generalized structured component analysis. Psychometrika. 2009;74(3):517–30.
Article Google Scholar
Brown MB. 400: a method for combining non-independent, one-sided tests of significance. Biometrics. 1975;31(4):987–92.
Article Google Scholar
Kost JT, McDermott MP. Combining dependent P-values. Stat Probabil Lett. 2002;60(2):183–90.
Article Google Scholar
Alves G, Yu YK. Accuracy evaluation of the unified P-value from combining correlated P-values. PLoS One. 2014;9(3):e91225.
Article PubMed PubMed Central Google Scholar
Hoerl AE, Kennard RW. Ridge Regression - Biased Estimation for Nonorthogonal Problems. Technometrics. 1970;12(1):55&.
Article Google Scholar
Meinshausen N, Maathuis MH, Bühlmann P. Asymptotic optimality of the Westfall–young permutation procedure for multiple testing under dependence. Ann Stat. 2011;39(6):3369–91.
Article Google Scholar
Westfall PH, Young SS. Resampling-based multiple testing : examples and methods for P-value adjustment. New York: Wiley. 1993;
Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc Ser B Methodol. 1995;57(1):289–300.
Google Scholar
Li B, Wang G, Leal SM. SimRare: a program to generate and analyze sequence-based data for association studies of quantitative and qualitative traits. Bioinformatics. 2012;28(20):2703–4.
Article CAS PubMed PubMed Central Google Scholar
Azhar S. Peroxisome proliferator-activated receptors, metabolic syndrome and cardiovascular disease. Futur Cardiol. 2010;6(5):657–91.
Article CAS Google Scholar
Hall D, Poussin C, Velagapudi VR, Empsen C, Joffraud M, Beckmann JS, Geerts AE, Ravussin Y, Ibberson M, Oresic M, et al. Peroxisomal and microsomal lipid pathways associated with resistance to hepatic steatosis and reduced pro-inflammatory state. J Biol Chem. 2010;285(40):31011–23.
Article CAS PubMed PubMed Central Google Scholar
Deng Y, Xu L, Zeng X, Li Z, Qin B, He N. New perspective of GABA as an inhibitor of formation of advanced lipoxidation end-products: it's interaction with malondiadehyde. J Biomed Nanotechnol. 2010;6(4):318–24.
Article CAS PubMed Google Scholar
Ma YH, Hu JH, Zhou XG, Zeng RW, Mei ZT, Fei J, Guo LH. Transgenic mice overexpressing gamma-aminobutyric acid transporter subtype I develop obesity. Cell Res. 2000;10(4):303–10.
Article CAS PubMed Google Scholar
Wu G, Fang YZ, Yang S, Lupton JR, Turner ND. Glutathione metabolism and its implications for health. J Nutr. 2004;134(3):489–92.
Article CAS PubMed Google Scholar
Wagner O, Jilma B. Putative role of adhesion molecules in metabolic disorders. Horm Metab Res. 1997;29(12):627–30.
Article CAS PubMed Google Scholar
McIlwain DR, Berger T, Mak TW. Caspase functions in cell death and disease. Cold Spring Harb Perspect Biol. 2015;7(4)
Fuchsberger C, Flannick J, Teslovich TM, Mahajan A, Agarwala V, Gaulton KJ, Ma C, Fontanillas P, Moutsianas L, McCarthy DJ, et al. The genetic architecture of type 2 diabetes. Nature. 2016;536(7614):41–7.
Article CAS PubMed PubMed Central Google Scholar
Fan J, Li R. Variable selection via nonconcave penalized likelihood and its Oracle properties. J Am Stat Assoc. 2001;96(456):1348–60.
Article Google Scholar
Liu KQ, Liu ZP, Hao JK, Chen L, Zhao XM. Identifying dysregulated pathways in cancers from pathway interaction networks. BMC Bioinformatics. 2012;13:126.
Article PubMed PubMed Central Google Scholar

Download references

Acknowledgements

Not applicable.

Funding

This research was supported by grants of the Korea Health Technology R&D Project through the Korea Health Industry Development Institute (KHIDI), funded by the Ministry of Health & Welfare, Republic of Korea (HI15C2165, HI16C2037), supported by the Bio-Synergy Research Project (2013M3A9C4078158) of the Ministry of Science, ICT and Future Planning through the National Research Foundation. The publication of this article was sponsored by the Bio-Synergy Research Project.

Availability of data and materials

An implementation of PHARAOH-multi can be downloaded from the website (http://statgen.snu.ac.kr/software/pharaoh-multi). The KARE exome sequencing dataset is a part of T2D-GENES consortium, and is available upon approval of T2D-GENES project committee. The HEXA exome chip dataset is a part of KoGES, and is available upon approval of the genome center in Korea National Institute of Health.

About this supplement

This article has been published as part of BMC Bioinformatics Volume 19 Supplement 4, 2018: Selected articles from the 16th Asia Pacific Bioinformatics Conference (APBC 2018): bioinformatics. The full contents of the supplement are available online at https://bmcbioinformatics.biomedcentral.com/articles/supplements/volume-18-supplement-17.

Author information

Authors and Affiliations

Interdisciplinary Program in Bioinformatics, Seoul National University, Seoul, South Korea
Sungyoung Lee, Sungkyoung Choi & Taesung Park
Department of Statistics, Seoul National University, 1 Gwanak-ro Gwanak-gu, Seoul, 08826, Korea
Yongkang Kim & Taesung Park
Department of Psychology, McGill University, Montreal, Canada
Heungsun Hwang

Authors

Sungyoung Lee
View author publications
You can also search for this author in PubMed Google Scholar
Yongkang Kim
View author publications
You can also search for this author in PubMed Google Scholar
Sungkyoung Choi
View author publications
You can also search for this author in PubMed Google Scholar
Heungsun Hwang
View author publications
You can also search for this author in PubMed Google Scholar
Taesung Park
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

SL performed all analyses and developed the software implementation. SL and TP conducted the entire study, developed the methodology, and wrote the manuscript. YK and SC helped with the performing of analyses. HH helped developing the methodology. All of the authors have read and approved of the final manuscript.

Corresponding author

Correspondence to Taesung Park.

Ethics declarations

Ethics approval and consent to participate

KARE and HEXA dataset are a part of Korean Genome Epidemiology Study (KoGES). In this study, the exome sequencing data from 1037 samples of KARE study and the exome chip data from 3445 samples of HEXA study were used. KARE dataset was used under the partnership of T2D-GENES. All participants of KARE and HEXA studies provided written informed consent. The study using KARE and HEXA samples was approved by two independent institutional review boards at Seoul National University.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

Reprints and permissions

About this article

Cite this article

Lee, S., Kim, Y., Choi, S. et al. Pathway-based approach using hierarchical components of rare variants to analyze multiple phenotypes. BMC Bioinformatics 19 (Suppl 4), 79 (2018). https://doi.org/10.1186/s12859-018-2066-9

Download citation

Published: 08 May 2018
DOI: https://doi.org/10.1186/s12859-018-2066-9

Pathway-based approach using hierarchical components of rare variants to analyze multiple phenotypes

Abstract

Background

Results

Conclusions

Similar content being viewed by others

Pathway analysis of rare variants for the clustered phenotypes by using hierarchical structured components analysis

Hierarchical structural component model for pathway analysis of common variants

An evaluation of approaches for rare variant association analyses of binary traits in related samples

Background

Methods

Exome sequencing dataset for discovery study

Exome chip dataset for replication study

PHARAOH-multi method

Parameter estimation

Significance testing

Multiple testing correction

Simulation study

Results

Simulation study

Real data discovery from whole-exome sequencing dataset

Replication study using independent exome chip dataset

Discussion

Conclusion

References

Acknowledgements

Funding

Availability of data and materials

About this supplement

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethics approval and consent to participate

Consent for publication

Competing interests

Publisher’s Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation