Genome-wide joint analysis of single-nucleotide variant sets and gene expression for hypertension and related phenotypes

Tong, Xiaoran; Wei, Changshuai; Lu, Qing

doi:10.1186/s12919-016-0017-x

Genome-wide joint analysis of single-nucleotide variant sets and gene expression for hypertension and related phenotypes

Proceedings
Open access
Published: 18 October 2016

Volume 10, article number 36, (2016)
Cite this article

Download PDF

You have full access to this open access article

BMC Proceedings Aims and scope

Genome-wide joint analysis of single-nucleotide variant sets and gene expression for hypertension and related phenotypes

Download PDF

Xiaoran Tong¹,
Changshuai Wei² &
Qing Lu¹

1286 Accesses
Explore all metrics

Abstract

Background

With the advance of next-generation sequencing technologies, the study of rare variants in targeted genome regions or even the whole genome becomes feasible. Nevertheless, the massive amount of sequencing data brings great computational and statistical challenges for association analyses. Aside from sequencing variants, other high-throughput omic data (eg, gene expression data) also become available, and can be incorporated into association analysis for better modeling and power improvement. This motivates the need of developing computationally efficient and powerful approaches to model the joint associations of multilevel omic data with complex human diseases.

Methods

A similarity-based weighted U approach is used to model the joint effect of sequencing variants and gene expression. Using a Mexican American sample provided by Genetic Analysis Workshop 19 (GAW19), we performed a whole-genome joint association analysis of sequencing variants and gene expression with systolic (SBP) and diastolic blood pressure (DBP) and hypertension (HTN) phenotypes.

Results

The whole-genome joint association analysis was completed in 80 min on a high-performance personal computer with an i7 4700 CPU and 8 GB memory. Although no gene reached statistical significance after adjusting for multiple testing, some top-ranked genes attained a high significance level and may have biological plausibility to hypertension-related phenotypes.

Conclusions

The weighted U approach is computationally efficient for high-dimensional data analysis, and is capable of integrating multiple levels of omic data into association analysis. Through a real data application, we demonstrate the potential benefit of using the new approach for joint association analysis of sequencing variants and gene expression.

Incorporating ENCODE information into association analysis of whole genome sequencing data

Article Open access 18 October 2016

Genome-wide QTL and eQTL analyses using Mendel

Article Open access 18 October 2016

On combining family- and population-based sequencing data

Article Open access 18 October 2016

Background

Next-generation sequencing technology provides denser genetic profiles than previous microarray-based genotyping technology [1]. It could effectively capture rare variants with low minor allele frequency (MAF). Driven by the advance of sequencing technology and limited heritability explained by the genome-wide association studies (GWAS) findings [2, 3], current research focus has shifted toward studying rare variants associated with common complex diseases. Although these studies hold great promise for finding new genetic variants predisposing to human disease, they also face great challenges, for example, low power for detecting rare variants because of their low frequency. The dramatic increase in numbers of single nucleotide variants (SNVs) also raises computational and statistical challenges (eg, multiple testing issue). One practical strategy is to group multiple SNVs according to known functional information (eg, variants in a gene or a pathway) or location (eg, variants in a fix-sized bin [4]), and jointly analyze these SNVs [5, 6]. By grouping and testing multiple SNVs, we are able to aggregate association signals and reduce the number of tests.

Besides SNVs, other omic data, such as gene expression, could also be collected. These intermediate omic data can be integrated into sequencing studies for improved power and better biological interpretation. While the conventional analysis only links SNVs or gene expression to disease phenotypes, the emergence of multilevel data brings the possibility of jointly analyzing SNVs and other omic data. By fully utilizing the information, the joint analysis has great potential to improve power [7]. Nevertheless, how to efficiently analyze the high-dimensional sequencing data and other omic data remains a challenge.

Methods

In this empirical study, we used a similarity based weighted U approach to jointly model SNVs and gene expression data of 142 unrelated Mexican American samples provided by Genetic Analysis Workshop 19 (GAW19). By using the weighted U approach, we performed a genome-wide joint association analysis, evaluating the association of 17,558 genes with three phenotypes (ie,, systolic blood pressure [SBP], diastolic blood pressure [DBP], and hypertension [HTN]).

For the integrative analysis, we extended previously developed nonparametric approaches [8] to handle both SNVs and gene expression. To aggregate the rare variants in a gene, a weighted sum approach is used [8]. Let p _k denote the MAF of the k ^th SNV (k = 1,2,…,K), the weight for the k ^th SNV can be defined as $ {w}_k=1/\sqrt{p_k\left(1-{p}_k\right)} $. Let K be the total number of SNVs in a gene region, the weighted sum score for the j ^th sample can be obtained by,

$$ {a}_j=\frac{{\displaystyle {\sum}_{k=1}^K{w}_k{v}_{jk}}}{2{\displaystyle {\sum}_{k=1}^K{w}_k}}, $$

where v _jk is the genotype value of the k ^th SNV for the j ^th sample, coded by the minor allele count (ie, 0, 1, and 2). We then define a weighted U statistic to assess the joint effect of SNVs and gene expression on the disease phenotype,

$$ U={\sum}_{i\ne j}f\left({a}_i,{a}_j\right)\mathit{\mathsf{g}}\left({t}_i,{t}_j\right)h\left({y}_i,{y}_j\right), $$

where f(a _i,a _j), g(t _i,t _j), and h(y _i,y _j) measure the similarities of SNVs, gene expression, and phenotypes, respectively. Phenotypic similarity h(y _i,y _j) serves as the U kernel,

$$ h\left({y}_i,{y}_j\right)=\frac{\left({y}_i-E(Y)\right)\left({y}_j-E(Y)\right)}{Var(Y)}, $$

where y _i and y _j are ranks of the i ^th and j ^th samples’ phenotypes. The genetic and gene expression similarities are weight functions, defined based on the Gaussian distance,

$$ f\left({a}_i,{a}_j\right)={e}^{-\frac{{\left({a}_i-{a}_j\right)}^2}{2N}}\kern0.5em \mathit{\mathsf{g}}\left({t}_i,{t}_j\right)={e}^{-\frac{{\left({t}_i-{t}_j\right)}^2}{2N}}, $$

where a _i (a _j) and t _i (t _j) denote the weighted sum score and the gene expression value of the individual i(j), respectively.

Under the null hypothesis of no association, phenotypic similarity is unrelated to genetic or gene expression similarities. Because phenotypic similarity is symmetric, that is, E (h(y _i,y _j)) = 0, the expectation of U statistic is 0. Under the alternative, phenotypic similarity increases with the increase of genetic or gene expression similarities. Therefore, the positive phenotypic similarities are heavier weighted and the negative phenotypic similarities are lighter weighted, leading to a positive value of U. Because the U kernel satisfies the finite second moment condition, E(h ²(y _i ,y _j)) < ∞, and is degenerate (ie, Var(E(h(y _i,y _j))) = 0), the limiting distribution of U can be approximated as a linear combination of chi-squared random variables with one degree of freedom [8], and its p value can be obtained by using the Davis method [9].

The weighted U approach is also flexible for testing other hypothesis. In addition to evaluating the joint effect of genetic markers and gene expression (G + T), it could be used for testing genetic effect (G) alone or gene expression (T) effect alone. For example, we can modify the approach by setting the gene expression similarity as constant (eg, g(t _i,t _j) ≡ 1) to test genetic effect.

Results

Genome screening

We applied three tests (ie, G + T, G, and T) to 142 unrelated Mexican American samples from the San Antonio Family Heart Study (SAFHS) and the San Antonio Family Diabetes/Gallbladder Study (SAFDGS). All analyses were based on SNVs on the odd-numbered autosomes and gene expression data provided by GAW19. In this study, we assembled multiple SNVs based on the functional unit (ie, gene) to facilitate the joint modeling of gene and gene expression. We obtained primary and alternative assembles from Genome Reference Consortium release version 38 (GRCh38) and identified 32,436 gene regions in correspondence to 17,264 RNA probes. The number of gene regions exceeds the probes because multiple assembles of one gene can share one nucleotide sequence, as well as the RNA probes designed to capture such sequence. SNVs that are not within or near a gene (±5 kb at both ends) were removed. Gene regions with no SNVs or RNA probes were also discarded. SNVs with no variation (ie, MAF = 0) were dropped, as were gene regions containing only such SNVs. A total of 6,956,910 SNVs, corresponding to 17,558 gene regions, remained for the joint analysis. The first, second, and third quartiles of the SNV counts in these regions are 115, 205, and 411, respectively. We used SBP, DBP, and HTN measurements at the first examination year as phenotypes, and age, gender, medication use, and smoking status as covariates. To account for population stratification, we performed principal components (PCs) analysis by using the EIGENSTRAT software [10]. The first 20 PCs were used in the analysis to adjust for potential confounding bias because of population stratification.

The whole-genome joint analysis of 3 phenotypes was completed in 80 min using a single core of i7 4700 CPU with 8 GB memory. Table 1 summarizes the top genes from the analysis, which were selected based on the smallest p value of three tests. In general, we observed that the G + T test either attained the smallest p value or a p value close to the smallest one. After adjusting for multiple testing, none of the genes were significantly associated with the phenotypes. However, if we used a significant threshold of 0.05, 4 of 15 genes were missed by considering SNVs alone (ie, G) and six genes were missed by considering gene expression alone (ie, T), while all 15 genes could be captured by the joint association analysis (ie, G + T). This suggests that there are potential advantages to combining genetic and gene expression information in the association analysis. The quantile–quantile (QQ) plot was also drawn, which showed no evidence of systematically inflation of the G + T test (Fig. 1).

Table 1 Summary of top 5 genes associated with SBP, DBP, and HTN

Full size table

Discussion

Further investigation of the top genes also found biological plausibility of several genes related to blood pressure. For instance, the product of PED4A hydrolyzes the second messenger cyclic adenosine monophosphate (cAMP), which plays a crucial role in controlling blood pressure [11]. PHOX2A is also important for the development of autonomic nervous system, which controls the involuntary functions, such as heart rate and blood pressure [12].

The study has certain limitations. Out of 8,348,674 SNVs, 1,391,764 (17 %) were unused because they are not in or near any gene. We could group these SNVs by physical location and also incorporate them into the analysis [4]. We found limited association evidence of single-nucleotide polymorphisms (SNPs) identified from previous GWAS, possibly because of differences in study samples (ie, whites vs. Mexican Americans). Another possibility is that majority of SNVs in our study are rare (MAF <0.01), whereas previous GWAS mainly focus on common variants (MAF >0.05).

The analysis of a large number of genes raised the issue of multiple testing. In our analysis, the false discovery rate approach was used to account for the issue of multiple testing. After adjusting for multiple testing, none of the genes could reach statistical significance. By using the biology knowledge and statistical tools, we might be able to further reduce the number of tests and increase our chance to detect an association. For instance, all assembles of one gene have high correlation, and we can either exclusively use the primary assemble or adjust p values for multiple correlated tests to better solve the multiple testing issue.

Conclusions

The emerging sequencing data and other omic data provide invaluable source for genetic study of human diseases, yet integrating and modeling these high-dimensional data remain a great challenge. By integrating both sequencing variants and gene expression into the association analysis, the weighted U approach provides a powerful and computationally efficient way for screening disease-associated genes. By applying the approach to the GAW19 data, we showed that the joint analysis of sequencing variants and gene expression could have some advantages over association analysis only using sequence variants or gene expression.

References

Metzker ML. Sequencing technologies—the next generation. Nat Rev Genet. 2010;11(1):31–46.
Article CAS PubMed Google Scholar
Pritchard JK. Are rare variants responsible for susceptibility to complex diseases? Am J Hum Genet. 2001;69(1):124–37.
Article CAS PubMed PubMed Central Google Scholar
Manolio TA, Collins FS, Cox NJ, Goldstein DB, Hindorff LA, Hunter DJ, McCarthy MI, Ramos EM, Cardon LR, Chakravarti A, et al. Finding the missing heritability of complex diseases. Nature. 2009;461(7265):747–53.
Article CAS PubMed PubMed Central Google Scholar
Agne M, Huang CH, Hu I, Wang H, Zheng T, Lo SH. Identifying influential regions in extremely rare variants using a fixed-bin approach. BMC Proc. 2011;5 Suppl 9:S3.
Article PubMed Google Scholar
Zawistowski M, Gopalakrishnan S, Ding J, Li Y, Grimm S, Zollner S. Extending rare-variant testing strategies: analysis of noncoding sequence and imputed genotypes. Am J Hum Genet. 2010;87(5):604–17.
Article CAS PubMed PubMed Central Google Scholar
Li B, Leal SM. Methods for detecting associations with rare variants for common diseases: application to analysis of sequence data. Am J Hum Genet. 2008;83(3):311–21.
Article CAS PubMed PubMed Central Google Scholar
Nica AC, Dermitzakis ET. Expression quantitative trait loci: present and future. Philos Trans R Soc Lond B Biol Sci. 2013;368(1620):20120362.
Article PubMed PubMed Central Google Scholar
Wei CS, Li M, He ZH, Vsevolozhskaya O, Schaid DJ, Lu Q. A weighted U-statistic for genetic association analyses of sequencing data. Genet Epidemiol. 2014;38(8):699–708.
Article PubMed PubMed Central Google Scholar
Davis CS. The distribution of a linear combination of chi-square variables. Biometrics. 1982;38(1):279.
Google Scholar
Patterson N, Price AL, Reich D. Population structure and eigenanalysis. PLoS Genet. 2006;2(12):e190.
Article PubMed PubMed Central Google Scholar
Omori K, Kotera J. Overview of PDEs and their regulation. Circ Res. 2007;100(3):309–27.
Article CAS PubMed Google Scholar
Wen G, Wessel J, Zhou W, Ehret GB, Rao F, Stridsberg M, Mahata SK, Gent PM, Das M, Cooper RS, et al. An ancestral variant of Secretogranin II confers regulation by PHOX2 transcription factors and association with hypertension. Hum Mol Genet. 2007;16(14):1752–64.
Article CAS PubMed PubMed Central Google Scholar

Download references

Acknowledgements

The Genetic Analysis Workshop is supported by NIH grant R01 GM031575. The authors wish to thank the editor and two anonymous referees for their helpful comments that improved the manuscript.

Declarations

This article has been published as part of BMC Proceedings Volume 10 Supplement 7, 2016: Genetic Analysis Workshop 19: Sequence, Blood Pressure and Expression Data. Summary articles. The full contents of the supplement are available online at http://bmcproc.biomedcentral.com/articles/supplements/volume-10-supplement-7. Publication of the proceedings of Genetic Analysis Workshop 19 was supported by National Institutes of Health grant R01 GM031575.

Authors’ contributions

XT conducted the data analysis and drafted the manuscript. CW helped write the R code for the joint analysis. QL conceived of the study and helped finalize the manuscript. All authors read and approved the final manuscript.

Competing interests

The authors declare they have no competing interests.

Author information

Authors and Affiliations

Department of Epidemiology and Biostatistics, Michigan State University, East Lansing, MI, 48824, USA
Xiaoran Tong & Qing Lu
Health Science Centre, University of North Texas, Fort Worth, TX, 76107, USA
Changshuai Wei

Authors

Xiaoran Tong
View author publications
You can also search for this author in PubMed Google Scholar
Changshuai Wei
View author publications
You can also search for this author in PubMed Google Scholar
Qing Lu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Qing Lu.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

Reprints and permissions

About this article

Cite this article

Tong, X., Wei, C. & Lu, Q. Genome-wide joint analysis of single-nucleotide variant sets and gene expression for hypertension and related phenotypes. BMC Proc 10 (Suppl 7), 36 (2016). https://doi.org/10.1186/s12919-016-0017-x

Download citation

Published: 18 October 2016
DOI: https://doi.org/10.1186/s12919-016-0017-x

Genome-wide joint analysis of single-nucleotide variant sets and gene expression for hypertension and related phenotypes