Background

Genome-wide association studies (GWAS) have become a major strategy for genetic dissection of human complex diseases. There is substantial overlap, both phenotypically and in allelic associations, between biomarkers and/or risk factors and between related diseases, and it is becoming important to understand the ways in which polymorphisms affect multiple phenotypes. Many phenotypes may be available from a single study population but current GWAS approaches usually examine them separately within a univariate framework. This strategy ignores potential genetic correlation between different traits.

From the perspective of maximising power for a given size of dataset, it has been shown that joint analyses of correlated traits in linkage analysis have substantially improved power in localizing genes [14]. Similarly, multivariate approaches in association studies can theoretically improve the ability to detect genetic variants whose effects are too small to be detected in univariate tests [4]. Multivariate association tests have been proposed for unrelated samples [5] and for family data [6]. Most of these tend to be inefficient and/or computationally intensive, especially at the genome-wide level. The approach proposed by Ferreira and Purcell has been shown to be powerful when traits have moderate to high correlation and efficient when applied to samples of unrelated individuals [7].

Genetically complex (multifactorial) diseases such as cardiovascular disease and type 2 diabetes often have common risk factors. A number of biochemical markers are known to be associated with obesity, pre-diabetic states, or risk of cardiovascular disease. Lipid traits such as triglycerides, and the low-density lipoprotein (LDL) and high-density lipoprotein (HDL) components of cholesterol, are well-known risk factors for cardiovascular disease. Other biochemical markers such as C-reactive protein (CRP) [8], the enzymes used as liver function tests (gammaglutamyl transferase, GGT [911], alanine aminotransferase, ALT; and aspartate aminotransferase, AST), butyrylycholinesterase (BCHE) [12, 13]), serum ferritin [14] and uric acid [15, 16] have also been shown to be associated with the risk of cardiovascular disease, hypertension, obesity, insulin resistance or metabolic syndrome. These biochemical markers are correlated so we may gain power, insight or both from a multivariate approach. For example, serum GGT is significantly correlated with total or LDL cholesterol, HDL (inversely) and particularly with triglycerides [17, 18]. Also, GGT is significantly correlated with other liver enzymes AST and ALT [17, 19]. Serum triglyceride is correlated with the liver enzymes [17] and uric acid and also associated with cardiovascular risk.

The importance of genetic variation has been shown previously through univariate analyses of serum lipids [20], uric acid [2123], GGT [24], ALT [24] and AST [17, 24], BCHE [25], ferritin [26] and for CRP [27, 28]. Nevertheless, little is known about common genetic influences on these variables and joint analysis may reveal whether the same gene influences multiple traits.

The aim of our study is to identify genes and regions associated with multiple biochemical traits related to cardiovascular risk, type 2 diabetes or metabolic syndrome. We used a recently described multivariate association test [7] to perform genome-wide association analysis. This approach was used initially to screen for multivariate trait-SNP association using a subset of unrelated individuals. To confirm findings from the multivariate test, univariate association tests were conducted making use of the full dataset by including all family members.

Methods

Subjects

Biochemical traits were measured in serum samples from twins and their families, and genome-wide SNP markers were genotyped. The study participants comprise:

(1) Adolescent twins and their non-twin siblings living in south-east Queensland (Australia) who had participated in the Brisbane Longitudinal Twin study [2932]. Full details are described in Middelberg et al. [33]. A total of 2548 participants (1317 females and 1231 males; mean age of 14.8 years) were genotyped.

(2) Adult twins consisting of twins and their family members who participated in studies of: (i) alcohol and nicotine dependence and metabolic risk for alcoholic liver disease [34]; (ii) anxiety and depression [35]; and (iii) endometriosis [36]. A total of 9145 individuals (5703 females and 3442 males; mean age of 46.2 years) were genotyped.

Combining these studies, 20,230 individuals had biochemical measurements and 11,683 (from 4986 families) had both genotype and phenotype data. Out of the 11,683, there are only 1483 (from 1015 families) who had data for all the 13 traits. Where multiple measurements of the same trait in an individual were available, the average of the values was used.

For each of these studies, participants (and, for subjects aged < 18 years, their parents) gave informed consent to the questionnaire, interview, and blood collection, and all studies were approved by the QIMR Human Research Ethics Committee.

Laboratory measurements

Serum was separated from the blood samples and stored at -70°C until analyzed. Serum cholesterol, HDL cholesterol, triglycerides, BCHE, glucose, uric acid, ferritin, CRP, AST, ALT and GGT were measured using Roche methods on a Roche 917 or Modular P analyzer (Roche Diagnostics, Basel, Switzerland). LDL cholesterol was calculated using the Friedewald equation. Insulin was measured on an Abbott Architect. BMI was calculated from measured or self-reported weight and height for the adults and from measured weight and height for the adolescents.

Genotyping

DNA was extracted from blood samples using standard methods and genotyped with Illumina 610K, 317K or 370K chips at CIDR or deCODE Genetics. Data cleaning for SNP genotypes included checking the expected relationships between individual family members and resolving Mendelian errors [37]. Imputed genotypes for non-typed HapMap SNPs were generated using MACH1.0 (http://www.sph.umich.edu/csg/abecasis/mach/index.html) [3840] program. Any imputed SNP which had r 2 ≥ 0.3 was included in the genotype data.

Statistical Analysis

Distributions of all biochemical variables were examined. Serum AST, ALT and GGT, CRP, triglycerides and BMI were log-transformed. For each trait, individuals who were more than five standard deviations from the mean of that trait were excluded. Results for glucose and insulin in adults were adjusted for fasting time based on the reported time of last meal and time of blood collection. Prior to genetic association analysis, the variables were also adjusted for the effects of age, squared age (age2), sex, sex × age and sex × age2. Standardized residuals were obtained and used in the association analysis. All data pre-processing and descriptive analyses were done using STATA version 7.0 [41] and SPSS version 17.0.2 (Mar 11, 2009). Multivariate association analysis was performed using the PLINK (v1.07) implementation of the multivariate test described by Ferreira and Purcell [7]. This test is computationally too intensive when applied to family data to be efficient for genome-wide analysis. Therefore the analysis was performed in two stages. First, we selected one individual per family (using the person with data for the greatest number of phenotypes) from each of the 4986 families and applied the multivariate test as a screening tool. Next, for each locus with a multivariate p-value of less than 5 × 10-8, we identified the traits that showed evidence for association with that locus (that is, with a canonical correlation weight > |0.2|) and confirmed that specific trait-SNP association with a univariate association test using all relatives for each family. The univariate association test was performed using "fastassoc" in MERLIN 1.1.2 [42] which takes the average of two results in MZ twin pairs.

Results

General Characteristics

Means and standard deviations of all the traits for males and females in adolescent and adult genotyped cohorts are listed in Additional file 1, Table S1. Generally, the means of the biochemical traits are lower in the adolescents than the adults, as expected. Phenotypic correlations between each pair of age-corrected traits separately for males and females in the combined sample are shown in Additional file 1, Table S2. The strongest correlations (r > 0.5) observed in males were between glucose and insulin (0.53), between AST and ALT (0.66) and between GGT and ALT (0.57). In the females, the strongest correlations observed were between glucose and insulin (0.59), between BCHE and glucose (-0.59), between BCHE and CRP (-0.53) and between AST and ALT (0.63). Given that most of the other pair-wise correlations (Additional file 1, Table S2) are low to moderate (r < 0.3), the multivariate approach is expected to provide comparable or slightly improved power to detect pleiotropic loci when compared to univariate analysis followed by correction for the number of traits tested [7].

Genome-wide association analyses

The multivariate analysis identified a total of 766 SNPs in 11 independent (r2 < 0.1) loci associated with biochemical traits with a p-value of less than 5 × 10-8 (Table 1 and Figure 1). Of these, there are eight loci potentially associated with more than one trait (Table 1). Three loci (on chromosomes 8, 12 and 19) showed strong or close to genome-wide significant evidence of associations with more than one trait in the all-subject univariate analyses.

Table 1 Summary of SNP associations (based on multivariate p-value of < 5 × 10-8)
Figure 1
figure 1

Manhattan plots for multivariate QTL analysis in unrelated-subject data (N = 4986) for the 13 traits. Genomic position is on the x-axis and the -log10 of the association p-value is on the y-axis. Points with p-value of 5 × 10-8 are shown in red.

The most strongly associated SNP at the chromosome 8p21.3 locus was rs17091905 (multivariate p = 2.8 × 10-13). HDL, CRP, triglycerides and BCHE had trait loadings of greater than |0.2|. To confirm the multivariate result, we individually tested each of these traits using a univariate test in the full sample of 11,683 individuals. The univariate tests confirmed the association with HDL (p = 5.7 × 10-12) and triglycerides (p = 5.1 × 10-15) but not at genome-wide significance for CRP (p = 0.008) and non-significant (p = 0.069) for BCHE. This variant is in strong or partial LD with previously-reported variants for HDL or triglycerides [4345] (Additional file 1, Table S3).

The second variant rs3213545 (multivariate p-value = 3.9 × 10-14) which is located on chromosome 12q24.2 (OASL) was confirmed to be significantly associated with GGT (p = 3.6 × 10-15) [46] and also showed moderately strong significance for LDL (p = 2.9 × 10-5) and CRP (p = 8.8 × 10-5) (Table 1).

The third variant was rs2075650 (multivariate p-value = 5.7 × 10-10) located on chromosome 19q13.32 (TOMM40/APOE-C1-C2-C4 gene cluster) where HDL, LDL, CRP and triglycerides had trait loadings of greater than |0.2|. Significant univariate associations were observed for LDL (p = 1.6 × 10-14) and CRP (p = 4.2 × 10-8) and close to genome-wide significant univariate associations were seen for HDL (p = 8.1 × 10-8) and triglycerides (p = 9.6 × 10-7) (Table 1 and Figure 2). This SNP has previously been reported to be associated with LDL [43], LDL buoyancy [47] and CRP [48], and there is an association between LDL (or TG or HDL) and rs4420638 which is in partial LD (r2 = 0.4) with this SNP (Additional file 1, Table S3).

Figure 2
figure 2

Radar chart of polymorphisms on chromosome8 (a), chromosome 12 (b) and chromosome 19 (c). Each dot on the plot represents the standardized beta (1-unit change per copy increment of the minor allele) of each trait from univariate testing.

To determine whether there are any further unreported genes/regions to be detected by multivariate analysis, a lower p-value threshold of multivariate p < 9 × 10-5 was used. No new loci were found but a further six previously reported loci were replicated as listed in Additional file 1, Table S4.

The Q-Q plot from multivariate analysis was also examined closely to determine whether there are any excess association signals detected by multivariate analysis which have not already been detected by univariate analysis. SNPs that were found in significant regions (genes) in univariate analyses were removed (Figure 3) from the plot. The Q-Q plot with excluded SNPs showed that there is no excess of significant p-values hence indicating there are additional loci that have not already been detected by univariate analysis.

Figure 3
figure 3

Q-Q plot of multivariate analysis. Black points correspond to SNPs included in the analyses. The 45° line refers to no significant association. The dotted line corresponds to p-value of 5 × 10-8. "Excluded" line is where SNPs that were found in significant regions (genes) in univariate analyses were removed.

Examination of the directions of the allelic effects on the different phenotypes showed unexpected results. At LPL on chromosome 8, the minor allele A at rs17091905 increased HDL-C and decreased triglycerides, but the direction of the nominally significant effect on CRP was to increase it. At the chromosome 12 locus the minor allele A at rs3213545 tended to increase LDL-C but it significantly decreased GGT and tended to decrease CRP. Similarly at the chromosome 19 locus, the effect of the minor allele (G for rs2075650) was to increase LDL-C and triglycerides and to decrease HDL-C, consistent with an increase in cardiovascular risk, but to decrease CRP, again suggesting opposite allelic effects on the markers of different aspects of cardiovascular risk.

Discussion

We have applied a multivariate approach to identify variants associated with more than one trait, initially using 4986 unrelated individuals across 13 biochemical traits. Univariate testing of the significant or near-significant loci, on the full sample of 11,683 individuals, was then used to confirm these findings. We are interested firstly in the usefulness of multivariate analysis as a substitute for the more laborious and potentially less powerful approach of conducting multiple univariate analyses and comparing the results, and secondly in the details of the loci which are found to have effects on multiple variables in our data.

Testing one individual per family identified three known loci that were significantly or near-significantly associated with more than one trait, and replicated 11 loci in previously published genes that that passed a genome-wide threshold of 5 × 10-8 for single variables (Table 1). When a lower genome-wide threshold (p < 9 × 10-5) was used, a further six published loci were also identified (Additional file 1, Table S4). The three loci in previous publications using univariate association analysis (highlighted in Table 1) had evidence of significant or close to significant associations with more than one trait in our data, hence indicating benefits of detecting pleiotropic loci in multivariate analysis.

We have identified polymorphisms showing strong evidence of allelic associations with HDL and triglycerides on chromosome 8 (LPL gene MIM 609708); with GGT and possibly LDL and CRP on chromosome 12 (OASL gene MIM 603281); and with HDL and LDL and possibly CRP and triglycerides on chromosome 19 (TOMM40 (MIM 608061) /APOE (MIM 107741)-C1 (MIM 107710)-C2 (MIM 608083)-C4 (MIM 600745) gene cluster). Each gene has been previously recognised in genome-wide association studies concentrating on a few of these variables [49]. The function of these genes is reasonably well-established. LPL plays a key role in lipid metabolism and is responsible for hydrolysis of triglyceride molecules present in circulating lipoprotein. APOE and APOC genes also play a key role in lipid metabolism and cholesterol transport by helping to stabilise and solubilize lipoproteins as they circulate in the blood [50, 51]. Both LPL and APOE polymorphisms have been found to be significantly associated with increases in LDL and decreases in HDL [52]. The functional connection between the OASL gene (2',5'-oligoadenylate synthetase-like, also known as "thyroid hormone receptor interactor") and these phenotypes is unclear. However, nearby genes in linkage disequilibrium with the lead SNP in OASL include HNF1A and c12orf43. HNF1A is expressed in liver, kidney and endocrine pancreas and regulates a number of genes involved in lipoprotein metabolism including apolipoproteins, cholesterol synthesis enzymes and bile acid transporters [53]. HNF1A also has allelic associations with type 2 diabetes [54], CRP [5557] and coronary heart disease [58]. The findings for the lipids, in particular, were similar to those obtained in previous genetic association studies on general population. However the relationships between inflammation (as presumptively measured by CRP) and the traits associated with obesity and cardiovascular risk are of particular interest. CRP was significantly, though not always strongly, correlated with each of the other traits at the phenotypic level and it also showed up in the multivariate association findings.

The multivariate approach helps us to understand the connections between variables. For example, for rs2075650 on chromosome 19, the multivariate approach suggested LDL, CRP, HDL and triglycerides might be associated with this particular SNP. Although the effects on LDL, HDL and triglycerides are consistent with what was expected (that is, the LDL effect is inversely associated with the HDL effect, positively associated with the triglycerides effect, and the HDL effect is inversely associated with that on triglycerides), the effects on CRP are contrary to expectation. The effect direction is opposite to those for LDL and triglycerides, and the same as that for HDL. This suggests that the alleles or haplotype which have risk-increasing effects on lipids have a potentially protective effect on CRP and (so far as the effect on CRP is reflecting the degree of inflammation) on the inflammatory process. The effect estimates of LDL for rs2075650 obtained in our study were similar to obtained by Aulchenko et al. [43]. The effect of rs2075650 [G] on LDL was estimated as 0.160 ± 0.018 by Aulchenko et al. and 0.153 ± 0.020 in our analysis. The effect estimates of CRP for this SNP was not available from previous study for comparison. In addition, it was interesting to observe that rs4429638 which is in partial LD (r2 = 0.4) with rs2075650 has allelic effects in the opposite direction on LDL [52, 5961] and CRP [55]. Similarly on chromosome 12, rs3213545 affects LDL and CRP (and GGT) but not HDL or triglycerides. Again, the allelic effects on LDL and CRP are in opposite directions. This shows the usefulness of the multivariate approach to help understand the connections between several trait-SNP associations, which can then be modelled and evaluated in more detail.

Our study differs from previous investigations as it examines a large number of correlated biochemical traits, initially using unrelated individuals and following up the findings in other members of the families. It confirms some published associations and identifies new ones. As our cohort consisted of adolescents and adults, results from adults, adolescents and combined (adults and adolescents) cohorts were examined and compared. Because of the larger number of adults studied, results from adults were similar to the combined data. Results from the adolescents were not notably different from the combined data.

One main limitation of our approach is that only a subset of the data (from unrelated subjects) can be used for the initial multivariate analysis. Although it would add power, it is too computationally intensive to use all the available data (that is, taking account of the family structure) in genome-wide multivariate analysis. Although a subset of data was used, the method applied in our study was very efficient and easy to perform. A more specific limitation in our data is that the glucose and insulin measurements were not made on fasting blood samples. In adults, we made adjustments for the time since the last meal but in adolescents we had to rely on the fact they were seen at the same time of day and blood was taken around three hours after the expected time of breakfast.

Another set of limitations is related to the use of biomarkers of risk or, for CRP, of systemic inflammation. It seems that some loci may affect HDL-C or triglycerides without affecting cardiovascular risk [49], and it is possible that some loci might affect serum CRP without affecting inflammation. Nevertheless the divergence between allelic effects on risk factors deserves further examination.

Conclusion

Our study demonstrated that it is useful to examine multiple phenotypes jointly in order to better understand the connections between them and to make the distinction between common and unique genetic effects. Our efficient approach (a combination of multivariate and univariate analysis) was able to identify three possible loci that might affect multiple traits, and validated 17 loci that have previously been reported. It highlighted anomalous effects on CRP, which is increasingly recognised as a marker of cardiovascular risk as well as of inflammation. Confirmation and extension of our findings will require studies which measure multiple phenotypes in each genotyped subject, and will benefit from combination of data from multiple studies to achieve sufficient power.