Background

In the past decade, genetic locus discovery for human traits and diseases has been advanced via genome-wide association studies (GWAS). Recent improvements in technology to produce genotype data in a very cost- and time-effective manner and powerful easy-to-use software tools have played a major role in these advances, facilitating fast analysis of constantly increasing amounts of data. Clearly, the next advances in the field of genomics will be based on large-scale sequencing and other high-dimensional omics data. A key challenge for successful utilisation of these data lies, once again, in the availability of powerful methods and user-friendly software tools, thus enabling researchers to make rapid discoveries [1].

Large-scale sequencing efforts, such as the 1000 Genomes Project [2] or more recently the UK10K Project [3] and the Haplotype Reference Consortium [4], have enabled better characterization of variation in the human genome, especially in the low-frequency and rare variant range. Here, we denote all variants with minor allele frequency, MAF < 5%, by RVs. Imputation based on variant density detected by these projects yields high-quality genotype data even down to 0.01% allele frequency [5]. Large scale sequencing data generation encourages method and software development for elucidating RV effects, since traditional single-variant methods are underpowered to detect RV associations. Several methods and related software tools have been proposed, including burden tests using collapsing techniques, variance-component tests and combinations of the two [6].

There has also been increasing interest in addressing analysis of high-dimensional phenotypic and omic data, such as metabolomics, in relation to human genome variation. Multi-phenotype analysis (MPA), i.e. joint analysis of multiple phenotypes, is an example of recent developments in the field. Several methods and related software for single-variant MPA, including Bayesian and frequentist approaches, have recently been published [7]. The MPA approach is motivated by several factors: 1) it boosts power for locus discovery [811]; 2) it provides more precise parameter estimates [12]; and 3) it has biological advantages including the possibility to identify multi-phenotype effects, including pleiotropy [13], when one locus affects multiple phenotypes. The power improvement by the MPA approach is especially relevant from a computational point of view, because to enable the discovery of further loci for complex traits, the analyses will need to be based on hundreds of thousands of individuals, such as those available from the UK Biobank and other new large-scale efforts based on sequencing. Storage and computational load for such amounts of data will pose a challenge, and alternative strategies for boosting power for locus discovery other than that of increasing sample size, clearly bring an enormous advantage.

We propose a novel tool MARV for RV MPA, which enables joint analysis of both large-scale high-dimensional genomic and phenotypic data. It extends the burden test for RVs to high-dimensional phenotypic data by applying the MPA approach. Recently, methods designed for MPA of RVs have been proposed [1416], but these have several limitations regarding scalability and ability to combine continuous and discrete phenotypes, and more importantly, the associated software: they either lack an easy user-interface or are computationally inefficient – key features to facilitate fast discoveries. Our software tool MARV enables analysis of both continuous and binary phenotypes, as well as genotyped, imputed or sequenced data. MARV is computationally efficient for large-scale data. From a user point of view, it enables standard formats of data as used in other GWAS software, and the analyses are run using a command line interface, also familiar from widely used GWAS software such as Plink [17] and SNPTEST [18], thus enabling researchers quickly and effortlessly to transit from the standard single variant, single phenotype GWAS to region-based analysis of multiple phenotypes.

Implementation

The method on which MARV is based is briefly introduced in Methods, and is extensively described, including power simulations, elsewhere [19] (Methods). MARV is written in C++ and has a command line user-interface. A single run of MARV consists of just one step and the required input files, commands and the resulting output files are described below.

Data input and commands

MARV requires three files for a successful run: sample, genotype and genomic region input files (Fig. 1). The sample and the genotype input files should be in the SNPTEST v2 [18] format. The genomic region file should contain three columns: the name of the region, and the start and the end positions for the region. It is important that the positions in this file correspond to the positions of the genotype file, i.e. the same genome build for these two should be used (Fig. 2).

Fig. 1
figure 1

Workflow of a MARV run including required files, commands and resulting output files

Fig. 2
figure 2

Examples of the required input file formats for MARV and the resulting output files

The user then needs to specify the phenotypes to be analysed (--pheno_name), corresponsing to a column name in the sample file, and the method to use for the analysis, i.e. whether to analyse the genotype dosages derived by the software from the imputation probabilities (--method expected) or whether to use the thresholded genotypes based on a pre-defined cut-off (--method threshold, with a cut-off default 0.9 which can be changed with the --call_thres option) (Fig. 1). Additionally, the user may specify several other options, such as individuals or SNPs to extract or exclude from the analysis. It is important to specify the threshold used for the minor allele frequency (--rare_thresh, by default 0.05, which means that variants with minor alleles of frequency < 5% only will be included in the analysis). All the available options of the latest version of MARV can be found from the online manual of MARV.

Data analysis

MARV works across the genome by going through the specified gene regions one by one. Based on the gene boundaries and desired rare variant cut-off, it calculates, for each individual, the proportion of minor alleles at rare variants within the region [20]. After this calculation is performed for all individuals, a linear regression is fitted using the proportion as the outcome and the listed phenotypes as its predictors. The likelihood contribution of each individual is further weighted by the number of successfully genotyped/imputed RVs in the region of interest. For each genomic region, weighted linear regression is performed for all different phenotype combinations, i.e. if a user specifies phenotypes pheno_a and pheno_b, three different models for the proportion are fitted with the following predictor combinations: 1) pheno_a + pheno_b, 2) pheno_a, 3) pheno_b. MARV calculates the Bayesian Information Criterion (BIC) for each model to help the user in identifying the best fitting phenotype combination.

Output files

MARV produces three files by default:.error file,.log file and.result file (Fig. 1). The error file will be empty if the run was completed successfully; otherwise details about problems during the run are reported (Fig. 2). The log file will give specific details of the analysis, including the number of samples in the sample file and genotype file, and the number of phenotypes used for the analysis. It will also include the variants included for the analysis of each genomic region, along with their MAFs. The results file will include one row for results per each genomic region. If the user specifies printing of all the possible model combinations (--print_all) there will be as many rows per gene as there were different model combinations fitted. This file will inform the log likelihood and BIC of the model as well as the P-value for each model. We note that the P-value is uncorrected for any multiple testing. If the user is interested in the effect estimates and their standard errors for each of the model members, i.e. phenotypes included in the fitted model, a separate.betas file can be requested (--betas) (Figs. 1 and 2). A complete list of the columns in the output file with their meanings is provided in the online tutorial of MARV.

Case study

To illustrate the use of MARV across the genome, we have applied it to data from the Northern Finland Birth Cohort 1966 (NFBC1966), which covers over 96% of all births in the two northernmost provinces of Finland in 1966 (N = 12,068 live-born children) [21]. We included data from 4,721 cohort members who had participated in the 31 year clinical examination and had genetic data as well as data on triglycerides (TG), fasting insulin (FI) and waist-to-hip ratio (WHR). The Ethics Committees of the University of Oulu and Northern Ostrobothnia Hospital District have approved the study. Individuals used for the analyses have provided written, informed consent.

Motivation for the selection of the traits comes from a common variant single-trait GWAS, which has shown an enrichment of FI associations among SNPs preselected on Metabochip for TG and waist phenotypes [22]. For the selected traits, we applied the following criteria: 1) FG: exclude non-fasting individuals and/or those having type 1 or 2 diabetes mellitus or on diabetes treatment or having fasting blood glucose ≥ 7 mmol/l and/or being pregnant, 2) TG: exclude non-fasting and/or individuals known to be on lipid lowering medication. We modelled each trait on sex, body mass index and the first three principal components derived from the genetic data to control for potential population structure. An inverse normal transformation was further applied to the residuals of WHR and TG to reduce skewness.

DNA was extracted from blood samples drawn after overnight fasting at the 31 year clinical examination. Genotyping was performed with the Illumina HumanCNV370DUO Analysis BeadChip platform at the Broad Institute, USA, with Beadstudio algorithm being used for genotype calling. Detailed genotyping and sample quality control (QC) of the first set of data have been reported before [23]. Additional samples were genotyped afterwards, resulting in 5,402 subjects and 324,896 SNPs available for analysis. The 1,000 Genomes Project “all ancestries” reference panel (March 2012) was used for imputation, resulting in ~38 M SNPs for analysis.

We analysed the transformed residuals in MARV with the method “threshold” (option -m threshold), i.e. genotypes with probability of 0.95 or higher were considered called, whilst all others were considered missing. The gene list from the University of California Santa Cruz (UCSC, NCBI genome sequence build 37, hg19) [24] was used to define gene regions, and a level of significance of 1.67 × 10−6 was adopted based on a Bonferroni correction for 30,000 genes. We analysed all variants irrespective of their annotation across autosomal chromosomes using the following cut-offs: MAF < 5% and imputation quality > 0.4.

Results and discussion

Case study

The three selected phenotypes, FI, TG and WHR, were modestly correlated with each other (RFI_TG = 0.37, RFI_WHR = 0.18, RTG_WHR = 0.19). The multi-phenotype analysis of the three phenotypes revealed genome-wide significant associations covering two gene regions on chromosome 11: at APOA5 (apolipoprotein A-V) and at ZNF259 (the zinc finger protein 259, also known as ZPR1) genes (Table 1, Figs. 3 and 4). Besides the full model, MARV also provides parameter estimates and tests of associations for each phenotype combination, including the single phenotype models. Therefore, we were able to compare the results from the joint analysis against traditional single phenotype analyses. Additionally, the BIC provided by MARV for each sub-model served for selection of the phenotype combination providing the best fit. At APOA5, the best fitting model according to BIC contained TG only (P = 2.0 × 10−7), while at ZNF259, the model with FI and TG provided the lowest BIC and hence support for the best fit (P = 1.8 × 10−9) (Table 1). The model with FI and TG provided a lower P-value compared to those obtained from univariate models (P TG = 6.5 × 10−8 and P FI = 0.27), suggesting that at least the association with FI would have been missed in univariate analyses. The effects of TG and FI on the rare allele load were in opposite directions: while the increase in TG levels was associated with a greater proportion of minor alleles within ZNF259, the opposite was true for FI. This was true also for the univariate models (Table 1). All results files outputted by MARV are available as Additional files 1, 2, 3 and 4.

Table 1 Results for loci reaching genome-wide significance in the multi-phenotype rare variant analysis of NFBC1966 (N = 4,721). Regression coefficients with their standard errors (SE) are reported, followed by the P-value and the Bayesian Information Criterion (BIC) for the analysed model. TG, triglycerides; ln(FI), natural logarithm transformed fasting insulin; WHR, waist-to-hip ratio
Fig. 3
figure 3

QQ-plot of MARV analysis results on triglycerides, fasting insulin and waist-to-hip ratio in the NFBC1966

Fig. 4
figure 4

Manhattan plot of MARV analysis results on triglycerides, fasting insulin and waist-to-hip ratio in the NFBC1966. Genes reaching genome-wide significance (P < 1.67 × 10−6) are annotated

Common variants at these two identified genes have previously been associated with TG, total cholesterol, high-density lipoprotein, low-density lipoprotein, apolipoprotein A1 and B, coronary heart disease, coronary artery disease, plasma viscosity, Lp-PLA2 activity, prostate cancer, and circulating vitamin E levels [2533]. A recent large-scale GWAS also reported RV associations at ZNF259 with triglyceride levels [34]. Our analysis pointed to multi-phenotype effects with TG and FI. A recent study in Japanese individuals showed evidence for associations between variation in ZNF259 and type 2 diabetes [35], making this locus of interest for further investigation in the pathogenesis of the disease. Interestingly though, in our MPA the effects of TG and FI on the rare allele load at ZNF259 were in opposite directions, contrary to our expectations, since elevated TG levels usually correlate with elevated rather than decreased FI levels.

Running time and memory

We measured running time and memory usage of MARV by performing additional analyses on the NFBC1966 data with different number of individuals, phenotypes and on different sized chromosomes. For these analyses, we used 2,405 and 4,809 (i.e. ~double the first) individuals with complete data on eight continuous phenotypes. We analysed a combination of two, four and eight continuous phenotypes and used 1000 Genomes imputed chromosomes 1 and 22 data for the association analyses. All analyses were run and their performance data were collected using Imperial College HPC Cluster. Compute nodes were equipped with Intel(R) Xeon(R) CPU E5-2620 v2 @ 2.10GHz machine.

The results are summarised in Table 2. We observe that the size of the genomic region to be analysed notably affects the running time. However, there is not a linear relationship between the number of phenotypes and the computation time required. For example, the increase in time for chromosome 1 is just under 3 h (17% from original time) even when the number of phenotypes is doubled from four to eight and the number of models to be fitted is more than 20-fold. Doubling the sample size roughly triples the runtime.

Table 2 Computational time and peak memory usage of MARV by varying sample size, chromosomal size and number of phenotypes

The memory usage of MARV is more related to the size of the genetic data and number of individuals to analyse rather than the number of phenotypes to analyse. In our example, the peak memory usage was almost constant for all chromosome 1 and 22 analyses when the sample size remained the same, independent of the number of phenotypes in the model (Table 2). Considering the size differences of these two chromosomes (Table 2), we note that the increase in memory usage is not linear, however.

Conclusions

Our novel tool MARV allows for RV analysis of multiple phenotypes in a computationally efficient and user-friendly manner. The data input formats and the command line interface familiar from widely-used GWAS software will offer researchers a quick setup for the analyses. Moreover, the feature of analysing all phenotype combinations within one run and the calculation of BIC to help in model selection will pave the way for rapid discoveries and novel insights into biology of complex traits.

Methods

Statistical model

MARV is based on a so-called “reverse regression” approach, i.e. as compared to the standard GWAS in which the phenotype is the outcome and the genotype the predictor, this scenario is reversed in MARV. By using the genetic data as the outcome, we enable assessment of associations with multiple phenotypes simultaneously through the use of simple linear regression. While the “reverse regression” approach has been proposed for single genetic variants with the risk allele count or allele dosage being the outcome [36, 37], MARV uses a mutational load (burden) of risk alleles at RVS as the outcome. That is, the outcome is the proportion of RVs at which minor alleles are carried by individuals within a genomic region. This proportion is then modelled as a linear combination of K phenotypes. Mathematically, if r i is the number of minor alleles at RVs and n i is the total number of RVs, the model becomes:

$$ {r}_i{n_i}^{\mathit{\hbox{-}} 1} = \upalpha +\boldsymbol{\upbeta} {\mathbf{y}}_{\mathrm{i}} + {\upvarepsilon}_{\mathrm{i}}, $$

where r i n i −1 is the proportion of minor alleles for ith individual, y i is a vector of phenotype data for individual i, with corresponding regression coefficients β = [β1,…, βK], and ɛi ~ MVN(0,ơ2), ơ2 being a covariance matrix. Weighted linear regression is applied to allow for weighting by the number of successfully genotyped or imputed RVs within the region of interest. The significance testing is based on a likelihood ratio test which compares the weighted likelihoods of the fitted model against a null model where β = 0. The likelihood ratio test statistic has an approximate χ2 distribution with K degrees of freedom.

The type I error rate and power of the method have been tested under various scenarios with simulated phenotype and genotype data, and the results from these analyses are described in detail elsewhere [19].

Availability and requirements

Project name: MARV

Project home page: https://github.com/ImperialStatGen/MARV

Operating system(s): UNIX

Programming language: C++

Other requirements: Standard Linux/UNIX build tools to compile the program.

License: BSD 3-Clause License

Any restrictions to use by non-academics: None