From quantitative trait locus mapping and linkage analysis to genome-wide association studies (GWASs), genetic markers have been used to locate causal genes underlying Mendelian and complex traits with impressive success: the molecular basis for nearly 3,000 Mendelian disorders is known [1] and over 4,500 single nucleotide polymorphisms (SNPs) have been associated with a variety of human traits and complex diseases [2]. These studies rely on linkage with the disease-causing variant and, by their very nature, indirect genetic marker studies have limitations. The causal variant or gene remains unknown for the majority of the 4,500 SNPs associated with complex disease and for over 3,500 Mendelian disorders. New sequencing-based studies have emerged and are poised to change genetic mapping fundamentally by enabling the direct identification of causal sequence variants in a single experiment. We will no longer have to rely on linkage with the disease-causing variant; instead, by obtaining full sequence data for all genes we can now directly test for association with disease. As we have learned in the past few years, however, there is a great deal of human genetic variation [3] and finding the causal variant among thousands of candidates can be difficult.

Here we review the computational and statistical approaches that have emerged for managing these data in this rapidly exploding field. First, we briefly review the process for identifying variants in next-generation sequencing (NGS) studies and then discuss strategies for identifying the causal variant in Mendelian disorders among the total number of variants identified. We also discuss strategies for identifying the causal gene(s) in complex diseases among all genes in the genome, before outlining some challenges facing current exome sequencing studies.

Variant discovery in exome sequencing projects

NGS methods have been developed that harness massively parallel DNA sequencing [4] and enable large-scale sequencing projects that have applications ranging from cataloging genetic diversity on a population level [3] to identifying a disease-causing variant in a single individual, which might lead to directed therapy [5]. Most large-scale medical sequencing projects so far have focused on the protein-coding region of the genome (the 'exome'). This has been driven in part by cost (whole genome sequencing is still relatively expensive for large sample sizes), biology (most known examples of disease-causing variants alter the protein sequence), and practical considerations (there is currently little consensus on interpreting non-coding genetic variation).

Various methods have been developed to select a subset of the genome for sequencing, but only solid-phase hybridization [6] and liquid-phase hybridization [7] have been commercially applied for selecting the entire human exome as the target for sequencing. After target enrichment, sequencing is performed using various NGS technologies, including reversible terminator reactions, sequencing by ligation, pyrosequencing and real-time sequencing [8]. These generate millions of short sequence copies, or reads, tiled across the portions of the reference genome that were targeted. Although numerous algorithms have been developed to align NGS reads to the reference genome (Bowtie, Short Oligonucleotide Analysis Package (SOAP) and Blat-like Fast Accurate Search Tool (BFAST), among others [9]), most sequencing projects use Mapping and Assembly with Qualities (MAQ) [10] or the Burroughs-Wheeler Aligner (BWA) [11] because of computational efficiency and multi-platform compatibility. The resulting aligned sequence is then inspected for positions that vary from the human reference sequence and are identified as SNPs.

As with alignment tools, many algorithms have been developed to identify a high-quality set of variants in NGS projects. Most current SNP discovery tools rely on the calculation of genotype likelihoods at each position [10], defined as the probability of observing the given sequencing data (base calls and base quality scores) at that position given a set of underlying genotypes. Bayesian posterior probabilities can then be calculated for each potential genotype [12]. Two popular tools for SNP discovery in NGS data that are easily incorporated into data-processing pipelines are SAMtools [13] and the Genome Analysis Toolkit UnifiedGenotyper [14, 15]. Other tools have been developed to exploit aspects of specific types of NGS technologies (optimizing base quality estimates from pyrosequences, for example) [1618] or low-coverage sequencing data [18, 19].

By applying the appropriate tool one can identify a set of positions in the sequencing data that are different from the reference sequence along with an indication of genotype quality. Typically 15,000 to 20,000 variants are discovered per exome, with the variation in this number occurring from different exome target definitions [2023] (a target set with fewer genes or exons would be expected to have fewer total variants) and ancestry (individuals of African ancestry have more variants per exome than individuals of European ancestry [3], for example). By contrast, about 3 million SNPs per genome are discovered using whole-genome sequencing [24] because of the larger sequencing target (whole genome sequencing targets about 3 Gb, whereas the typical exome target is about 33 Mb). To facilitate the processing and sharing of these large datasets, the Variant Call Format (VCF) text file format [3] is emerging as the accepted format for reporting sequence variation from NGS projects, and the SAM/BAM file format is routinely being used for storing and sharing raw NGS data [13].

Challenges for variant discovery in exome sequencing projects

Because even a single base-pair change can be associated with disease, SNP discovery algorithms must robustly distinguish true variation from sequencing errors. This challenge is magnified in exome sequencing projects, in which discovering rare variants is often the goal. NGS has an inherently higher per-base error rate than Sanger sequencing [25] but is generally thought to compensate for these errors with much higher coverage (most NGS experiments for disease-association generate an average of greater than 20- to 30-fold coverage). Despite this degree of coverage, however, the higher error rate of NGS can introduce false-positive associations if cases and controls have differential coverage depths [26]. In large-scale sequencing projects aimed at discovering rare variants associated with complex disease, differential coverage between cases and controls should be one of the quality control metrics (of potentially many); however, a standardized quality control approach to NGS data has not yet emerged.

Applying exome sequencing to Mendelian disorders

Exome sequencing has been successfully used to find the causal variant in several Mendelian disorders, such as Miller syndrome [27] (a rare autosomal recessive disorder characterized by craniofacial abnormalities), Kabuki syndrome [28] (an autosomal dominant form of mental retardation with facial abnormalities), and many others [29]. It is emerging as an attractive method for disease-gene mapping in Mendelian traits when linkage studies have been inconclusive or impossible [23] (often owing to low numbers of affected individuals) or when looking for causal de novo mutations [20, 28]. Successful studies have typically analyzed fewer than ten individuals and often only affected individuals have been sequenced. These small studies are underpowered for detecting association using currently available association tests and use a different analytic approach for novel gene discovery compared with methods developed for the analysis of complex diseases.

Identifying causal variants: filtering

Various heuristic filtering methods have been used to narrow the search for the causal variant from about 20,000 to often a single variant, or to a single gene (with several independent variants; Figure 1). In general these heuristic filters rely on four main assumptions: (1) the causal variant will alter the protein coding sequence; (2) it will be extremely rare (often assumed to be shared only by cases in one family); (3) every carrier of a putative disease-causing variant will have the phenotype (complete penetrance); and (4) every individual with the disorder will carry the putative disease-causing variant (that is, complete detectance, or 100% probability of observing a genotype given the phenotype). Functional annotation can divide variants into synonymous variants (those that do not change the amino acid sequence), missense variants (those that introduce an amino acid change), and loss-of-function variants (those that prematurely truncate proteins and those disrupting protein splicing). Approximately 50 to 75% of variants can be removed from consideration by focusing only on nonsynonymous (protein-altering) changes [30, 31]. Some studies further divide variants into different classes on the basis of the predicted effects of the protein alterations (most commonly using PolyPhen [32], SIFT [33], GERP [34] or PhyloP [35]). Under the assumption that variants responsible for Mendelian disorders will not be present in publicly available databases of human genetic variation, investigators have removed variants for further consideration if they are found in HapMap [36], 1000 Genomes Project [3], dbSNP [37], and privately available variants from other exome sequencing projects (typically shared controls or cases for other phenotypes sequenced locally). Restricting the search to nonsynonymous variants not present in available databases currently reduces the list of putative causal variants to approximately 200 to 500 [23, 27, 38].

Figure 1
figure 1

Typical heuristic filtering applied to exome sequencing projects aimed at novel gene discovery for Mendelian disorders, along with key assumptions at each step. Each individual carries approximately 3 million SNPs. Sequential filters shown here can be applied to reduce the number of potential disease-associated variants.

Finding causal variants under a recessive model

To further narrow the search, investigators have imposed a recessive model of disease when the pedigree suggests this mode of inheritance, requiring a putative causal variant to be present in a homozygous state for all individuals (while absent in public databases), or for individuals to be compound heterozygotes in the putative gene (carrying two separate variants in the same gene), which can reduce the list to a single variant or gene [20, 22, 23]. This has been successfully performed in at least 11 studies of recessive disorders with various numbers of individuals down to as few as one, in which a single individual with Perrault syndrome (ovarian dysgenesis with sensorineural deafness) was found to have two separate non-synonymous variants in HSD17B4, a gene that is involved in peroxisomal fatty acid β-oxidation.

These simple filtering techniques may not be sufficient, however, and additional approaches might be needed to further narrow the search. An example of this was the use of an identity by descent analysis in a sequencing study to discover the cause of hyperphosphatemia mental retardation syndrome [39]. After common variants were excluded from the list of shared variants among three affected individuals, 14 candidate genes were left; of these, however, only two were found in regions of the exome that were inferred to be identical by descent. PIGV (encoding phosphatidylinositol glycan class V), a gene that is involved in the synthesis of glycosyl-phosphatidylinositol, was identified as the causal gene after the final two candidate genes were sequenced in additional families. Our guess is that after the 'low-hanging fruit' are found, additional novel methods incorporating techniques from population and statistical genetics will be needed to identify causal genes in sequencing projects in which the answer is not immediately apparent.

Finding causal variants under a dominant model

In contrast to the autosomal recessive model of disease, there have been fewer published examples of novel gene association with autosomal dominant disorders (only four have yet been published [29]), perhaps highlighting the relative difficulty in finding such causal genes with exome sequencing. The general approach in the dominant model also relies on filtering a list of nonsynonymous variants to exclude those previously identified in either public databases or shared control exomes, and it requires affected individuals to be heterozygous for the same variant [31] or to be heterozygous for different variants in the same gene [28]. As a proof of principle for exome sequencing in gene discovery for Mendelian disorders, the exomes of four individuals with Freeman-Sheldon syndrome (a rare autosomal dominant disorder previously known to arise from mutations in myosin heavy chain 3, MYH3) were sequenced in one of the first publications detailing exome sequencing of multiple individuals [22]. MYH3 was identified as the only gene containing non-synonymous variants in all four individuals while being absent from dbSNP and other control exomes.

Challenges for exome sequencing for Mendelian disorders

All exome sequencing studies for gene discovery in Mendelian disorders have relied on the assumption of complete penetrance. Under this assumption, they exclude variants from consideration if present in public catalogs of human genetic variation or unpublished datasets. As these databases expand, however, disease-causing variants might appear in one or more publicly available datasets. The limitation of requiring absence from these datasets is also apparent when one allows for a genetic model of incomplete penetrance (that is, if the phenotype is present in only some fraction of carriers). In the future such a filtering strategy might need to specify a minor allele frequency threshold in such datasets as opposed to requiring complete absence. The converse of penetrance (the probability of observing a phenotype given a genotype) is detectance (the probability of observing a genotype given a phenotype), and almost all exome sequencing studies for Mendelian disorders have relied on a model of complete detectance. The causal gene for Kabuki syndrome, however, was found only after allowing for incomplete detectance [28], and might not have been identified as MLL2 (mixed lineage leukemia 2) if the discovery panel had not been so enriched for carriers (90% of the discovery panel carried a loss-of-function variant in MLL2 compared with 60% of the replication panel). In the future, better tests will be needed that incorporate incomplete penetrance and detectance. However, it is clear that integration of gene length will be critical, as longer genes will dominate the results given the greater numbers of variants due to their size.

Applying exome sequencing to complex disease

GWASs have been performed for many complex traits and have identified associations with thousands of common variants (minor allele frequency typically over 5%), each conferring a modest increase in risk among carriers (with odds ratios rarely above 1.3 [40]). These 'risk alleles' are typically not causal and are associated with the phenotype of interest because of linkage with the causal variant. Exome sequencing studies fundamentally differ from GWASs because, in theory, they enable unbiased variant discovery and allow for direct association between phenotype and causal variant. The driving hypothesis behind complex disease exome-sequencing studies, motivated by the results of early sequencing studies [4144], is that multiple rare variants in protein-coding genes contribute to the trait of interest. Focusing on rare genetic variation is also supported by studies predicting that numerous functional and deleterious variants segregate in the population at frequencies (0.5 to 5%) too low to be detected by GWASs [4547]. These rare variants pose an analytical challenge, however, because they are present in so few individuals that there is low power to detect an association. Although we are still awaiting the results of the first exome sequencing studies for complex diseases, we review (below and in Figure 2 and Additional file 1) the available tests for rare variant association, some of which are likely to be applied in ongoing projects (such as the Exome Sequencing Project from the National Heart Lung and Blood Institute [48]).

Figure 2
figure 2

An illustration of rare variant association tests. Cases and controls from a hypothetical complex disease exome sequencing project are depicted. The horizontal bars indicate aligned exome sequences for individuals; stars indicate the presence of a non-reference allele. Variants 1 and 4 represent low-frequency variants with predominance in cases, Variant 2 represents a singleton, Variant 3 represents a common variant, and Variant 5 represents a low-frequency variant exclusive to controls. For simplicity, these variants are displayed with similar frequency, although very rare variants represent the majority of variation in real sequencing studies. As illustrated, the specific genetic architecture underlying the complex phenotype of interest is expected to have a large role in which test is most powerful for detecting an association. Collapsing methods may be best if a burden of rare variants drives the phenotype, whereas aggregation methods may be more powerful if the full allelic spectrum is contributory. Finally, for genes harboring both risk and protective alleles, bidirectional tests may be most appropriate. See Additional file 1 for examples of methods of each type. MAF, minor allele frequency.

Single variant tests

The simplest approach to analyzing variants from exome sequencing data is to examine each one individually for association with the given phenotype. For example, dichotomous traits (myocardial infarction, diabetes, schizophrenia, and so on) can be analyzed using the χ2 test for contingency tables, Fisher's exact test, Cochran-Armitage test for trend, or logistic regression [49]. These methods test for an enrichment of the 'risk' allele in cases or controls (if seen more frequently in controls, it would be deemed a 'protective' allele). An example would be finding a variant present in 3% of cases but only 1% of controls. Whether this overrepresentation is statistically significant depends on the total number of individuals in the study and the required level of statistical stringency. Quantitative traits (such as blood lipid levels, body mass index or height) can be analyzed by linear regression [49]. By definition, rare variants have low population frequency, and the statistical power to detect association with a phenotype is low for modestly sized studies. For example, assuming 10% disease prevalence, in a study with 1,000 cases and 1,000 controls, there is 2% power to detect an association for a rare variant (minor allele frequency of 0.5%), with a threefold effect at the genome-wide significance level of 5 × 10-8.

Multiple variant tests

Groups of variants can be analyzed together in an attempt to improve power. In whole genome sequencing, a sliding window can be used to group variants, whereas in exome sequencing the natural unit of grouping is one gene. Alternative splicing can complicate this analysis, however, as a single variant might belong to multiple transcripts of the same gene with different functional effects (a variant might be classified as synonymous for one transcript and missense for another, for example). To extend the single variant tests above, single-SNP P-values from multiple variants can be combined by Fisher's [50] or Stouffer's [51] methods. Variants can also be combined in multiple logistic or linear regression models. However, because these simple approaches still essentially test each variant separately and then combine evidence from multiple variants, the results must be adjusted for many degrees of freedom, which will limit the power of these approaches.

Given the large amount of human genetic variation, it would not be surprising to find neutral variants in a causal gene. Therefore, selecting a subset of variants for regression can improve the power to detect an association. For example, synonymous variants are typically discarded because they are less likely to be causal. Shrinkage and regularization regression methods such as LASSO [52], ridge regression [53], and stepwise regression have been proposed for association studies. In these methods, the regression model is fitted while accounting for the cost of adding each additional variable to the model. Other approaches, such as logic regression [54] and the method proposed by Han and Pan [55], use data-driven combinations of variants to select variables for regression.

Collapsing methods

Another approach to increasing power is to collapse multiple rare variants together for analysis. The framework of these tests involves collapsing all variants across a unit (each gene being a unit, for example) together so that even if variants are individually rare, they might be jointly present in sufficient frequency to be used in a univariate test. When used for dichotomous traits, collapsing methods test whether the overall burden of rare variants is higher in cases than controls. For example, CAST [56] examines the differences in the number of individuals with one or more rare variants between cases and controls, and the CMC test [57] is based on comparison of non-synonymous rare variants between cases and controls. These tests rely on designating a set of variants as 'rare' for inclusion, and it is not surprising that altering this definition can greatly influence the association results. Unfortunately there is little guidance in this area and allele frequency thresholds of 1% or 5% are commonly (and arbitrarily) chosen. An alternative approach has been developed that uses the data to select the best variants. The variable-threshold test [58] finds the frequency threshold that best discriminates cases from controls. Similarly, RareCover [59] aims to find the optimal set of variants to collapse together. Although there have been no published complex-disease exome sequencing studies, these tests have been applied to candidate gene sequencing results [58, 60].

Aggregation methods

An alternative to the collapsing methods involves aggregation, which aims to summarize the information from many variants while appropriately weighing the contribution of each variant. Although collapsing methods discard variants that are considered unlikely to be causal, aggregation methods aim to include the full frequency spectrum of alleles (rare and common) into the association test. The weighted-sum statistic [61] weighs variants according to allele frequency (rare variants are given stronger weighting) because of an assumption that functional variants of large effect are kept at a low population frequency by purifying selection. Weighing variants by apparent effect size is also effective and is implemented in KBAC [62] and the test described by Ionita-Laza et al. [63]. These tests have been applied to candidate gene sequencing results [58].

Extensions to these methods

Accounting for covariates

The association of genotype with phenotype can be confounded by various factors such as ancestry, age and sex. Methods that can directly account for such covariates can be advantageous in discerning the causal effect of genetic variants. When a test does not directly accommodate covariates, regressing the genotype and phenotype on the covariate and using the residuals for the association analysis can remove the effect of the covariate on the phenotype.

Accounting for risk and protective alleles together

The effects of genetic variants can be neutral, protective or detrimental for a given disease trait. Many existing methods test for a frequency differential of variants between cases and controls and a mixture of positive and negative effects will adversely affect these tests. For example, PCSK9 (encoding proprotein convertase subtilisin/kexin type 9), a gene associated with cholesterol levels and coronary artery disease, contains both risk-lowering loss-of-function variants and gain-of-function variants that increase risk [64]. Testing for a difference in the aggregate of these alleles in either cases or controls would not be expected to yield significant results as cases will be enriched for risk variants and controls will be enriched for protective variants, effectively canceling each other out in the sum total. Methods that account for a mixture of directions of effects can be more powerful in such scenarios, and several tests explicitly account for bidirectionality of effects (Additional file 1). The prevalence of genes with variants having bidirectional effects is currently unknown but loss-of-function variants are expected to be more abundant in the general population and this bidirectional effect may be less apparent for sequencing studies not focusing on phenotypic extremes. Regardless, it is likely that multiple genes in a common pathway would have alleles with bidirectional effects, and if a collapsing method is used to group variants across a pathway, these tests can be increasingly used.

Incorporating functional annotations

Several studies have shown that using functional information improves the power to detect association [58, 6567]. For protein-coding variants this can include the predicted effect on protein function, using programs such as SIFT [33, 68], PolyPhen [32, 69], Panther [70, 71], MutationAssessor [72], SNAP [73] and PupaSuite [74]. For non-coding variants, evolutionary conservation and functional effects can be assessed using programs such as PhyloP [75], PhastCons [76], SCONE [77] and SiPhy [78].

Statistical power

The statistical power of the methods to test for association with rare variants has not been systematically analyzed. Although articles that describe novel association tests usually provide power comparisons to previous methods, these calculations are prone to being performed under specific assumptions about the genetic architecture of the trait that often favors the test being implemented and might not be representative for human traits in general [79, 80]. Extending the results from theoretical studies [81] and early sequencing studies of candidate genes [41, 42, 82] would suggest that approximately 10,000 exomes are needed to achieve genome-wide significance for complex traits (in which a Bonferroni-corrected P-value for 20,000 genes would require P < 2.5 × 10-6). Even the most powerful of the methods available for analyzing sequencing data will not lower these requirements substantially. It would not be surprising, then, that the first exome sequencing association studies will be underpowered and exome sequencing will need to be replicated with additional sequencing or genotyping (or both) [83].

Which test(s) should be used?

The decision regarding the use of specific tests will depend on many factors, including study design (if the trait is quantitative or dichotomous), the assumption of the underlying genetics (whether only rare variants or both rare and common variants are expected to contribute to disease, whether protective and risk variants are expected), and pragmatic considerations (which test is available for use). Most importantly, different tests are powered to detect associations for different aspects of genetic architectures (number of affected loci, associated population frequencies, or associated effect sizes and directions) [79, 84, 85]. Currently, no software suite contains more than a small number of tests and input formats vary between available software packages, which complicates applying multiple tests to the same study. In the future we expect multiple tests to be implemented in available software suites.

Challenges for exome sequencing applied to complex disease

Numerous tests have been developed for analyzing sequencing data (Additional file 1). Running a large battery of these tests comes at the cost, however, of having to penalize multiple hypothesis testing, as well as potential confusion over inconsistent results (a gene can be highly ranked in one test and not significant in another, for instance). Regardless of the test, unless rare variants have a surprisingly large phenotypic effect on complex diseases, achieving sufficient statistical power will require large studies. DNA sequencing costs will continue to decrease, however, and adequately sized studies might soon be performed (simulations suggest that 5,000 cases and 5,000 controls would provide adequate power to detect association for rare variants with modest effect [81]). Combining results from different studies on the same phenotype is an attractive intermediate option (as has been seen with increasingly larger GWAS meta-analyses). This will probably prove more challenging than GWAS meta-analysis, however, as differences in results from multiple sequencing centers (perhaps with different sequencing technologies or different exome target definitions, for example) can introduce significant technical artifacts. Once putative variants have been discovered, the replication strategy for exome studies will depend on the genetic architecture discovered in the analysis. Disease-associated low-frequency polymorphisms can be verified with follow-up genotyping. If the phenotype is caused by a collection of singleton variants, however, further sequencing in additional individuals will be needed and might prove expensive (especially if multiple genes are being considered or if genes are large or have many exons).

Prospects for the future

The growing number of exome sequencing studies demonstrates the power of this approach in mapping genes involved in Mendelian phenotypes. The success of this approach is uncertain, however, as publication bias makes it unclear how many studies fail to identify a causal locus by exome sequencing. Non-allelic heterogeneity, regulatory variation and structural variation underlying phenotypes all pose challenges for sequencing-based discovery of Mendelian genes. It is possible that new statistical and computational methods will increase the already impressive success rate of exome sequencing studies for Mendelian disorders.

Although we are still awaiting the completion of the first exome sequencing studies focusing on complex phenotypes, the early studies will probably be underpowered because current sequencing costs prohibit the adequately sized samples discussed above (10,000 samples). Owing to this lack of power, the first studies may not result in the discovery of numerous novel loci involved in traits of medical relevance. We believe that the enthusiasm for sequencing studies should not be diminished, however, because this technology has already shown great promise in the field of Mendelian disorders and sequencing costs will continue to decline, leading to adequately powered studies for complex traits. Technology already allows for the complete characterization of genetic diversity. The success of complex trait genetic research will now be determined by our ability to interpret the data and assemble sufficiently large well-phenotyped clinical populations.