Current Psychiatry Reports

, 15:349

Evaluating Rare Variants in Complex Disorders Using Next-Generation Sequencing

Authors

  • Matthew Ezewudo
    • Department of Human GeneticsEmory University
    • Department of Human GeneticsEmory University
Genetic Disorders (JF Cubells and EB Binder, Section Editors)

DOI: 10.1007/s11920-013-0349-4

Cite this article as:
Ezewudo, M. & Zwick, M.E. Curr Psychiatry Rep (2013) 15: 349. doi:10.1007/s11920-013-0349-4
Part of the following topical collections:
  1. Topical Collection on Genetic Disorders

Abstract

Determining the genetic architecture of liability for complex neuropsychiatric disorders like autism spectrum disorders and schizophrenia poses a tremendous challenge for contemporary biomedical research. Here we discuss how genetic studies first tested, and rejected, the hypothesis that common variants with large effects account for the prevalence of these disorders. We then explore how the discovery of structural variation has contributed to our understanding of the etiology of these disorders. The rise of fast and inexpensive oligonucleotide sequencing and methods of targeted enrichment and their influence on the search for rare genetic variation contributing to complex neuropsychiatric disorders is the next focus of our article. Finally, we consider the technical challenges and future prospects for the use of next-generation sequencing to reveal the complex genetic architecture of complex neuropsychiatric disorders in both research and the clinical settings.

Keywords

Human geneticsGenomicsGenetic architectureComplex traitsNext-generation sequencingTargeted enrichmentSingle nucleotide variantsSNVsSingle nucleotide polymorphismsSNPsStructural variationCopy number variantsCNVsComplex neuropsychiatric disordersSchizophreniaAutismGenetic disordersPsychiatry

Introduction

The individual and societal burdens of common, complex neuropsychiatric disorders are truly profound [1]. One of the major goals of contemporary biomedical research is to elucidate those disease mechanisms that underlie complex neuropsychiatric disorders like autism spectrum disorder (ASD) or schizophrenia (SZ). The hope is that an understanding of the pathogenesis of these disorders will enable the development of new treatments for those patients already affected, and new preventatives for those who are not. Because susceptibility to neuropsychiatric disorders is influenced by variation in both genes and environmental exposures, both genetic and epidemiological studies can help uncover novel disease mechanisms.

Our review focuses on what genetic studies of complex neuropsychiatric diseases have revealed about their genetic architecture, with a particular emphasis on studies of schizophrenia and autism. We divide the review into four main sections that reflect the technologies, experimental designs, and hypotheses tested in both recent and ongoing genetic studies of complex neuropsychiatric disorders. The first section discusses the recent history of human genetic studies, which have focused on the contribution of common variation to the risk of complex diseases. The second examines the contributions of exceptionally rare variants with large effects on disease risk. The third section addresses how the rapid development of next-generation sequencing and the targeted enrichment of eukaryotic genomes are contributing to studies of complex traits. Finally the last section focuses on challenges facing the application of next-generation sequencing in both research and clinical translational applications.

Common Genetic Variation and Complex Disease

For human genetic studies, the decade after the initial sequencing and analysis of a human reference genome has been a revolutionary one [2, 3]. The scaffold, a reference genome provided, allowed us to catalog one class of human genetic variation: single nucleotide polymorphisms (SNPs). Subsequent studies dramatically reduced the cost of genotyping genome-wide collections of hundreds of thousands of SNPs, while at the same time developing a map of the patterns of statistical correlation among common SNP variants, referred to as linkage disequilibrium, in the HapMap project [4, 5]. Furthermore, theoretical predictions suggested that a classic experimental design derived from epidemiology, a case–control association study, would have more statistical power than traditional genetic family-based linkage studies [6]. With these technologies in hand, genome-wide studies of complex disorders became feasible.

From this point, the conceptual framework and the types of experiments pursued were driven by both the availability of high-throughput genotyping platforms developed during the HapMap project and assumptions about the genomic architecture of common complex disorders [7]. The initial application of these technologies focused on an experimental design called the genome-wide association study (GWAS). The human genetic GWAS “industry” set out to test the hypothesis that common variants (those with >5 % frequency in the human population) with large effects contributed significantly to the risk of disease. The essential idea was that common disease-causing variants, which were expected to be found at an elevated frequency in cases as compared to matched controls, would either be genotyped directly or be in linkage disequilibrium with common SNPs, thereby allowing them to be discovered. This approach was successful in identifying many novel loci that contribute to a wide variety of complex diseases [810].

Unfortunately, the results of multiple genome-wide association studies of common neuropsychiatric disorders have been far more modest. Studies of schizophrenia (SZ) revealed only a few loci that exceed genome-wide levels of statistical significance, while the effect sizes of the variants uncovered were remarkably small [1115]. Moreover, only a modest amount of the total heritability of SZ has been accounted for, in contrast to other complex traits, such as human height [1618]. Similarly for autism, multiple GWAS identified a few loci of very small effect [1921]. A subsequent meta-analysis suggested that finding any genetic variants with an odds ratio greater than 1.5 for autism is extraordinarily unlikely [22]. For neuropsychiatric disorders, therefore, the effect sizes of the variants identified have been disappointingly small, particularly when compared to GWAS of other complex human disease traits.

It is in fact the small effect size of common variants that is the most striking finding from nearly all genome-wide association studies of complex diseases. Together, these studies have soundly rejected the hypothesis that common variants with large effects underlie the vast majority of complex human diseases. Thus, while one can argue that the GWAS approach has been a success, these studies have revealed that the genetic architecture of most complex diseases is unlike that seen in cystic fibrosis or sickle cell anemia, where common alleles with very large effects account for most of the disease prevalence in human populations. This finding is particularly relevant for complex neuropsychiatric disorders like autism spectrum disorders and schizophrenia. Furthermore, although there are statistically compelling associations between common genetic SNPs and diseases [23, 24], single nucleotide polymorphisms (SNPs) alone are unable to account for all the genetic proportion of heritability in complex traits. The fact that a substantial proportion of the estimated heritability of these traits remains unexplained points to other classes of genetic variants that have yet to be discovered [2527].

Rare Genetic Variation and Complex Disease

In retrospect, perhaps this outcome should have been less surprising. Theoretically, it has long been recognized that both common and rare variation likely contribute to the genetic architecture of complex traits [7, 2835]. In addition, genome-wide association studies of common variants were pursued for the simple reason that technological advances made this experiment feasible. Direct sequencing of genomes to identify the contribution of rare variants in large numbers of patient samples faced daunting technological challenges and excessive costs that simply made such studies impractical.

While large-scale genotyping of SNPs for GWAS was underway, similar genome-wide technologies led to the discovery of widespread variations in copy number across the human genome [3640]. This class of genetic variation, consisting of deletions and duplications larger than 100 kb, was surprisingly frequent. Although clinical geneticists had long recognized that cytologically visible, and usually much larger, chromosomal changes were associated with rare human diseases, the developing technologies allowed the discovery of smaller copy number variants (CNVs) that were not observable using classic cytological approaches. The role of this structural variation in human disease became an immediate focus [41, 42].

Soon after, the discovery of an elevated frequency of CNVs in patients with schizophrenia hinted at an explanation for the great heterogeneity of the disorder [15, 4348]. Similar findings also came to light for autism [4954], as well as for both schizophrenia and autism [55]. Nevertheless, the apparently pathogenic CNVs discovered to date are in general large and very rare in the population, which means they alone are unable to explain all of the missing heritability.

Next-Generation Sequencing, Targeted Enrichment, and Complex Disease

Comprehensive sequencing of human genomes would no doubt be better at capturing the allelic architecture of complex diseases than the genotyping of common variants in GWAS or the detection of rare, large CNVs with methods like array comparative genomic hybridization, but even in the recent past this has been cost-prohibitive. Recent advances in next-generation sequencing (NGS), however, have increased throughput while decreasing costs, so this barrier is eroding quickly [5663] see review in [64]. Combining NGS with methods that can enrich for portions of complex eukaryotic genomes has made it feasible to pursue other types of genetic variation underlying complex disease traits.

The initial application of these technologies focused on whole-exome sequencing, which involves sequencing the 1 % of the human genome that codes for proteins, in the context of diseases caused by mutations at single loci (so called Mendelian diseases) [6567]. Application of these approaches to schizophrenia uncovered a role for de novo mutations in the etiology of the disorder [68, 69]. A more recent study suggested that many of the variants contributing to schizophrenia must be very rare and have yet to be discovered [70]. Whole-exome sequencing studies of autism also point to a role for de novo mutations in autism phenotypes [7175]. Targeted studies of the X chromosome in males with autism, an attractive target given the 4:1 preponderance of males affected with the disorder, have revealed a number of putative autism susceptibility loci [7679]. More recently, a combination of targeted enrichment of the X chromosome exome and next-generation sequencing identified the AFF2 locus as having a significantly larger number of rare missense mutations in those with autism versus unaffected controls [80].

The clear message from all these studies is that exome sequencing can detect a broader allelic spectrum of complex neuropsychiatric disorders like schizophrenia and autism. As whole-genome sequencing becomes more and more cost effective, the field is bound to move toward this experimental design, which can reduce biases in ascertainment and make it possible to discover the full diversity of genetic variation.

Challenges Facing Next-Generation Sequencing and Complex Diseases

Next-generation sequencing and genomic enrichment technologies promise to detect both common and rare variants, thereby giving us a better understanding of the genetic architecture of complex diseases, yet there are a number of substantial challenges facing the application of these technologies in both research, and ultimately, the clinic (see reviews in [81, 82]). We believe these challenges fall into three main categories: accurate identification of genetic variation, efficient analysis of next-generation sequencing data, and interpreting the functional effects of genetic variation.

Accurate Identification of Genomic Variation

Next-generation sequencing (NGS) technology platforms (Illumina, Roche 454, ABI SOLiD, Ion Torrent) have higher error rates in individual sequence reads than conventional Sanger sequencing. These errors could be systemic and significant enough to yield false-positive variant calls and associations, as well as obscure actual associations. Making even more errors possible are biases that arise in coverage: GC-rich genomic regions tend to have lower sequencing coverage. Further, enrichment technologies add another layer of possible errors, especially those methods that select sequences by hybridization to a complementary oligonucleotide. The existence of gene families and other repetitive regions imply that multiple genomic regions can be captured and enriched by a single oligonucleotide substrate targeted at a specific region. Finally, errors in mapping sequence against a human genome reference sequence can lead to the misidentification of genetic variants. This is of particular concern because the human genome reference sequence is an idealized genome, and undetected variation among individuals can lead to spurious outcomes in the mapping and identification of genomic variation [83].

The fact that the human genome is very large (~3 × 109 base pairs, or 3000 Mb) implies that extraordinary accuracy is necessary to identify variant sites. Even modest error rates of 1 %, for instance, could impugn the validity of association studies [84]. As a simple example, if we consider only SNPs, we expect approximately 3 million variant sites per genome. As shown in Table 1, in a scenario that surely underestimates the possible sources of error, unless algorithms calling variants sites are exceptionally accurate, the result will be an enormous number of false-positive findings.
Table 1

Expected number of errors in human whole-genome sequencing

Error rate

Expected number of errors (3000-Mb human genome)

Expected number of variant sites (per human genome)

Expected proportion of false-positive variant sites

1.0E-3

3,000,000

3,000,000

0.5

1.0E-4

300,000

3,000,000

0.1

1.0E-5

30,000

3,000,000

0.01

1.0E-6

3,000

3,000,000

0.001

1.0E-7

300

3,000,000

0.0001

1.0E-8

30

3,000,000

0.00001

1.0E-9

3

3,000,000

0.000001

Efficient Analysis of Next-Generation Sequencing Data

Assuming we are able to accurately identify variant sites, the next step is functionally annotating those same sites so we can focus on those most likely to contribute to disease. Genomic variation identified with NGS technologies needs to be annotated to establish its type, genomic region, the evolutionary conservation of its site, and whether it has prior characterization. A typical whole-genome association study of a population would yield millions of variants, data that cannot realistically be characterized using public web genome browsers because of the huge effort involved. One early solution to this problem was the open source Sequence Annotator, or SeqAnt [85]. Researchers will have to rely increasingly on such high-performance annotation tools to analyze the sequencing data generated from large sequencing studies [86].

Interpreting the Functional Effects of Genetic Variation

Ultimately, the goal of association studies is to link genetic variants to phenotypes through statistical tests that show significant connections (effect sizes) between the discovered variant and the phenotype of the disorder. In those cases of Mendelian disorders, where single variants can account for a given disease phenotype, interpreting the functional effects of genetic variation is in many cases easier.

For complex neuropsychiatric disorders, on the other hand, the challenge is far more formidable [81]. When performing statistical testing of association of hundreds of thousands of variants in a genome-wide study, one immediately confronts the multiple testing problem, which is simply that, when performing a very large number of tests, the expected number of findings that exceed a nominal threshold of 0.05 will be substantial. Statistical methods like a Bonferroni correction or permutation can be used to control this issue, such that only very significant association signals are selected and false positives are reduced [87]; however, the extent to which false-negative findings are increased by these approaches remains unclear and difficult to determine.

Beyond this, the extent of genomic variation, perhaps much of it having no impact on the patient’s phenotype, provides a stark challenge to interpreting the effects of genetic variation. Figure 1, for example, shows a summary of single nucleotide variants discovered through targeted sequencing of the genomic region containing the FMR1 and AFF2 loci in 144 boys with autism. A striking aspect of the figure is the enrichment for rare variants not seen before in public databases like dbSNP (Fig. 1). Population genetic models predicted that most variants will be rare, and recent genome-wide empirical studies have clearly established the vast excess of rare, previously unseen variants in human populations [88••, 89•]. As a result, to gain sufficient statistical power to identify genetic variants contributing to complex diseases, very large patient collections, on the order of 10,000 or more, are likely to be required.
https://static-content.springer.com/image/art%3A10.1007%2Fs11920-013-0349-4/MediaObjects/11920_2013_349_Fig1_HTML.gif
Fig. 1

Summary of single nucleotide variant (SNV) and insertion/deletion (indel) variation discovered at the FMR1 and AFF2 loci in males with autism spectrum disorder. The frequency of SNVs and indels (minor alleles) in cases is plotted against their level of evolutionary conservation. Population genetic theory predicts, and empirical data has now confirmed, that most genetic variation is rare. The observation of evolutionary conservation between species at a given site in the human genomes implies that genetic variation at this site is deleterious to individual fitness and is therefore rapidly removed from the human population. Hence, it follows that disease-causing mutations are expected to be enriched among the class of rare variants found at highly evolutionary conserved sites in the human genome. Most common variation has already been discovered and exists in public databases like dbSNP (blue; circles and diamonds). In contrast, most of the rare variation at both loci was not contained in public databases (red; circles and diamonds)

Another approach that could help meet the statistical challenges of association studies is biological network and pathway analysis. Increasingly knowledge of biological pathways can enable statistical methods that leverage this information to discern associations between patterns of genetic variation and gene networks or pathways. Holistically testing for pathways and networks between genes across the groups being compared may improve power for associating genes to disease phenotype than the single variant comparison approach [90]. Still, these approaches presuppose knowledge of important pathways, and may not be the best way to uncover novel pathways or the action of mutant alleles that act outside of canonical pathways.

Finally, it is worth noting that the ultimate demonstration of causation will almost certainly fall beyond purely statistical methodologies. It may become necessary, and important if we are to understand fundamental disease mechanisms, to perform direct functional testing of variants in vitro or in model organisms in vivo. These experiments are far lower throughput than the original sequencing at the present time and likely represent a future bottleneck in our efforts to understand the genetic contribution to complex diseases. Furthermore, as we explore more deeply traits influenced by the actions of many genes, we will also need to more carefully examine the effects of environmental variation on the human traits of interest. In essence, DNA sequence and variation information is context-dependent, and to understand mechanisms of disease, we would ideally perform studies that can take into account both genomic variation information and putative environmental exposures.

Conclusions

The dramatic increase in the whole-exome and whole genome sequencing of large numbers of individuals has revealed more genetic variation between individuals than was previously suspected, as well as evidence for a higher incidence of rare and private variations in individuals within subpopulations. In a recent review on genetic variability among humans, Olson emphasized that, although a number of different evolutionary and demographic forces act to influence human genomic variation, population genetics studies and, more recently, deep sequencing point to mutation-selection balance as having the greatest impact on the genetic predisposition to disease [91•].

GWAS studies of neuropsychiatric disorders have unequivocally shown that common variants with large effects do not underlie schizophrenia or autism. While statistical analyses of these complex disorders are consistent with the action of a very large number of common alleles of small effect, they are unable to account for the entire estimated heritability of the disorders. At the same time, while rare pathogenic CNVs can account for nearly all Mendelian forms of complex neuropsychiatric illnesses, because they are so rare in the general population, alone they cannot explain all of the missing heritability. Now, with the rapid advances and reduced costs of whole-genome sequencing, human geneticists will finally be able to more comprehensively uncover all classes of human genetic variation in large patient populations. Sifting through these enormous datasets will undoubtedly pose a stiff challenge for human geneticists, particularly given the tremendous heterogeneity and complexity underlying neuropsychiatric illnesses like autism and schizophrenia. There is little doubt that better integration of genomics with collaborative studies in physiology, biochemistry, and epidemiology is vital if we are to truly understand disease mechanisms and develop innovative methods of prevention and treatment for these devastating disorders.

Conflict of Interest

M. Ezewudo: none; M.E. Zwick: grant from National Institutes of Health/National Heart, Lung, and Blood Institute, and consultant to Henry M. Jackson Foundation.

Copyright information

© Springer Science+Business Media New York 2013