Molecular Neurobiology

, Volume 49, Issue 1, pp 601–614

Introduction to Deep Sequencing and Its Application to Drug Addiction Research with a Focus on Rare Variants


  • Shaolin Wang
    • Department of Psychiatry & Neurobiology ScienceUniversity of Virginia
  • Zhongli Yang
    • Department of Psychiatry & Neurobiology ScienceUniversity of Virginia
    • Shanxi Key Laboratory of Environmental Veterinary ScienceShanxi Agricultural University
  • Jennie Z. Ma
    • Department of Public Health SciencesUniversity of Virginia
  • Thomas J. Payne
    • ACT Center for Tobacco Treatment, Education and Research, Department of Otolaryngology and Communicative SciencesUniversity of Mississippi Medical Center
    • Department of Psychiatry & Neurobiology ScienceUniversity of Virginia

DOI: 10.1007/s12035-013-8541-4

Cite this article as:
Wang, S., Yang, Z., Ma, J.Z. et al. Mol Neurobiol (2014) 49: 601. doi:10.1007/s12035-013-8541-4


Through linkage analysis, candidate gene approach, and genome-wide association studies (GWAS), many genetic susceptibility factors for substance dependence have been discovered such as the alcohol dehydrogenase gene (ALDH2) for alcohol dependence (AD) and nicotinic acetylcholine receptor (nAChR) subunit variants on chromosomes 8 and 15 for nicotine dependence (ND). However, these confirmed genetic factors contribute only a small portion of the heritability responsible for each addiction. Among many potential factors, rare variants in those identified and unidentified susceptibility genes are supposed to contribute greatly to the missing heritability. Several studies focusing on rare variants have been conducted by taking advantage of next-generation sequencing technologies, which revealed that some rare variants of nAChR subunits are associated with ND in both genetic and functional studies. However, these studies investigated variants for only a small number of genes and need to be expanded to broad regions/genes in a larger population. This review presents an update on recently developed methods for rare-variant identification and association analysis and on studies focused on rare-variant discovery and function related to addictions.


Rare variantsNext-generation sequencingDrug addiction


Substance abuse and addiction pose significant threats to public health worldwide. According to World Health Organization estimates, there were 2 billion alcohol abusers, 1.3 billion tobacco users, and 230 million illicit drugs users worldwide in 2004 [1]. In the USA, the harmful use of alcohol results in 2.5 million deaths each year, and cigarette smoking accounts for 30 % of deaths from cancers and nearly 80 % of deaths from chronic obstructive pulmonary disease (COPD) [2, 3]. Globally, more than 6 million people were killed by cigarette smoking, and about 0.7 % of the global burden of diseases was attributable to illicit drugs worldwide, with the social cost of illicit substance use being in the region of 2 % of GDP in those countries that have measured it [1].

Epidemiologic studies have found that many individuals become addicted to multiple drugs after the initiation of one drug [4, 5]. Evidence from family, twin, and adoption studies strongly implicate genetic factors in each step of addiction including vulnerability to initiation, continued use, and propensity to become dependent [6]. Family study has shown siblings of drug abuser and drug-dependent probands are at approximately a 1.7-fold higher risk of developing marijuana dependence, cocaine dependence, or habitual smoking than are siblings of nondependent individuals [7]. Further, twin studies suggest shared environmental influence contributes more to the availability of and exposure to the substance such as smoking initiation; however, genetic factors have greater effects on smoking progression to ND [4]. Many large twin studies for alcohol-related behaviors have consistently shown that heritability of alcohol abuse and dependence ranges from 50 % to 70 % [8]. Meta-analysis of the twin studies shows that both genetics and environment are important in smoking-related behaviors, with an estimated average heritability of 0.50 for smoking initiation and 0.59 for ND [9]. Several family studies of illicit drug use estimate heritability ranging from 30 % to 80 % [8, 10, 11]. Finally, it deserves to be mentioned that a heritability estimate is specific to the sample under study. Thus, the role of genetic influences may differ across samples, and heritability can be affected by many factors such as sex, age, education, socioeconomic status, and cultural background.

Identifying the Genetic Risk Factors

At least two linkage studies, Collaborative Study on the Genetics of Alcoholism (COGA), conducted on multigenerational pedigrees densely affected by alcoholism, and the NIAAA linkage study on a homogeneous population from a southwestern Native American tribe revealed that susceptibility loci on chromosome 4 increase the risk of alcohol dependence (AD) [1214]. However, the two linkage peaks were not located in the same genomic region. The linkage peaks identified in COGA from the general US population were located close to chromosome 4q, the alcohol dehydrogenase (ADH) gene cluster, whereas the NIAAA study detected the linkage signal close to the GABRA2GABRB1 cluster. Both findings were confirmed in other independent studies [6, 1517]. Several GWAS studies have been conducted for AD, and only one GWAS for alcoholism reported two correlated intergenic single nucleotide polymorphisms (SNPs) (rs7590720, P = 9.72 × 10−9; rs1344694, P = 1.69 × 10−8) on chromosome 2q35 that reached genome-wide significance in the combined GWAS and follow-up replication studies [18]. Other investigators did not observe a significant association with AD. The ADH-encoding genes are the best studied candidate genes for AD, and ADH1B, ADH1C, and ALDH2 coding variants are well characterized with the phenotype of AD such as ADH1B Arg48His [19, 20], ADH1C Ile350Val [21], and ALDH2 Glu504Lys [22].

To identify susceptibility loci for ND, more than 20 genome-wide linkage analyses have been conducted among different populations with a variety of assessments of ND including smoking initiation (SI), smoking quantity (SQ), heaviness of smoking index (HSI), Fagerström test for nicotine dependence (FTND), ever-smoking, habitual smoking, cigarettes per day (CPD), or maximum number of cigarettes smoked in a 24-h period [23]. Multiple regions, i.e., located on chromosomes 3–7, 9–11, 17, 20, and 22, have shown “significant” or “suggestive” linkage [23, 24]. Four regions, i.e., on chromosomes 9q, 10q, 11p, and 17p, have been replicated in at least four independent studies [2529]. Eight regions, i.e., on chromosomes 1, 5, 10, 11, 12, 16, 20, and 22, have been nominated as significant loci for ND-related phenotypes by reaching either genome-wide significance or a theoretical linkage threshold [24, 28, 3035]. Recent GWAS and candidate gene studies (Table 1) have identified several SNPs in the nAChR subunit genes, i.e., on human chromosomes 15q (CHRNA5CHRNA3CHRNB4) and 8p (CHRNA6CHRNB3), that influence the risk of ND, as defined by FTND and CPD [3645]. Further, an important nicotine metabolism gene, CYP2A6, was reported to be significantly associated with CPD in a recent GWAS [41]. In addition, candidate gene-based association studies revealed that DRD2/ANKK1 [4648], NRXN1 [49, 50], DBH [51], and BDNF [51, 52] showed strong associations with CPD, FTND, or smoking initiation/cessation results that have been replicated in at least two independent studies.
Table 1

A list of replicated SNPs and genes for drug addictions from GWAS and candidate gene studies










rs7590720, rs1344694





rs1229984 (R48H)

AD, esophageal cancer, upper aerodigestive-tract cancer

[19, 20]


rs698 (I50V)





rs671 (E504K)

AD, alcoholic liver disease, cirrhosis, pancreatitis

[19, 22]



rs1800497 (Taq1A)








[49, 50]




CPD, FTND, lung cancer

[41, 44, 48]







Smoking cessation

[48, 51]



rs1028936, rs1329650






Smoking initiation

[51, 52]


rs2734849 (Arg/His), rs4938012


[46, 47]


rs4245150, rs17602038



rs588765, rs578776, rs16969968 (D398N), rs6495308, rs55853698, rs2036527

Chronic pulmonary disease

[6, 127, 128]


rs1051730, rs6495308







Chronic pulmonary disease




Chronic pulmonary disease





rs1799971 (A118G), rs1799972 (A6V),

Opioid addiction,

[58, 6163]


Heroin addiction




Opioid addiction

[58, 60, 63]


Heroin addiction

The SNPs listed for AD and ND were reported either in the GWAS study, which reached genome-wide significance, or in the candidate gene-based association studies that are significant after correction and have been replicated in independent studies

For illicit drugs, few linkage analyses have been conducted, and most of the results have not been replicated. The first linkage study of cannabis dependence was conducted in a linked set of family, twin, and adoption studies, which revealed suggestive linkage on chromosomes 3q and 9q for cannabis dependence vulnerability [53]. Two loci, i.e., on chromosomes 16 and 19, were linked to severe cannabis use/antisocial subtype in a Native American community study [54]. For other illicit drugs, significant linkage peaks have been identified on chromosomes 9 and 12 for cocaine dependence [55], chromosome 17 for a heavy opioid-use cluster-defined trait [56] and on 14q for DSM-IV opioid dependence [57]. Li and Burmeister [6] summarized that regions on chromosomes 2–5, 7, 9–11, 13, 14, and 17 have independent evidence of “suggestive” or “significant” linkage, with regions on chromosomes 4, 5, 9, 10, 11, and 17 receiving the strongest support for harboring susceptibility genes for addictions to multiple drugs. Multiple candidate gene studies revealed OPRM1, OPRD1, and several other genes to be associated with opioid dependence [5862]. A few GWAS studies have also been conducted on illicit drug addiction; however, none of these findings reached genome-wide significance [6365]. Although genetic studies have been successful in identifying a number of common variants that showed significant association with substance abuse, these variants contributed only a small portion of the phenotypic variance related to substance abuse, suggesting further investigation is greatly needed to identify the unexplained phenotypic variance.

Common vs. Rare Variants

During recent years, GWAS approaches have been utilized extensively to study complex human traits; however, most common variants identified with such an approach can explain only a small proportion of genetic variation, which has sparked an intense debate about the common disease—common variant hypothesis. For example, despite a type 2 diabetes study with a discovery sample of 10,128 and a replication sample of 53,975, the 18 common variants significantly associated with the disease seem to explain only about 6 % of the higher risk of disease among relatives [66, 67]. A meta-analysis of schizophrenia GWAS that included 8,008 cases and 19,077 controls identified only seven significant SNPs, some in high linkage disequilibrium (LD) with each other and each with an odds ratio below 1.3, despite an estimated heritability of 80–85 % for schizophrenia [68]. This “missing heritability problem” suggests that a few dozen loci with moderate effects and intermediate frequencies each explaining part of the disease risk in a population simply is not the case, as is typically observed in crosses or pedigrees. Since then, several new hypotheses have been proposed: (1) a large number of small-effect common variants across the entire allele frequency spectrum (the infinitesimal model) [69]; (2) a large number of large-effect rare variants (the rare-allele model) [70]; and (3) some combination of genotypic, environmental, and epigenetic interactions (the broad-sense heritability model) [71, 72].

The rare-variant hypothesis [67, 73] proposes that a significant proportion of the inherited susceptibility to relatively common human chronic diseases is attributable to the summation of the effects of a series of low-frequency dominantly and independently acting variants of several genes, each conferring a moderate but readily detectable increase in relative risk [74]. Such rare variants will mostly be population-specific because of founder effects resulting from genetic drift [74]. Furthermore, evolutionary theory predicts that disease alleles should be rare. The disease-promoting variants will be prevented from drifting to a higher frequency in the population under purifying selection. The population data from exome sequencing showed that non-synonymous coding variants are significantly skewed toward low frequencies, which certainly can be explained by purifying selection. However, the rare-variant hypothesis has its limitations, as there is no evidence that the rare variants make a large contribution to the genetic variance yet to be detected in GWAS. Although the rare variants may not solve all the issues, they do provide a strong complement to the functional evidence, which cannot be explained by the common variants identified by most GWAS. As more and more subjects are being sequenced by whole-genome or whole-exome sequencing, the cumulative excess of rare coding variants with substantial effect size remains to be discovered. Considerable resolution of the burden of deleterious rare variants will no doubt emerge in the next few years as whole-exome and whole-genome sequencing ramps up [75, 76].

Rare-Variant Discovery and Next-Generation Whole-Genome and Whole-Exome Sequencing

After completion of the HapMap and 1000 Genome Projects, millions of novel SNPs were deposited in public databases [7779]. However, more than 95 % of the variants discovered in the 1000 Genome Pilot Project are common (MAF ≥ 5 %); low-frequency (1–5 %) and rare (<1 %) variants remain poorly characterized [78]. These variants are highly enriched for potentially functional mutations such as protein code-changing variants. The second phase of the 1000 Genome Project finished more than 1,000 genomes from 14 populations from Europe, East Asia, sub-Saharan Africa, and the Americas, which captured as many as 98 % of accessible SNPs having a frequency of 1 % in related populations. These data enable analysis of common and low-frequency variants in individuals from diverse, including admixed, populations [78]. The enormous low-coverage whole-genome sequencing data, deep-coverage exome sequencing data, and high-density SNP genotyping data allowed the identification of 10,000–50,000 potentially functional variants with MAF <5 % per individual in each population [78].

Both whole-genome and whole-exome sequencing have been successful in identifying potentially functional variants of low frequency. Whole-genome sequencing can be conducted easily through standard library preparation protocols based on available sequencing platforms. For the current next-generation sequencing technology, Illumina HiSeq2000 and Lifetech SOLiD 5500XL are the two most powerful high-throughput sequencing platforms. Illumina Hiseq2000 has the highest throughput (600 GB/run) and lowest price for per gigabyte of data (~$100/GB). Lifetech SOLiD 5500XL has flexibility to fit small to large whole-genome sequencing projects; however, the cost for high-coverage (~30×) whole-genome sequencing is still an obstacle for processing thousands of individuals. Low-coverage whole-genome sequencing has been adapted, as the simulation study showed that sequencing 3,000 individuals at 4× depth provides similar power to that of deep sequencing of >2,000 individuals at 30× depth but requires only ~20 % of the sequencing effort by assuming that disease-associated variants have a frequency of greater than 0.2 % [80]. Low-coverage whole-genome sequencing can be used to build a reference panel for imputing additional samples and improving the coverage crossing the genome to increase the power.

Exome sequencing is an alternative approach to whole-genome sequencing that sequences only the coding and UTR regions. This is intended to identify functional variants likely to change amino acid and gene expression pattern. The cost for exome sequencing is much lower, as it sequences less than 2 % of the whole genome. However, exome sequencing requires extra capture enrichment steps prior to the sequencing. Currently, there are two capture methods, i.e., array-based and probe-based. Array-based capture is the first-developed exome capture technology; it hybridizes targeted fragments to the oligonucleotides synthesized on the microarray. Nimblegen SeqCap is the only array-based capture exome enrichment kit. The limitations of this technology include the need for expensive hardware as well as the relative large amount of DNA needed (10–15 μg). The probe-based capture method was then developed by pooling custom oligonucleotides and conducting hybridization in solution, which not only simplifies the hybridization process but also requires much less DNA (~3 μg). The oligonucleotides used for hybridization can be either DNA or RNA. Agilent SureSelect, Illumina TrueSeq, and Lifetech TargetSeq are the three available probe-based exome capture kits. The Agilent SureSelect kit is an RNA probe-based capture that can be used on most high-throughput sequencing platform, whereas both Lifetech TargetSeq and Illumina TruSeq are DNA probe-based capture kits and can be used only on their own proprietary platforms.

Recently, Flanigan et al. [81] reported a comparison of the performance of three commercial exome capture kits based on the neuromuscular disorder (NMD) gene panels. Although 92 %–94 % of the known NMD exons were included in all three targeted regions, the actual capture results demonstrated that at best, 60 % of these exons obtained 100 % coverage. The best-performing kit, Agilent SureSelect (v3), captured an average of 92.7 % of the bases in the NMD gene exons, but only 58 % of the NMD isoforms had all exons captured at 100 % (e.g., 42 % of the NMD genes had at least one exon with one base insufficiently captured to make a genotyping call). Illumina TruSeq captured 89.2 % of exons, but only 36.5 % of the genes had 100 % coverage. Nimblegen v2 captured 90.5 % of the genes, but only 46.3 % of the genes had 100 % sequence coverage.

Targeted Resequencing

Although the cost of next-generation sequencing has been substantially reduced within the last few years, the total cost can be still high when thousands of samples need to be sequenced at the genome/exome level to detect low-frequency or rare variants. An alternative approach is targeted resequencing of smaller regions or a few dozen genes, usually 10 kB to 10 MB. Prior to the adoption of next-generation sequencing, amplicon resequencing was conducted on a traditional automated capillary-based Sanger sequencing platform, such as ABI3730, which produces accurate genotype calls for each individual but with relatively low throughput. Next-generation sequencing, along with a variety of capture enrichment methods, provides a powerful tool for sequencing targeted regions at relatively low cost. Several technologies have been developed based on the targeted region size, capture enrichment methods, and sequencing technologies. Multiplex amplicon and DNA/RNA probe-based capture are the two most popular enrichment methods (Table 2). Multiplex amplicon sequencing can be applied to several genes and targeted regions <500 kB at relatively low cost compared with capture enrichment, and three companies have this technology with different amplicon sizes: Lifetech (<10 kB), Illumina (<500 kB), and RainDance (<10 MB). The RainDance multiplex amplicon can provide targeted resequencing of as much as 10 MB, although the cost is higher than that of other capture enrichment methods. However, RainDance does have an advantage in that there are many customized panels that have been widely adopted in the genetic diagnostics field such as for cancer, autism spectrum disorder (ASD), and human leukocyte antigen identification [82, 83]. The sample preparation steps are highly automated and compatible with all sequencing platforms. For probe-based in-solution capture enrichment, Agilent HaloPlex/SureSelect (1 kB–10 MB), Lifetech TargetSeq (100 kB–10 MB), and Illumina TruSeq (500 kB–10 MB) targeted resequencing kits are using the same technology as their whole-exome capture kits, just with smaller capture sizes.
Table 2

Targeted enrichment/amplicon resequencing platform comparison







Target size (sequencing platform)

1–10 kB

HaloPlex (ILM, ION)

Ion Xpress (ION)

TruSeq (MSE, HSE)


10–500 kB

TargetSeq (ION, SOL)

200 kB–10 MB

SureSelect (ILM, SOL, ROC)



Probe (< 10 MB)

Amplicon (<10 kB), Probe (0.1–10 MB)

Amplicon (<500 kB), Probe (0.5–25 MB)

Amplicon (<10 MB)



Compatible with most high-throughput sequencing platforms; widely adopted

Work only with Ion Torrent or SOLiD system

Work only with MSE or HSE system

Many panels available: cancer, ASD, HLA, pharmacogenetic, etc.







ILM ILLUMINA, SOL SOLiD, ION Ion Torrent, ROC ROCHE, MSE Illumina MiSeq, HSE Illumina HiSeq

The molecular inversion probe is similar to in-solution hybridization capture, as the only difference is capture by circularization, which has a higher capacity than the multiplex amplicon [84]. The OS-Seq is a new oligonucleotide selective-capture method involving capturing and sequencing of genomic targets on a sequencer’s solid-phase support, such as the Illumina flow cell, which overcomes the limitations of either in-solution or array-based capture such as the capture efficiency. First, target-specific oligonucleotides (40mers) were synthesized using the same method utilized in the traditional microarray and then immobilized on the flow cell; these “primer probes” served as both capture probes and sequencing primers. Second, a single-adaptor library prepared from genomic DNA was added to the flow cell, where the desired targets were captured by the immobilized primer probes. Third, the captured library fragments were prepared for bridge amplification, clustered, and sequenced. The capture efficiency was much higher than with the in-solution capture methods (~90 % vs. ~60 %) [84]; however, this technology may require some extra front-end input to optimize probe design, as no commercial platform is available at present.

DNA pooling was proposed as a strategy for reducing the cost of large-scale genotyping-based disease association studies [85, 86]. However, the difficulty in measuring allele frequencies accurately from intensity data has limited the use of this strategy. Unlike pooled genotyping, pooled DNA sequencing not only provides digital allele counts for each variant but also detects novel sequence variants. Several recent studies have demonstrated the potential of pooled sequencing using next-generation platforms for identifying disease-associated rare mutations [87, 88]. The pooling approach can significantly reduce the overall cost by sequencing pooling samples for variant discovery and following with high-throughput genotyping verification in the large set of samples.

In Sillico Rare-Variant Identification Using Genotype Imputation

Another solution for rare-variant initial identification is genotype imputation, as many large-scale genotyping and sequencing projects have released information on millions of variants to public databases, which provides an enormous resource for rare-variant discovery using the imputation approach. Genotype imputation allows the evaluation of association variants that are not directly genotyped based on the variants within each individual and reference population from HapMap and the 1000 Genomes Project. Imputation is useful not only for combining results from different studies conducted on different genotyping platforms but also for increasing the power of individual scans for both traditional GWAS and sequencing-based association analysis. Unlike the common variants, rare variants tend to be recent discoveries and to share shorter haplotype stretches, which gives a much higher discordance rate during imputation. The percentage of missingness also was significantly increased [89]. One of the largest human exome sequencing projects from the National Heart, Lung, and Blood Institute (NHLBI) reported that approximately 73 % of all protein coding variants and approximately 86 % of variants predicted to be deleterious arose in the past 5,000–10,000 years. Because of the nature of rare variants, the reference panel size used for imputation needs to be much larger than for common-variant imputation. A recent review reported that across all imputation panels and genotyping chips, the imputation error rate increases as the minor allele frequency decreases, which is in line with previous observations that rare SNPs are more difficult to tag than common SNPs [90]. Using a reference panel phased with trio information boosts imputation performance compared with a reference panel phased without trio information, and the combination of the CEU + YRI + JPT + CHB reference panel can improve the imputation performance and accuracy across all the populations when imputing genotypes at SNPs with MAF < 5 % [89].

Although rare-variant imputation is a feasible approach, in addition to sequencing for rare-variant discovery, genotyping/sequencing validation is necessary after imputation. Multiple imputation programs have been developed. Detailed information about these programs is provided in the next section.

Rare-Variant Analysis and Statistical Methods

Sequencing Data Analysis, Variant Calling, and Imputation

After high-throughput sequencing, variant calling is essential to retrieve the rare variants accurately from either high-coverage exome sequencing or low-coverage whole-genome sequencing data. The variant-calling process pipeline has several components including base calling, alignment, post-alignment processing, variant and genotype calling, and candidate variant filtering. The base calling processes all the raw image data to sequence data, which usually is performed by high-throughput sequencing software. A number of alignment, post-alignment processing, and variant-calling programs are available. The Burrow–Wheeler aligner (BWA) provides accurate and fast alignment, allowing gapped alignment among all alignment programs [91]. Genome analysis toolkit (GATK) packages provide post-alignment processing including local realignment around indels and quality recalibration [92]. Both Unified genotyper (GATK) and SAMtools [93] have built-in probabilistic approaches that facilitate variant calling with different coverage. The variant calling can be carried out in multiple individuals simultaneously. Targeted resequencing can be applied to the same variant-calling pipeline as the whole-exome/genome sequencing. However, if the targeted resequencing is carried out with pooling samples, the pipeline requires different programs for variant calling after alignment and post-alignment processing. The Syzygy program [88] was developed especially for pooling sample resequencing, and the program computes the likelihood that the position contains a non-reference allele, using Bayes’ rule and a parameter that specifies the number of chromosomes in the pool.

Moreover, genotype imputation can fully utilize the sequencing data to increase the availability of variants for association analysis and enhance the power of statistical analysis to facilitate the combination of results across studies using meta-analysis [89]. Multiple imputation programs have been developed. MaCH [94], MaCH-Admix ( [95], IMPUTE [96], IMPUTE2 ( [97] are based on the extension of the HMM model originally developed as part of importance sampling schemes for simulating coalescent trees, modeling LD, and estimating recombination rates [89]. MaCH works by successively updating the phase of each individual’s genotype data conditional on the current haplotype estimates of all the other samples, and the MaCH-Admix, based on the MaCH, allows the user to decide to either to use an integrated model for estimating parameters (recombination rate and error rate) by program or give fixed parameters acquired from the reference panel only or both reference haplotypes and target genotypes [95]. However, IMPUTE and IMPUTE2 are using fixed estimates of mutation rates and recombination maps. Furthermore, MACH uses a random subset of sample haplotypes as templates, whereas IMPUTE2 uses a subset of haplotypes selected to be similar to the haplotypes of the individual currently being estimated. The IMPUTE2 strategy appears to permit greater improvement in accuracy, as sample size increases and model complexity (the number of states) is held constant [98]. Minimac is a low-memory, computationally efficient implementation of the MaCH algorithm for genotype imputation that is designed to work on phased genotypes and can handle reference panels with hundreds or even thousands of haplotypes such as the 1000 Genomes Project [99]. A similar approach has been implanted in IMPUTE2. BEAGLE is based on a graphical model of a set of haplotypes. This method works iteratively by fitting the model to the current set of estimated haplotypes and then resampling new estimated haplotypes for each individual based on the model of fit; [100, 101]. This program can be used for both case–control and family-based study designs.

Rare-Variant Association Analysis Methods

Unlike the GWAS, the power of single-marker analysis is poor for rare-variant association analysis because of the extremely low frequency [102]. Impelled by growing interests, burden tests have been proposed for rare variants, and these methods often employ the ideas of pooling or collapsing multiple rare variants within a region, weighting and/or prioritizing rare variants based on the functional and other criteria, and distribution-based approaches [102105]. Recent evidence suggests that multiple rare variants often act collectively on disease risks [87, 103, 106, 107], with the need for aggregate effects of low-frequency variants in order to increase the power.

The cohort allelic sum test (CAST) was the first method developed to collapse information on all rare variants within a region, for example, the exons of a gene into a single dichotomous variable for each subject by indicating whether the subject has any rare variants within the region and then applying a univariate test [103]. However, this method does not easily accommodate covariates and consider weighting of the variants, which is also not compatible with quantitative traits. The combined multivariate and collapsing (CMC) test [102] extends the CAST by collapsing variants in subgroups according to allele frequencies and combining these subgroups using Hotelling’s T2 test, which controls type I error well. However, the disadvantage of this approach is that the threshold is not easy to select for the variants in a biological meaningful way. Compared with the CAST method of assigning similar effects to all rare variants, weighting methods usually assign high priority to alleles based on their frequency in the control population, potential functional changes predicted by PolyPhen/SIFT, or other criteria. Weighted sum statistics (WSS) is the first method developed based on the weighting method by grouping variants according to function, and the permutation of disease status were applied among affected and unaffected individuals to test the excess of variants in the affected individuals [104]. The weight method does have limitations in that it applies much higher weights to very rare variants in some scenarios. Besides collapsing or weighting variants, the variable threshold (VT) test uses a variable allele frequency threshold instead of a fixed threshold, and then assesses the statistical significance by permutation testing with VTs [108]. The above all burden test methods require either specification of thresholds for collapsing or the use of permutation to estimate the threshold. Permutation tests are computationally expensive, especially on the whole-genome scale and are difficult to adjust for covariates because permutation requires independence of the genotype from the covariates.

The C-alpha test is a non-burden-based test and robust to the direction and magnitude of effect that compares the expected variant with the actual variance of the distribution of effect for the case–control data, which improves the power relative to the burden-based test methods [109]. However, the C-alpha method is not easy to adjust for covariates such as controlling population stratification. Kernel-based methods are a non-burden test, such as the sequence kernel association test (SKAT), which aggregates individual variant-score test statistics with weights when SNP effects are modeled linearly instead of aggregating variants. The SKAT extends kernel machine-based tests for rare variants with more accurate asymptotic approximations in the tail distribution [110], and it is powerful when a genetic region has both protective and deleterious variants of many non-causal variants. In general, SKAT first checks each single variant causal direction, then generates a weight for each variant, which can avoid loss of analytical power. However, in an extreme case, SKAT may have less power than other software if all variants in a gene or region are truly causal and affect the phenotype in the same direction [110, 111]. The SKAT produces conservative Type I errors for small-sample case–control sequencing association studies, which could lead to power loss and are often observed in current exome sequencing studies [110, 112]. The SKAT-O was further improved [113] through maximizing power by adaptively using the data to optimally combine the burden test and the non-burden sequence kernel association (SKAT), which are computationally efficient and can easily be applied to GWAS studies. Recently, family-based SKAT (famSKAT) has been developed based on the framework of linear mixed effected model to extend SKAT for rare variant association analysis with quantitative traits for family data [114]. Moreover, several popular multivariate tests for GWA studies have been evaluated for rare-variant association including the minimum of univariate P-values (UminP) [115], sum score [116], sum of squared score (SSU) [117], and weighted sum of squared score (SSUw) tests [111]. Although more methods are being developed, there is still no method available that can be fit to all scenarios for rare-variant association analysis. In a real case, various methods should be tested.

Example of Rare-Variant Discover in Addiction Studies: ND

Discovery of Rare Variants Related to ND

GWAS have identified common variations in the several nAChR subunit genes/clusters, such as CHRNA5CHRNA3CHRNB4 and CHRNA6CHRNB3, which contribute to ND. However, the role of rare variation in the risk of ND in these nicotinic receptor genes has not been well studied. In the work of Wessel et al. [118], 11 nAChR subunit genes were sequenced through the amplicon approach, and a total of 44 common and 129 rare SNPs (MAF < 5 %) were identified and tested for association with the FTND score using data obtained from 430 individuals, 18 of whom were excluded because of the reduced completion rate. The CAST and the weighted sum statistic methods were used for a rare-variant association test only and the multivariate distance matrix regression method for both common and rare SNPs. Significant association was observed between the FTND score and common and rare SNP/SNVs in CHRNA5 and CHRNB2 and of rare SNVs in CHRNA4. Both common and rare SNP/SNVs from multiple nAChR subunit genes were associated with the FTND score in this sample of treatment-seeking smokers. A follow-up CHRNA4 rare-variant study resequenced exon 5 from more than 2,000 individuals including both European American (EA) and African American (AA) populations [119], and the association test suggested that rare variants in CHRNA4 confer protection against ND. Recently, a rare-variant study was reported through pooled sequencing of the coding regions and flanking sequence of CHRNA5, CHRNA3, CHRNB4, CHRNA6, and CHRNB3 through an amplicon approach in AA and EA nicotine-dependent smokers and smokers without symptoms of dependence [120]. The carrier status of individuals harboring rare missense variants at conserved sites in each of these genes was then compared in cases and controls to test for an association with ND. Missense variants discovered at conserved residues in CHRNB4 are associated with a lower risk of ND in AA and EA, with two variants (T375I and T91I) contributing most to this association [120].

The above rare-variant studies utilized amplicon approaches along with either Sanger or next-generation sequencing or both, which screened only a handful of genes. However, the linkage analysis, candidate gene, and GWAS studies also identified several other possibly important genes beyond the nAChR subunits. We recently developed a customized targeted capture panel of 32 genes (Supplementary Table 1) including both nAChR subunit genes and several neurotransmitter receptors and metabolism genes, which have been reported to be associated with ND from various studies [6, 121]. The Agilent SureSelect Capture panel (250 kB) includes the coding regions, UTR regions, and flanking sequence of these genes. A total of 400 samples (200 sib pairs) were selected from the mid-South tobacco family study and divided into eight pools (50 samples/per pool) based on ethic group (EA and AA), smoking status (smoking and nonsmoking), and FTND (light and heavy smokers). The concentrations of individual DNA samples were first measured using the Quant-iT™ dsDNA assay kit (Lifetech) and pooled in equimolar amounts as suggested by manufacturers, and then library preparation, targeted capture, and high-throughput sequencing (72 bp paired-end) were conducted, following by data analysis including base quality recalibration and alignment using BWA and hg19 assembly build as the reference. Variant calling was conducted using Syzygy designed for a pooling approach.

Overall, 62 GB (868 million reads) of raw sequencing data was produced with an average of 108 million reads per pool. More than 80 % of the raw sequencing data was mapped on the human genome (hg19) after filtering and mapping. A total of 147 million reads were mapped to the targeted regions, which is about 20 % of the total mapped reads, and the entire targeted regions were 100 % covered with a median coverage of 106× for each individual in the pool. The distribution of reads crossing the genome is shown in Fig. 1. The minor allele frequency (MAF) was calculated for several common variants (MAF ≥ 0.05) from pooling the sequencing and then compared with our previous genotyping results based on ABI TaqMan assay; the correlation between these methods is 0.97 for AA samples and 0.90 for EA samples (Fig. 2). The variants identified with a minimum MAF of 0.75 % and minimum sequencing read counts of 500 were set up for variant selection. After removing the intron and synonymous variants, a total of 430 putative functional rare variants were identified from the eight pools and ranked according to the Polyphen and SIFT scores and variant frequency. Table 3 shows the summarized results of putative functional variants identified from each pool including 28 premature stop codons, 212 damaging variants, and 190 tolerated variants. Several predicted-functional rare variants were selected for further validation in the mid-South tobacco case/control samples [45] and showed significant association with ND (Table 4), which is consistent with the previous report from Haller et al. [120].
Fig. 1

Distribution of mapped reads of each sequenced gene on 22 human chromosomes
Fig. 2

Correlation of SNP allele frequencies of all sequenced genes detected by genotyping on the TaqMan OpenArray genotyping system and pooling targeted resequencing in AA and EA samples

Table 3

Discovery of putative rare variants from ND candidate gene resequencing

Library/pooled DNA group






African American heavy smokers





African American heavy smokers: nonsmoker sibling controls





African American light smokers





African American light smoker: nonsmoker sibling controls





European American heavy smokers





European American heavy smokers: nonsmoker sibling controls





European American light smokers





European American light smokers: nonsmoker sibling controls










The number of variants shown in each row may overlap

Table 4

Significantly associated rare variants in nACh receptor subunit genes



American African

European American


MAF (%)



MAF (%)



MAF (%)







































































Several low-frequency and rare variants within CHRNA5CHRNA3CHRNB4 discovered from targeted resequencing were selected for further validation on the TaqMan® OpenArray® genotyping system (Life Technologies). The samples used in the replication study were from the Mid-South Tobacco Case–Control Study (MSTCC) population, which consists of 4,548 smokers and nonsmokers aged 18 years or older of either African American (AA, N = 3,161) or European American (EA, N = 1,387) origin, who were recruited primarily from the city of Jackson, MS during 2005–2011 [45]. Although questionnaires assessing various smoking-related behaviors were administered to each participant, only the Fagerström test for nicotine dependence (FTND) and indexed CPD data were analyzed for this report. The association analysis was performed using a linear regression model by regressing FTND scores and indexed CPD on age and sex, in PLINK [130]. The FTND scores were continuous, ranging from 0 to 10, and indexed CPD categories were defined as 1 ≤10 CPD, 2 11–20 CPD, 3 21–30 CPD and 4 ≥31 CPD, as we did in our previous studies [38, 45]. Non-smokers were excluded from the regression model

Function Studies of Rare Variants in ND

The majority of low-frequency and rare variants change the amino acid coding and offer the potential to study the function of these variants, which explains the phenotypic variance they cause. For nAChR subunit genes, in vitro electrophysiology has commonly been used to study the function of variants. The function of two CHRNB4 variants (T375I and T91I) and a missense variant in CHRNA3 (R37H) in strong LD with T91I were examined in vitro by an electrophysiology approach in HEK293 cells. The minor allele of each polymorphism increased the cellular response to nicotine (T375I, P = 0.01; T91I, P = 0.02; R37H, P = 0.003), but the largest effect on in vitro receptor activity was seen in the presence of both CHRNB4 T91I and CHRNA3 R37H (P = 2 × 10−6) [120].

In vitro function study does have its limitations; several knockout (KO) mice provide a chance to study the in vivo function of variants. KO mice with modified CHRNA5 [122] and CHRNB4 [123] subunits, as well as CHRNA3 KO [124] heterozygous knockout mice, facilitate the functional study of rare variants in the CHRNA5-A3-B4 gene clusters. Remarkably, A5 subunit knockdown in the medial habenula (MHb) did not alter the rewarding effects of nicotine but abolished the inhibitory effects of higher nicotine doses on brain reward systems [125]. Because the MHb projections extend almost exclusively to the interpeduncular nucleus (IPN), IPN activation was diminished in response to nicotine in CHRNA5-knockout mice, which further increased nicotine intake in rats with disruption of IPN signaling.

Taking it one step further for the in vivo function study, Hong et al. [126] reported the variant Asp398Asn in CHRNA5 is associated with the dorsal anterior cingulate-ventral/extended amygdala circuit through a resting-state functional connectivity (rsFC) approach, which decreases the intrinsic resting functional connectivity strength in this circuit. Although Asp398Asn is a common variant, it extended the functional study from in vitro to in vivo. Xie et al. [119] examined the effect of rare variants on in vivo nAchR binding using single-photon emission-computed tomography (SPECT) in a subsample of 139 subjects. One of the rare variants was associated with substantially greater nAchR availability in the brain than was seen in four age-matched individuals, suggesting that the variant alters nAchR availability. All the recent rare-variant functional studies related to ND focus on nAchR subunit genes; however, the rare variants of nAchR subunit genes may not explain all the phenotypic variance in ND. Several other genes that have been reported in ND GWAS [6, 121, 127, 128] and/or candidate studies need further investigation for potential functional rare variants associated with ND.


To date, all reported rare-variant studies related to substance abuse targeted only a handful of candidate genes. Almost all the recent studies were conducted on ND, but not on AD and other illicit drug dependences. Although these genes have been well investigated with confirmation of strong association with drug addiction based on the GWAS or candidate gene studies or both, unidentified gene/variants associated with drug addiction may still be missing. Whole-exome sequencing or large-scale candidate genes targeted resequencing is necessary for studying unknown or less well studied genes/variants, especially the low-frequency and rare variants. However, rare-variant discovery requires extensive sequencing of much larger populations than is needed for common-variant identification based on the case–control studies design used for GWAS. Family-based designs may have some advantage over case–control studies when exploring the potential functional rare variants that change amino acid coding or gene expression pattern, as rare variants may be enriched in a few family samples that may increase the statistical power. Family-based designs are efficient for whole-genome sequencing because of the ability to impute the sequence of non-founders. Considering that linkage peaks identified in the family studies usually are not consistent with GWAS results, this suggests those linkage peaks likely harbor some rare variants that cannot be detected by GWAS because of the low frequency and lack of power. Taken together, these findings indicate that rare variants may have great potential to elucidate the unexplained contribution to the phenotypic variance for complex traits, but confirming this requires even greater efforts.


The preparation of this review was supported by NIH grant DA-012844 to MDL. The authors thank Dr. David Bronson for his excellent editing.

Supplementary material

12035_2013_8541_MOESM1_ESM.docx (39 kb)
ESM 1(DOCX 39 kb)

Copyright information

© Springer Science+Business Media New York 2013