Background

Until recently, the majority of methodological approaches for the analysis of common, single-nucleotide variants (SNVs; common SNVs have minor allele frequency of at least 5%) involved analysis of microarrays using either single-marker or multiple-marker approaches, with single-marker approaches, by far the more common of the 2 approaches in practice. Typically, large genome-wide association studies will conduct hundreds of thousands (or millions) of single-marker analyses. Although less common in practice, considerable methodological development has taken place in the area of multiple-marker analysis in which signals from multiple common SNVs are aggregated into a single test of association for the set of SNVs of interest (see Refs. [1, 2] for recent reviews).

Of particular interest in the development of multiple-marker tests is the blurring of the lines that has recently taken place between what were traditionally considered "gene-based" tests of association and "pathway-based" tests of association. Historically, gene-based tests of association aggregated a small number of SNVs within a gene into a single test statistic, whereas pathway-based tests operated in 2 stages [1, 2]. In the traditional approach to pathway analysis, researchers first generate a statistic for each gene. In a second stage, researchers combine multiple gene-level statistics into a pathway statistic. Recently, however, a single-stage approach was advocated whereby, in a single stage, all the SNVs in the pathway all simultaneously aggregated into a single statistic [1]. Effectively, this single-stage approach could be considered a "SNV-set" approach where the set of SNVs can be defined in any biologically plausible manner. Although this relatively new approach is gaining in popularity in the literature, little concrete evidence of its performance relative to the 2-stage approach is available. It has been suggested, however, that the 2-stage approach may be optimal when there are fewer causal variants in each gene, but each with strong risk, and the single-stage approach may be better suited for cases where there are more causal variants, but with weaker effects [35].

With the rapid growth in access to next-generation sequencing (NGS) technology, there has been a tidal wave of methodological developments to analyze such data. At the heart of most analytic approaches for NGS data is the desire to appropriately handle rare variants. Because statistical tests of individual rare variants lack power and multiple testing penalties quickly become unwieldy because of the preponderance of rare variants, most recently proposed NGS data analysis approaches attempt to aggregate multiple SNV association signals so as to increase power. These rare SNV-set methods can be viewed as a special type of multiple-marker test that is particularly useful when SNVs are rare. For a broad overview and classification of these rare variant approaches for case-control studies see Liu et al [6].

There are few family-based tests of rare variant association that are appropriate for complex pedigrees analyzing dichotomous phenotypes for association with rare variants (see Hainline et al [7] for more extensive discussion). In this article, we apply several of these recently proposed approaches to very large sets of SNVs; the very large sets are created by combining SNVs from multiple genes into pathway based sets of SNVs. The goal of our analysis is to evaluate the appropriateness of these methods for use with massive SNV sets when the proportion of noncausal variants may be very large, but the set may also contain multiple causal variants. We evaluate both single-stage and traditional 2-stage approaches, in which stage 1 computes a test statistic for each gene in the pathway, and stage 2 combines the gene-based statistics into a single statistic for the pathway.

Methods

Sample and genes

There were 849 individuals in complex pedigrees with simulated phenotype data available. We classified each of the 849 individuals as either hypertensive (systolic blood pressure >140 mm Hg or diastolic blood pressure >90 mm Hg) or not hypertensive based on whether the individual was classified as hypertensive at any of up to 4 measurements (waves). For our analysis of the simulated data we focused on simulated phenotype 1. We had knowledge of the simulation answers for our analysis. SNVs were mapped to genes using a custom version of ANNOVAR, where a SNV is assigned to a gene if its physical location is within the start-stop position of the gene [8].

Creation of gene sets

To evaluate the performance of different approaches to the analysis of very large sets of SNVs from multiple genes (eg, pathways) we created 800 sets of 5 genes each, where 200 of the sets contained no genes with causal variants, 200 of the sets contained 1 gene with causal variants, 200 of the sets contained 3 genes with causal variants, and 200 of the sets had all 5 genes containing causal variants. Sets were created by randomly choosing genes, without replacement, from lists of genes that were known (based on the simulation model) to contain or not contain causal SNVs. The average number of SNVs in the 800 sets was 880 (SD = 468; minimum = 58; maximum = 2451).

Previously proposed statistical tests

We applied 5 different family-based tests of association considered by Zhu and Xiong [9] to sets of SNVs. All tests were conducted in R using software functions written by Zhu and Xiong and custom scripts. More details on the tests are available elsewhere [7, 9].

The methods considered by Zhu and Xiong utilize a correction factor, P corr , which summarizes the additional correlation in the samples that occurs as a result of the complex pedigree structure. P corr is a function of the estimated kinship matrix (see Hainline et al [7] for details) and is used to adjust the standard error of the test statistics for the additional correlation contained in the pedigree structure.

Zhu and Xiong [9] propose a generalized multivariate T2 test comparing the mean allele counts across n variants (eg, SNVs within a gene) between the cases and controls; Hotelling's T2 test is a multivariate version of the 2-sample t-test. Alternatively, Zhu and Xiong also consider a version that collapses rare variants below a threshold before applying the T2 test (combined multivariate and collapsing [CMC]) or uses eigenvectors from the genotype matrix to reduce matrix dimensionality (functional principal component analysis [FPCA]; see Hainline et al [7] for details). In our implementation of CMC we used minor allele frequency cutoffs of 5% and 0.5%. Briefly stated, in all cases, T2 is computed as if there was no pedigree structure in the data ( T i n i t i a l 2 ) . The pedigree-adjusted statistic is computed as T i n i t i a l 2 / p c o r r . Finally, Zhu and Xiong investigate an approach that applies a single-marker test (pedigree-adjusted single-marker χ 2 test of association) to all SNVs in a set, and then uses the minimum p value within the set, as the p value for the entire set. Thus, there are 5 test statistics in total: T2, CMC5%, CMC0.5%, FPCA and χ m i n 2 , where we refer to all methods except χ m i n 2 as T2 based approaches, because they all rely on the Hotelling's T2 statistic.

Application of the tests to both 1-stage and 2-stage pathway analysis

In our analysis, we applied the 5 methods described above in 2 different ways. First, we directly applied each of the 5 methods to all 800 sets of SNVs (1-stage pathway analysis). All 1-stage pathway analysis tests are of the null hypothesis that no SNVs in the set are associated with the phenotype, with an alternative hypothesis that at least 1 SNV in the set is associated with the phenotype.

Second, we applied each of the 5 methods to all 5 individual genes contained within each of the 800 sets. We then combined the 5 gene p values for each test-set combination using Fisher's combined probability test (2-stage pathway analysis). Fisher's combined probability test is defined as F i s h e r p = i = 1 5 - 2 log ( p i ) where p i is the p value from the ith gene in the set and F i s h e r p ~ χ 10 2 , yielding 1 Fisher's combined probability test p value for each of T2, CMC5%, CMC0.5% , FPCA, and χ m i n 2 , as described in the previous section. Fisher's combined probability test is one of many choices for 2-stage pathway analysis [5], but has the convenient advantage of having a known null distribution when the p values being combined are independent (in our case, this means that there is no linkage disequilibrium between the randomly chosen genes in the sets, which is a reasonable assumption). All 2-stage analyses have a null hypothesis that none of the genes in the set of genes contain any SNVs associated with the phenotype, with an alternative hypothesis that at least 1 gene in the set of genes contains at least one SNV associated with the phenotype.

Modified implementation of the Zhu and Xiong methods

When the Zhu and Xiong approaches are applied to the 800 sets of SNVs, a dramatic inflation of the type I error rate is observed. In particular, among the 5 single-stage approaches, type I error rates ranged from 43.5% to 99.5%, and among the 5 two-stage approaches, type I error rates ranged from 7.5% to 90.5%. Although we do not provide detailed results here, we note that the magnitude of the type I error rate increased as the number of SNVs in the set increased; in particular, the inflated type I error rates only occurred on sets containing more than 200 SNVs. We determined that a potential cause of the inflated type I error rate was in the approach taken when estimating the SNV genotype covariance matrix [10]. When analyzing large sets (>200) of SNVs and given the relatively small sample size (n = 849), prior research shows that the use of the maximum likelihood estimates (MLEs) in covariance matrix estimation may yield unstable results, and that a shrinkage covariance estimator may perform better [10].

We modified the covariance estimation procedure for all methods, replacing the MLEs with the shrinkage estimator. Although we lost the guarantee of the analytic null distribution derived by Zhu and Xiong, we explored the use of the shrinkage estimator using the null distribution in Zhu and Xiong. We found that this approach provided increasingly overconservative results as the set size increased (detailed results not shown). Nonetheless, we still expect that large values of the test statistic should lead to rejection of the null hypothesis.

To account for the effect of set size on the behavior of the statistic, we estimated the empirical cumulative distribution function of the p values for the "null" sets (sets containing no causal SNVs) separately as a function of the number of SNVs in the set for each of the test statistics. In particular, F m , n ( t ) = 1 k i = 1 k 1 p i , m t , where p i,m , I = 1...k, are the p values for test statistic m (eg, T2, CMC) when the set size is n. To obtain robust estimates of F(t), we binned sets of similar set size; there were 5 bins in total, representing quintile breaks in the null set size distribution. To provide appropriate control of the type I error rate and allow evaluation of the different methods, we computed a modified p value, p ( m o d i f i e d ) i , m = F m , n ( p i , m ) , for all null and nonnull (contain at least 1 causal SNV) sets. These modified p values necessarily control the type I error rate for null sets, allowing us to get a sense of the performance of the different methods if we had an appropriate null distribution for the new test statistics. Although in practice we do not have knowledge of the null set of pathways, this serves as an exploratory analysis to motivate derivation of a closed form null distribution for the modified Zhu and Xiong statistic.

Results

We applied each of the 10 tests described earlier to all 800 sets of genes. Table 1 illustrates the percent of significant sets (α = 0.05) across the 800 sets, where we stratify the 800 sets into 200 sets containing varying numbers of causal and noncausal genes.

Table 1 Percent significant sets at α = 0.05 by number of causal and noncausal genes in the set using modified p values

By design, sets containing no causal genes have empirical type I error rates of approximately 5%. The reason for some deviation from 5% in Table 1 is because of a combination of estimating the empirical null distribution based on 200 null sets and the binning procedure used (see Methods). In general, modest increases in the percent of significant sets are observed as more causal genes (genes containing at least 1 causal SNV) are added to the set for single-stage and 2-stage methods based on FPCA and χ m i n 2 , single-stage CMC0.5%, while other methods showed little difference in the percent of significant null sets and nonnull sets.

Discussion

Few analyses have considered methods to analyze pathways (sets of genes; large sets of SNVs) for association with rare variants in family studies. Previous research showed that, depending on the underlying genetic architecture, either single-stage or 2-stage approaches to pathway testing may provide a powerful testing approach. However, this result has not been rigorously established across the large class of pathway testing approaches, especially rare variant pedigree data. In this article, we considered both 1-stage and 2-stage approaches to pathway analysis by applying methods proposed by Zhu and Xiong to very large multi-SNV sets (1-stage), or a 2-stage approach using Fisher's combined probability test in conjunction with the methods of Zhu and Xiong.

As with most gene-based testing approaches the authors of the primary methods considered here only evaluated their method on "small" sets of SNVs (189 SNVs) [9]. However, the assumption of many authors of gene-based testing approaches is that the methods can be applied to large sets of SNVs that may contain SNVs from multiple genes (see, eg, Madsen and Browning [3]). This is both a reasonable and natural assumption, although one that has rarely been considered explicitly in the literature.

In our consideration of 1-stage approaches, we found a substantially inflated type I error rate, which increased as the number of SNVs increased. Further analysis pinpointed the issue to the use of MLEs when estimating the genotype covariance matrix (as in the Zhu and Xiong code). Schafer and Strimmer [10] showed that a shrinkage estimator provides robust covariance estimates as the number of SNVs increases relative to the sample size. When we implemented the shrinkage estimator, the results were overly conservative (empirical type I error rate was substantially less than the nominal rate) when using the null distribution for MLEs from Zhu and Xiong. This pattern of findings also held true for 2-stage approaches when any genes in the set were large (containing more than 200 SNVs). To address these limitations, we applied an empirical correction factor to the p values, which allowed us to examine performance of the 10 methods.

Conclusions

Further research is necessary to develop alternative covariance estimation procedures and corresponding null distributions for large (more than 200 SNVs) sets. We note that, in both a companion paper [7] and here, use of the MLE estimation approach yields well-controlled type I error rates for SNV sets with less than 200 SNVs. Permutation approaches to control type I error should also be explored, however, given the complex pedigree structure present in the data will require gene dropping or related approaches that may limit their practical utility.

Assuming that appropriate finite sample null distributions can be derived when applying SNV-set methods to very large sets of SNVs, the true underlying genetic disease architecture will play a significant role in determining which statistical methods will perform best in practice. A recent article evaluating the differences in rare variant tests of association illustrated that for sets of SNVs where the proportion of noncausal SNVs is very large, as will likely be the case in 1-stage approaches for rare variants, tests like the minimum χ m i n 2 (equivalent to an L norm) will perform better than T2-based (L2 or L1 norm) type tests (see Liu et al [6] for details) by empirically upweighting the strongest SNV-phenotype associations.

Few methods exist for the analysis of potential relationships between binary phenotypes and rare genetic variation; fewer still have considered how such methods will perform on very large sets of SNVs that may span multiple genes. Our analysis identified substantial inflation of the type I error rate using a standard approach, and so we implemented an empirical approach to evaluate the relative performance of different methods. Further research is necessary to explore robust test statistics and analytic strategies for large SNV sets in complex pedigrees across a wide variety of genetic architectures.