Background

Use of prior or additional information may improve the power of single-nucleotide polymorphism (SNP)-disease association analysis. In particular, genome-wide linkage scans can provide complementary information to genome-wide association studies (GWAS). The weighted false-discovery rate control (WFDR) method incorporates genome-wide linkage study results by converting the linkage scores into SNP-specific weights then re-scaling the association p-value for each SNP [1]. The stratified FDR control method (SFDR [2]) prioritizes genomic regions according to the available linkage scores. WFDR requires the investigator to choose and assign weights to SNPs whereas in SFDR, stratum-specific weights are internally derived by the choice of strata and the distribution of data [3]. SFDR is designed to use prior information to assign SNPs into strata that are more or less likely to include true-positive associations, which can similarly improve the power of GWAS, but is more robust than WFDR to uninformative or even misleading prior information [3]. We applied these two FDR methods along with the original FDR method to the North American Rheumatoid Arthritis Consortium (NARAC) study data provided for Genetic Analysis Workshop 16 (GAW 16) using previously reported linkage study results for rheumatoid arthritis (RA) [4]. We also performed genome-wide linkage and association analyses of the FHS data and applied the three FDR methods. We compared the regions of association identified by the different methods.

Methods

Samples, phenotypes, and genotypes

RA data

The NARAC data provided for GAW 16 included a set of 868 cases and 1,194 controls with information on a binary outcome (RA affection status) and on 545,080 genome-wide SNP genotypes from the Illumina 550 k chip as well as the physical locations for the SNPs. In our association analysis, RA affection status was defined as positive for anti-cyclic citrinullated peptide antibody (anti-CCP).

FHS data

SNP genotyping data were provided based on the GeneChip Human Mapping 500 k Array and 50 k Human Gene Focused Panel. We analyzed a combined sample of the Offspring Cohort (N1 = 2,584) and the Generation 3 Cohort (N2 = 3,811) for association with each of the blood lipid measures, high-density lipoprotein (HDL), low-density lipoprotein (LDL), and triglyceride (TG), and included all family members in the sample who had been genotyped and phenotyped. The mean of lipid measures over multiple exams was adjusted for age, sex, body mass index (BMI), alcohol intake, and cigarette smoking. The phenotype measures of treated people were imputed using the methods in Kathiresan et al. [5] and Levy et al. [6].

Quality control of SNP data

We excluded SNPs with Hardy-Weinberg equilibrium p-value ≤ 10-9in controls, missing genotype rate >5%, and minor allele frequency <0.01. Samples were also filtered by individual call missing rate >5%, duplicity, and relatedness. Within autosomes, there were 490,915 SNPs remaining in the RA data and 430,292 SNPs in the FHS data after applying this set of quality control criteria, using the computer program PLINK [7].

Genome-wide association analysis

RA data

Each SNP was tested for association using the 1-df allelic chi-squared test assuming an additive genotype model, implemented in PLINK [7].

FHS data

SNP association was evaluated using adjusted residual mean values obtained from the generalized estimating equation model for familial correlation. We split families unconnected in the Offspring and Generation 3 Cohort using the R package "kinship" [8]. Generalized estimating equation fitting was performed using a SAS GENMOD procedure assuming an exchangeable working correlation matrix.

Genome-wide linkage analysis

RA data

Results of the NARAC linkage study of RA using 642 Caucasian families and high-density SNP genotyping, as reported in Amos et al. [4], were used as the available prior linkage information, based on RA status (anti-CCP positivity) as the phenotype. The linkage scores at SNP markers across the genome, publicly available as supplementary information, were the nonparametric linkage (NPL) scores computed by Amos et al. [4] using linkage disequilibrium (LD) eliminated SNP genotypes. For the chromosomes with large centromeres (1, 3, 9, 11, 16, and 19), they assumed zero recombination of the centromeric regions.

FHS data

We performed a genome-wide linkage scan for each of the three lipid phenotypes values (i.e., covariate adjusted residuals), using 8,545 individuals from 1,349 FHS families (3,928 founders, 4,617 non-founders; 4,363 females and 4,182 males; family size ranging from 3 to 19). We selected 5,102 SNPs for the linkage scan according to the criteria of MAF > 0.45, HWE test p-values in founders >0.05, individual genotype missing rate <5%, SNP missing rate <2%, mendelian error rate <5%, and LD measure r2 < 0.05 between SNPs. Genome-wide linkage scans were performed using the regression methods of MERLIN-REGRESS (version 1.1.2) [9, 10]: the identical-by-descent (IBD) allele-sharing status for all relative pairs was regressed on the squared differences and squared sums of the pairs' trait values. This method requires specification of the population trait mean, variance, and heritability. We therefore estimated the heritability using the variance-components (VC) option in MERLIN. Lacking an available genetic map, we interpolated the deCODE map from the Affymetrix, Inc. website for the 5,102 selected SNPs.

SNP-specific linkage scores

The linkage score corresponding to each of the ~550 k GWA SNPs, Z i , i = 1, ..., M, was interpolated from the linkage scores of the available neighboring markers according to the relative distance between markers.

False-discovery rate control methods

False-discovery rate (FDR) control was performed by computing q-values using the method suggested by Storey [11]. If the q-value of a single SNP analysis was less than the chosen FDR control threshold value, the hypothesis of no association between the SNP and the disease was rejected.

Stratified FDR

Based on the linkage scan results, high and low linkage regions were determined using a NPL threshold value of C = 1.64 for the NARAC RA data and LOD threshold value C = 0.5 for the FHS data. SNPs that fell into a high-linkage region were grouped as Stratum 1 (Z i C) and SNPs that fell into a low-linkage region were grouped as Stratum 2 (Z i <C). FDR control was applied separately for the SNPs in Stratum 1 and Stratum 2 [11].

Weighted FDR

The weight of each SNP was obtained as w i = exp(B·Z i )/v, where and B = 1 (exponential weighting [1]). FDR was applied to weighted p-values, p j /w j , and the corresponding q-values were computed.

Results

Results of the RA data analysis

The SNPs in chromosome 6 (MHC region) showed very strong association with p-values less than 10-100 and also very high linkage scores (NPL>16). To focus on results outside regions of established importance, Table 1 excludes chromosome 6 SNPs and presents results of SNPs with ranks ≤ 10 based on any of FDR methods or SNPs from genes previously reported to be associated with RA [1214]. Most of the latter were ranked higher than other SNPs in the genome. For some SNPs, mostly in the stronger linkage regions, either SFDR or WFDR improved the rank more than the other. For example, the original rank of rs1018361 in CTLA4 was 285 using FDR, which changed to 28 and 96 using SFDR and WFDR, respectively. In some weak linkage regions (TRAF1, WDFY4), WFDR retained similar ranks as FDR, whereas the SFDR ranks increased. However, q-values for WFDR were generally much higher than those of FDR and SFDR (results not shown). These analyses suggest several new associations with RA (e.g., CNTNAP2 on chromosome 7).

Table 1 Results of FDR, SFDR, and WFDR analyses of selected SNPs for the RA phenotype from the NARAC study

Results of the FHS data analysis

Table 2 presents the results of SNPs selected with ranks ≤ 10 based on any of the three FDR methods or those most significant among genes previously reported for association with TG [5, 15]. All SNPs with rank ≤ 10 resided in the previously reported genes. SNPs in stronger linkage regions showed improvement in rank using SFDR and WFDR (linkage scores in bold). However, some of the gene regions previously reported for TG, HDL, or LDL were not confirmed in the FHS samples.

Table 2 Results of FDR, SFDR, and WFDR analysis of selected SNPs for the TG phenotype from the FHS study

Discussion

The SFDR and WFDR methods can improve power of genome-wide association analyses when linkage scans are informative [1, 3]. In the RA and FHS studies, using SFDR and WFDR improved ranks of SNPs in new and previously reported regions, suggesting improved power.

The threshold value for Stratum 1 and 2 in SFDR was somewhat arbitrarily set as 1.64 for the RA data (where the linkage results were NPL scores) and 0.5 for the FHS data (where the linkage results are LOD scores) to maintain the proportion of Stratum 1 to be about 5% when the power of linkage scans is relatively low. The effect of small differences in threshold values for SFDR was insignificant on average in a simulation study [3]. In the RA and FHS data sets, the choice of different thresholds did produce some differences in ranking values, but the effect was minimal.

The power under FDR control depends on the proportion of true alternative hypotheses (true causal SNPs) in a family of tests. By preserving most of the potentially true causal SNPs in a small-sized Stratum 1, we can improve study power [3]. However, how to choose this threshold to optimize power remains an open question. A similar question applies to the choice of the weighting scheme (i.e., the value of the B parameter) for the WFDR method.

SFDR and WFDR control methods using previous linkage results can be extended to multi-marker analysis with fixed or sliding windows, for example, by using the average linkage score within a window as a measure of prior information. Further study is warranted to evaluate extensions to multi-marker analysis settings.