Background

Genome-wide association studies (GWAS) have provided a powerful tool for identifying disease susceptibility genes. However, analysis of GWAS data has been focused on single-point tests, such as the traditional allele-based chi-squared test or the Cochran-Armitage Trend test [1], which proceed by testing each single nucleotide polymorphism (SNP) independently. As it is likely that the disease variants have not been directly genotyped in a GWAS, tests that account for multiple flanking SNPs in linkage disequilibrium (LD) with the disease variants may increase the power to detect association [2].

Several approaches have been proposed in order to test for association based on multiple markers, which include the haplotype-based approach [35] and the multivariate approach [6, 7]. Akey et al. [8] used analytical approaches to demonstrate that multilocus haplotype tests can be more powerful than single-marker tests. For the multivariate approach, tests such as Hotelling's T2 test are often used to account for multiple markers jointly [6, 9]. Although statistical power can be increased by such multi-marker approaches, it is not a straightforward operation to select markers for testing. Including all markers in a gene or region may not be feasible since it greatly increases the degrees of freedom in the test, which can reduce the power.

Follow-up studies, such as fine mapping and sequencing, are necessary in order to validate association signals and they are also challenging [2]. Prioritization of genes or regions for follow-up studies is often decided by a threshold of P-values or ranking for significant markers [10, 11]. However, many false positives can still exist in the markers classified as significant for follow-up as a result of statistical noise and genome-wide multiple testing. Joint and/or meta-analysis of GWAS data can achieve greater power if these data or P-values are available from different datasets. If P-values from individual and joint analyses are available, it is possible to further increase the power by assigning more weight to markers with replicated association signals in several datasets or to markers that have flanking markers with an association signal.

We propose the use of the GWAS noise reduction (GWAS-NR) approach which uses P-values from individual analyses, as well as joint analysis of multiple datasets, and which accounts for association signals from surrounding markers in LD. GWAS-NR is a novel approach to extending the power of GWAS studies to detect association. Noise reduction is achieved by applying a linear filter within a sliding window in order to identify genomic regions demonstrating correlated profiles of association across multiple datasets. As noise reduction (NR) techniques are widely used to boost signal identification in applications such as speech recognition, data transmission and image enhancement, we expect that GWAS-NR may complement other GWAS analysis methods in identifying candidate loci that may then be prioritized for follow-up analysis or analysed in the context of biological pathways.

Enhancing statistical power is particularly important in the study of complex diseases such as autism. There is overwhelming evidence from twin and family studies for a strong genetic component to autism, with estimates of heritability greater than 80% [1214]. Autism is generally diagnosed before the age of 4, based on marked qualitative differences in social and communication skills, often accompanied by unusual patterns of behaviour (for example, repetitive, restricted, stereotyped) [15]. Altered sensitivity to sensory stimuli and difficulties of motor initiation and coordination also are frequently present. Identifying the underlying genes and characterizing the molecular mechanisms of autism will provide immensely useful guidance in the development of effective clinical interventions.

Numerous autism candidate genes have been reported based on association evidence, expression analysis, copy number variation (CNV), and cytogenetic screening. These genes involve processes including cell adhesion (NLGN3, NLGN4 [16], NRXN1 [17], CDH9/CDH10 [18, 19]), axon guidance (SEMA5A [20]), synaptic scaffolding (SHANK2, DLGAP2 [21], SHANK3 [22]), phosphatidylinositol signalling (PTEN [23], PIK3CG [24]), cytoskeletal regulation (TSC1/TSC2 [24, 25], EPAC2/RAPGEF4 [26], SYNGAP1 [21]), transcriptional regulation (MECP2 [27], EN2 [28]) and excitatory/inhibitory balance (GRIN2A [29], GABRA4, GABRB1 [30]). However, aside from rare mutations and 'syndromic' autism secondary to known genetic disorders, the identification of specific genetic mechanisms in autism has remained elusive.

Over the past decade, the vast majority of genetic studies of autism (both linkage and focused candidate gene studies) have failed to broadly replicate suspected genetic variations. For this reason, the assumption that autism is governed by strong and pervasive genetic variations has given way to the view that autism may involve numerous genetic variants, each having a small effect size at the population level. This may arise from common variations having small individual effects in a large number of individuals (the common disease-common variant [CDCV] hypothesis) or rare variations having large individual effects in smaller subsets of individuals (the rare variant [RV] hypothesis).

Given the potential genetic heterogeneity among individuals with autism and the likely involvement of numerous genes of small effect at the population level, we expected that the GWAS-NR could improve the power to identify candidate genes for follow-up analysis. We applied GWAS-NR to autism GWAS data from multiple sources and conducted simulation studies in order to compare the performance of GWAS-NR with traditional joint and meta-analysis approaches. These data demonstrate that GWAS-NR is a useful tool for prioritizing regions for follow-up studies such as next-generation sequencing.

Methods

GWAS-NR

The GWAS-NR algorithm produces a set of weighted P-values for use in prioritizing genomic regions for follow-up study. Roeder and Wasserman [31] characterize the statistical properties of such weighting approaches in GWAS, observing that informative weights can improve power substantially, while the loss in power is usually small even if the weights are uninformative. The GWAS-NR algorithm computes a weight at each locus based on the strength and correlation of association signals at surrounding markers and in multiple datasets, without relying on prior information or scientific hypotheses. The weights are applied to the P-values derived from joint analysis of the complete data and the resulting weighted P-values are then used to prioritize regions for follow-up analysis.

Noise reduction methods are frequently applied when extracting a common signal from multiple sensors. The filter used by GWAS-NR is similar to the method proposed by de Cheveigné and Simon [32] for sensor noise suppression in magneto- and electro-encephalograph recordings. Each sensor is projected onto the other sensors and the fitted values from these regressions are used in place of the original values. The fitted values of such regressions retain sources of interest that are common to multiple sensors. As the regression residuals are orthogonal to the fitted values, uncorrelated components are suppressed.

In a genomic context, the 'sensors' take the form of probit-transformed P-values derived from independent datasets, as well as P-values derived from joint analysis of the full dataset. The filter inherently highlights cross-validating associations, by preserving signals that jointly occur in a given genomic region and attenuating spikes that are not correlated across subsets of the data. However, GWAS-NR can achieve no advantage over simple joint analysis when an association signal is restricted to a single marker and flanking markers provide no supplementary information.

We estimate ordinary least-squares regressions of the form

Z i j = α j k + β j k Z i k + v j k

and compute projections

Z i j ^ = α j k + β j k Z i k

where Z i and Z ik are the probits Φ-1(1 - p) of the P-values at locus i in two datasets j and k. Φ-1(⋅) denotes the inverse of the cumulative standard normal distribution. The estimates are computed within a centred sliding window of w markers and β jk are constrained to be nonnegative which sets Z i j ^ to the mean Z i j ¯ in regions having zero or negative correlation across sensors. As β jk is driven by the covariance between probits in datasets j and k, probits that demonstrate positive local correlation will tend to be preserved, while probits demonstrating weak local correlation will be attenuated. One local regression is computed for each locus and is used to compute a single fitted value Z i j ^ for that locus. The same method is used to compute projections Z i k ^ .

In order to capture association signals at adjacent loci in different datasets without estimating numerous parameters, the regressor at each locus is taken to be the probit of the lowest P-value among that locus and its two immediate neighbours. Quality control (QC) failure or different genotyping platforms can cause SNP genotypes to be missing in different datasets. Missing genotypes for a locus having no immediately flanking neighbours are assigned a probit of zero. The window width w is calculated as w = 2h + 1, where h is the lag at which the autocorrelation of the probits declines below a pre-defined threshold. In practice, we estimate the autocorrelation profile for each series of probits and use the average value of h with an autocorrelation threshold of 0.20.

After computing the projections of Z j and Z k , the resulting values are converted back to P-values and a set of filtered P-values is computed from these projections using Fisher's method. The same algorithm is executed again, this time using the probits of the filtered P-values and the P-values obtained from the joint association analysis of the complete data. The resulting Fisher P-values are then treated as weighting factors and are multiplied by the corresponding raw P-values from the joint analysis, producing a set of weighted P-values. To aid interpretation, we apply a monotonic transformation to these weighted P-values, placing them between 0 and 1 by fitting parameters of an extreme value distribution. The GWAS-NR algorithm was executed as a Matlab script.

Simulations

Although noise reduction has been shown to be useful in other biomedical applications [32], understanding its properties for identifying the true positives in disease association studies is also important. We used computer simulations to compare the performance of GWAS-NR with the joint association in the presence of linkage (APL) analysis and Fisher's method under a variety of disease models. We used genomeSIMLA [33] to simulate LD structures based on the Affymetrix 5.0 chip and performed the sliding-window haplotype APL [34] test to measure association. Detailed descriptions for the simulation settings are provided in Additional File 1 and detailed haplotype configurations can be found in Additional File 2.

An important goal for the proposed approach is to help prioritize candidate regions for follow-up studies such as next-generation sequencing. Top regions or genes ranked by their P-values are often considered priority regions for follow-up studies. In order to investigate the proportion of true positives that occur in the top regions, we treated the association tests as binary classifiers. The markers were ranked by their P-values and markers that occurred in the top k ranking were classified as significant, where k was pre-specified as a cut-off threshold. The markers that were not in the top k ranking were classified as non-significant. We then compared the sensitivity and specificity of GWAS-NR with the joint and Fisher's tests. The sensitivity was calculated based on the proportion of the three markers associated with the disease that were correctly classified as significant. The specificity was calculated based on the proportion of markers not associated with the disease that were correctly classified as non-significant. The sensitivity and specificity were averaged over 1000 replicates.

Ascertainment and sample description

We ascertained autism patients and their affected and unaffected family members through the Hussman Institute for Human Genomics (HIHG, University of Miami Miller School of Medicine, FL, USA), and the Vanderbilt Center for Human Genetics Research (CHGR, Vanderbilt University Medical Center, Tennessee, USA; UM/VU). Participating families were enrolled through a multi-site study of autism genetics and recruited via support groups, advertisements and clinical and educational settings. All participants and families were ascertained using a standard protocol. These protocols were approved by appropriate Institutional Review Boards. Written informed consent was obtained from parents, as well as from minors who were able to give informed consent; in individuals unable to give assent due to age or developmental problems, assent was obtained whenever possible.

The core inclusion criteria were as follows: (1) chronological age between 3 and 21 years of age; (2) presumptive clinical diagnosis of autism; and (3) expert clinical determination of autism diagnosis using Diagnostic and Statistical Manual of Mental Disorders (DSM)-IV criteria supported by the Autism Diagnostic Interview-Revised (ADI-R) in the majority of cases and all available clinical information. The ADI-R is a semi-structured diagnostic interview which provides diagnostic algorithms for classification of autism [35]. All ADI-R interviews were conducted by formally trained interviewers who have achieved reliability according to established methods. Thirty-eight individuals did not have an ADI-R and, for those cases, we implemented a best-estimate procedure to determine a final diagnosis using all available information from the research record and data from other assessment procedures. This information was reviewed by a clinical panel led by an experienced clinical psychologist and included two other psychologists and a paediatric medical geneticist - all of whom were experienced in autism. Following a review of case material, the panel discussed the case until a consensus diagnosis was obtained. Only those cases in which a consensus diagnosis of autism was reached were included. (4) The final criterion was a minimal developmental level of 18 months as determined by the Vineland Adaptive Behavior Scale (VABS) [36] or the VABS-II [37] or intelligence quotient equivalent >35. These minimal developmental levels assure that ADI-R results are valid and reduce the likelihood of including individuals with severe mental retardation only. We excluded participants with severe sensory problems (for example, visual impairment or hearing loss), significant motor impairments (for example, failure to sit by 12 months or walk by 24 months) or identified metabolic, genetic or progressive neurological disorders.

A total of 597 Caucasian families (707 individuals with autism) were genotyped at HIHG. This dataset consisted of 99 multiplex families (more than one affected individual) and 498 singleton (parent-child trio) families. A subset of these data had been previously reported [19]. In addition, GWAS data were obtained from the Autism Genetic Resource Exchange (AGRE) [35] as an additional dataset for analysis. The full AGRE dataset is publicly available and contains families with the full spectrum of autism spectrum disorders. From AGRE, we selected only families with one or more individuals diagnosed with autism (using DSM-IV and ADI-R); affected individuals with non-autism diagnosis within these families were excluded from the analysis. This resulted in a dataset of 696 multiplex families (1240 individuals with autism) from AGRE [35].

Genotyping and quality control and population stratification

We extracted DNA for individuals from whole blood by using Puregene chemistry (QIAGEN, MD, USA). We performed genotyping using the Illumina Beadstation and the Illumina Infinium Human 1 M beadchip following the recommended protocol, only with a more stringent GenCall score threshold of 0.25. Genotyping efficiency was greater than 99%, and quality assurance was achieved by the inclusion of one CEPH control per 96-well plate that was genotyped multiple times. Technicians were blinded to affection status and quality-control samples. The AGRE data were genotyped using the Illumina HumanHap550 BeadChip with over 550,000 SNP markers. All samples and SNPs underwent stringent GWAS quality control measures as previously described in detail in Ma et al. [19].

Although population substructure does not cause a type I error in family-based association tests, multiple founder effects could result in a reduced power to detect an association in a heterogeneous disease such as autism. Thus, we conducted EIGENSTRAT [38] analysis on all parents from analysed families for evidence of population substructure using the overlapping SNPs genotyped in both the UM/VU and AGRE datasets. In order to ensure the most homogeneous groups for association screening and replication, we excluded all families with outliers defined by EIGENSTRAT [38] out of four standard deviations of principal components 1 and 2.

Haplotype block definition

We used haplotype blocks to define regions of interest. Significant regions can be used for follow-up analysis such as next-generation sequencing. We applied the haplotype block definition method proposed by Gabriel et al. [39] to the UM/VU dataset. We performed GWAS-NR based on single-marker APL P-values from UM/VU, AGRE and joint tests. We also performed GWAS-NR on P-values obtained from sliding-window haplotype tests with a haplotype length of three markers for the UM/VU, AGRE and joint datasets. Since the true haplotype length is not known, we chose a fixed length of three markers across the genome and used GWAS-NR to sort out true signals from the P-values. Blocks containing the top 5000 markers, as ranked by the minimum values (MIN_NR) of the GWAS-NR P-values obtained from single-marker tests, and the GWAS-NR P-values obtained from tests of three-marker haplotypes, were selected for further analysis.

Combined P-values for haplotype block scoring

In order to test for the significance of the haplotype blocks, we calculated the combined P-value for each block using a modified version of the Truncated Product Method (TPM) [40]. TPM has been shown to have correct type I error rates and more power than other methods combining P-values [40] under different simulation models. Briefly, a combined score was calculated from the markers in each block, based on the product of MIN_NR that were below a threshold of 0.05. We used the Monte Carlo algorithm [40] with a slight modification to test the significance of the combined score. Specifically, a correlation matrix was applied to account for correlation among P-values for the markers in the same block. The null hypothesis is that none of the markers in the haplotype block are associated with the disease. In order to simulate the null distribution for the combined score, we generated two correlated sets of L uniform numbers based on the correlation of 0.67 for CAPL and HAPL P-values, where L denotes the number of tests in the block. The minimum values were selected from each pair in the two sets, which resulted in a vector of L minimum values. Then the correlation matrix was applied to the vector of L minimum values and a null combined GWAS-NR score was calculated for the haplotype block.

Functional analysis

In order to investigate functional relationships among genes in the candidate set, each candidate was manually annotated and cross-referenced, based on a review of current literature, with attention to common functions, directly interacting proteins and binding domains. Supplementary functional annotations were obtained using DAVID (The Database for Annotation, Visualization and Integrated Discovery) version 6.7 [4143].

Results

Simulations

We present the simulation results for the three-marker haplotype disease models in Figures 1 and 2. Figure 1 presents receiver operating characteristic (ROC) curves to show the sensitivity and specificity of GWAS-NR, the joint APL analysis and Fisher's tests, based on varying cut-off values of ranking for significance. The Fisher's test to combine P-values was used here as a standard meta-analysis approach. The performance of a classification model can be judged based on the area under the ROC curve (AUC). For scenario 1 (identical marker coverage in each dataset), GWAS-NR produced a greater AUC than the joint and Fisher's tests. It can also be observed from the figure that, given the same specificity, GWAS-NR achieved a higher sensitivity for classifying true positives as significant as the joint and Fisher's tests.

Figure 1
figure 1

Comparative classification rates for genome-wide association studies noise reduction (GWAS-NR), joint analysis and Fisher's test. GWAS-NR has area under the curve (AUC) of 0.703 and the joint and Fisher's tests have AUC of 0.64 and 0.615, respectively, for the recessive model. Also GWAS-NR has AUC of 0.899 and the joint and Fisher's tests have AUC of 0.795 and 0.777, respectively, for the multiplicative model. For the dominant model, AUC for GWAS-NR, the joint and Fisher's tests are 0.981, 0.880 and 0.867, respectively. For the additive model, AUC for GWAS-NR, the joint and Fisher's tests are 0.932, 0.822, and 0.807, respectively.

As independent datasets may have an imperfect overlap of markers, which is true of the UM/VU and AGRE autism data, and the omission of the closest disease-related polymorphism from the data can have substantial negative impact on the power of GWAS [44], we also compared the performance of GWAS-NR with the joint APL tests and Fisher's tests under a range of missing marker scenarios: 20% of the simulated markers in one dataset were randomly omitted for the recessive and multiplicative models and 50% of the simulated markers were randomly omitted in one dataset for the dominant and additive models. This performance is shown in Figure 2. Again, the GWAS-NR produced a greater AUC than the joint and Fisher's tests and achieved a higher sensitivity for classifying true positives at each level of specificity.

Figure 2
figure 2

Comparative classification rates for genome-wide association studies noise reduction noise reduction (GWAS-NR), joint analysis and Fisher's test with 20% and 50% missing markers. GWAS-NR has area under the curve (AUC) of 0.689 and the joint and Fisher's tests have AUC of 0.622 and 0.598, respectively, for the recessive model. Also GWAS-NR has AUC of 0.883 and the joint and Fisher's tests have AUC of 0.776 and 0.760, respectively, for the multiplicative model. For the dominant model, AUC for GWAS-NR, the joint and Fisher's tests are 0.961, 0.852 and 0.844, respectively. For the additive model, AUC for GWAS-NR, the joint and Fisher's tests are 0.895, 0.785, and 0.775, respectively.

The results for the two-marker haplotype disease models are shown in Additional File 3. The same pattern is also observed in Additional File 3 that GWAS-NR produced greater AUC than the joint and Fisher's tests.

We also evaluated the type I error rates of the modified TPM for identifying significant LD blocks using a truncation threshold of 0.05. For the scenario assuming full marker coverage as described in Additional File 1, the modified TPM had type I error rates of 0.035 and 0.004 at the significance levels of 0.05 and 0.01, respectively. For the missing-marker scenario, the type I error rates for the modified TPM were 0.046 and 0.007 at the significance levels of 0.05 and 0.01, respectively.

Autism GWAS-NR results

We applied the GWAS-NR in autism data using UM/VU, AGRE and the joint (UM/VU)/AGRE datasets. A flow diagram (Additional File 4) for the data analysis process is found in the supplemental data. The selection of haplotype blocks based on Gabriel's definition resulted in a total of 2680 blocks based on the top 5000 markers. Moreover, 141 markers out of the 5000 markers which are not in any blocks were also selected. Blocks of LD were scored based on the truncated product of P-values below a threshold of 0.05 and a P-value for each block was obtained through Monte Carlo simulation. The P-values for 141 markers not in any blocks were also calculated using the Monte Carlo algorithm to account for the minimum statistics. All of the 141 markers had P-values less than 0.05 and were selected. 725 LD blocks achieved a significance threshold of P < = 0.01, and an additional 810 blocks achieved a threshold of P < = 0.05. A complete list of these blocks is presented in Additional File 5.

In order to determine what genes reside within the 1535 significant LD blocks, we used the University of California Santa Cruz (UCSC) Genome Browser Table Browser. The 1535 regions were converted into start and end positions based on the SNP positions in the March 2006 (NCBI36/hg18) human genome assembly. These start and end positions were used to define regions in the UCSC Table Browser. We searched each region for overlap with the RefSeq annotation track in the UCSC Browser. This search resulted in 431 unique genes which mapped back to 646 significant LD blocks and 50 single markers. These genes are presented in Additional File 6. For the remaining 839 LD blocks that did not overlap a RefSeq gene, we identified the nearest RefSeq gene using Galaxy [45]. The distance to these nearest genes averaged 417,377 bp with a range from 5296 to 5,547,466 bp. These nearest genes include candidate genes for which strong proximal associations with autism have previously been reported, such as CDH9 [18, 19] and SEMA5A [20]. We considered these genes for follow-up because GWAS-NR, by construction, may capture association information from nearby regions that may not be in strict LD with a given SNP and because these proximal locations may also incorporate regulatory elements. These genes are presented in Additional File 7. Combining these sets resulted in a candidate set of 860 unique genes (presented in Additional File 8). For genes assigned to more than one significant LD block, the lowest P-value among these blocks is used for sorting and discussion purposes.

The most significant LD block we identified is located at 2p24.1 (ch2 204444539-20446116; P = 1.8E-06) proximal to PUM2. One LD block located within the PUM2 exon also had nominally significant association (P = 0.024). Additional top-ranking candidates, in order of significance, include CACNA1I (P = 1.8E-05), EDEM1 (P = 1.8E-05), DNER (P = 2.7E-05), A2BP1 (P = 3.6E-05), ZNF622 (P = 8.11E-05), SEMA4D (P = 9.09E-05) and CDH8 (P = 9.09E-05). Gene ontology classifications and InterPro binding domains reported by DAVID [4143] to be most enriched in the candidate gene set are presented in Tables 1 and 2, respectively, providing a broad functional characterization of the candidate genes identified by the GWAS-NR in autism.

Table 1 Common functions of autism candidate genes identified by genome-wide association studies-noise reduction (GWAS-NR)
Table 2 Common binding domains of autism candidate genes identified by genome-wide association studies-noise reduction (GWAS-NR).

Cell adhesion represented the most common functional annotation reported for the candidate gene set, with a second set of common functional annotations relating to neuronal morphogenesis and motility, including axonogenesis and neuron projection development. Given the enrichment scores reported by DAVID [4143] implicating neurite development and motility, and because numerous cell adhesion molecules are known to regulate axonal and dendritic projections [46, 47], we examined the known functional roles of the individual candidate genes responsible for these enrichment scores. A total of 183 candidate genes were represented among the top 20 functional classifications reported by DAVID [4143]. Based on annotations manually curated from a review of current literature, we observed that 76 (41.5%) of these genes have established roles in the regulation of neurite outgrowth and guidance. These include 39 (51.3%) of the candidate genes contained in the cell adhesion, biological adhesion, cell-cell adhesion and homophilic cell adhesion pathways.

Gene ontology [48] specifically associates two pathways with the narrow synonym 'neurite outgrowth': the neuron projection development (pathway 0031175); and the transmembrane receptor protein tyrosine kinase activity (pathway 0004714). To further test for functional enrichment of genes related to neurite outgrowth, we formed a restricted composite of these two pathways. Enrichment analysis using the EASE function of DAVID [4143] rejected the hypothesis that this composite pathway is randomly associated with the autism candidate set (P = 2.07E-05).

Although many of the candidate genes identified by the GWAS-NR remain uncharacterized or have no known neurological function, we identified 125 genes within the full candidate set having established and interconnected roles in the regulation of neurite outgrowth and guidance. These genes are involved in diverse sub-processes including cell adhesion, axon guidance, phosphatidylinositol signalling, establishment of cell polarity, Rho-GTPase signalling, cytoskeletal regulation and transcription. Table 3 presents a summary of these genes by functional category. Additional File 9 presents annotations for these 125 candidates. Additional File 10 presents 104 additional candidates which have suggestive roles in neurite regulation based on putative biological function or homology to known neurite regulators but where we did not find evidence specific to neurite outgrowth and guidance in the current literature.

Table 3 Autism candidate genes with known roles in neurite outgrowth and guidance.

Outside of functions relating to neuritogenesis, the most significant functional annotation reported by DAVID for the candidate gene set relates to transmission of nerve impulses (p = 9.02E-04). We identified 40 genes in the candidate set related to neurotransmission (synaptogenesis, neuronal excitability, synaptic plasticity, and vesicle exocytosis) which did not have overlapping roles in neurite regulation. Table 4 presents a summary of these genes by functional category.

Table 4 Autism candidate genes with roles in synaptic function.

In order to investigate how the GWAS-NR results compared with the joint APL tests and Fisher's tests, we examined the lists of top 5000 markers selected based on GWAS-NR, joint APL test and Fisher's test P-values. A total of 3328 of the markers are overlapped between the lists for the GWAS-NR and joint APL tests, while 1951 of the markers are overlapped between the lists for the GWAS-NR and Fisher's tests. Thus, GWAS-NR had a higher concordance with the joint APL tests than the Fisher's tests. The results suggested that Fisher's test may have the lowest sensitivity to identify the true positives, which is consistent with our simulation results. Moreover, 120 markers that are not overlapped between Illumina Infinium Human 1M beadchip and Illumina HumanHap550 BeadChip were among the top 5000 markers selected based on GWAS-NR. Some of the 120 markers are in the significant genes identified by haplotype blocks such as PUM2, A2BP1, DNER and SEMA4D.

In order to similarly investigate the overlap of candidate genes indentified by GWAS-NR and joint APL tests, we repeated the haplotype block scoring method with the top 5000 markers as identified by joint APL: this analysis resulted in 1924 significant LD blocks. Of these, 1257 overlapped with the blocks selected by GWAS-NR analysis. Identification of the RefSeq genes within with these 1257 shared regions showed that 380 potential candidate genes were shared by the two methods. In addition, GWAS-NR analysis produced 53 non-overlapping genes while the joint APL analysis produced 349 non-overlapping genes.

As GWAS-NR amplifies association signals that are replicated in multiple flanking markers and across data sets, the method can be expected to produce a reduced list of higher confidence candidate regions for follow-up, compared with standard single-locus methods. At the same time, GWAS-NR does not generate a large number of significant candidates in regions that would otherwise be ranked as insignificant. While it is not possible to exclude a role in autism for the 349 additional candidate genes produced by the joint APL analysis, it is notable that among the top 20 gene ontology pathways reported by DAVID [4143] for this set of genes, not one is specific to neuronal function (data not shown). This analysis highlights the utility of GWAS-NR to narrow and prioritize follow-up gene lists.

Discussion

We propose the use of GWAS-NR, a noise-reduction method for genome-wide association studies which aims to enhance the power to detect true positive associations for follow-up analysis. Our results demonstrate that GWAS-NR is a powerful method for the enhancement of the detection of genetic associations. Simulation evidence using a variety of disease models indicates that, when markers are ranked by P-values and candidates are selected based on a threshold rank, GWAS-NR achieves higher classification rates than the use of joint P-values or Fisher's method. In simulated data, the GWAS-NR also achieves strong performance when there is imperfect marker overlap across datasets and when the closest disease-related polymorphism is not typed. As Müller-Myhsok and Abel have observed, when less-than-maximum LD exists between a disease locus and the closest biallelic marker, the required sample size to achieve a given level of power may increase dramatically, particularly if there is a substantial difference in allele frequencies at the disease marker and the analysed marker [49].

In the context of allelic association, noise can be viewed as observed but random association evidence (for example, false positives) that is not the result of true LD with a susceptibility or causative variant. Such noise is likely to confound studies of complex disorders, where genetic heterogeneity among affected individuals or complex interactions among multiple genes may result in modest association signals that are difficult to detect. The influence of positive noise components is also likely to contribute to the so-called 'winner's curse' phenomenon, whereby the estimated effect of a putatively associated marker is often exaggerated in the initial findings, compared with estimated effects in follow-up studies [50]. GWAS-NR appears to be a promising approach to address these challenges.

By amplifying signals in regions where association evidence is locally correlated across datasets, the GWAS-NR captures information that may be omitted or underutilized in single-marker analysis. However, the GWAS-NR can achieve no advantage over simple joint analysis when flanking markers provide no supplementary information. This is likely to be true when a true risk locus is typed directly and a single-marker association method is used or when a true risk haplotype is typed directly and the number of markers examined in a haplotype-based analysis is of the same length.

Joint analysis generally has more power than individual tests due to the increase of sample size. Therefore, GWAS-NR, which uses P-values from individual analyses as well as joint analysis of multiple datasets, is expected to have more power than individual tests. However, if there are subpopulations in the sample and the association is specific to a subpopulation, joint analysis may not be as powerful as an individual test for the subpopulation with the association signal. If samples from multiple populations are analysed jointly, test results for individual datasets should also be carefully examined with the GWAS-NR results.

It is common for linear filters to include a large set of estimated parameters to capture cross-correlations in the data at multiple leads and lags. However, in a genomic context, the potentially uneven spacing of markers and varying strength of linkage disequilibrium between markers encouraged us to apply a parsimonious representation that would be robust to data structure. We expect that a larger, well-regularized parameterization may enhance the performance of the noise filter, particularly if the filter is constructed to adapt to varying linkage disequilibrium across the genome. This is a subject of further research.

Our simulation results indicate that applying the modified TPM to select LD blocks based on GWAS-NR can have conservative type I error rates. The original TPM reported by Zaykin et al. [40] produced the expected level of type I error, as a known correlation matrix was used in the simulations to account for correlation among P-values due to LD among markers. However, the true correlation is unknown in real datasets. Accordingly, we estimated correlations in our simulations and analysis by bootstrapping replicates of samples, as well as using the sample correlation between P-values obtained though single marker APL and sliding window haplotype analysis. It is possible that the use of estimated correlations may introduce extra variations in the Monte-Carlo simulations of TPM, which may contribute to conservative type I error rates. As we have demonstrated that GWAS-NR achieves higher sensitivity at each level of specificity, the resulting regions with top rankings can be expected to be enriched for true associations when such associations are actually present in the data, even if the LD block selection procedure is conservative. Overall, the simulation results suggest that GWAS-NR can be expected to produce a condensed set of higher confidence follow-up regions, and that this prioritization strategy can control the number of false positives at or below the expected number in analysis.

Autism

Our data identify potential candidate genes for autism that encode a large subset of proteins involved in the outgrowth and guidance of axons and dendrites to their appropriate synaptic targets. Our results also suggest secondary involvement of genes involved in synaptogenesis and neurotransmission which further contribute to the assembly and function of neural circuitry. Taken together, these findings augment existing genetic, epigenetic and neuropathological evidence suggestive of altered neurite morphology, cell migration, synaptogenesis and excitatory-inhibitory balance in autism [49].

Altered dendritic structure is among the most consistent neuroanatomical findings in autism [51, 52] and several other neurodevelopmental syndromes including Down, Rett and fragile-X [53, 54]. Recent neuroanatomical findings include evidence of subcortical, periventricular, hippocampal and cerebellar heterotopia [55] and altered microarchitecture of cortical minicolumns [56], suggestive of dysregulated neuronal migration and guidance. In recent years, evidence from neuroanatomical and neuroimaging studies has led a number of researchers to propose models of altered cortical networks in autism, emphasizing the possible disruption of long-range connectivity and a developmental bias toward the formation of short-range connections [57, 58].

Neurite regulation is a common function of numerous top-ranking candidates. PUM2 codes for pumilio homolog 2, which regulates dendritic outgrowth, arborization, spine formation and filopodial extension of developing and mature neurons [59]. DNER regulates the morphogenesis of cerebellar Purkinje cells [60] and acts as an inhibitor to retinoic-acid induced neurite outgrowth [61]. A2BP1 binds with ATXN2 (SCA2), a dosage-sensitive regulator of actin filament formation that is suggested to mediate the loss of cytoskeleton-dependent dendritic structure [62]. SEMA4D induces axonal growth cone collapse [63] and promotes dendritic branching and complexity in later stages of development [64, 65]. CDH8 regulates hippocampal mossy fibre axon fasciculation and targeting, complementing N-cadherin (CDH2) in the assembly of synaptic circuits [66].

Neurite outgrowth and guidance can be conceptualized as a process whereby extracellular signals are transduced to cytoplasmic signalling molecules which, in turn, regulate membrane protrusion and neuronal growth cone navigation by reorganizing the architecture of the neuronal cytoskeleton. In general, neurite extension is dependent on microtubule organization, while the extension and retraction of finger-like filopodia and web-like lamellipodia from the neuronal growth cone is dependent on actin dynamics. Gordon-Weeks [67] and Bagnard [68] provide excellent overviews relating to growth cone regulation and axon guidance. Figure 3 provides a simplified overview of some of these molecular interactions.

Figure 3
figure 3

Simplified schematic illustrating molecular mechanisms of neurite regulation. Extracellular events such as cell contact [79], guidance cues [64], neurotransmitter release [80], and interactions with extracellular matrix components [46] are detected by receptors and cell adhesion molecules at the membrane surface and are transduced via cytoplasmic terminals and multidomain scaffolding proteins [47] to downstream signalling molecules [8183]. Polarity and directional navigation is achieved by coordinating local calcium concentration [84], Src family kinases [85], cyclic nucleotide activation (cAMP and cGMP) [86], and phosphoinositide signalling molecules which affect the spatial distribution and membrane recruitment of proteins that regulate the neuronal cytoskeleton [87]. Chief among these regulators are the small Rho family GTPases RhoA, Rac and Cdc42, which serve as molecular 'switches' to activate downstream effectors of cytoskeletal remodelling [88]. In developed neurons, this pathway further regulates the formation of actin-dependent microarchitecture such as mushroom-like dendritic spines at the postsynaptic terminals of excitatory and inhibitory synapses [89]. This simplified schematic presents components in an exploded format for tractability, and includes an abridged set of interactions. Additional File 9 presents autism candidate genes identified by GWAS-NR having known roles in neurite regulation. RPTP (receptor protein tyrosine phosphatase); EphR (Eph receptor); FGFR (fibroblast growth factor receptor); EphR (Eph receptor); PLXN (plexin); NRP (neuropilin); Trk (neurotrophin receptor); ECM (extracellular matrix); NetR (netrin receptor); NMDAR (NMDA receptor); mGluR (metabotropic glutamate receptor); AA (arachidonic acid); PLCγ (phospholipase C, gamma); MAGI (membrane associated guanylate kinase homolog); IP3 (inositol 1,4,5-trisphosphate); DAG (diacylglycerol); PIP2 (phosphatidylinositol 4,5-bisphosphate); PIP3 (phosphatidylinositol 3,4,5-trisphosphate); PI3K (phosphoinositide-3-kinase); nNOS (neuronal nitric oxide synthase); NO (nitric oxide); IP3R (inositol trisphosphate receptor); RyR (ryanodine receptor); GEF (guanine exchange factor); GAP (GTPase activating protein); MAPK (mitogen-activated protein kinase); and JNK (c-Jun N-terminal kinase).

The autism gene candidates identified by GWAS-NR show functional enrichment in processes, including adhesion, cell motility, axonogenesis, cell morphogenesis and neuron projection development. Notably, a recent analysis of rare CNVs in autism by the Autism Genome Project Consortium indicates similar functional enrichment in the processes of neuronal projection, motility, proliferation, and Rho/Ras GTPase signalling [21].

We propose that, in autism, these processes are not distinct functional classifications but instead cooperate as interacting parts of a coherent molecular pathway regulating the outgrowth and guidance of axons and dendrites. Consistent with this view, the candidate set is enriched for numerous binding domains commonly found in proteins that govern neuritogenesis. These include immunoglobulin, cadherin, pleckstrin homology, MAM, fibronectin type-III and protein tyrosine phosphatase (PTP) domains [6971].

The cytoskeletal dynamics of extending neurites are largely governed by the activity of Rho-GTPases, which act as molecular switches to induce actin remodelling. Molecular evidence suggests that disassociation of catenin from cadherin promotes the activation of Rho-family GTPases Rac and Cdc42, resulting in cytoskeletal rearrangement [72]. Guanine nucleotide exchange factors (GEFs) such as DOCK1 [73] and KALRN [74] activate Rho-GTPases by exchanging bound guanosine diphosphate (GDP) for guanosine triphosphate (GTP), while GTPase activating proteins (GAPs) such as SRGAP3 [75] increase the rate of intrinsic GTP hydrolysis to inactivate GTPases. Pleckstrin homology domains, characteristic of several GEFs and GAPs, bind to phosphoinositides to establish membrane localization and also may play a signalling role in GTPase function [76]. Certain GTPases outside of the Rho family, particularly Rap and Ras, also exert an influence on cytoskeletal dynamics and neurite differentiation [77, 76].

Several genes in the candidate set with established roles in neurite formation and guidance have been previously implicated in autism. These include A2BP1 (P = 3.60E-05), ROBO2 (2.00E-03), SEMA5A (2.30E-03), EN2 (4.00E-03), CACNA1G (6.00E-03), PTEN (8.00E-03), NRXN1 (1.10E-02), FUT9 (1.80E-02), DOCK8 (2.10E-02), NRP2 (2.60E-02) and CNTNAP2 (2.70E-02). Other previously reported autism candidate genes with suggestive roles in neurite regulation include PCDH9 (1.76E-03), CDH9 (6.00E-03) and CSMD3 (2.10E-02).

The enriched presence of transcription factors in the candidate set is intriguing, as many of these candidates, including CUX2, SIX3, MEIS2 and ZFHX1B have established roles in the specification of GABAergic cortical interneurons [76]. Many guidance mechanisms in the neuritogenic pathway, such as Slit-Robo, semaphorin-neuropilin, and CXCR4 signalling also direct the migration and regional patterning of interneurons during development. Proper targeting of interneurons is vital to the organization of cortical circuitry, including minicolumnar architecture which is reported to be altered in autism [78]. Thus, the functional roles of the candidate genes we identify may embrace additional forms of neuronal motility and targeting.

Conclusions

We proposed a noise-reduction methodology, GWAS-NR, to enhance the ability to detect associations in GWAS data. By amplifying signals in regions where association evidence is locally correlated across datasets, the GWAS-NR captures information that may be omitted or underutilized in single-marker analysis. Simulation evidence demonstrates that under a variety of disease models, GWAS-NR achieves higher classification rates for true positive associations, compared with the use of joint p-values or Fisher's method.

The GWAS-NR method was applied to autism data, with the objective of prioritizing regions of association for follow-up analysis. Gene set analysis was conducted in order to examine if the identified autism candidate genes were over-represented in any biological pathway relative to the background genes. The significance of a given pathway suggests that the pathway may be associated with autism due to the enrichment of autism candidate genes in that pathway. We find that many of the implicated genes cooperate within a coherent molecular mechanism. This neuritogenic pathway regulates the transduction of membrane-associated signals to downstream cytoskeletal effectors that induce the directional protrusion of axons and dendrites. This mechanism provides a framework that embraces numerous genetic findings in autism to date, and is consistent with neuroanatomical evidence. While confirmation of this pathway will require additional evidence such as the identification of functional variants, our results suggest that autistic pathology may be mediated by the dynamic regulation of the neuronal cytoskeleton, with resulting alterations in dendritic and axonal connectivity.