Targeted sequencing and integrative analysis to prioritize candidate genes in neurodevelopmental disorders

Neurodevelopmental disorders (NDDs) are a group of diseases characterized by high heterogeneity and frequently co-occurring symptoms. The mutational spectrum in patients with NDDs is largely incomplete. Here, we sequenced 547 genes from 1102 patients with NDDs and validated 1271 potential functional variants, including 108 de novo variants (DNVs) in 78 autosomal genes and seven inherited hemizygous variants in six X chromosomal genes. Notably, 36 of these 78 genes are the first to be reported in Chinese patients with NDDs. By integrating our genetic data with public data, we prioritized 212 NDD candidate genes with FDR < 0.1, including 17 novel genes. The novel candidate genes interacted or were co-expressed with known candidate genes, forming a functional network involved in known pathways. We highlighted MSL2, which carried two de novo protein-truncating variants (p.L192Vfs*3 and p.S486Ifs*11) and was frequently connected with known candidate genes. This study provides the mutational spectrum of NDDs in China and prioritizes 212 NDD candidate genes for further functional validation and genetic counseling. Supplementary Information The online version contains supplementary material available at 10.1007/s12035-021-02377-y.


Introduction
Neurodevelopmental disorders (NDDs) are complex and heterogeneous disorders characterized by impaired motor function, learning, and verbal and non-verbal communication resulting from dysfunction during brain development [1][2][3][4]. NDDs span a very wide range of neurological and psychiatric disorders [1,5] and affect more than 3% of children worldwide [6]. Although each diagnosis is distinct in clinical settings, in general, NDDs are characterized by high clinical and genetic heterogeneity. Previous studies have estimated the heritability of NDDs, with the highest being up to~80% [7][8][9][10]. Multiple phenotype-genotype correlation studies have revealed that patients carrying deleterious variants in the same Yi Zhang and Tao Wang contributed equally to this work. risk gene have manifested broad clinical phenotypes, including impaired social interaction, developmental delay, regression, and repetitive behavior [11][12][13]. The high level of heterogeneity and similar pathogenetic mechanisms present challenges for clinical diagnosis and treatment.
Recent studies have reported that targeted sequencing is a powerful and cost-effective tool for discovering new genetic risk genes in many diseases, particularly diseases with high genetic heterogeneity [4,14,15]. Both rare inherited and de novo variants (DNVs) have been demonstrated to contribute to NDDs [16][17][18]. Stessman et al. sequenced 208 candidate genes in patients with NDDs and identified 91 risk genes with locus-specific significance for disruptive variants in 5.7% of patients [4]. Wang et al. performed targeted sequencing of 189 autism spectrum disorder (ASD) candidate risk genes in Chinese patients and found that genes identified in European ASD cohorts were highly relevant in Chinese cohorts [19]. Guo et al. expanded the above study to a larger cohort of patients and provided support for a multifactorial model of ASD risk [20]. Some studies have shown that there are multiple risk genes that are shared between neurodevelopmental disorders, and integrating datasets at multiple levels is beneficial for the etiology of NDDs [21,22]. Takata et al. combined published DNV data and found that integrative analyses was conducive to identifying significant genes and extending ASD-related molecular and brain networks [23]. Gonzalez-Mantilla et al. used multilevel data-integration approach and identified novel candidate genes for developmental brain disorders [24]. We have previously demonstrated that DNVs, gene set enrichment analysis, and protein-protein interaction (PPI)/co-expression analysis can provide new insights into the genetic mechanisms underpinning NDDs [25][26][27][28]. However, only a few known pathogenetic genes have been described to explain the genetic causes of NDDs, and the etiology behind NDDs remains unclear.
Here, we used targeted sequencing to examine 547 genes from 1102 Chinese patients with NDDs. We identified potential functional variants and explored the patterns of DNVs in Chinese patients with NDDs. We then integrated our dataset with public datasets of neurodevelopmental disorders from the Gene4Denovo [29] database to prioritize NDD candidate genes and discover novel candidate genes. Finally, we investigated whether novel candidate genes were functionally associated with known candidate genes using multilevel bioinformatics analysis.
A total of 935 unrelated trios (probands and their unaffected parents) and 167 probands without parents were recruited from China. Genomic DNA (1 μg) extracted from whole blood was sheared and assembled into a DNA library prior to targeted sequencing. The Illumina X10 sequencing system (Illumina, San Diego, CA, USA) was used to generate pairedend raw data. This panel resulted in an average depth of 181.46× in target regions, and 98.82% of target bases were covered with depth ≥ 10× on average. This study was approved by the Institutional Review Board of the State Key Laboratory of Medical Genetics, School of Life Sciences, Central South University, Changsha, Hunan, China. All subjects who participated in this study provided informed consent prior to sample collection.

Variation Detection and Annotation
Quality control of the sequencing data was performed using Cutadapt [33] and FastQC (https://www.bioinformatics. babraham.ac.uk/projects/fastqc/) to remove adapter and unqualified sequences, respectively. BWA-MEM [34] was employed to align the clean reads to the human reference genome (hg19). Samtools [35] utilities were used to mark duplicate reads and generate position-sorted files, while the Genome Analysis Toolkit HaplotypeCaller was used to call variants. Comprehensive annotation of all variants was performed using ANNOVAR [36], including functional implications (gene region, functional effect, mRNA GeneBank accession number, amino acid change, cytoband, etc.), functional predictions for missense variants, and allele frequencies of gnomAD, ExAC, and inhouse data (1113 WES samples and 2469 WGS samples). Deleterious missense variants (Dmis) were predicted by ReVe [37] and missense variants with ReVe ≤ 0.7 were excluded. Only protein-truncating variants (PTVs, including stop-gain, stop-loss, frameshift, and splicing) and Dmis with minor-allele frequencies ≤ 0.1% were defined as potential functional variants and selected for further analysis. Sanger sequencing was used to validate all potential functional variants in our study.

Prioritization of Candidate Genes
Using de novo PTVs, together with Dmis and background DNV rates, we employed the TADA classification tool [38] to prioritize candidate genes, with gene exhibiting a false discovery rate (FDR) < 0.1 defined as candidates. First, DNVs from 935 trios, inherited variants from 1102 probands, and genetic variants from 3582 in-house Chinese controls (Table S6) were applied to TADA model, and we prioritized 17 candidate genes with FDR values < 0.1. To increase the statistical power, we integrated DNVs from 16,807 probands with different types of NDDs and included 3391 controls from the Gene4Denovo database (version 1.0) [39] and 208 candidate genes with FDR values < 0.1 for further analysis. All samples integrated in our study have been carefully deduplicated.
To discover novel candidate genes, we collected known candidate genes from previous studies and related databases, and excluded them from the list of candidate genes prioritized by TADA. Known candidate genes were defined as follows: (1) genes defined as risk genes in any ten recent publications involving large-scale exome, whole-genome, or targeted sequencing (De Rubeis et al. [17], Sanders et al. [40], Lelieveld et al. [41], McRae et al. [42], Stessman et al. [43], Yuen et al. [44], Nguyen et al. [22], Takata et al. [23], Coe et al. [21], and Satterstrom et al. [45]); (2) genes collected from OMIM that were associated with neurodevelopmental disorders, including intellectual developmental disorder, autism, mental retardation, epileptic encephalopathy, and schizophrenia; (3) genes belonging to the category of syndromic gene or score category of 1 or 2 in SFARI Gene [30]. Genes that met any of the above conditions were classified as known candidate genes. In contrast, genes that failed to meet any of the above conditions above and had no obvious genetic evidence in association with NDDs in PubMed were classified as novel candidate genes.

Permutation Test
Spatial and temporal expression data of the human brain were downloaded from BrainSpan (http://www.brainspan.org/). For ethical reasons, we removed samples during fetal stages and selected postnatal cortical samples for further study. We calculated the Pearson correlation coefficients between any two genes based on their expression levels. Gene pairs with |R| > 0.7 were regarded as being co-expressed in the human brain. The permutation test was performed to compare the novel and the known candidate gene sets, in order to evaluate their functional connections. In brief, we compared the number of co-expressed genes within the novel and the known candidate gene sets and their connections with a random gene set of 1,000,000 random iterations.

Functional Network Analysis
We downloaded PPI data from IntAct Molecular Interaction database (https://www.ebi.ac.uk/intact/) for analysis and considered gene pairs with an intact-miscore ≥ 0.45 to be interacting genes. We then constructed a network based on gene pairs selected from our PPI and co-expression analyses. Novel candidate genes were defined as "seed genes" that were directly connected and used to form an interconnected functional network. Known NDD candidate genes directly connected to at least two novel candidate genes were added to the above network. To further investigate the functional pathways of novel candidate genes, we performed GO enrichment analysis using MetaScape (https:// metascape.org). Similar pathways were merged into a single cluster. Network figures were drawn using Cytoscape v.3.7.2.
We then employed probability of loss-of-function intolerance (pLI) [46] analysis from ExAC and residual variation intolerance scores (RVIS) [47] to evaluate the functional impact of genes with multiple DNVs. According to the scores of pLI and RVIS, we ranked them and used percentiles as an indicator of gene intolerance. Genes with low-percentile RVIS or pLI were more likely to be intolerant to genetic variants. As expected, it was shown that genes with multiple DNVs both exhibited significantly lower percentile pLI (p = 4.10 × 10 −10 , two-tailed Wilcoxon rank-sum test) and RVIS (p = 1.55 × 10 −10 , two-tailed Wilcoxon rank-sum test) than background genes (Fig. S2a). In particular, 20 of the 21 multiple DNV genes ranked in the top 50% of pLI and RVIS ( Table 1), suggesting that they are less tolerant of damaging variants.

Inherited X-Linked Hemizygous Variants
We identified seven inherited X-linked hemizygous variants in six genes (Table S5). Among these seven variants, three variants, including two Dmis (p.R270C on PTCHD1 and p.R197H on SLC16A2) and one splicing variant (c.1030-1G>C on SLC9A7), were recorded in the dbSNP or gnomAD database. The remaining four variants, including three Dmis (p.Q153R on ATP6AP2, p.R287C on ARHGEF9, and p.C517Y on PLXNA3) and one frameshift (p.N136Kfs*30 on SLC16A2) are reported here for the first time. Compared with the 805 known candidate genes, we note that five of the above genes are classified as known candidate genes and three (PTCHD1, ATP6AP2, and SLC9A7) are reported for the first time in Chinese patient with NDDs. In addition, we identified a novel candidate gene (PLXNA3) carrying a novel hemizygous missense

Prioritization of Candidate Genes
Previous studies have demonstrated that both DNVs and rare inherited variants (RIVs) contribute significantly to NDDs [49][50][51][52][53][54]. Therefore, we integrated data on DNVs and RIVs and employed TADA [38] to prioritize candidate genes in NDDs (Table S6) Protein-protein interaction, co-expression analysis and functional network analysis The novel candidate genes are significantly interacted with known risk genes.
The novel candidate genes are significantly co-expressed with known risk genes.
The novel candidate genes are significantly enriched NDD-associated pathways as similar as the known risk genes.

Functional Characteristics of Prioritized Candidate Genes
We next employed MetaScape [58] to perform functional enrichment analysis. As expected, these candidate genes were significantly enriched in NDD-associated pathways, such as synapse organization, covalent chromatin modification, head development, behavior, and regulation of ion transport [17,[59][60][61][62][63] (Fig. S3). Interestingly, the novel candidate genes  PTCHD1 and ARHGEF9 were in two groups, including FDR TADA_combined < 0.05 and genes with X-linked hemizygous were also involved in similar biological pathways. For example, MSL2 was implicated in covalent chromatin modification, LRRC4 was involved in chemical synaptic transmission, and PLXNA3 was associated with head development and axonogenesis.
We also performed functional cell-specific enrichment analyses [64,65] to investigate whether candidate genes were associated with specific tissues or cells. By analyzing the mouse transcriptomic profiling datasets of different developmental stages and brain regions, we found that the 212 NDD candidate genes tended to be enriched in the cortex and striatum during the middle fetal stage (Fig. S4a). In the cellspecific enrichment analyses, we observed a highly significant enrichment in Drd1 + and Drd2 + spiny neurons of neostriatum and rods (Fig. S4b), similar to data reported in a previous study [21]. These results suggest that the 212 NDD candidate genes are functionally associated with the etiology of NDDs.

Functional Network Analysis Between Known and Novel Candidate Genes
To investigate correlations between novel and known candidate genes, we performed a permutation test to estimate the relationship between these genes based on co-expression gene pairs identified from the BrainSpan atlas. We found that 15 of the 17 novel candidate genes (p = 2.35 × 10 −3 , permutation test, Fig. S5a) were co-expressed with 285 known candidate genes (p = 3.72 × 10 −4 , permutation test, Fig. S5b), with 523 connections between them (p = 8.80 × 10 −5 , permutation test, Fig. S5c), suggesting that the novel candidate genes are significantly co-expressed with the known candidate genes and are more likely to be related to the pathology of NDDs.
To further investigate the functional relationship between novel and known candidate genes, we constructed a functional network by integrating PPI data from IntAct and brain expression data from BrainSpan. Only known candidate genes directly interacting/co-expressed with at least two novel candidate genes were added to the network. The co-expressed/PPI network encompassed 159 genes, including 11 novel candidate genes and 148 known candidate genes (Fig. 2). These genes were enriched in several biological processes known to be related to NDDs, such as covalent chromatin modification (GO:0016569, p = 2.35 × 10 −12 , Fisher's exact test), chemical synaptic transmission (GO:00072686, p = 7.31 × 10 − 12 , Fisher's exact test), and brain development (GO:0007420, p = 1.23 × 10 −7 , Fisher's exact test). Furthermore, these genes showed a significant enrichment of previously reported gene sets: FMRP targets [66] (p < 2.22 × 10 −16 , Fisher's exact test), and genes essential in mice [67] (p < 2.22 × 10 −16 , Fisher's exact test). It is worth noting that six novel candidate genes (UBR3, PLXNA3, ITSN1, MSL2, LRRC4, and SPAG9) were significantly involved in functional clusters known to be associated with NDDs. For example, MSL2, UBR3, SPAG9, and PLXNA3 are involved in chromatin organization, while ITSN1, SPAG9, and PLXNA3 have been implicated in nervous system development. In addition, we found that nine novel candidate genes showed co-expression/interaction with more than two known candidate genes (Fig. S6). For example, POLR3A, which was the most frequently connected novel gene, was co-expressed/interacted with 99 known candidate genes. Other genes (LRRC4, ITSN1, UBR3, PLXNA3, MSL2, SPAG9, PSD3, and YTHDC1) were connected with more than five candidate genes. These results suggest that novel candidate genes are functionally associated with known risk genes and that these 11 novel candidate genes may have a stronger influence in the etiology of NDDs.

Discussion
Previous studies have provided evidences that targeted sequencing and integrative analyses play a critical role in the discovery of novel candidate genes [21][22][23][24]. Based on targeted sequencing of 547 target genes, we identified 108 DNVs, accounting for~10.91% (102/935) of our Chinese cohort. Among the genes with DNVs in our study, most genes have been reported to be associated with NDDs in previous studies, suggesting that genes identified in our Chinese cohort are relevant in other populations. Among the 108 DNVs, 60 DNVs were not recorded in public databases (including dbSNP, gnomAD, Gene4Denovo [39], and PubMed), consistent with the findings of Georgi et al., who reported that recurrently mutated amino acid sites in genes are rarely detected [68]. These results highlight the role of DNVs in the genetic heterogeneity of NDDs. In addition, we identified 21 genes with multiple DNVs present in 51 Chinese patients, and our analyses suggest that these genes display significant intolerance to damaging variants. Consistent with previous ASD study [19], SCN2A is the most frequent gene with multiple DNVs in the Chinese population. Compared with the known candidate genes, 20 of 21 genes with multiple DNVs were classified as known candidate genes. Notably, this study is the first to report TCF20 (with its three de novo PTVs) as a candidate NDD gene in a Chinese cohort.
By integrating information regarding DNVs from public datasets with our results, we prioritized 212 candidate genes, confirming that combing public NDD datasets is beneficial for the discovery of candidate genes [22][23][24]. Consistent with our analyses of genes with multiple DNVs, the 212 identified candidate genes were more intolerant of damaging variants. In addition, we demonstrated that the 212 candidate genes are closely associated with the etiology of NDDs from the perspective of biological pathway and functional cell-specific enrichment analyses. These results suggest that most of the 212 candidate genes identified in our study truly contribute to NDDs and they are worth validating in genetic functional studies or replicating in cohorts.
In this study, we prioritized 17 novel candidate genes and revealed similar functional characteristics between these novel candidate genes and other known candidate genes. Through functional network analysis, we observed that novel candidate genes frequently interacted/were co-expressed with known candidate genes, and genes in the network were enriched in NDD-associated clusters, as described in previous studies [4,17,23,69,70]. Interestingly, six novel candidate genes were closely connected with known candidate genes and were involved in NDD-associated clusters, suggesting that these novel candidate genes are more likely to be associated with NDDs. For example, ITSN1 (with an FDR value < 0.01 in the TADA analysis) was involved in nervous system development and synapse organization and connected with 64 known candidate genes. Jakob et al. reported that loss of the signaling scaffold intersectin 1 (ITSN1) in mice led to defective neuronal migration and ablates Reelin stimulation of hippocampal long-term potentiation [71]. Our analyses revealed that SPAG9 carried a de novo PTV in our cohort and it is reportedly overexpressed in human astrocytoma which arises from neural progenitor cells in the central nervous system [72]. We also found that SPAG9 was implicated in chromatin organization and co-expressed/interacted with eight known candidate genes, further supporting a role for this gene in NDDs. Among the novel candidate genes, MSL2 was a particularly interesting gene that carried two de novo PTVs in our Chinese cohort. Except for the multiple DNVs in MSL2, it was the gene that frequently co-expressed/interacted with known candidate genes and was involved in the functional cluster of chromatin organization. Iossifov et al. reported a de novo MSL2 PTV in a patient with autism [69], further suggesting that MSL2 may be strongly linked to NDDs. We expect to have a large cohort study or functional experiments to validate this possibility in future studies.
Despite our best efforts to comprehensively integrate public data with our cohort data to discover novel risk genes, our study was still bound by some limitations. First, while we discovered that novel candidate genes identified in our study are functionally associated with known candidate genes at multiple levels, these novel candidate genes lack strong evidences that support their roles in NDDs. We expect that these novel candidate genes could be validated in larger cohorts or through functional genetic experiments. Second, the sample size of our Chinese cohort is not large enough, and future studies need to recruit more volunteers to improve the statistical power and allow for comparisons with variant patterns of other populations. Third, since NDDs are characterized by high clinical and genetic heterogeneity, and multiple risk factors contribute to the etiology of these disorders, candidate gene analyses can only partially elucidate the processes underlying NDDs. Further studies are needed to integrate the impact of distinct influences such as epigenetic, environmental, and genetic factors.  In summary, our study demonstrates that integrating DNVs from multiple NDD-related studies can help in the identification of risk genes. We highlight that the pattern of DNVs in Chinese cohorts is relevant to other populations. Furthermore, we provide evidence at multiple levels that novel candidate genes are functionally associated with known candidate genes. Finally, our study describes new high-confidence risk genes that should aid the study of the NDD etiology and we expect that these genes will be worth analyzing in large cohorts or being validated in genetic functional experiments.

Declarations
Ethics Approval This study was approved by the Institutional Review Board of the State Key Laboratory of Medical Genetics, School of Life Sciences, Central South University, Changsha, Hunan, China. All subjects who participated in this study provided informed consent prior to sample collection.

Consent for Publication Not applicable
Conflict of Interest The authors declare no competing interest.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.