Introduction

Parkinson’s disease (PD) is the second most common neurodegenerative disorder associated with a host of motor and non-motor symptoms1. These symptoms include muscle rigidity, resting tremor, bradykinesia, and postural instability that maybe accompanied by autonomic dysfunction, sensory symptoms, fatigue, as well as cognitive and behavioral symptoms. Some patients may also develop depression, visual hallucinations, and dementia2. A definitive diagnosis is only possible postmortem and is based on the presence of two hallmarks:(1) Lewy bodies in surviving neurons from various brain regions, including the substantia nigra; and (2) the progressive degeneration of the nigrostriatal system1,3. Even after many decades of research, PD etiology remains largely unknown. However, research continues to demonstrate the strong genetic component of PD, which is important to understand disease mechanisms4,5.

The majority of reported PD cases are sporadic and only 5–10% are regarded as familial. The latter exhibit a monogenic form of the disease with a classical Mendelian mode of inheritance (reviewed in6). Despite the rarity of familial PD, family-based studies have been instrumental in identifying at least six genes with confirmed causal link to PD6,7,8. Point mutations, exon deletions, and copy number variants (CNV) in SNCA (MIM 163890), LRRK2 (MIM 609007), PARKIN (MIM 602544), PINK1 (MIM 608309), and PARK7 (MIM 602533) genes are described in both PD familial and sporadic cases, suggesting that both forms may share the same defective molecular pathways5,9,10,11,12,13,14,15. While linkage analysis led to the identification of high-penetrance disease-causing mutations, genome wide association studies aided the discovery of lower impact common variants with incomplete penetrance that represent risk factors16,17,18,19.

Although substantial progress has been achieved in the field of PD genetics, a large portion of the disease burden could not be explained by the identified mutations and risk variants. This is a reflection of the recognized genetic heterogeneity of PD, as allelic variability has been observed not only across populations but also within families4,13,20,21,22,23,24. Considering the above, it seems that additional genes have yet to be discovered. This has become more attainable with the emergence of whole exome sequencing (WES) that has been successfully applied in the discovery of novel genes in other complex neurological disorders25,26,27,28,29. In PD, WES has led to the identification of more than 40 candidate genes4,30,31,32,33,34. At present, supporting functional evidence for a role in PD pathophysiology is available only for three of the candidate genes (VPS35,TMEM230 and DNAJC13)32,35,36,37,38.

Thus far, the majority of WES findings in PD are derived from familial cases, and when possible validated in larger replication cohort with the same form of the disease4,30,32,34,38. Such an approach is appealing as rare variants with large effect size tend to aggregate in multi-incident families. However, the more common (sporadic) form of PD remains under-investigated apart from a single attempt in which a large cohort of more than 1000 cases (including non-familial) with early onset PD underwent WES assuming recessive inheritance33.

Some analysis strategies for WES data resulted in successful gene discovery in simple Mendelian diseases. However, these strategies are not suited for complex disorders where allelic heterogeneity and oligogenic inheritance are suspected. Here we set out to identify rare variants with likely pathogenic effect representing candidates worthy of further investigation in future studies. To achieve this, we devised a thorough 3-stage analysis strategy to overcome some of the common challenges encountered in gene discovery of complex disorders (Fig. 1). Our stringent prioritization was carried out to increase the likelihood that the candidate variants discovered in this study have a role in PD.

Figure 1
figure 1

Summary of the 3-stage analysis approach applied in this study. (a) Pre-WES mutation screening of reported genes. (b) WES filtering and validation. (c) Criteria for gene prioritization.

Results

PARKIN exon 3 deletion

A homozygous deletion of PARKIN exon 3 was detected in two early onset cases. One was a familial case (PD-10) with three reported affected siblings, unfortunately; none of the siblings were available for DNA testing. The other case (PD-56) was sporadic with no reported family history of PD. The remaining samples were negative for deletions/duplication in the surveyed genes therefore representing a good sample pool for candidate gene discovery.

Validated rare variants

As our cohort is a mixture of familial and sporadic cases, we searched for homozygous, compound heterozygous and heterozygous putative variants in all samples regardless of the mode of inheritance and consanguinity. At this stage we focused on rare variants in both local and international frequency databases. This inclusive approach offers a number of advantages (1) exploration of interfamilial and intrafamilial heterogeneity, (2) detection of autosomal recessive variants in seemingly “sporadic” cases with uncertain/unknown family history, (3) minimizing variant filtering flaws (inclusion/exclusion) due to inaccurate/incomplete pedigree information or family history and errors in in silico predictions of variant impact.

Our analysis workflow yielded a total of 125 Sanger validated rare variants in 51/60 (85%) cases with 1–6 variants/sample. Of these variants 90 were missense and 34 were loss of function (LoFs) (Fig. 2). Variants were identified in 117 genes, 13 of which were observed in 2 or more cases (NOTCH4, BCOR, FAM174A, EIF4G1, DNAJB9, RABEP1, EPRS, BRINP2, HEPHL1, PUM1, GAMT, SH3TC2 and SEC22A). Two genes were previously reported in PD (ATP13A2 and EIF4G1)39,40,41 (Fig. 3 and Supplementary Table S1).

Figure 2
figure 2

Breakdown of genetic alterations identified in this study. Pie chart illustrating the type and the number of all the validated genetic alterations (SNVs and CNVs) identified in this study.

Figure 3
figure 3

Distribution of variants harbouring genes across the genome. Distribution of the identified genes with validated variants across chromosomes.

Among all the identified variants 39 had a pLI ≥ 0.9, 82 had a positive Z score and 34 had both pLI ≥ 0.9 and a positive Z score. While these gene constraint metrics are useful (in combination with other in silico tools) in predicting variant deleteriousness, they are not individually sufficient to infer or exclude pathogenicity42. This is supported by the fact that many of the well-established PD genes have high tolerance for missense and/or LoFs (Supplementary Table S2). Therefore, these metrics were neither considered in the filtering process nor included in the subsequent variant prioritization. On the other hand, CADD and PredictSNP2 prediction results were taken into account for variant prioritization. It is noteworthy, that none of the pre-screened genes in our cohort43 were flagged by our filtering pipeline. This provides reassurance that we did not overlook any candidate or known disease-causing variants in these genes.

In addition to rare variants, we interrogated our PD cohort dataset for shared variants with a deleterious prediction (CADD ≥ 20) these variants were common in the general Saudi population but absent in the international databases. We compiled a list of variants enriched in our cohort and unique to the Saudi population (Supplementary Table S3). Although this list is derived from a small sample, it has the potential to inform variant selection for association studies if empowered with results from a larger cohort in future research.

Prioritised rare variants

Next, we further restricted our list of candidate variants on the basis of fulfilling at least 1 out of 5 proposed conditions (Fig. 1c). Our aim was to discriminate variants that are most likely to contribute to the phenotype. These variants would represent promising candidates that warrant further functional investigation. This strategy produced a list of 22 prioritized single nucleotide variants (SNVs) in 20 genes (Table 1). Five genes carried multiple variants in unrelated patients. A nonsense variant (p.R151X) in FAM174A was detected in two familial cases and their affected relatives (PD-19 and PD-34). Both affected siblings in PD-19 were heterozygous for PARIKIN exon3 deletion43. The index case in PD-34 was homozygous for this variant while his affected siblings and offspring were heterozygous. This is unsurprising, since inter and intrafamilial differences were previously observed in PD cases22,23. A splice-donor variant (c.2591 + 3 G > A) occurring in PUM1 was present in two familial cases (PD-32 and PD-46). The third recurring variant was a missense (p.K715R) within EIF4G1 that was found in two sporadic cases (PD-56 and PD-62) and a familial case (PD-23).

Table 1 List of Prioritized Variants.

Among the recurrent genes was EPRS, in which two distinct missense variants (p.R838H and p.Y791C) were detected in two sporadic cases (PD-43 and PD-57), respectively. The former case presented with late onset, while the latter had an early onset form. NOTCH4, also harboured two distinct LoF variants (c.2865 + 2T > C and p.Q1257X) in two sporadic cases (PD-1 and PD-64). The former had an early age of onset (41 years) but that of the later was not reported. To our knowledge, none of the shortlisted genes, except for ATP13A2 and EIF4G1, have been previously reported in PD39,40,41,44.

Biological processes and networks over-represented in our genes set

The assessment of the resulting genes list with Ingenuity Pathway Analysis (IPA) discovered “Cell Death and Survival”, “Cellular Assembly and Organization” and “Cell Morphology” to be among the top ranking molecular and cellular functions. As for system level functions, “Nervous System Development and Function”, “Tissue Morphology” and “Embryonic Development” were among the top 5 (Supplementary Table S4). On the other hand, Gene Ontology (GO) analysis revealed significant enrichment in terms pertaining to the nervous system. Terms like “Nervous system development (GO:0007399)”,“Neuron part (GO:0097458)”, “Postsynaptic density (GO:0014069)”, “Postsynaptic specialization (GO:0099572)”, “Asymmetric synapse (GO:0032279)”, and “Neuron to neuron synapse (GO:0098984)”, were all over-represented (Supplementary Table S5). Both approaches identified categories specific or related to the nervous system. In addition, differential brain expression was also determined from public databases (Supplementary Table S6).

Discussion

Next Generation Sequencing (NGS) is widely used to identify disease causing variants in monogenic and complex disorders45,46,47. Both exome and targeted sequencing are used as molecular research tools for gene discovery in neurological disorders25,26,48,49. The success of WES in identifying causative mutations in different disorders has encouraged its application in clinical settings26,50. And with the implementation of precision medicine and through incorporation of molecular screening as part of routine clinical practice, identification of rare variants in complex disorders becomes a research priority.

PD-NGS oriented studies focused on small sets of familial cases or trios, with a few exceptions where large cohorts were used to identify de novo and inherited mutations. Interestingly, the majority of the identified mutations represent private changes restricted to a single family or case, which is expected considering the heterogeneity of the disease4,33. These results suggested that infrequent low penetrant mutations in PD patients could be a major cause of the disease4,51. Hence, searching for a prevalent causative gene/mutation is unreasonable especially that mutations in some of the previously known PD genes with established disease related functional roles are rare21,30,32,35.

Identifying rare variants with minor allele frequencies (MAF) (≤%1) in genes expressed in the brain and/or in molecular pathways linked to a neurological disorder could contribute to our understanding of the genetic basis of PD. We complied a cohort of both familial and sporadic cases that were subjected to WES and analysed the data regardless of consanguinity or mode of inheritance using a strict multistage filtering. The analysis yielded a general list of rare variants in 51/60 cases and prioritized list of potential disease related variants in 25 index cases (Table 1, Supplementary Table S1).

In total 20 genes were shortlisted with potential disease related variants; all these genes were not previously linked to PD except for EIF4G1 and ATP13A2, where there are conflicting reports on their role in familial PD39,41,44. Eleven variants were absent in our local control database and 4 out of these were recorded in international databases, the remaining had a MAF of less than 1% in local and/or international databases. Although the majority of the SNVs identified in this study were unique events in a single family or a sporadic case; our pipeline identified multiple variants in 5 genes. Two different variants were detected in 2 of these genes (NOTCH4 and EPRS), whereas PUM1, FAM174A and EIF4G1 had a single recurrent variant (Table 1, Supplementary Table S1).

All the cases selected for WES in this study were negative for point mutations and confirmed CNV changes in the known genes43, with the exception of one familial and a sporadic case each carried PARKIN exon3 deletion and an additional WES identified variant. The two affected siblings from PD-19 were heterozygous for the PARKIN exon3 deletion and both had a truncating variant in FAM174A, this may represent a case of digenic mechanism for disease progression52. The sporadic case PD-56, homozygous for PARKIN exon3 deletion, had a heterozygous rare missense variant. Although PARKIN exon3 deletion is a confirmed disease-causing mutation13,53,54, these additional variants may play a role in disease progression and may contribute to its phenotypic heterogeneity.

Assigning disease causality of newly identified variants requires rigorous functional assessment and segregation analysis, however, when feasible; replication cohorts provide evidence that may support novel findings4,33. One of the limitations of our study is the lack of unaffected family members or parents for segregation analysis and since we did not have access to a powerful PD disease-related exome dataset; we further investigated the occurrence of our identified variants in Saudi Human Genome Program (SHGP) patient database, which contained cases with overlapping neurodegenerative phenotypes. We found that 7 of our variants were also present in patients with overlapping phenotypes (Supplementary Table S7), which further supports the role of these genes in the development of PD.

To gain an insight into the impact of the identified genes on the central nervous system (CNS), we surveyed the Mouse Genome Informatics (MGI; http://www.informatics.jax.org) database for any available transgenic mice for our candidate genes with a neurological phenotype55. The knockout mouse models of 13 of the prioritized genes presented with different neurological and behavioral phenotypes, including aging related phenotypes, tremors, impaired limb coordination, abnormal gait, abnormal synaptic vascular formation and other specific neurological phenotypes. Of these, NOTCH4 and CLSTN3 showed extensive regional brain anomalies. Tremors and/or abnormal balance were present in SPF7, ATP13A2, CLSTN3 and NEFH transgenic mice, while involuntary movement and limb gasping were observed characteristics in the CELSR1 and SNAP91 transgenic mice, respectively (Supplementary Table S8).

In addition to mining mouse models databases, both GO and IPA analysis of our selected genes list have identified significant enrichment of biological processes/terms related to the nervous system and neuronal function (Supplementary Tables S4 and S5). This enrichment further supports the hypothesis that these genes may influence key neuronal functions and contribute to the disorder.

Taken together, from the 18 genes prioritized here, NOTCH4 is the strongest PD candidate gene with a unique truncation and a splice site variant identified in two sporadic cases; the splice site variant was also present in a familial case in the SHGP pandp (Supplementary Table S7) with an overlapping phenotype. There are conflicting reports about the linkage of NOTCH4 variants to neurological conditions including schizophrenia and Alzheimer’s disease but none linking it to PD56,57,58,59. Several Notch4-allele targeted and knockout mouse models were developed, including reporters and inducible transgenics. Interestingly, one of these transgenes, Tg(tetO-Notch4*)1Rwng (MGI:5502689), contains DNA encoding amino acids 1411–1964 of Notch4 placed under the control of the tetracycline response element and is used to model arteriovenous malformations of the human brain (DOID:0060688). This inducible transgenic mouse exhibits multiple brain abnormalities and increased neuron apoptosis. These mice also present with ataxia and seizures60,61 (Supplementary Table S8). The Notch hetero-oligomer contains 6 characterized domains; a large extracellular domain (ECD), with 10–36 tandem Epidermal Growth Factor (EFG)-like repeats which participate in ligand interactions; a negative regulatory region, containing three cysteine-rich Lin12-Notch Repeats (LNR); a single transmembrane domain (TM); a small intracellular domain (ICD), which includes a RAM (RBPjk-association module); in addition to six ankyrin repeats (ANK), involved in protein-protein interactions; and a PEST domain62,63. The p.Q1257X variant, identified in PD-64 falls in the (LNR) repeat involved in receptor regulation, while the splice variant (c.2865 + 2T > C) affects an exon that falls within Calcium-binding EFG-like domain. It is well established that Notch signalling is an evolutionarily conserved pathway involved in a wide variety of developmental processes, including adult homeostasis, stem cell maintenance, cell proliferation and apoptosis62,64.

In summary, we used a combination of gene dosage and NGS analysis to screen for changes in both familial and sporadic PD cases. Using this approach we identified at least a single potential disease related genetic event in 51/60 cases studied. Our strict filtering and prioritizing criteria retained at least one single rare variant in (26/51) 50% of the cases studied; some of these variants are in strong candidate genes with known brain related functions while others may represent low penetrance risk alleles. Failure to identify potential candidate variants in the remaining cases could be attributed to a number of reasons, including missing variants in poorly covered regions or variants in non-coding or regulatory regions. There is also the possibility of gene dosage alterations existing in genes not included in this study. This approach is suitable for a complex heterogeneous disorder with different molecular mechanisms at play.

Conclusions

We have used a stringent multistage filtering of WES data to identify potential disease related variants in a cohort of both sporadic and familial cases. We identified a number of interesting variants in genes not previously linked to PD. Our data is consistent with previous WES studies where the majority of variants and candidate genes identified represent private events with very low or no rate of replication. The integrative approach employed here generated a useful catalogue of rare potentially deleterious PD candidate variants for further genetic replication and functional assessment studies.

Methods

Patients and samples

We assembled a cohort of 60 Saudi patients (19 familial and 41 sporadic) whom all presented with a consultation of PD symptoms (Supplementary Tables S9, S10 and Fig. S1). Pathogenic point mutations in PD major genes were previously ruled out in these patients43.

Multiplex ligation-dependent probe amplification (MLPA)

Gene dosage alterations were assessed using two commercially available MLPA kits: SALSA MLPA probemix P051-D1 and P052-D1 Parkinson (MRC Holland, The Netherlands) as described65 (Fig. 1a). Together, the probemixes contained MLPA probes covering all exons of the following PD related genes: PARKIN, SNCA, PINK1, PARK7, UCLH1, and GCH1, as well as selected exons of LRRK2, ATP13A2, CAV1, and CAV2. Different probes covering all exons of PARKIN were included in both kits permitting cross verification of any detected changes. Rearrangements detected in PARKIN were verified using P052-D1 MLPA assay and/or by conventional PCR using primers flanking the deleted exon as previously described43.

Whole exome sequencing, data processing and primary analysis

Whole exome sequencing and subsequent data analysis for all samples were performed as previously described48. Briefly, 100 ng of genomic DNA from each sample was sequenced on the Ion Proton platforms using the whole exome AmpliSeq kit (Life Technologies, Carlsbad, CA, USA). A maximum of 17 Gb of DNA sequence was generated for each sequencing run/sample. First, reads were subjected to quality control (QC) checks to eliminate any low quality reads, then were mapped and aligned to UCSC Human reference genome (hg19) (http://genome.ucsc.edu/) using tmap, which is part of the Torrent Suite package. All variants were called using Torrent Suite Variant Caller (Life Technologies, Carlsbad, CA, USA) and annotated with ANNOVAR (http://annovar.openbioinformatics.org).

Filtering and validation

Extensive genetic and allelic heterogeneity, incomplete penetrance and the possible presence of phenocopies, all have often been observed in complex disorders such as PD30,33,34,35,36. With this in mind, we decided to filter for variants with homozygous and heterozygous transmission for both familial and sporadic cases- regardless of the observed/predicted mode of inheritance and consanguinity. This is to allow for interfamilial and intrafamilial heterogeneity and to avoid missing autosomal recessive variants in sporadic cases66,67. Only genes with positive brain expression in publicly available databases (as listed in Gene cards) were selected. Of the selected genes, only functional variants (LoF and missense) were retained before applying the allele frequency filter. We used a stringent minor allele frequency filter of (MAF ≤ 1%) in the SHGP local “ethnically matching” database and/or international (ExAC and 1000 Genomes)42,68. The resulting variants were then validated using Sanger sequencing (Fig. 1b).

Local databases and controls

The SHGP database constitutes NGS data from exomes and 13 targeted gene panels including a panel specific for neurological disorders48,69. At the time of the analysis, the exome database included 2379 local control exomes (termed here SHGP Exdb). The neurological disorders gene panel comprised 1863 patients diagnosed with different neurological disorders including neurodegeneration and neuropathy (this patient database is referred to as SHGP pandb). These databases are from a population with high endogamy and consanguinity and is enriched with rare recessive disease causing mutations, some with founder effect69. We used the SHGP Exdb as the ethnically matched controls and the SHGP pandb as our local replication cohort.

Variant prioritization

Because WES generates a large number of variants even after the initial filtration, we created a list of prioritized variants worthy of further investigation in future studies. It is important to point out that we deliberately avoided using the American College of Medical Genetics (ACMG) variant classification system which is only intended for Mendelian disorders and is considered unsuitable approach for complex disorders70. We therefore, have devised a prioritization strategy in which only variants predicted to be strongly deleterious (CADD > 20 and PredictSNP2 classification of “deleterious” for splice site variants) were considered. Of note, only splice site variants affecting exons not subject to alternative splicing were considered71,72. Candidate variants were further prioritized on the bases of meeting at least 1 of 5 strict criteria (1) presence in a gene that was observed in 2 or more cases, (2) presence in a gene previously associated with PD, (3) the gene harbouring the variant has a mouse model with documented neurological or behavioural deficits, (4) same variant was found in additional affected family members (when available), and finally (5) same variant was observed in our local replication cohort (SHGP pandb) (Fig. 1c).

IPA and GO-enrichment analysis

To identify the functional categories enriched in our genes set (genes containing variants with CADD > 20 and PredictSNP2 classification of “deleterious” for splice site variants), we used two independent web-based applications; Gene Ontology (GO) enrichment analysis (http://geneontology.org/)73 and the Ingenuity Pathway Analysis software core analysis function (IPA®,v01-08,QIAGEN, Redwood City, www.qiagen.com/ingenuity). Our genes set was analyzed for any significant (p < 0.05 and Bonferroni corrected) over-represented GO terms under the three main categories (molecular function, cellular component and biological process). For the IPA core analysis, we first uploaded the genes accession numbers into the software before running either “Expression” or “Variant effect” core analysis (both gave identical results). The analysis was set using the “Ingenuity Knowledge base” as a reference set. The pre- analysis filtering included all “Data sources”, “Tissues and cell lines” and “Mutations findings from the knowledge base”. IPA uses right-tailed Fisher’s exact test to calculate the statistical significance of the resulting functions, pathways and networks. Only the top 5 terms under each category are listed in this study. As for networks, only those with a score >5 were considered.

Ethics, consent and permissions

We declare that informed consents were obtained from all participants in adherence with the declaration of Helsinki and according to KFSHRC IRB and Research Advisory Committee (RAC) rules and regulations under the following approved project (RAC# 2110035).