Background

Neurodevelopmental disorders (NDDs) are clinically defined as “a group of conditions with onset in the developmental period (…) characterized by developmental deficits that produce impairments of personal, social, academic, or occupational functioning” [Diagnostic and Statistical Manual of Mental Disorders, 5th Edition – DSM-5]. Intellectual disability (ID), formerly known as “mental retardation”, is an incomplete mental development, leading to a substantial limitation in general mental abilities, intellectual functioning, adaptive behaviour and function skills, in comparison with individuals of the same age, gender and social-cultural background [1]. These limitations can be observed in many domains such as communication, personal care, self-governance, functional academic skills, among others [1,2,3].

ID can appear as an isolated feature (non-syndromic ID, NSID), or associated with facial dysmorphic features, other morphological anomalies, multisystemic disorders (syndromic ID, SID) [4] or multiple neuropsychiatric and/or neurobehavioral problems, such as autism or epilepsy, and neuromuscular features, e.g. ataxia, spastic paraplegia, sensory or motor neuropathy, and muscular dystrophy [5,6,7]. Previously, ID classification was based on intelligence quotient (IQ) scores: mild (IQ 50–69, 85.0% of ID cases), moderate (IQ 35–49, 10.0% of ID cases), severe (IQ 20–34, 3.5% of ID cases) and profound (IQ < 20, 1.5% of ID cases) [1, 8,9,10,11]. Nowadays, ID diagnostic criteria include (i) deficits in intellectual functioning confirmed by clinical evaluation and standard IQ testing; (ii) deficits in adaptive functioning that results in failure to meet developmental and sociocultural standards for personal independence and social responsibility; and (iii) onset of deficits during the developmental period. The severity of ID is based on the level of adaptive functioning deficits of an individual in the conceptual, social and practical domains, which determines the level of support needed [1]. Under the age of 5 years, the term Global Developmental Delay (GDD) is used [2, 12, 13]. GDD is characterized by the failure to accomplish developmental milestones expected for a given age range, in two or more of the above-mentioned domains, including gross or fine motor skills, speech and language, cognition, personal-social and activities of daily living. ID and GDD are evaluated and clinically followed by the same medical specialties, in particular in paediatric clinics, psychiatry, neurology/epilepsy, and rehabilitation medicine clinics [14]. Of note, not all children with GDD will show ID in adulthood [15].

ID affects between 1 and 3% of individuals worldwide, although with some regional differences [16]. Mild ID is believed to affect 0.7–1.3% of the general population [17], while severe and profound ID have an estimated prevalence of less than 0.5%. ID represents an important public health problem, affecting families and the society, being a burden to the health systems with direct costs estimated in 43.3 billion euro per year in Europe [18]. Non-genetic or environmental factors, such as socio-cultural determinants and infections, can contribute to ID, although the majority of severe or profound ID are known to have a monogenetic origin [2, 7, 19, 20].

Technological advances in the last decade, led to the identification of novel ID genes, bringing new insights into the ID molecular diagnosis, and the underlying biological mechanisms [6]. Establishment of the ID genetic aetiology is mandatory for proper diagnosis, prognosis and disease management, assuming a key role in genetic counselling. Based on disease recurrence risk and the availability of a specific preimplantation or prenatal test, couples can be offered planning in future pregnancies [21]. Currently, ID is rarely treatable but molecular diagnosis is crucial to guide patients and families in the process of disease acceptance and expectations adjustment allowing the liaison with patient organisations and associations. The ragbag of ID classifications, diagnostic methodologies and functional studies demand constant update and systematization to improve ID diagnostic and investigational strategies. Here, we propose to review seminal works in ID particularly focusing on massive parallel sequencing applications and functional validation of genetic variants, aiming at guiding ID diagnostic investigation.

Intellectual disability is genetically and clinically extremely heterogeneous

Genetic diagnosis of ID can be dated back to 1959 with the identification of trisomy 21 in Down syndrome [22], still being the most frequent chromosome disorder and the most common single cause of ID [23]. Conventional cytogenetics, namely karyotyping and fluorescence in situ hybridization (FISH), allow the identification of numeric and structural chromosome abnormalities, which are responsible for about 15% of ID [24]. Recurrent microdeletions and microduplications have been identified by chromosomal microarray analysis (CMA), in patients affected with ID-related disorders, including Williams, DiGeorge, Prader-Willi, Angelman, Wolf-Hirschhorn or Cri du Chat syndromes [6, 25]. DNA copy-number variants (CNVs) containing few to hundreds of genes, have increasingly been identified as ID causes [26]. CNVs, occur mostly de novo, and are responsible for about 10–14% of ID cases [26,27,28,29]. Research studies in cohorts of patients carrying recurrent CNVs allowed the identification of new disease and dosage sensitive dominant genes [30, 31].

Regarding monogenic ID cases, most are caused by single nucleotide variants (SNVs), and small insertions or deletions (indels), in genes that code for proteins that operate in key biological processes such as neurogenesis, synaptogenesis or synaptic plasticity. Development of a DNA sequencing method, the Sanger sequencing in 1975 [32], and further automatization and commercialization in the 1980’s, were key for the detection of this type of variants [33,34,35].

Non-Mendelian ID disorders are a challenge in diagnosis, genetic counselling and recurrence risk estimation. A special group are those caused by dynamic mutations occurring in tri, tetra and pentanucleotide repetitive regions. The first report of ID pathogenic variants caused by repeat expansions occurred in 1991. This study described the identification of a trinucleotide repetitive region, a CGG repeat tract located at the 5′ untranslated region of FMRP translational regulator 1 gene (FMR1) implicated in Fragile X syndrome (FXS) [36]. FXS is the most common cause of inherited ID, and despite being identified three decades ago, there is no effective treatment and knowledge on disease mechanisms is scarce [37]. To date, more than 40 inherited diseases affecting the central nervous system, have been identified [38,39,40,41,42].

Also, DNA methylation or DNA imprinting, well-known epigenetic disease mechanisms, do not follow a Mendelian inheritance pattern [43]. Imprinting diseases are implicated in ID, growth impairment, development and metabolism defects, associated with disturbance of the regulation, dosage and genomic sequence of imprinting loci [44]. The identification of consistent and significant methylation aberrations in multiple unrelated but phenotypically similar patients [43, 45, 46] is still challenging. The expression pattern of imprinted genes is monoallelic and parental origin dependent [47]. To date there are eight well-characterized imprinting disorders: Prader-Willi [48], Angelman [49], Silver-Russell [50], Beckwith-Wiedemann [51], Temple [52] and Kagami-Ogata syndromes [53], Transient Neonatal Diabetes [54] and Pseudohypoparathyroidism type 1B [55].

Another group of heterogeneous non-Mendelian genetic diseases are those caused by pathogenic variants in the mitochondrial genome (mtDNA), also known to be involved in ID [56]. Mitochondrial disorders are characterized by a deficient oxidative phosphorylation, with an estimated prevalence among adults of 2.9 cases per 100,000 individuals and 9.6 cases per 100,000 individuals, respectively caused by nuclear or mtDNA mutations [57]. Approximately 1 in 200 healthy individuals carry a pathogenic variant in mtDNA with low levels of heteroplasmy, with implications in the offspring [58]. Leigh syndrome caused by molecular defects in nuclear and mtDNA genes, and Mitochondrial DNA Depletion syndrome 4A (Alpers syndrome), are two examples of childhood-onset mitochondrial neurodegenerative disorders [59, 60].

The large genetic heterogeneity, intrinsic to ID-related disorders, as well as the absence of a specific inheritance pattern, especially when there is only one affected family member, can hamper the selection of the gene to target. To interrogate a large number of genes in a single step, tackling the majority of ID causes, including SNV, indels, CNVs and even structural chromosome abnormalities, the development of the genome-wide sequencing approaches, such as massive parallel sequencing, was essential.

Massive parallel sequencing - a milestone towards ID-gene identification

Massive parallel sequencing commonly named next generation sequencing (NGS) is a fast, accurate, efficient and cost-beneficial screening strategy, representing a milestone in novel ID genes identification [61, 62]. Non-targeted NGS, a “genotyping driven” gene identification approach, unveiled the complexity of genotype-phenotype correlations, especially in heterogeneous disorders, where pathogenic variants in some ID-related genes can be implicated in “atypical” phenotypes [63, 64]. For instance, variants in CHD2, SETD2 and SLC6A1 genes are known to cause autism in some cases and severe ID without autistic features in others [27]. With the use of reverse phenotyping, clinicians return to patients to validate or infirm a molecular result even in cases of rare genetic occurrences. Describing new features associated with well-known phenotypes expanded the phenotyping spectrum of a given gene/disease, impacting ultra-rare disorders with atypical phenotypes [65].

The first study using exome sequencing (ES) to uncover the genetic basis of Miller syndrome, a monogenic disorder, was published in 2010 [66]. In the last decade, new genes were rapidly associated with other autosomal dominant syndromes [6] and the number of autosomal recessive ID (ARID) genes more than doubled [67, 68]. Concurrently, more than 2500 ID genes, were identified, including primary and candidate genes (Fig. 1) [4].

Fig. 1
figure 1

ID genes identified over the time. ID – intellectual disability; ARID – autosomal recessive intellectual disability; ADID – autosomal dominant intellectual disability; XLID – X-linked intellectual disability; MtID – mitochondrial intellectual disability. Reproduced from Vissers et al. [6] and updated with information from SysID database [4]

According to the SysID database, there are 1500 primary ID genes, causing 1797 ID related disorders, and 1248 ID candidate genes. ID related genes can be gathered based on their ontology, or biological function (Fig. 2). The gene ontology-based analysis shows the large heterogeneity of ID, as well as the biological pathways involved. Gene cluster analysis shows 270 genes and 415 diseases associated with metabolism [4]. Phenylketonuria and galactosemia, caused by molecular deficits in PAH and GALT genes respectively, are examples of such disorders, representing 1–5% of ID causes [69]. A significant number of ID genes/diseases are also involved in transport (214/342), nervous system development (200/339), RNA metabolism (179/273) and transcription (152/245) [4].

Fig. 2
figure 2

ID genes and diseases according to the corresponding ontology. Number of genes (dark grey) and related diseases (light grey) grouped by the biological pathway implicated. MT – mitochondrial; BMP – Bone morphogenetic protein; TOR – Target of rapamycin. Adapted from SysID database [4]

Common features: from library preparation to sequencing reactions

Four sequencing platforms sharing common basic features, such as library preparation and template generation, were hitherto developed. Sequencing reactions are intrinsic to each methodology and the signal resulting from the amplification is obtained by fluorescence, light or ionic potential modification, depending on the underlying principle: sequencing by synthesis, pyrosequencing, sequencing by ligation and ion semiconductor sequencing (Fig. 3) [70, 71].

Fig. 3
figure 3

Overview of the NGS techniques. Schematic representation of the common features (1 and 2) and different particularities (3). APS – Adenosine 5′ phosphosulfate; PPi – Inorganic pyrophosphate; ATP – Adenosine triphosphate; P1 – Primer 1. Reproduced by permission of Applied Biological Materials Inc. (abm)

Sequencing by synthesis is based in a step-by-step incorporation of nucleotides attached to a single florescent molecule. The error rate is low, although increasing with the read length [72]. In pyrosequencing, a pyrophosphate molecule is released and light will be generated after a cascade of chemical reactions, following the polymerase incorporation of a nucleotide. This results in larger read lengths, but with high costs and high error rate over homopolymers of 6 or more nucleotides [73]. In ligation, the reaction is based in fluorescent 8-mer oligonucleotide probes, resulting in very short read lengths [74]. In ion semiconductor sequencing, the nucleotide sequence is determined by pH changes. Overall, this is the most cost-effective and time-efficient, despite the high error rate in large homopolymers [75].

Targeted-NGS is effective on clinically recognizable forms of ID

Targeted-NGS (TNGS) has been largely used in ID diagnostic settings, either using panels of genes involved in common pathways, or by studying an entire chromosome. Najmabadi et al. [76] identified putative disease-causing variants in 78 out of 136 consanguineous families (57%), resulting in the identification of 50 candidate ARID genes and variants in known syndromic-ARID genes in 26 families. Tzschach et al. [77], sequenced 107 XLID genes in 50 patients with a suggestive XLID family history and in 100 sporadic ID patients, identified pathogenic variants in 13 (26%), and in five (5%) patients, respectively. Hu et al. [5] identified seven novel XLID genes: CLCN4, CNKSR2, FRMPD4, KLHL15, LAS1L, RLIM and USP27X and a previously characterized ID pathogenic variant in 74 families (18%), after sequencing 745 genes in 405 families. The diagnostic yield is biased to the targeted regions and influenced by the clustering of genetic errors, typically occurring in regions with high homozygosity due to inbreeding, such as in Iran [78,79,80]. In well characterized patients with dysmorphic, neurological or systemic features, TNGS low sequencing costs, high coverage, completeness and incidental findings low-rate, results in a decrease in the diagnostic turnaround time. As knowledge evolves, e.g. new disease associated genes are identified, updates are needed, which can be laborious, time-consuming and increase the TNGS costs [81,82,83].

Exome-sequencing improves the diagnostic yield in syndromic NDDs

Exome sequencing (ES) has been shown to be a powerful, robust, and scalable methodology in ID diagnosis. Trio-ES analysis (i.e. proband and parents) led to the identification of a significant number of de novo variants in patients with sporadic ID [84]. De Ligt et al. [2] performed a trio-ES study in 100 families and identified 70 de novo variants in 53 patients, with an overall diagnostic yield of 53%. Rauch et al. [27] identified 87 de novo variants of which 16 in known ADID genes, in 45 out of 51 patients after a negative CNV screening. Considering the six loss-of-function variants, identified in six novel ADID genes and assumed to be pathogenic, a diagnostic yield of 88% is achieved. The Deciphering Developmental Disorders (DDD) study recruited families from all regional genetics services around the United Kingdom (UK) and Ireland. Around 2000 families with undiagnosed developmental disorders were included in the first year of the study, increasing to 8000 within 3 years. After genome-wide microarray and trio-ES studies, focusing 1133 complete trio-families, de novo and segregating variants in known developmental disorder genes were identified, representing a diagnostic yield of 27% [85]. In 2018, data were reanalysed in light of new molecular and clinical knowledge and a diagnosis was attained in further 454 families, representing a diagnostic yield of 40% [86]. In 2019, a meta-analysis gathering information on 30 NDDs studies published between January 2014 and June 2018 concluded that the ES yield for overall NDDs is 36%, isolated NDDs 31%, and syndromic NDDs 53 [87].

Genome-sequencing: a complete approach

Genome sequencing (GS) provides homogeneous coverage, improving the detection of SNVs, CNVs, and balanced translocations [88], as well as the detection of mosaicism, when coverage depth is sufficient (e.g. a mean coverage of 130 ×) [80, 89]. In the Schluth-Bolard et al. [90] study, balanced chromosomal rearrangements with inversions and translocations were identified in three patients. Gilissen et al. [67] identified 84 de novo CNVs and 82 SNVs in a cohort of 50 patients, previously undiagnosed after ES, reaching a conclusive diagnosis in 21 patients (42%). These authors estimate that the cumulative diagnostic yield of GS was 62%, including de novo SNVs (39%), de novo CNVs (21%) and recessive variants (2%), based on previously published data with large cohorts [67]. In a cohort of 244 ID/developmental delay (DD) children, Bowling et al. [91] identified 44 pathogenic and 10 likely pathogenic SNV/indel variants, 5 pathogenic and 1 likely pathogenic CNVs, resulting in a diagnostic yield of 25%. Wang et al. [92] tested whole genome low-coverage sequencing to detect CNVs, and medical exome sequencing (MES), i.e. exome analysis of known ID disease-causing genes, to identify SNVs, in 95 patients with a negative CNVs screening. Nineteen pathogenic CNVs in 16 patients (17%), and ten pathogenic SNVs in 8 patients (8%) were found [92]. GS is the most comprehensive genetic test, as it interrogates all the genome [67], however, improvements in the bioinformatics algorithms for variant detection and interpretation are needed. Together with the decrease of the associated costs, are crucial for the routine implementation of GS in diagnostic settings [93].

Variant filtering

Massive parallel sequencing raw data is standardly generated in the FASTQ format. The files contain the identification, sequence and sequence identifier, and quality values of each sample [94]. Reads are usually mapped into the hg19/GRCh37 or GRCh38 versions of the human reference genome, and the alignment results are typically reported in binary alignment map (BAM) format. BAM files contain information on the possible location of each read in the human genome [95]. After sequence alignment, variant calling will identify differences between the reads sequence and the reference genome. Variants are usually reported as variant call format (VCF) file. VCF files are composed of several lines, each corresponding to a genomic position [96]. Sophisticated algorithms as used to screen the information generated after genome sequencing with inherent data storage and interpretation issues. Due to the intrinsic ID heterogeneity, the use of guidelines are important. Figure 4 represents a simplified workflow to guide variant filtering.

Fig. 4
figure 4

Variant filtering flowchart. SNP – single nucleotide polymorphism; DGV – database of genomic variants; SNVs – single nucleotide variants; CNVs – copy number variants; SVs – structural variants; CSAS – canonical splicing acceptor site; CSDS – canonical splicing donor site; SAS – splicing acceptor site; SDS – Splicing donor site; Q-PCR – quantitative PCR

Variant coverage

Variants occurring in 20% of the reads, with a minimum coverage of ten, should be considered to reduce the prioritization of sequencing artefacts [93, 97]. Nevertheless, variants occurring in less than 1% of the reads can be identified, when sufficient coverage is attained (e.g. 30–60 x for genome) [97]. Rohlin et al. [98] study suggest a high mosaicism detection rate when compared with other molecular techniques, but dependent on coverage levels. Jamuar et al. [99] identified mosaic pathogenic variants, the majority of which were undetected by conventional Sanger sequencing, in peripheral blood DNA from patients with brain malformations, using high-coverage sequencing target panels.

Variant frequency

Variants causing uncommon and severe conditions usually are rare among the general population, and therefore variants with a frequency ≥ 1% (based on SNPs – Ensembl [100], dbSNP [101] and gnomAD [102], for SNVs and small indels, Database of Genomic Variants (DGV) [103] or DECIPHER [104] in case of CNVs, and other in-house databases) can be excluded from further analysis. Exceptions are those involved in rare oligogenic diseases that can exceed 18% [105] and common variants (minor allele frequency, MAF ≥ 5%) generally located in non-coding regions [106]. Niemi et al. [107] studied a cohort of 6987 children with severe NDDs and showed that inherited common variants were responsible for 7.7% of risk variance. Databases have emerged focusing on non-coding regions regulatory elements, such as CODE (http://www.encodeproject.org) [108] and Orion (http://www.genomic-orion.org) [109].

Variant percentage among reads

The inclusion of the putative ID Mendelian inheritance in the filtering strategy and variant prioritization may help to organize information and to reduce the number of candidate variants [110, 111]. For instance, homozygous variants are often associated with consanguinity, and therefore more common in inbred populations, and ID sporadic cases are frequently caused by autosomal dominant de novo pathogenic variants [78]. Ancestry is therefore relevant information to consider before prioritization [78, 79]. Homozygous variants usually show a > 80% variant allele frequency (VAF), whereas compound heterozygous variants show a VAF varying from 20 to 80% among reads.

Variant review

Candidate variants should be reviewed by manual analysis, using a suitable software such as the Integrative Genomics Viewer (IGV) [112]. Although still debatable, gold standard methodologies might be used to confirm variants [113, 114], such as Sanger sequencing for SNVs, and genomic quantitative PCR (Q-PCR) for CNVs.

Variant deleteriousness categorization

We suggest sequential steps for accessing the functional impact of variants in ID, towards variant classification in five categories: pathogenic, likely pathogenic, uncertain significance, likely benign, and benign, according to the American College of Medical Genetics and Genomics (ACMG) and the Association for Molecular Pathology (AMP) recommendations (Fig. 5) [115].

Fig. 5
figure 5

Variant classification flowchart. ID – intellectual disability; CNS – central nervous system; MAF – minor allele frequency; LoF – loss of function. Adapted from Schuurs-Hoeijmakers et al. [116]

Known pathogenic variants in well-recognized ID genes, based on the data published at ClinVar [117], ClinGen [118], OMIM [119, 120], and SysID [4] databases, should be first prioritized. Other aspects should then be considered: (i) implication in other disorders, with central nervous system (CNS) impairment; (ii) levels of expression in CNS/brain, (iii) interaction with other proteins implicated in ID, or (iv) biochemical function similarity with other ID genes (Table 1). Variants predicted to seriously disrupt the protein function (e.g. Loss of function, LoF) with a MAF of ≤1%, and its presence in > 50% of isoforms, should follow. When available, familial studies are used to confirm the segregation of each suitable candidate variant with the phenotype.

Table 1 Bioinformatic analysis databases and tools

In silico causality prediction

Particularly in missense variants, causality ascertainment is challenging [27], with an accuracy of about 80%, despite the improvement in the in silico pathogenicity predictions tools [78]. In Rauch et al. [27] work, two NAA10 variants were classified as pathogenic based on the expected protein effect and patient’s phenotype, yet predicted as benign using in silico tools. Putative splicing effect can be screened using tools such as SpliceSiteFinder-like (normal score threshold ≥70 for SDS and SAS) [127], MaxEntScan (normal score threshold ≥0 for SDS and SAS) [128], NNSPLICE (normal score threshold ≥0.4 for SDS and SAS) [129] or GeneSplicer (normal score threshold ≥0 for SDS and SAS) [130] and Combined Annotation Dependent Depletion cut-off ≥15 (CADD, http://cadd.gs.washington.edu/score) [131] to predict gene disruption.

Replication studies

Gather unrelated patients with a similar phenotype and carrying putative deleterious variants in the same gene, i.e. replication, is crucial to identify new ID genes. Nevertheless, assembly patients that comply to these characteristics is problematic, particularly in rare ID syndromes. To overcome this bottleneck several open-access online platforms allow data sharing:

  1. (i)

    GeneMatcher (https://genematcher.org) [132, 133],

  2. (ii)

    Human Disease Genes website series (http://humandiseasegenes.info) [134],

  3. (iii)

    PhenomeCentral (https://www.phenomecentral.org) [135],

  4. (iv)

    Leiden Open Variation Database (LOVD, https://www.lovd.nl) [136],

  5. (v)

    Clinvar (https://www.ncbi.nlm.nih.gov/clinvar) [117], and

  6. (vi)

    Solve-RD - solving the unsolved rare diseases (https://solve-rd.eu) [137], among others.

Model organisms

In vivo and in vitro studies are particularly important to disclose the deleteriousness of ambiguous or novel variants as well as to implicate new genes in ID phenotypes. The implementation of ID functional studies, using model organisms or patient-derived tissues or cells, is however, complex in a diagnostic facility [78]. Since the 1980s and 1990s, models have been used to understand the mechanisms of monogenic ID disorders, as orthologous genes are involved in evolutionary conserved biological processes [138]. Simple organisms, with short life cycles, allowing genetic manipulation, can easily give insights into several biological processes [139]. Next, several model organisms and corresponding ID seminal studies will be described.

Yeast

Yeast has been considered a valuable ID model following the advances in “autophagy” knowledge, a mechanism compromised in neurological disorders [140]. Saccharomyces cerevisiae, with 23% of homology with human genes [141], shares particular evolutionary conserved key elements with neurons, e.g. budding or mating in yeast to neurite outgrowth or spinogenesis in neurons [142]. Yeast models were used to define (i) the function of septin in the differentiation and compartmentalization of neurons [143], (ii) the role of the MED12-complex in transcriptional regulation [144], and (iii) the mechanisms underlying mitochondrial disorders [145]. Furthermore, has been used to study aging mechanisms and age-associated neurodegenerative disorders (reviewed by Ruetenik et al. [146]).

Caenorhabditis elegans

The nematode Caenorhabditis elegans has also been largely used as a model for neurodevelopmental disorders [147]. With approximately 41% homology with human genes, a short life cycle, easy cultivation and accessibility to the entire nervous system structure [148]. C. elegans revealed to be a very valuable model to study crucial processes, such as cell birth and diversification, cell migration, morphogenesis and pathfinding, synapse formation, and neurite/synapse sorting maintenance and plasticity (reviewed by Rapti [147]). The use of C. elegans brought important insights into the human system nervous illness, such as epilepsy, autism spectrum disorder (ASD) and ID (reviewed by Bessa et al. [149]).

Drosophila melanogaster

Identification of conserved genes and pathways in Drosophila melanogaster (with 75% homology to human genes), goes back to the end of the 1970s [150, 151]. The genes involved in wings development and pattering contributed to the characterization of pathways and mechanisms responsible for skeletal and craniofacial abnormalities in humans [138]. Drosophila is a reference model in ID and ASD as the neuromuscular junction show structural, morphologic, and functional similarities to human synapses [152]. Allowing the study of subcellular events, such as synapses and dendritic complexity, neurotransmission and circuit connectivity, neuronal activity and physiology, brain morphology, and behaviour alterations such as learning and social interaction issues [153], makes Drosophila a valuable and complete model to understand those disorders. Some human genes do not have a homologue in Drosophila, where vertebrate models, such as zebrafish and mice, are useful.

Zebrafish

Zebrafish, with 70% of genomic content homology with humans [154], and similar CNS structures, such as the hippocampus, diencephalon, tectum and tegmentum, and cerebellum, has emerged as an important disease model but also to test potential therapeutic solutions [155]. Zebrafish has a short reproductive cycle, transparent embryos and larvae, easy access to the central nervous system [156], being used to recapitulate: (i) behaviour, such as hypoactivity and hyperactivity, hyperexcitability, impulsiveness, aggressiveness, circadian disturbances, and schizophrenia; (ii) cognitive, learning and memory deficits, and structural abnormalities; or (iii) physical, such as microcephaly, macrocephaly and microphthalmia, some of the neurodevelopmental disorders clinical features. Zebrafish has been widely used as model for ASD, attention deficit hyperactivity disorder (ADHD), ID and schizophrenia-like phenotypes (reviewed by De Abreu et al. [157]).

Mice

ID research, NDD investigation, including development of innovative therapies is anchored in mice studies, due to the similarity (90%) between both genomes [158]. Pivotal studies include: (i) biochemical alterations, such as Mecp2-related deficit in Gamma aminobutyric acid (GABA) and glutamate synthesis pathway [159], and the imbalance of brain metabolites in the hippocampus of Fmr1 KO mice during the developmental period of synaptogenesis and early myelination [160], (ii) changes in synaptic morphology and function, such as Syngap1 associated to early maturation of the spines [161], and decrease of dopamine auto receptors in Mecp2 KO mice [162], and (iii) behavioural issues, such as social impairment, communication problems, repetitive behaviour and resistance to change in routine, cognition, memory, and learning.

Patient-derived cellular models

The brain is an unavailable organ in live humans whereas post-mortem tissue gives information mostly on the end-stage of a disease, providing little contribution on early brain development or impairment [163]. ID genes are differently expressed during brain development and thus the impact of variants in such genes should be accessed at the suitable stage of maturity [164]. Cellular models to study monogenic ID disorders have emerged as an alternative to animal models [165], such as human-induced pluripotent stem cells (hiPSCs).

Human-induced pluripotent stem cells

hiPSCs differentiation allow generation of somatic cells, including human neurons at early developmental stages. Patient-derived fibroblast can be reprogramed into iPSCs using the “OSKM” factors (Oct3/4, Sox2, Klf4, and c-Myc), and then differentiated into highly pure populations of glutamatergic, GABAergic, dopaminergic, serotonergic or motor neurons, astrocytes, or oligodendrocytes, depending on the transcription factor used [163, 166]. The simultaneous culture of two or more cell types is possible allowing a physiological contextualization and recapitulation of the human biological systems [167]. hiPSC models have been used in ID-related disorders, such as Rett, Fragile-X, Dravet, Phelan-McDermid, Miller Dieker, Angelman, Prader-Willi, Timothy, Williams-Beuren and Lowe syndromes, Friedreich’s ataxia, Alexander and Pelizaeus-Merzbaucher diseases, primary microcephaly and X-linked adrenoleukodystrophy (reviewed by Sabitha et al. [168]). The duration of the procedure and the expertise needed, are some of the limitations [169]. Additionally, phenomena such as genetic instability and epigenetic alterations leading to changes in gene expression can occur during the reprogramming procedure, and hamper results interpretation [170]. Furthermore, hiPSCs do not recapitulate behavioural phenotypes, nor the influence of environmental factors or late-onset diseases due to their incomplete maturation [171].

Induced neurons

Induced neurons (iNeurons) have shown to be a promising alternative to hiPSCs, as they preserve the original somatic age-related epigenetic landscape. iNeurons resulting from differentiation of mouse embryonic fibroblasts using the transfection factors Ascl1, Brn2, and Myt 1 l (BAM pool) [172] were first developed in 2010. To overcome the need of an invasive sample collection, such as skin biopsy, Tanabe et al. [173] described a method to generate neurons by reprogramming blood nuclear cells (blood iNs). Nevertheless, the necessary co-culture with mouse glia convolutes the interpretation of the results, as these cells can distort neuronal morphologies [171].

Genome editing using CRISPR platforms

Genome editing systems such as the Clustered Regularly Interspaced Short Palindromic Repeats (CRISPR) are indispensable tools in biological research [174, 175]. The key success in the CRISPR mechanism is the association of a RNA guide (gRNA) and Cas9 protein. While the gRNA, a 20-nt targeting sequence, recognizes DNA sites by base pairing, the Cas9 cleaves DNA, through double-strand breaks (DBS) induction, activating DNA repair mechanisms such as nonhomologous end joining (NHEJ) or homology-directed repair (HDR) [174, 176]. Several CRISPR/Cas9-based studies have been carried out in hiPSCs, showing their efficiency and potential (reviewed by Ben Jehuda et al. [176]). The use of CRISPR and hiPSCs simultaneously allows analysis in a donor-independent manner, overcoming the heterogeneity often observed in hiPSCs, due to the specific genetic background, epigenetic factors or other inter-individual differences [164, 168]. One of the limitations of CRISPR/Cas-9 editing system is the off-target effects i.e. Cas-9 binds and cleaves unintentional genomic sites [164, 177]. The “prime editing” combining Cas9 and a reverse transcriptase, allows genome editing without the double-strand DNA breaks collateral effect [178]. CRISPR interference (CRISPRi) and CRISPR activation (CRISPRa) have been developed as alternatives for previous genome editing platforms. The CRISPRi/a result from the fusion of the CRISPR technology with a dead nuclease (dCas9), allowing the repression/activation of gene expression at the transcriptional level [179]. These tools, so far eligible for Mendelian disorders, mandate recommendations and guidelines to ensure that human genome editing is used ethically and safely.

Concluding remarks

Diagnostic approach in a medical genetic setting begins by the observation and categorization of the clinical features [180]. The Human Phenotype Ontology Project (HPO; https://hpo.jax.org/app/) terminology gathers a set of terms and codifications of signs, symptoms, and other phenotypic manifestations, contributing to an accurate phenotyping. By adopting this terminology, clinical data can be shared and integrated across the scientific and medical communities [121, 181], guiding geneticists towards the definition of the ID diagnosis strategy and molecular defect identification. While at this point genotype-phenotype correlation is complex, new ID classification systems have emerged. Kochinke et al. [4] developed a phenotype-based bipartite clinical classification system that interprets the phenotypic heterogeneity characteristic of monogenic ID. Recently, Biesecker et al. [182] suggested a syndrome definition based on the affected gene and phenotypic description. Using the ID gene GLI3 as an example, a clear and simplistic description of several heterogeneous diseases would be GLI3-related Pallister-Hall syndrome or GLI3-related Greig cephalopolysyndactyly syndrome.

The literature indicates that high ID diagnostic yields are attained by applying the following sequential testing strategy using validated methods, after a detailed clinical evaluation: numeric and structural chromosomal abnormalities analysis, FXS testing, MECP2 (females) and PTEN genes investigation (in the presence of ASDs with macrocephaly) [183], CNVs screening by CMA [184] and exome sequencing [87, 185]. As illustrated in Fig. 6, ID diagnostic yield depends on the technology used, the presence and variability of other clinical features and inheritance pattern (Fig. 6). De Brouwer et al. [186] demonstrated the importance of a deep, accurate and homogeneous phenotyping, after diagnosing 42% of patients with XLID. Najmabadi et al. [76], combined microarray analysis and massive parallel sequencing with a diagnostic yield of 57%, in consanguineous families. Nevertheless, caution is warranted when comparing data from different studies, and special attention should be drawn to the heterogeneity of the clinical descriptions and putative bias in patient ascertainment.

Fig. 6
figure 6

ID diagnostic yield. Rate of ID diagnosis in different studies, indicated by the name of the first author and the year of publication. Coloured rectangles correspond to the methodology used in each study [2, 5, 21, 24, 26,27,28, 67, 76, 77, 85,86,87, 91, 92, 186,187,188,189,190,191,192]. FISH - fluorescent in situ hybridization; CMA - chromosomal microarray; TNGS - Targeted-NGS; ES - exome sequencing; GS - genome sequencing

ID diagnosis strategy should also include systematic reanalysis of previously generated data, in light of the new knowledge [193, 194], e.g. databases update, novel disease genes identification, new clinical features and molecular information [195]. This is a clear advantage of the ES / GS over the TNGS. Reanalysis of ES data from 1133 children with severe developmental disorders and their parents, increased the diagnostic yield from 27% (2015) to 40% (2018) [85, 86].

To date, the ID diagnostic yield remains low, and the identification of previously undetected variants in non-coding regions by GS will clarify hitherto some molecularly undiagnosed ID cases. Moreover, the recent development of long-read sequencing (LRS), namely Single-molecule real-time (SMRT) sequencing, using PacBio sequencing (Pacific BioSciences, Menlo Park, CA, USA) [196], and nanopore sequencing, using the MinION instrument (Oxford Nanopore Technologies, Oxford, UK) [197], will fill the gap of massive parallel sequencing, with long reads (over 10 kb), and alignment and mapping errors reduction. LRS improves the identification of structural variants, such as large inversions and translocations, and pseudogenes, as well as precisely sequence long tandem repeat expansions and high GC-rich regions, increases variant phasing determination, allowing the simultaneous establishment of parental origin, inheritance patterns, and disease risk haplotypes [198].

While the current variant classification guidelines combine functional and clinical data, the stepwise ABC system proposed by Houge et al. [199] suggests a sequential combination of the (A) functional and (B) clinical grades and optionally (C) selection of a standard comment(s) that best address the clinical question. In order to guide clinicians in attaining variant significance, the ABC system can be used separately or as an add-on the ACMG/AMP classification. This highlights the need and the importance of the crosstalk between clinicians and laboratory geneticists to guide genetic investigation, to establish (novel) genotype-phenotype correlations and ultimately to understand the mechanisms underlying the diseases.

The current challenge is the evaluation of the pathogenicity of the variants, rather than their identification. For this purpose, multidisciplinary international research collaborations/cooperation must be established. Ideally, a “rapid” functional test to study several genes in a diagnostic setting, might contribute to overcome this issue. This could represent an important step to translate these insights into future applications that will improve personalized patient support, care and treatment.